generative and transformational techniques in software engineering iv: international summer school,...

Lecture Notes in Computer Science 7680Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David HutchisonLancaster University, UK

Takeo KanadeCarnegie Mellon University, Pittsburgh, PA, USA

Josef KittlerUniversity of Surrey, Guildford, UK

Jon M. KleinbergCornell University, Ithaca, NY, USA

Alfred KobsaUniversity of California, Irvine, CA, USA

Friedemann MatternETH Zurich, Switzerland

John C. MitchellStanford University, CA, USA

Moni NaorWeizmann Institute of Science, Rehovot, Israel

Oscar NierstraszUniversity of Bern, Switzerland

C. Pandu RanganIndian Institute of Technology, Madras, India

Bernhard SteffenTU Dortmund University, Germany

Madhu SudanMicrosoft Research, Cambridge, MA, USA

Demetri TerzopoulosUniversity of California, Los Angeles, CA, USA

Doug TygarUniversity of California, Berkeley, CA, USA

Gerhard WeikumMax Planck Institute for Informatics, Saarbruecken, Germany

Ralf Lämmel João SaraivaJoost Visser (Eds.)

Generative andTransformationalTechniquesin Software Engineering IV

International Summer School, GTTSE 2011Braga, Portugal, July 3-9, 2011Revised Papers

13

Volume Editors

Ralf LämmelUniversität Koblenz-LandauFB4, Institut für InformatikB 127, Universitätsstr. 1, 56070 Koblenz, GermanyE-mail: [email protected]

João SaraivaUniversidade do MinhoDepartamento de InformáticaCampus de Gualtar, 4710-057 Braga, PortugalE-mail: [email protected]

Joost VisserSoftware Improvement GroupP.O. Box 94914, 1090 GX Amsterdam, The NetherlandsE-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349ISBN 978-3-642-35991-0 e-ISBN 978-3-642-35992-7DOI 10.1007/978-3-642-35992-7Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2012955412

CR Subject Classification (1998): D.2, D.3, F.3, D.1, F.4.2, D.2.1

LNCS Sublibrary: SL 2 – Programming and Software Engineering

© Springer-Verlag Berlin Heidelberg 2013This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,even in the absence of a specific statement, that such names are exempt from the relevant protective lawsand regulations and therefore free for general use.

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Preface

The fourth instance of the International Summer School on Generative andTransformational Techniques in Software Engineering (GTTSE 2011) was heldin Braga, Portugal, July 3–9, 2011.

The biannual, week-long GTTSE summer school brings together PhD stu-dents, lecturers, as well as researchers and practitioners who are interested inthe generation and the transformation of programs, data, software models, datamodels, metamodels, documentation, and entire software systems. The GTTSEschool draws from several areas of the broad software engineering and program-ming language communities, in particular: software reverse and re-engineering,model-driven software development, program calculation, generic language tech-nology, generative programming, aspect-oriented programming, and compilerconstruction. The GTTSE school presents the state of the art in software lan-guage engineering and generative and transformational techniques in softwareengineering with coverage of foundations, methods, tools, and case studies.

The previous three instances of the school were held in 2005, 2007, 2009, andtheir proceedings appeared as volumes 4143, 5235, and 6491 in Springer’s LNCSseries.

The GTTSE 2011 program offered seven long technical tutorials (approx.three hours of plenary time each), six short technical tutorials (approx. 90 min-utes each with 2 speakers in parallel), a special tutorial on communication incomputer science (approx. three hours of plenary time), and another special tuto-rial on tooling research (approx. one hour of plenary time). All of these tutorialswere given by renowned researchers in the extended GTTSE community. Typi-cally, a tutorial combines foundations, methods, examples, and tool support. Alltutorial presentations were invited by the organizers to complement each otherin terms of the chosen application domains, case studies, and the underlyingconcepts.

The program of the school also included a participants workshop (or studentsworkshop) to which all students had been asked to submit an extended abstractbeforehand. The Organizing Committee reviewed these extended abstracts andinvited 12 students to present their work at the workshop. The quality of thisworkshop was exceptional, and two awards were granted by a jury of seniorresearchers that was formed at the school. Three of the participants respondedto the call for contributions to the proceedings; one of the submissions wasaccepted through peer review.

The program further included a hackathon to exercise technologies for lan-guage engineering, software generation, and transformation in the context of thecommunity project 101companies. Junior and senior participants enjoyed thisformat; 10 teams submitted hackathon contributions. Another two awards weregranted by a jury of senior researchers that was formed at the school.

VI Preface

The program of the school and additional resources remain available online.1

In this volume, you can find revised and extended lecture notes for six longtutorials, five short tutorials, and one per peer-reviewed participant contribution.Each of the included long tutorial papers was reviewed by two members of theScientific Committee of GTTSE 2011. Each of the included short tutorial paperswas reviewed by three members. The tutorial papers were primarily reviewed tohelp the authors in compiling original, readable, and useful lecture notes. Thesubmitted participant contributions were peer-reviewed with three reviews perpaper. For all papers, two rounds of reviewing and revision were executed.

We are grateful to our sponsors for their support and to all lecturers andparticipants of the school for their enthusiasm and hard work in preparing ex-cellent material for the school itself and for these proceedings. Thanks to theirefforts the event was a great success, which we trust the reader finds reflected inthis volume. Our gratitude is also due to all members of the scientific committeewho not only helped with the labor-intensive review process that substantiallyimproved all contributions, but also sent their most appropriate PhD studentsto the school.

September 2012 Ralf LammelJoao SaraivaJoost Visser

1 http://gttse.wikidot.com/2011

Organization

GTTSE 2011 was hosted by the Departamento de Informatica, Universidade doMinho, Braga, Portugal.

Program Chairs

Ralf Lammel Universitat Koblenz-Landau, GermanyJoao Saraiva Universidade do Minho, Braga, PortugalJoost Visser Software Improvement Group, Amsterdam,

The Netherlands

Students’ Workshop Chairs

Joost Visser Software Improvement Group, Amsterdam,The Netherlands

Eric Van Wyk University of Minnesota, USA

Organization Chair

Jacome Cunha Universidade do Minho, Portugal

Publicity Chair

Vadim Zaytsev Centrum Wiskunde & Informatica,The Netherlands

Scientific Committee

Sven Apel University of Passau, GermanyArpad Beszedes University of Szeged, HungaryMark van den Brand TU Eindhoven, The NetherlandsThomas Dean Queen’s University, CanadaErik Ernst University of Aarhus, DenmarkAnne Etien Polytech’Lille, FranceJean-Marie Favre OneTree Technologies, LuxembourgBernd Fischer University of Southampton, UKDragan Gasevic Athabasca University, CanadaJeff Gray University of Alabama, USA

VIII Organization

Yann-Gael Gueeheneuc Ecole Polytechnique de Montreal, CanadaMartin Horauer University of Applied Sciences, Technikum

Wien, AustriaNigel Horspool University of Victoria, CanadaZhenjiang Hu National Institute of Informatics, JapanJan Jurjens TU Dortmund, GermanyChristian Lengauer University of Passau, GermanyAndrea De Lucia University of Salerno, ItalyMarjan Mernik University of Maribor, SloveniaOscar Nierstrasz University of Bern, SwitzerlandKlaus Ostermann University of Marburg, GermanyJens Palsberg UCLA, USAJeff Z. Pan The University of Aberdeen, UKMassimiliano Di Penta University of Sannio, ItalyAlfonso Pierantonio University of L’Aquila, ItalyZoltan Porkolab Eotvos Lorand UniversityMarkus Puschel ETH Zurich, SwitzerlandAndreas Prinz University of Agder, NorwayDavide Di Ruscio University of L’Aquila, ItalyBran Selic Malina Software Corp., CanadaOlaf Spinczyk TU Dortmund, GermanyPerdita Stevens University of Edinburgh, UKTarja Systa Tampere University of TechnologyWalid Taha Halmstad University, SwedenPeter Thiemann University of Freiburg, GermanySimon Thompson University of Kent, UKLaurence Tratt Middlesex University, UKEric Van Wyk University of Minnesota, USADaniel Varro Budapest University of Technology and

Economics, HungaryAndreas Winter Carl von Ossietzky University Oldenburg,

GermanySteffen Zschaler King’s College London, UK

Organization IX

Sponsoring Institutions

Departamento de Informatica, Universidade do Minho

Table of Contents

Part I: Long Tutorials

Compilation of Legacy Languages in the 21st Century . . . . . . . . . . . . . . . . 1Darius Blasband

Variation Programming with the Choice Calculus . . . . . . . . . . . . . . . . . . . . 55Martin Erwig and Eric Walkingshaw

Leveraging Static Analysis in an IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Robert M. Fuhrer

Differencing UML Models: A Domain-Specific vs. a Domain-AgnosticMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Rimon Mikhaiel, Nikolaos Tsantalis, Natalia Negara,Eleni Stroulia, and Zhenchang Xing

Model Management in the Wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197Richard F. Paige, Dimitrios S. Kolovos, Louis M. Rose,Nikos Matragkas, and James R. Williams

Bidirectional by Necessity: Data Persistence and Adaptabilityfor Evolving Application Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

James F. Terwilliger

Part II: Short Tutorials

Requirements for Self-adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271Nelly Bencomo

Dynamic Program Analysis for Database Reverse Engineering . . . . . . . . . 297Anthony Cleve, Nesrine Noughi, and Jean-Luc Hainaut

Model-Based Language Engineering with EMFText . . . . . . . . . . . . . . . . . . 322Florian Heidenreich, Jendrik Johannes, Sven Karol,Mirko Seifert, and Christian Wende

Feature-Oriented Software Development: A Short Tutorial onFeature-Oriented Programming, Virtual Separation of Concerns, andVariability-Aware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

Christian Kastner and Sven Apel

Language and IDE Modularization and Composition with MPS . . . . . . . . 383Markus Voelter

XII Table of Contents

Part III: Participants Contributions

Tengi Interfaces for Tracing between Heterogeneous Components . . . . . . . 431Rolf-Helge Pfeiffer and Andrzej W ↪asowski

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

Compilation of Legacy Languages

in the 21st Century

Darius Blasband

RainCode45 rue de la Caserne

1000 Brussels, [email protected]

http://www.raincode.com

Abstract. This is the true story of the development of a PL/I compilerfor Microsoft’s .NET platform. This compiler uses a front-end originallydesigned for legacy modernization purposes. It was developed withoutany influence on the language design that was thus imposed upon thedevelopment team. It targets a virtual machine with an architecturetotally different from the one PL/1 was designed for. The impact ofthese factors on the the development and architecture are discussed.

More pragmatic concerns such as compile-time performance, testingand quality control, emulating PL/I numeric data types, CICS and SQLextensions are discussed as well.

1 Introduction

This paper is about the development of a compiler for PL/I, which is an oldlanguage, that has enjoyed only limited academic scrutiny, and which is oftendescribed as one of the most complex languages ever from a compiler writer pointof view. This introduction aims at setting the stage, and describes the language,the technical as well as industrial environment.

This paper will also emphasize all the issues that make the development of acompiler very different from a re-engineering or migration solution, even if, froma distance, one can be fooled and think that they should be similar, as they bothultimately translate code from some source language to some target language.

More specifically, the backtracking-based parsing techniques used for this com-piler will be compared with GLR (introduced by Lang[56] and discovered in-dependently by Tomita[67]) which is commonly used in re-engineering tools.Island grammars and their applicability to a compiler development project willbe discussed as well.

1.1 An Introduction to PL/I

1.1.1 In the beginning...PL/I[50] was originally designed as a best of breed of the languages available atthe time, a mixture between COBOL, FORTRAN and ALGOL. The intention

R. Lammel, J. Saraiva, and J. Visser (Eds.): GTTSE 2011, LNCS 7680, pp. 1–54, 2013.c© Springer-Verlag Berlin Heidelberg 2013

http://www.raincode.com

2 D. Blasband

was to allow programmers with background in any of these languages to use PL/Iwith minimal additional effort. This school of thought has pretty well vanished bynow, as the combination of such heterogeneous sets of features has proven toxic.It makes the language hard to compile, hard to maintain and counter-intuitivein places.

Even so, this odd origin explains quite a few of PL/I’s characteristics.

1.1.2 A Flavor of the LanguageFrom samples such as the one displayed in figure 1, PL/I[50] looks like a fairlyconventional block-structured language.

PRIME: PROC(N) RETURNS (FIXED BIN);

FACTORCOUNT: PROC(N)

RETURNS (FIXED BIN);

DCL (N) FIXED BIN;

DCL (COUNT,SQROOT,I) FIXED BIN;

COUNT = 2;

SQROOT = SQRT(N);

DO I = 2 TO SQROOT;

IF MOD(N,I) = 0 THEN

COUNT = COUNT + 1;

END;

RETURN(COUNT);

END FACTORCOUNT;

DCL (N) FIXED BIN;

IF (FACTORCOUNT(N) = 2) THEN

RETURN(1);

ELSE

RETURN(0);

END PRIME;

Fig. 1. A sample PL/I program

However, there is far more to PL/I than a Pascal-like language with an agingsyntax:

– Character variables can be defined with a fixed length, or can be markedexplicitly as VARYING. Varying character variables are allocated with a fixedlength, but include a two bytes prefix to indicate the number of characterscurrently in use.

– PL/I supports a wealth of numeric data types, including fixed decimals,fixed binaries, picture types and floating point. It also supports complexnumbers, but those were mainly used for scientific applications while the

Compilation of Legacy Languages in the 21st Century 3

bulk of the legacy PL/I code our compiler is meant to address is made ofcommon business data processing applications. Fixed types are defined by anumber of digits (binary or decimal) and a position for an implicit decimalpoint. Complex conversion rules control how the compiler implicitly convertsdata from one type to another.

– Parameters are passed by reference. When a literal is passed as parame-ter to a procedure or function, or when there is a type mismatch betweenthe actual and formal types, the compiler automatically defines a tempo-rary variable matching the expected formal type, and converts the actualparameter into this temporary variable before passing it to the procedure orfunction.This apparently sensible approach means that a procedure or function whichintends to perform changes on one of its parameters may in fact not changeanything if the actual parameter provided to it has a slightly different datatype.More recent versions of the PL/I language definition allow for explicit passby value, but this does not have to be supported by a compiler aiming atporting legacy systems onto more modern platforms.

– PL/I provides an intricate I/O subsystem, that covers versatile formattingoptions inherited from FORTRAN, as well as access to indexed sequentialdata files similar to COBOL’s.

– Variables can be based on a pointer expression, meaning that they haveno allocation of their own. The pointer expression is evaluated to computethe address where the variable must be read, as shown in figure 2 where ascaffolding of such based variables is used to have a single variable accessresult in an implicit multiple pointer dereference.

DCL BPTR POINTER;

DCL 1 MREC BASED(BPTR),

2 SUBPTR POINTER;

DCL 1 UREC BASED (MREC.SUBPTR),

2 UPTR POINTER;

DCL VAL CHAR(10) BASED (UREC.UPTR);

...

VAL = ’Hello’; /* Implicitly

dereferences BPTR,

SUBPTR and UPTR */

Fig. 2. Based variables

4 D. Blasband

– PL/I relies on exceptions extensively, up to a point where some informationcannot be obtained by any other mean (to test for the end of an input file,for instance).The default behavior for a number of PL/I exceptions depending on whetherthe exception is fatal or not is to execute the corresponding exception han-dler, return to the place where the exception was raised, and resume exe-cution. In practice, it is common for PL/I exception handlers to set someerror code, write some log information and exit the current scope, therebyoverriding this return to the place where the exception was raised. This com-mon practice is similar to exception handling as provided by more modernlanguage languages (such as Java and C#).

– PL/I also comes with a comprehensive preprocessor, closer in terms of scopeto a macro assembler than to the limited facility used in C and C++. Itallows for precompilation-time variables, functions, loops, etc.

– PL/I does not support an explicit boolean type (even though Algol 60 did)but it provides a versatile bit string data type, and it is common practice touse a string made of a single bit to represent a boolean value.

– PL/I does not provide any form of explicit type declarations to factorize themore physical properties of a data type into an abstract named entity thatcan be referred to, to declare variables. The only type structuring constructavailable is the LIKE clause that allows a variable to be declared as of thesame type as another.

– Variables can be defined implicitly, and their type will depend on the context(a pointer if the first use of the variable is a BASED clause, for instance) oron their name if the context does not provide any valuable hint.

– When allocated dynamically, a structure can be self-defining, in the sensethat some fields of the structure (in figure 3, BUFLEN1 and BUFLEN2) candetermine the size of other fields in the same structure (resp. BUFFER1 andBUFFER2) which are then dynamic. Since the fields must be allocated con-tiguously, the offsets of BUFLEN2 and BUFFER2 within the BUFFERS struc-ture are dynamic as well. The two expressions on the left of the REFER

keyword BUFPARAM and BUFPARAM*2 give the initial values that must beused and stored into BUFLEN1 and BUFLEN2 when the BUFFERS structure isallocated.

DCL 1 BUFFERS BASED(PP),

2 BUFLEN1 FIXED BIN(31,0),

2 BUFFER1 CHAR (BUFPARAM REFER BUFLEN1),

2 BUFLEN2 FIXED BIN(31,0),

2 BUFFER2 CHAR (BUFPARAM*2 REFER BUFLEN2);

DCL BUFPARAM FIXED BIN(31,0);

Fig. 3. Self-describing structure


– PL/I supports controlled variables. They are similar in spirit to what C’sruntime library (implemented by the malloc and free functions) provides,but with a twist, namely the fact that allocating the same pointer variablemultiple times will implicitly link these allocations together. Freeing theallocated variable will equally implicitly reset it to the previous value of thecontrolled variable, as shown in figure 4.

DECLARE X FIXED BIN CONTROLLED;

ALLOCATE X;

X := 10;

ALLOCATE X;

X := 20;

FREE X;

/* X IS 10 AGAIN !!! */

Fig. 4. Stacked allocations

The list of PL/I’s feature and oddities goes on and on. It is arguably oneof the most complex languages ever designed, up to a point where one can-not help wonder how did the first implementers manage to write a compilerfor such a language with as little as 32 or 64K of memory, which is far lessthan what any compiler uses nowadays for something as primitive as symboltables.

Even PL/I’s syntax raises serious issues. PL/I is a classical case of a languagewhere keywords are not reserved. A statement such as the one shown in figure 5is perfectly valid. There is nothing that prevents a user from declaring (implicitlyor explicitly) a variable named IF, WHILE, etc.

IF IF=THEN THEN

THEN = ELSE;

ELSE

ELSE = THEN;

Fig. 5. PL/I’s lexical ambiguities

This apparently unreasonable degree of liberty does come with a rationale.By the very size of the PL/I language definition, by its origins, it is reasonableto expect many users to rely on a subset of the language. It would be uncom-fortable to see many possible names for user-defined identifiers considered illegalbecause they would clash with a part of the language definition one is not evenaware of. This issue is even made more critical by the fact that PL/I supportsabbreviations (In figure 1, DCL is an abbreviation for DECLARE, PROC is an abbre-viation for PROCEDURE, etc.) thereby further increasing the already large numberof (non-reserved) keywords in the language.

6 D. Blasband

A language without reserved words is also more resilient to evolution. Whenthe SELECT statement was introduced a few years after the first release of PL/I,no special attention was needed for programs with variables or procedures namedSELECT.

1.1.3 DialectsBecause of its complexity and large scope, a large number of restricted PL/Idialects have been introduced to care for specialized niches, mostly with an em-phasis on system programming. IBM successively introduced PL/S[17], PL/Xand PL/DS. Intel has been using and promoting PL/M[16] for system develop-ment on microprocessors since 1972.

1.1.4 Compiling PL/IQuite unusually for a language with so little academic support or inter-est, a comprehensive report about the development of a production-qualityPL/I compiler for the Multics operating system[32][14] is available[19]. Forinstance, it explains how one can cope with lexical ambiguities as shown infigure 5.

When parsing a statement, a preliminary screener checks whether the state-ment at hand is an assignment of the form X=... or X(...)=... as it is theonly statement that can start with an identifier. If it is an assignment, the twoparts are analyzed as expressions and any leading keyword is understood as auser-defined identifier.

If it is not an assignment, the statement must start with a keyword, and aseparate branch of the syntactical analyzer then checks for the syntax of thevarious other statements.

This approach is similar to what our PL/I parser does in terms of backtrack-ing on the input, except that the most specific interpretation of the input willalways be tested first in DURA-generated parsers (See 1.2.3). If a lexeme canbe understood as a keyword, this interpretation will be tested first. It is only ifthis path of analysis fails, or if the keyword cannot be interpreted as a keywordthat assignment will be tested for.

The ad hoc technique described above for the Multics compiler cannot beused in our context, as it does not integrate gracefully in our compiler-compilerarchitecture (it is probably better suited for tools that supports predicates, suchas ANTLR[62]), but it shows how PL/I’s language design is less idiotic than itseems at first sight. The language grammar is such that lexical ambiguity canbe addressed locally in the parser in a few well-identified spots.

The Multics PL/I report[19] also provides valuable data points, such as atotal effort of 4 people for 18 months to complete the development and testingof the compiler. Of course, everything has changed since then (the hardware,the amount of memory one can use), the circumstances are different (generat-ing code for a virtual machine vs. generating native code), nor the constraints(they focused on performance, we are far more relaxed in this area) and eventhe language is not really the same. The Multicians dealt with interrupts and


multitasking, while they did not have to support SQL and CICS and numerousother extensions. More essentially, their compiler was not meant to compile largeamounts of existing code, and was under no obligation to support whatever oddfeature was available in another preexisting compiler.

Even so, it is a valuable data point. It clearly indicates that linear develop-ment according to the book will not do. Corners would have to be cut, andthe success of this endeavor will depend on our ability to leverage existingcomponents.

1.1.5 Multiple UsesPL/I was meant to be a language that could cover any software develop-ment task, including business and scientific applications as well as systemprogramming.

Oddly enough, it has also been mentioned that PL/I was meant to addressthe growing need for software applications that would span more than one ofthese domains. Whether this appreciation was true or not at the time, it is anopinion that has essentially disappeared by now.

1.1.6 A Controversial LanguageAll the defects and flaws of the language mentioned in this paper are not justwisdom obtained as a side effect of time passing by. PL/I has always been asource of heated controversy. As early as 1972, Dijkstra, in his ACM TuringLecture, said[37]:

. . . PL/I, a programming language for which the defining documentationis of a frightening size and complexity. Using PL/I must be like flying aplane with 7000 buttons, switches and handles to manipulate in the cock-pit. . . . if I have to describe the influence PL/I can have on its users, theclosest metaphor that comes to my mind is that of a drug. . . .When FOR-TRAN has been called an infantile disorder, full PL/I, with its growthcharacteristics of a dangerous tumor, could turn out to be a fatal disease.

Also in 1972, Holt[44] advocates for the definition of a very limited subset ofPL/I, as it provides a wealth of features which, according to the author, doesmore harm than good.

On the other hand, PL/I did get some traction in the market, as many pro-grammers praised it for its flexibility, often in contrast to the rigidity and poorexpressiveness of COBOL.

1.2 The Preexisting PL/I Infrastructure

This section describes the PL/I tool set (referred to as the “RainCode Enginefor PL/I”) we had before starting working on this compiler, and which served asbasis for this project. A number of design issues are shown to be inherited fromthis existing infrastructure, more than true decisions per se.

8 D. Blasband

1.2.1 The RainCode Engine for PL/IThe RainCode Engine is a legacy modernization tool, which is available for anumber of programming languages, including COBOL and PL/I.

It reads an input file, preprocesses it, and parses it into a parse tree, whichis annotated with semantic information. This annotated parse tree is then givento user-written scripts that can operate on the tree, to find instances of specificpatterns, measure complexity, count instances of constructs, produce variousforms of outputs, etc. The RainCode Engine is thus an enabling technology thatallows one to develop source code analysis and transformation tools while reusingan existing parsing infrastructure.

1.2.2 YAFL as the Implementation LanguageThe RainCode Engine is implemented in the YAFL programming language[24],together with compiler-compiler extensions[25] summarized in short below.

YAFL is a statically typed object-oriented home-grown language that pro-vides single inheritance, generics, interfaces and quantifiers[28]. It compiles tointermediate C code for performance and portability. Memory is managed auto-matically, using a precise garbage collector

The YAFL compiler is extensible, in the sense that one can use it as a library,inherit from its classes and redefine behaviors to alter code generation or even,support language extensions. It is a compile-time version of reflection[65], whichshould not be confused with its more commonly used runtime counterpart, whichis also supported by YAFL using an ad hoc class library.

1.2.3 The Underlying Parsing TechnologyRainCode is based on a parsing technology, named DURA [25][26]. It is an SLRparser[35] generator with backtracking. SLR has been preferred over LALR[36]because of its simplicity, the extra recognition power of LALR being compensatedby backtracking. DURA is similar to other backtracking parsers technologiessuch as Lark[41] or btyacc[68], with two major differences.

First, the parsers generated by DURA cooperate with lexers built by a speciallexer generator, named lexyt[25]. These lexer can backtrack as well, and addressthe case shown in figure 5, by stating that IF is a keyword, and can backtrackto a plain identifier if the parser could not find a suitable way of understandingthe input with an IF keyword at that position.

Second, the grammar is not given in a separate source file, but integratedinto the YAFL source code (See figure 6) as grammar clauses in specificallymarked non-terminal classes. DURA is thus little more than a YAFL compilerwith extensions to support grammar rules in class definitions. YAFL supports acompiler extension mechanism[24], and even though it predates the very conceptof DSL (Domain Specific Language), the techniques used to embed a languageextension into an existing infrastructure are similar.

The grammar rule refers to class attributes. When synthesizing a rule,an instance of the enclosing class is created, and its attributes populated.


DURA-generated parsers do more than recognize the input language, they buildthe correspondingly strongly typed parse tree automatically.

This integration goes further than just synthesizing attributes: when reducinga non-terminal, a Commit method is called. This method can be redefined toimplement parse-time semantic actions, or to provide predicates that controlwhether a given reduction is valid or not. If the Commit method returns FALSE,the reduction is considered invalid, the reduction is undone and the backtrackingparser tries another way of mapping the input to a valid parse tree.

NONTERMINAL CLASS IfStatement;

INHERITS Statement;

GRAMMAR

IfKeyword TheCondition ThenKeyword

TheIfStatement

(ElseKeyword TheElseStatement)?;

VAR

TheCondition: Expression;

TheIfStatement,

TheElseStatement: Statement;

END IfStatement;

Fig. 6. The DURA definition for the PL/I IF statement

Similarly to what Lark and btyacc provide, DURA can be tweaked to considera specific LR transition in last resort only. All the other actions (shift and reduce)must have been tried first. Figure 7 shows the usage of the YIELD operator thatcan be used to lower the priority of a grammar rule.

1.2.4 The RainCode Scripting LanguageThe RainCode scripting language is interpreted and dynamically typed. It isspecifically designed for convenience and expressiveness when performing anal-ysis and transformation on DURA-generated parse trees.

It provides a number of built-in facilities, such as list/set handling, stringhandling, access to relational or XML data, etc. in the form of function libraries.

It also supports a multi-paradigm programming style. One can write pro-cedural code, use quantifiers (not unlike YAFL’s[28]) to express proper-ties over sets, revert to a more functional programming style with lambdacalculus or even use Prolog[73]-like logic expressions with unification andbacktracking.

Even though it is often used for small-scale ad hoc operations, the RainCodeScripting Language also supports the ability to develop larger transformationsystems. One can divide a system into modules of manageable size, and have

10 D. Blasband

NONTERMINAL CLASS DoGroup;

INHERITS Group;

GRAMMAR

DoHeader TheBody EndKeyword

TheEndId? SemiColon;

DoHeader TheBody YIELD;

...

END DoGroup;

NONTERMINAL CLASS DoHeader;

INHERITS BaseNonTerminal;

GRAMMAR

DoKeyword SemiColon;

...

Fig. 7. DURA’s YIELD operator

them call each other. One can also define libraries of reusable modules to beshared among multiple projects.

1.2.5 Statically Compiled vs. Dynamic BehaviorThe RainCode scripting language is dynamic, and relies on the static objectmodel built directly by the DURA-generated parser. The classes that are usedto build the parse tree, as well as its grammar attributes are made availableautomatically in the RainCode scripting language by means of YAFL’s reflectionAPI.

This means that the dynamic RainCode Scripting Language and the nativeYAFL implementation share the same object model, as they are mapped toeach other. Whenever some new functionality is needed in a RainCode project,we have the choice of implementing it in the engine itself in YAFL, or defineit externally, in the RainCode Scripting Language. The same information isavailable in both environments, structured in the same way and the elements(classes, attributes, methods) have the same names.

It is then tempting to consider using the RainCode Scripting language to writeat least parts of the compiler. It is easy to use for prototyping, it is flexible andperformance would be adequate, mainly because the performance is dominatedby parsing.

However, we consider the RainCode scripting language as ill-suited for largescale mission critical products that must be deployed under the customer’s re-sponsibility and supervision. It really shines when it comes to the developmentof ad hoc source code analysis or transformation tasks, but it is too dynamic forcomfort to be used to develop shrink-wrapped products.


Writing the compiler in the RainCode Scripting language has never been con-sidered a viable option. It would provide early benefits because of its expressivepower and flexibility, but would turn to an intractable burden later on when thelack of compile-time typing would make bug fixing very tedious. The PL/I com-piler is therefore written entirely in YAFL, statically compiled, typed-checkedat compile time, etc.

1.3 An Introduction to .NET

.NET1 is a framework, developed by Microsoft, and that runs on Microsoft’sWindows operating system. .NET includes a virtual machine referred to as theCLR (for Common Language Runtime) where programs run in a protected en-vironment. .NET is the sole target environment for this PL/I compiler.

1.3.1 Managed vs. Non-managed CodeCLR makes a clear distinction between two forms of code:

– Managed code is made of a high-level instruction set, converted at compiletime or just in time into executable code for the hardware platform of choice.It is executed under the supervision of the CLR, and defers such issuesas concurrency and memory management to it. Managed code applicationsprovide guaranteed independence from each other, as well as memory safetyby means of the garbage collector.

– Unmanaged code is platform-native code, which is not monitored and whichruns under the user/developer’s responsibility.

Managed code is preferred whenever applicable, for the benefits induced by thevirtual machine, while unmanaged code can be used to integrate a legacy compo-nent written in some low-level language for which no .NET code compiler existsor when performance constraints make managed code impractical.

Even though it is based on an existing front-end, the PL/I compiler’s codegenerator has been written from scratch. There is thus no legacy to accommo-date, and the performance constraints do not justify reverting to unmanagedcode. Our PL/I compiler generates managed code only.

1.3.2 Unsafe CodeThe distinction between managed and unmanaged code should not be confusedwith safe vs. unsafe code. Unsafe code is managed code where low-level opera-tions such as pointer operations are allowed. It still runs under the CLR, and itis still running the virtual machine’s instruction set.

For obvious reasons, unsafe code should be avoided whenever possible. In thePL/I compiler, it has been used just twice for performance reasons, in well-defined and localized methods, in the runtime library. It is never used in codeactually generated by the compiler.

1 All the information about the .NET framework is easily available, with more detailsthan this paper will ever be able to list. Only the most relevant features of .NET aredescribed here, the motivated reader can access infinitely more through the Microsoftweb site.

12 D. Blasband

private unsafe static int reverseInt32(int v)

{int vv = v;

byte* l = (byte*)&vv;

byte* r = l + 3;

byte b;

b=*l;

*l=*r;

*r=b;

l++;

r--;

b = *l;

*l = *r;

*r = b;

return vv;

}

Fig. 8. Unsafe code in the PL/I compiler’s runtime library

One is given in figure 8. It deals with binary numerics endianness, by reversingthe byte order of a .NET 32 bit integer. It is easy to assess that this method’susage of pointers is limited to the integer value it is applied to, and that inabsence of loops, this function can be asserted to be correct and safe.

This kind of localized optimization has resulted in a 15% performance im-provement measured across all benchmarked PL/I modules.

1.3.3 How an Executable Can Be GeneratedThere are three ways one can generate a .NET portable executable:

– One can use .NET’s reflection API to build assemblies out of classes, meth-ods, and ultimately, atomic .NET statements. While this API can be usedto build and execute programs on the fly, it also provides a convenient andstrongly typed API for the backend of a compiler’s code generator.

– One can generate the portable executable’s binary format directly, as it isthoroughly and comprehensively documented.

– One can generate ILASM[11] source files, and use .NET’s standalone assem-bler to generate portable executables.

The first option should be preferred whenever applicable, as it considerablysimplifies the compiler’s logistics. It reuses an existing framework to representexecutables in a strongly typed fashion. Unfortunately, this option was not avail-able to us, as these code generation API’s are supported under .NET, while our


compiler is a native process (as it is based on the generation of intermediate Ccode for compilation). Building the sophisticated bridges required to interface.NET code with native code did not seem to be worth the trouble in this case.

The second option is always available, but it requires quite some nitty-grittylow-level work, and also requires one to generate debugger support files (withthe .PDB extension), while the two other ways listed above care for that.

Time to market was essential when starting this project, and we opted forthe third approach, by reusing the ILASM assembler to produce portable exe-cutables with minimal effort. It also presents the added advantage of providingan intermediate deliverable, namely the assembler source code, that can be readand checked for consistency.

The assembler provides adequate performance (See 2.2.1) and has proven reli-able and versatile, except for a few minor bugs that ended up being circumventedby small changes in the code generator.

1.3.4 The Instruction SetThe MSIL instruction set is stack-based (just as Java’s JVM) which considerablysimplifies the task of writing compilers when compared with a register-based ma-chine. Before performing an operation or calling a function, the various operandsare pushed on the stack. There is no need for sophisticated register allocationalgorithms, which are a significant part of any compiler development effort thattargets real world hardware.

The MSIL assembler source is structured by high-level constructs, such asclass, method, exception block, etc. as opposed to a true assembler where onejust builds lists of opcodes with arguments and a relocation table, and wherethe division into semantically meaning constructs is implied by the compiler.

1.3.5 The PEVERIFY Verification ToolPEVERIFY[15] is a semantic analysis tool that performs a number of checks on.NET assemblies.

It ensures that all the external symbols that are referred to are accessible.It also checks for consistent stack usage. Since all operations have a staticallydetermined effect on the stack, PEVERIFY detects and reports cases where stackunderflow or overflow can occur. This also means that the various paths leadingto any given label must have the same effect of the stack. It also checks for typeconsistency. Just as PEVERIFY can assert the effect of every instruction on thestack depth, it can assert the type of whatever value is left on the stack, andthereby, ensure that type usage is consistent.

The PEVERIFY tool allows for an early sanity check on the generated code,and allows for defects to be found statically at compile-time rather than dynam-ically, at runtime, after painfully long debugging sessions.

Since it operates directly on compiled .NET assemblies, it is immune to thedifferent ways one can produce such assemblies, directly, or by going throughintermediate MSIL code.

PEVERIFY makes a big difference. It is the tool that avoids the pain usuallyassociated with the generation of native code, where even the most mundane

14 D. Blasband

mistake can go undetected until execution with no useful diagnostic whatsoever.Even though generating code for .NET looks like native code generation withlabels, low-level opcodes and primitive operations, a tool such as PEVERIFYturns the experience closer to the comfort of a reasonably strongly typed inter-mediate language such as C or better, where the compiler for this intermediatelanguage provides a useful level of validation.

1.4 The Industrial Context

This section focuses on non-technical issues, that have driven the developmentof this compiler, as well as a number of the resulting design decisions.

1.4.1 The Market NeedPL/I is old, cumbersome and not very portable. Event though there have beenrumors about object-oriented extensions to the language, it is not taught anylonger, and except for the oddball case, no new development is started in PL/Inowadays.

PL/I originally ran on mainframes only, and the largest PL/I portfolios arestill running on that platform. Depending on PL/I thus translates to a doubledependency, namely the language in its own right, and a dependency on theunderlying platform.

Organizations with a serious stake in PL/I which aim at moving away fromtheir mainframe infrastructure are considering their options:

– Those with a portfolio small enough to rewrite it or standard enough for thereplacement by a package to be an option have done so a long time ago.

– Automated migration delivers lukewarm results when applied to PL/I. Busi-ness processing applications should be migrated to COBOL, as it is the onlylanguage commonly available today that provides equivalent data types. Onecan of course always emulate those data types in any other language, but theeffect on readability and maintainability is dreadful. See the authors’s per-sonal favourite[27] for samples of the resulting code when migrating COBOLto Java. Beside data types, COBOL is also far too simplistic in its computingmodel to adequately represent the more sophisticated memory managementfeatures of PL/I. Experiments have shown that automated migration fromPL/I to COBOL can achieve a 95% rate, in the sense that 5% of the programscontain constructs that cannot be reasonably mapped to COBOL.

Out of millions of lines of code, 5% of manual remediation is a huge projectin its own right. Manually coding these 5% in COBOL is no trivial task, asthey would have been processed automatically if it were at all possible. Theytypically represent statements that have no obvious counterpart in COBOL,and recoding them requires serious analysis and often a non-trivial redesignof the COBOL program at hand.

– Use a PL/I compiler for another platform, and port their code with minimalchanges.


The last option has its flaws. It requires keeping some PL/I knowledge in house tomaintain the migrated applications which may not be an organization’s favoriteway to extend the life of their legacy applications. On the other hand, when codeis translated to a target language that is very different from the source language(COBOL or PL/I to Java or C#, for instance) the resulting code is so awkwardthat it takes serious expertise of both the original and the target language formaintenance to be possible.

This approach also replaces the lock-in with the mainframe platform (hard-ware + operating system + compiler) vendor with a lock-in with the new com-piler vendor. Even so, for large, complex and mission-critical systems, it is theapproach that minimizes the technical risk.

1.4.2 A New Kind of CompilerThe PL/I compiler described in this document aims at compiling existing coderather than new code. The purpose is not to say that PL/I should be used todevelop supposedly better .NET applications, but to take existing code, compileand run it on a new, cheaper and more flexible platform. It is not meant topromote a better PL/I, or a more concise PL/I or even a different PL/I. It aimsat mimicking what IBM’s original PL/I compiler does, for better and for worse.

We’ll refer to this compiler as a legacy compiler, as opposed to a developmentcompiler. This different focus implies specific constraints. First, the source lan-guage cannot be changed. One must support what’s out there. Requiring a sourcechange, even a minor one, would need to be applied several thousand times in anysizable portfolio and would thereby seriously compromise the business value of thisnew compiler. Then, the original behavior of compiled programs must be main-tained, no matter how idiotic it sometimes is (see 3.4.1, for instance). This goesas far as supporting the contradictory differences in behaviors of PL/I compilers.This constraint can be daunting, as witnessed by the huge set of command-line op-tions supported by COBOL systems for Windows or open systems, that are usedto emulate the the behavior of a number of reference COBOL compilers.

A legacy compiler is not only about additional constraints. It is also abouta different focus, which can make the compiler engineer’s life easier. For in-stance, performance is usually not considered a critical issue any longer. Mostdata processing applications that must be ported using this compiler are almostexclusively I/O bound, and are moved from a platform where CPU cycles re-main very expensive to a platform where they are orders of magnitude cheaper,as long as one can run multiple programs in parallel to serve potentially largenumber of simultaneous users.

Being I/O bound means that the focus on performance is concentrated on theway the databases are being accessed, and more specifically, on using static SQLto access DB/2 or the relational database of choice.

A legacy compiler can also take advantage of specific features that are notcommonly available in development compilers. It can populate a repository inthe form of a relational database with data regarding program artifacts. Thisincludes programs, procedures, call graph, compilation errors, etc. It enables

16 D. Blasband

one to extract useful information from this repository by means of plain SQLstatements, for inventory, complexity analysis, impact analysis, etc.

2 The Compiler’s Architecture

2.1 The Front-End

The object model for the parse trees produced by DURA-generated parsers isas object-oriented as can be. Encapsulation is enforced strictly, and explicitgetters are required for non-terminal classes to give access to their content. Forinstance, in figure 6, the condition for the IfStatement class is not accessiblefrom an outside class unless an explicit getter is provided, the rationale beingthat most of what one can do with the condition should be handled internally,within the IfStatement class.

This paradigm is in total contradiction with an ideal architecture where thecompiler’s code generator is an independent component that consumes the datastructure built by the parser. Such a level of independence would require thatall the attributes of all the non-terminal classes are made available with gettersso that the code generator can access them. Given the size of the object modelused to represent PL/I programs, this would have been totally impractical. Asimpler and more radical approach was needed.

We finally decided to embed the code for the compiler as methods of the parsetree node datatypes, as a pragmatic and comfort-driven strategy. For instance,the Statement base class defines an abstract method named GenerateCode, thatcan then be redefined by the IfStatement class as defined above, to produce theappropriate target code, using the private attributes that implement the parsetree without requiring getters. The base class for expressions also provides asimilar abstract method, to be redefined by all the different sorts of expressionsdefined in the PL/I grammar.

This radical option has of course one simple consequence, namely the fact thatthe RainCode Engine and the compiler have become intimately intertwined, andcan barely be considered as separate products any longer. The compiler is, in fact,little more than the RainCode Engine for PL/I with a built-in code generationfacility.

2.1.1 Backtracking Parsers vs. GLRIt is straightforward to ensure that backtracking parsers recognize essentiallythe same language as parallel parsers (Earley[38] or GLR[56][67]) which buildmultiple parse trees simultaneously in case of local ambiguity.

When in a state where more than a single action can be taken, GLR andEarley parsers fork and try the different actions in parallel, while backtrackingparsers explore the first action, and try the subsequent ones in case of failure.

The only difference lies in the fact that backtracking parsers return the firstparse tree that matches the input (unless one forces it to backtrack to returnsubsequent parse trees), while parallel parsers return a set of all the match-ing parse trees (or some data structure with shared components, semantically


equivalent to such a set of parse trees). This property is especially importantwhen processing natural languages, as all the valid parse trees may have to bereturned. A program written in a real-world programming language should havea single interpretation, as it has a unique execution semantics. Having multipleparse trees means that the parser lacks discriminating power, and a separatedisambiguating pass is required at the end of the parsing process to select theunique valid parse tree[54][72].

Having a single parse tree in the making at any time also allows for morecomprehensive on-the-fly semantic analysis. The well publicized typedef issue[43]forces C, C++, Java and C# parsers to include at least a primitive form ofsymbol table at parse time to maintain the names of the valid types at any time.The sample shown in section 4.5 describes a similar issue with COBOL, wherethe parse tree depends on previous data declarations.

In a backtracking parser, one must support the ability to undo operations onthe semantic analysis information as symbols are unreduced from the LR stack,but at least, the semantic information is unique within the context of a parsingprocess.

When dealing with parallel parsers, each parallel parse must maintain its own,potentially different semantic information derived from a different understandingof the input. GLR implementations try to merge parallel parses when they reacha common state, but that becomes close to impossible if they all have to maintainseparate semantic information.

The number of simultaneous parallel parsing threads cannot be reduced bythread merge, which means that it is likely to explode.

It is thus our opinion, even though it is not substantiated by first hand ex-perience, that GLR works best when there is no need to maintain semanticinformation at parse time in which case it provides invaluable benefits.

When semantic information must be maintained, backtracking is a better wayto cope with local ambiguities, as it allows one to deal with a single instance ofthe semantic information at any time.

2.1.2 Performance IssuesBacktracking is a contentious issue, as it is in theory potentially exponential. Inpractice though, it is far from being the case, as ambiguity in valid PL/I programsis very localized (see, for instance, the discussion in section 1.1.4 about the simplescreener used by the Multics PL/I compiler to lift ambiguities at the statementlevel). Early experiments have shown that the DURA-generated parsers for PL/Icould process about 50.000 lines of code per second (including preprocessing),which is slow compared to what a hand-crafted or a yacc[51]-generated parsercan achieve, but fast enough to be practical for industrial use.

2.2 The Complete Process

The complete compilation process is fairly elaborate.The source code is first preprocessed using the PL/I preprocessor included

in the RainCode Engine for PL/I. It builds a preprocessed output, and

18 D. Blasband

maintains token per token information to allow for complete synchronizationwith the original source file. The output of the preprocessor is then parsed, us-ing the backtracking parser. This parser produced by DURA[25][26] builds astrongly typed parse tree.

On the valid parse tree, the tagger attaches all variable and procedure usagesto the corresponding declaration, taking scoping and visibility rules into account.

Two generic walk through processes are then run on the parse tree, callingthe generic Validate and Validate2 method on all the nodes, using DURA-generated visitors. The second Validate2 method is only needed when somevalidation process needs all the nodes to have been visited first, and cannotdepend on the order of this visit in any way. The process is then divided intwo parts, the first in Validate, the second in Validate2 which can rely onValidate having been run on all nodes.

The tagged and validated parse tree is the input to the code generation pro-cess, which generates an intermediate tree structure that describes the programin .NET concepts (classes, methods, types, statements, etc). A number of trans-formation, analysis and optimization tasks can be performed on this intermediaterepresentation.

This intermediate tree is then serialized as an MSIL assembler source file.This pass is trivial, as the intermediate tree represents .NET concepts with-out much additional semantic information. The .NET assembler ILASM[11]is used to produce a .NET assembly out of the assembler source file, andthe assembly is validated for consistency using the .NET PEVERIFY tool(See 1.3.5).

Each of these phases can cause errors, which result in the subsequent phasesbeing canceled.

2.2.1 PerformanceEarly benchmarks showed that four of the phases (the two first and the twolast) each accounted for about 25% of the execution time, with the other phasestaking relatively negligible time.

In other words, performance is heavily dominated by the phases we can’treally reduce: preprocessing and parsing depend on DURA parsers, that havebeen used in production for 15 years and where all the low hanging and the not-so-low hanging fruits have been picked ages ago while assembly and verificationdepend on external tools.

The beef of the development of this compiler happens in the middle phases.Since we know that performance is driven by the other phases, we concentrateon clarity, maintainability and structure rather than execution speed, knowingthat the impact on the final performance figures will be minimal.

2.2.2 Working on the Intermediate RepresentationThe intermediate representation of the PL/I compiler represents the major.NET concepts by a scaffolding of classes: modules, classes, functions, statement,function calls, etc.


Serialization of this intermediate representation into MSIL assembler is trivial,as the concepts are mapped trivially and .NET stack’s semantics also consid-erably simplifies the process. Generating the code for a + b is merely a matterof generating the code for a, b, followed by the appropriate opcode or functioncall to perform addition depending on the types of a and b. More informationabout the operations that are performed on the intermediate representation canbe found in section 4.2.

3 Mapping PL/I to .NET

This section provides a partial list of issues encountered when mapping PL/Iconcepts to .NET.

3.1 Memory Model

From the outset, it is tempting to believe that PL/I has inherited a common blockstructure from its Algol ancestor, and that variables (local, static or parameters)can be mapped to the corresponding concept in the .NET world. This simplisticanalysis soon shows its limits, as PL/I has a number of additional propertiesthat make it harder to represent accurately enough in .NET:

– One can actually take the address of any variable, using a variety of mecha-nisms (explicit pointers, based variables, etc.)

– Types are defined in terms of their physical representation, and programscan and do rely on these properties. Consider for instance figure 9 where a32 bit integer (expressed as a 31 bit integer, the sign taking one additional,implicit and mandatory bit) is redefined by a bit string, so that the samememory area can be viewed using two different types.

DCL AS_INT FIXED BIN (31);

DCL AS_BITS BIT(31) BASED ADDR(AS_INT);

AS_INT = 8; /* Implicitly sets

AS_BITS */

IF AS_BITS(3) THEN /* Implicitly tests

AS_INT */

...

Fig. 9. Redefining an numeric field with a bit string

This is not an example made up for the sake of making a theoreticalpoint. Such redefinitions are part of every seasoned PL/I programmer’s tool-box, and all sizable PL/I systems contain hundreds of instances of suchconstructs.

20 D. Blasband

For this kind of behavior to be reproduced accurately, one must manage mem-ory directly, and allocate variables in an array of bytes, essentially emulating thememory behavior of the original mainframe hardware platform. In practice, twoaggregate data types provide this accuracy in the emulated behavior under .NET:

– An AddressSpace class encapsulates the array of bytes that serves as theaddress space for a set of PL/I programs running within the same thread.Each PL/I program invocation allocates its data within the AddressSpace.This allows parameters to be passed by reference, as they are allocated withina shared address space.

– A MemoryArea structure represents a generalized pointer, by referring to anaddress space, together with an offset and a length. MemoryArea’s are perva-sive in the compiler’s generated code and in the runtime library. They referto the address space rather than referring to the byte array directly to allowfor reallocation. Whenever the address space is full and additional allocationspace is required, a new, bigger, byte array is allocated, and initialized withthe content of the original byte array. Since the MemoryAreas do not refer tothese byte arrays directly, but rather, go through the AddressSpace instead,such reallocation can be performed simply and safely as long as they are notconcurrent with the program execution thread.MemoryArea’s are implemented by .NET structures rather than classes.Structures are composite data aggregates with a value semantics. They arestack-allocated to avoid the overhead induced by classes, at allocation timeas well as for the increased pressure put on the garbage collections2. The.NET standard class library makes extensive use of such structures, forDecimals[21] to implement fixed decimals numbers for accurate computa-tion, for Point[22] to represent a point on a device for the graphics library,and more. See 5.5 for the description of a flaw in the way structures aresupported by .NET.

Numeric data are then emulated to the bit level, including endianness and bitordering. Failing that, an example such as the one given in figure 9 would behavedifferently from the mainframe.

COBOL compilers[8] aiming at the .NET platform have very similar require-ments. COBOL and PL/I essentially use the same set of primitive data types(inherited from the mainframe architecture and instruction set). Quite logically,PL/I and COBOL compilers on non-mainframe platforms end up applying verysimilar solutions.

3.2 Thread Safety

Being thread-safe was not an original requirement for the PL/I compiler. Still,it has been taken into consideration based on the experience of how hard it is

2 Incidentally, there is no such value-based structure data type in Java, meaning thatone must revert to plain classes for every composite piece of data. This has a seriousimpact on the practicality of the JVM as a target for compilation.


to implement post hoc in an originally non thread-safe system. It seemed bothsafer and simpler to integrate from day one, even if the need for it was not allthat obvious in the beginning.

This soon proved a lucky gamble, as one of the first PL/I systems that wascompiled using our compiler was deployed under IIS[12], which requires thedeployed code to be 100% thread safe. IIS preallocates a number of servicethreads at startup, which handle HTTP requests as they come in.

On a compiler where execution performance is not the primary ambition,thread safety is reasonably easy to guarantee.

An ExecutionContext class encapsulates the thread-specific context infor-mation, and is systematically passed as a parameter to all the generated func-tions. It encapsulates PL/I global names, standard I/O streams, tracing in-formation, the address space (See section 3.1), the heap management datastructures, the CICS and SQL connectors, etc. It is essentially everythingthat would be managed globally in a single threaded environment. Goingthrough the extra indirection of the ExecutionContext to access data and re-sources guarantees isolation between concurrent threads. ExecutionContext’sand AddressSpace’s are represented by separate classes for clarity of purposeonly, as there is a one to one relationship between execution contexts and addressspaces.

3.3 Control Flow

Mapping PL/I’s control flow to .NET primitives is straightforward in mostplaces. PL/I procedures are mapped to .NET functions, most PL/I GOTO state-ments are mapped to .NET jump instructions, etc. There are just two caseswhere the mapping is non-trivial, namely nested procedures and non-local GO TO

statements.

3.3.1 Nested ProceduresPL/I supports Algol’s[60] nested procedures that are not supported natively bythe .NET platform, which has more of a C[30] and C++[40] flavor.

There are a number of ways of supporting nested procedures. Displays beingthe most commonly used one for Pascal[74]). YAFL[24] uses a scheme that isa variation on displays for cases where one has less control over the memoryrepresentation of the stack, parameters and locals (It is specially suited whengenerating intermediate C code). In .NET, one has even less control on memorymapping, on where locals are allocated, their address, etc. We eventually optedfor an even simpler scheme that does not require pointers nor table of pointersto locals.

When a procedure p uses a parameter from an enclosing procedure q, theparameter is added implicitly to all the procedures found in call chains betweenp and q.

The careful reader may have noticed that there is no explicit constraint re-garding p being nested within q, but it is implied by the fact that p uses a

22 D. Blasband

parameter of q, which is only possible when p is nested within q, which in turnimplies that all the procedures in the call chains between p and q are nestedwithin q as well.

Original code: Flattened code with mini-mal parameter lists:

Flattened code with syste-matic concatenation of in-herited parameter lists:

OUTERMOST: PROC (A,B,C);

LOCAL1: PROC;

... B ...

END LOCAL1;

LOCAL2: PROC(D);

... B ...

CALL LOCAL3(..);

END LOCAL2;

LOCAL3: PROC(E);

LOCAL4: PROC;

... A ... E ...

END LOCAL4;

CALL LOCAL4;

END LOCAL3;

CALL LOCAL1;

CALL LOCAL2(...);

CALL LOCAL3(...);

END OUTERMOST;

LOCAL1: PROC(B);

... B ...

END LOCAL1;

LOCAL2: PROC(D,A,B);

... B ...

CALL LOCAL3(...,A);

END LOCAL2;

LOCAL3: PROC(E,A);

CALL LOCAL4(E,A);

END LOCAL3;

LOCAL4: PROC(E,A);

... A ... E ...

END LOCAL4;

OUTERMOST: PROC(A,B,C);

CALL LOCAL1(B);

CALL LOCAL2(...,A,B);

CALL LOCAL3(...,A);

END OUTERMOST;

LOCAL1: PROC(A,B,C);

... B ...

END LOCAL1;

LOCAL2: PROC(D,A,B,C);

... B ...

CALL LOCAL3(...,A,B,C);

END LOCAL2;

LOCAL3: PROC(E,A,B,C);

CALL LOCAL4(A,B,C,E);

END LOCAL3;

LOCAL4: PROC(A,B,C,E);

... A ... E ...

END LOCAL4;

OUTERMOST: PROC(A,B,C);

CALL LOCAL1(A,B,C);



END OUTERMOST;

Fig. 10. Implicit parameters in nested procedures

Figure 10 shows an example where the implicit parameters induced by pro-cedure nesting are made explicit, and the nested procedures are then flattened,as they don’t rely implicitly on their enclosing procedure any longer. These flat-tened procedures can then be mapped trivially to .NET functions (This figureshows the effect of these implicit parameters on PL/I code. Our implementa-tion operates on the compiler’s intermediate representation rather than actuallyaltering the PL/I parse tree directly, but the principles are identical.)

This is one of the very few cases where an approach used for a compilationproblem can be reused more or less as is for a language migration solution.


Expanding the parameter lists as explained above is a viable solution whenevera language that supports nested procedures must be converted to a language thatdoes not. It minimizes disruption, and allows for readable and manageable code.One can even argue that the result is more maintainable than the original code,as it replaces an implicit parameter reference by an explicit one. By buildingthese parameter lists from the call graph, only the minimal set of parametersis computed for each procedure, rather than concatenating the parameter listsfrom all enclosing procedures without further ado, as depicted in figure 10.

This simple scheme when applied to formals cannot be extended to locals.While PL/I procedures typically have a handfull of formals, they can have dozensor even hundreds of local variables. Passing them one by one as additional pa-rameters would be unreasonable.

One must then revert to displays (or a semantically equivalent implementa-tion) or to globals.

3.3.2 Non-local Goto’s.NET’s branching instructions are limited in the sense that the jump and thetarget label must be defined in the same function. On the other hand, PL/Isupports non-local GO TO statements, where one can bypass the stack of calledprocedures, and exit from one or more scopes in the process.

In the sample shown in figure 11, the GO TO statement is used to exit thePROCESS_RECORD, and PROCESS_FILE procedures before actually branching tothe FATAL_ERROR label. This example is typical of cases where these non-localGO TO statements are used, namely as a form of exception handling withoutreturning to the exception site. Non-local GO TO statements as supported by PL/Iare limited in the sense that they can only be used to exit one or more scopes,they cannot be used to enter scopes, as it would raise intractable consistencyissues.

As shown in C#-like pseudo code in figure 11, this functionality is providedby throwing a NonLocalGotoException. A unique numeric code is attached toeach label which is the target of such a non-local GO TO statement (FATAL ER-

ROR code in the example). This code is attached to the exception, and testedwhen catching the exception to check for the label one must transfer to.

This scheme is less efficient than a plain jump instruction as provided by the.NET platform, but then again, it is (or should be) used in exceptional cases,and the true impact on performance should be limited in practice.

Besides, the original PL/I implementation of non-local GO TO statements islikely to be more expensive than plain GO TOs, as they require some housekeepingto be performed to unroll the procedure call stack.

3.4 Data Types

3.4.1 A Little Tale about Numeric PrecisionPL/I fixed numbers have a number of – binary or decimal – digits d and a scales (which indicates the number of – again, binary or decimal – digits that mustbe considered to be on the right side of the implicit decimal point).

24 D. Blasband

PROCESS_TRAN: PROC;

INITIALIZATION: PROC;

...

END INITIALIZATION;

PROCESS_RECORD: PROC;

...

IF ERROR THEN

GO TO FATAL_ERROR;

END PROCESS_RECORD;

PROCESS_FILE: PROC;

<Open File>

DO WHILE

END_OF_FILE = 1;

CALL PROCESS_RECORD;

END;

END PROCESS_FILE;

CALL INITIALIZATION;

CALL PROCESS_FILE;

GOTO EXIT;

FATAL_ERROR:

<Write Log>

EXIT:;

END PROCESS_TRAN;

final static int FATAL_ERRORcode = 9801;

public static void INITIALIZATION()

{...

}

public static void PROCESS_RECORD()

{...

if (error)

throw new

NonLocalGotoException

(FATAL_ERRORcode);

}

public static void PROCESS_FILE()

{... // This function does not contain any

// label which is the target of a non-

// local GO TO statement. It does not

// have to capture NonLocalGotoException

}

public static void PROCESS_TRAN()

{try

{INITIALIZATION();

PROCESS_FILE();

goto EXIT;

FATAL_ERROR:

<Write Log>

EXIT:;

}catch (NonLocalGotoException e)

{if (e.getLabel() == FATAL_ERROR_code)

goto FATAL_ERROR;

rethrow e;

}}

Fig. 11. Non-local GOTO statement


A fixed binary number k with d digits and a scale of s is thus a binary numbern stored in d bits such that k = n

2s .One must be able to convert fixed binary numbers to their decimal represen-

tation, to store them into fixed decimal variables or simply to display or printthem.

In other words, we are looking for an integer j such that k = n2s ≈ j

10r , wherer is the scale of the target decimal number. This approximation must be thesame as what is performed by the IBM compiler in terms of rounding and/ortruncation.

No single rounding or truncation scheme when computing n2s seemed to match

the mainframe compiler behaviour in all cases.Finally, we came across a page on IBM’s website[5], which candidly explains

how conversion to decimal number is performed by older PL/I compilers. If ourfixed binary number k with d1 digits and a scale of s1 is to be converted to afixed decimal number with d2 digits and a scale of s2, we have:

k =n

2s1=

n.5s1

2s1 .5s1=

n.5s1

10s1(1)

In other words, by taking the readily available integer value n, and by multiplyingit by 5s1 , one gets the integer representation of a fixed decimal with scale s1.Then, converting k to a decimal representation can be performed by multiplyingn by 5s1 (s1 has a range of 0 to 31, the powers of 5 can thus be precomputedin a table for efficient access), then changing the scale from s1 to s2 (which is atrivial matter when dealing with decimal numbers, as one must just move thedecimal point by s2 − s1 positions).

IBM’s website[5] indicates that older versions of their PL/I compiler multi-plied by 5s1 +1 instead of multiplying by 5s1 , introducing a minor positive bias.After experimenting to find which values of s1 required the bias – as it wouldhave been insane to have it for all possible values of s1, multiplying by 6 insteadof 5 for s1 = 1 – we implemented the same bias for s1 > 13 and got absolutelyidentical results as those produced on the mainframe.

We very knowingly reproduced what at the end of the day, can be consideredby IBM’s own admission, a weakness or even a bug. This is typical of a legacycompiler as introduced in 1.4.2: being equivalent to the original system is evenmore important than being mathematically correct.

3.5 Avoiding Excessive Allocations

As a platform, .NET is designed to handle string objects efficiently. They areimmutable, so that they can always be passed by reference with the absoluteguarantee that the callee cannot alter the string’s content, thereby reducing theneed for systematic copies for the sake of the implementation of a safe valuesemantics.

As a corollary, string literals are allocated at class initialization time and canbe used repeatedly without requiring additional allocations.

26 D. Blasband

.NET programmers then use strings liberally. They are efficient and supportedby a convenient as well as expressive syntax. It may be tempting for the compilerwriter to use them whenever possible, but this may soon prove to be a poordesign decision. PL/I variables (including strings varying or not) are allocatedwithin AddressSpace’s byte array (See 3.1). Converting them to .NET stringsis straightforward, but requires a new allocation on every evaluation, increasingthe pressure on the garbage collector.

The impact of such allocations is reduced in the PL/I compiler, by allow-ing a number of operations such as assignments and comparisons to be per-formed directly on the byte array using in the form of MemoryAreas withoutextracting the corresponding .NET string together with the extraneous memoryallocations.

3.6 Irregular Assignments

We have been raised with modern and reasonably well-designed languages, to-gether with expectations regarding the regularity of the behavior of commonconstructs such as assignments. For instance, when facing a statement such asa := b one expects to evaluate b, and assign it to a without further ado. We alsoexpect the left part of the assignment to support more sophisticated constructs,as in a.c(10) := b (just as in Fortran and COBOL, extracting elements from aPL/I array or passing parameters to a function or procedure is performed usingparentheses, introducing an ambiguity) to provide more flexibility in specifyingthe target variable for the assignment.

The source:

DCL 1 PERSON,

2 AGE PIC ’99’,

2 FIRSTNAME CHAR(10) VARYING,

2 LASTNAME CHAR(10) VARYING,

2 ADDRESS CHAR(30) VARYING;

STRING(PERSON) =

’45John Fitzgerald Kennedy’;

DISPLAY(PERSON.AGE);

DISPLAY(’[’ !! PERSON.FIRSTNAME !! ’]’);

DISPLAY(’[’ !! PERSON.LASTNAME !! ’]’);

DISPLAY(’[’ !! PERSON.ADDRESS !! ’]’);

The resulting output:

45

[John Fitzg]

[erald Kenn]

[edy]

Fig. 12. The PL/I STRING builtin function

These constructions produce what we’ll refer to as a descriptor (the wordingis vague on purpose, as the descriptor can be very different things depending onthe language at hand. It can be a pointer – or lvalue – in C, a pointer and a


length in COBOL, a pointer, a length and a bit offset in PL/I, and could verywell even contain a type information in some languages, where the target typeto use for the assignment can change dynamically as part of the evaluation ofthe left component).

The assignment is then merely a matter of evaluating the right part’s value,evaluating the left part’s descriptor, and assign the right part onto the descriptor.

Given a character buffer declaration as

DCL BUFF CHAR(100);

PL/I allows for a substring assignment, as in:

SUBSTR(BUFF,1,5) = ’XXXXX’;

which assigns the first 5 characters of the BUFF character buffer and leaves the95 other ones unaltered. (For the record, a statement such as BUFF = ’XXXXX’

pads the remaining 95 characters of BUFF with spaces.)To implement this, it is sufficient to have a function that computes the result-

ing descriptor. It does not require a complete overhaul of the common schemefor assignments as described above.

Things get more complicated when dealing with the STRING PL/I builtin.When used as an expression, as in A = STRING(B); it returns a character stringrepresentation of the argument it is given, by concatenating sub-elements overarrays and composite structures. When used on the left-side of the assignment,it assigns fields one by one.

For instance, figure 12 shows a PL/I code fragment where using the STRING

builtin on the left side of the assignment results in slices of the string on theright side of the assignment, assigned to each of the structure’s fields. As shownin figure 13, this requires ad hoc code generation, as the mapping is not a simplephysical one, but requires specific treatment for VARYING strings and the lengthcounters.

As explained in the introduction to PL/I in 1.1.2, varying strings are prefixedby two bytes that indicate their currently used length.

An assignment with a call to the STRING builtin as the left-side of the assign-ment cannot be compiled with a separate evaluation of the value on the rightside, and the assignment descriptor on the left side, as the assignment must splitthe value to assign varying strings and set their length appropriately.

In other words, such as assignment is not a true assignment. It must be de-tected as a specific pattern at compile-time, and ad hoc code must be generated,to evaluate the right side, divide it into pieces, and assign each of these piecesto the appropriate field of the left side.

3.7 Visual Studio Integration

Visual Studio is Microsoft’s IDE (Integrated Development Environment). Havingthe ability to develop, maintain and debug programs in Visual Studio is a mustfor any compiler targeting the .NET platform.

28 D. Blasband

AGE:

’4’ ’5’

FIRSTNAME:

0x0 0xA ’J’ ’o’ ’h’ ’n’ ’ ’ ’F’ ’i’ ’t’ ’z’ ’g’

LASTNAME:

0x0 0xA ’e’ ’r’ ’a’ ’l’ ’d’ ’ ’ ’K’ ’e’ ’n’ ’n’

ADDRESS:

0x0 0x3 ’e’ ’d’ ’y’

Fig. 13. The physical mapping after the assignment

Visual Studio integration covers a comprehensive set of features:

– Language-sensitive color coding and outlining– Project and program settings– Compiling from within Visual Studio, recovering errors if any– Debugging– Code completion and disambiguation (under the name Intellisense)

Our PL/I compiler supports all of these, except for code completion that requirespartial and permissive parsing, which has been considered to be unreasonablegiven how hard it is already to parse PL/I precisely as demonstrated in thisdocument.

Of all the features listed above, the debugger support is the only one that hasa direct impact on the compiler as such. The others are developed directly as.NET components that interact directly with Visual Studio.

3.7.1 Debugger SupportVisual Studio supports two modes for debugging.

In native mode, the process to debug is a .NET process, where functions are.NET functions and where the standard .NET debugger can be reused for every-thing that has to do with execution control (breakpoints, steps, call stack, etc.).

One can also use Visual Studio as a thin user interface, and where everythingelse is under the control of the debugged application. This mode, based on acustom debug engine allows Visual Studio to be used for interpreted or nativelanguages, where the standard .NET facility to control execution and breakpointscannot be used.

As our compiler generates managed .NET code, interoperability with other.NET languages is an important issue, even at the debugging level. We thereforeopted for the former approach, so that at least, we reuse the wealth of function-ality provided out of the box by the .NET debug engine and allow for seamlessmulti-language debugging.


Since PL/I variables do not map trivially to .NET variables nor to understand-able .NET types, they must be published by separate means. The generated codemust provide for the availability of the list of the variables that are accessible inany given scope, as well as the ability to get or set their value.

As shown in figure 2, evaluating the address of a PL/I variable may require anarbitrarily large number of dereferences, each one with different and potentiallydynamic offsets. In order to avoid duplicating this logic, the compiler generatesa .NET access function for each variable. These access functions just return theaddress of the variable, and their code is produced by reusing the compiler’sability to generate variable accesses as used in plain PL/I code. This ensuresconsistency between the variables as accessed by user-written PL/I code and thedebugger. This idea of promoting the consistency between the compiler and thedebugger environment is similar to what is described by Kadhim[53] to have adebugger’s expression evaluator reuse the compiler’s constant folding evaluator,replacing access to variables with their actual value. This ability was not availableto us, as our compiler is a native process, which does not integrate seamlessly ina .NET environment.

The compiler generates code to publish the variables’s names and structurein a way that is compatible with the debugger’s expression evaluator, and thedebugger then uses .NET’s reflection to call access functions (derived from eachvariable’s name).

4 Tricks of the Trade

This section lists real-world issues that we encountered during this project, to-gether with the corresponding solution, focusing on the odd, the unusual or thebarely mentioned in the literature.

4.1 Mixed Language Support

Just like COBOL, PL/I is often used in conjunction with embedded languages,CICS and SQL being by very far the most prevalent ones. According to CaperJones[52], about 30% of the software systems in production today in the USA ismade of programs with more than one programming language.

4.1.1 A Short Introduction to CICSCICS[49][2] is a transaction manager owned by IBM and running on itsmainframe platform, aimed at supporting high-volume online systems such asATMs, industrial production systems, airline reservation systems, etc. CICSprovides a large number of services, ranging from transaction synchroniza-tion to session management, message queuing, terminal management, andmore.

Applications running under CICS must be written specifically for this plat-form, as some of the facilities offered by the language of choice (COBOL and

30 D. Blasband

PL/I being the most common ones) cannot be used directly. File or terminal in-put/output, dynamic memory allocation and other primitives must go throughCICS using CICS-specific verbs.

Even though CICS was first released in the late sixties, it is still heavilyused by a large number of big organizations across the world. It is maintainedactively, and newest developments include support for Java and Enterprise JavaBeans[7]. Numerous other companies (Clerity/DELL[3], HTWC[10], Oracle[18])offer CICS emulators for other platforms than the IBM mainframe so that CICSapplications can be rehosted with minimal effort.

From a programmatic point of view, the most common way of writing CICSapplications is to embed CICS statements in the source code, and use a CICSprecompiler to recognize these statements and replace them by calls to theCICS runtime.

4.1.2 Using PrecompilersThe common way of dealing with embedded languages such as CICS and SQLis to have a precompiler, which will recognize the statements of interest andreplace them by some code that will deliver the required functionality, typicallyby one or more calls to a runtime component. Statements of interest are lexicallydelimited for easy recognition, using EXEC and END-EXEC for COBOL, EXEC andsemicolon for PL/I, as in:

EXEC SQL SELECT COUNT(*)

INTO :CUSTCOUNT

FROM CUSTTAB

WHERE NAME=:HNAME;

Fig. 14. Embedded SQL in PL/I code

where the colon prefixed :HNAME and :CUSTCOUNT refers to host variables, or, inother words, parameters to be passed to the SQL statement before execution.:HNAME is an input parameter that control the nature of the SQL statement toexecute, while :CUSTCOUNT is an output parameter that receives a result providedby the database engine.

The replacement code is far more complex than the original high-level state-ment in the embedded language. This difference in size justifies the use ofthe precompiler. Writing the same code manually would be cumbersome andunproductive.

Precompilers also perform perfunctory analysis on the source program, rec-ognizing macroscopic constructs (data division and working storage section inCOBOL, the head of the top-level procedure in PL/I) as they have to be ableto insert variable declarations in the source program text.

Such textual preprocessors have been used for decades, as they allow for aclear separation of concern between the business logic and the external interface


(CICS or SQL), and shield the compiler from the extra complexity induced bytens (for SQL) or even hundreds (for CICS) of additional statements to support.It also allows the database or the transaction processing monitor (See refCICSfor a short introduction on this topic) to evolve without impacting the compilerat all. New versions of the precompilers can be released without requiring asynchronized release of the supporting compiler, and different vendors (speciallyfor relational databases) each have they own precompiler, that replace SQLstatements by calls to the runtime of the relational database at hand.

4.1.3 PragmaticsPrecompilers also have shortcomings. First, adding a pass to the compilationprocess has a negative effect on compile-time performance, and as explainedin 2.2.1, it is an area where our compiler is already suboptimal.

Debugging also becomes a pain, as the source code which is actually compiledby the compiler is not the source code as maintained by the developer. SQLand CICS statements are replaced by long sequences of cryptic calls to runtimefunctions, that make no sense whatsoever for the developer.

This issue of desynchronization between the source code which is passed to thecompiler and the source code which is maintained by the developer is not a newone. All languages with preprocessors suffer from the same problem (C/C++,COBOL and PL/I), and commonly address it by maintaining synchronizationinformation, in the form of #line directives in C, so that the original as opposedto the preprocessed source code can be shown when debugging. This thus requiresdetailed knowledge of the preprocessor(s) to be used on the debugger side. Somecommercial compilers[13] provide hooks and APIs for third party preprocessorsto integrate gracefully so that debugging can then happen at the most relevantlevel of abstraction.

Finally, the SQL preprocessor (this does not apply to CICS) is significantlymore involved than a superficial reading of the above may suggest. It needs toperform at least a basic form of host language analysis, if only to find the typesof host variables (Such as :HNAME in figure 14). The SQL preprocessor whichwas supposed to ignore everything except the constructs of interest – not unlikeisland grammars, as described by Moonen[59] – end up needing some form ofdetailed parsing of the host language at hand.

4.1.4 Integrating SQL and CICS Support in the CompilerIn order to address these shortcomings, the PL/I compiler recognizes SQL andCICS extensions directly, as if they were part of the PL/I grammar.

The PL/I, SQL and CICS grammars go through static – i.e. compile-time– composition to produce a parser that recognize SQL and CICS statementsas plain PL/I statements. This is made possible by lexical backtracking, whichallows a CICS or SQL keyword to backtrack to a plain identifier if found outsidea CICS or SQL context.

This integration allows the original source with embedded SQL and CICSstatements to be compiled directly, allowing for a smoother debugging processand faster compilation.

32 D. Blasband

4.1.5 Deferring the Implementation to the RuntimePrecompilers come with specific CICS and SQL implementations and quite log-ically, they insert code which is specific to these implementations.

Our context is slightly different, as the CICS and SQL extensions are recog-nized by the compiler. Generating different code depending on the target SQLor CICS implementation at hand would be possible but cumbersome. It wouldrequire a serious overhaul of the compiler’s code generator whenever a new SQLor CICS implementation is to be supported.

Instead, the compiler generates calls to a component attached to the cur-rent ExecutionContext (See 3.2) and allows for a specific CICS or SQL imple-mentation to be plugged in the runtime, without affecting the compiler’s codegenerator. In other words, the generated code is the same for all the SQL andCICS engines one wishes to support. It is the runtime that must be given aspecific CICS or SQL implementation, to target a different relational databaseof transaction processing monitor.

Implementing a feature in manually written code (as in such a runtime plugin)is orders of magnitude simpler than working on a compiler to produce the samefunctionality. The delegation of the CICS and SQL services to the runtime is thus

The original source code:

FIRST: PROC(COMAREA_PTR)

OPTIONS(MAIN);

DCL 1 SQLCA ... ;

DCL AGE FIXED BIN (15,0);

EXEC SQL DECLARE AGE_CURS CURSOR FOR

SELECT FNAME, LNAME FROM

PERSON WHERE AGE = :AGE;

LOCPROC: PROC;

/* Local redefinitions */

DCL 1 SQLCA ... ;


EXEC SQL OPEN AGE_CURS;

WHILE SQLCA.SQLCODE = 0 DO;

EXEC SQL FETCH AGE_CURS INTO ...

..

END;

EXEC SQL CLOSE AGE_CURS;

END LOCPROC;

END FIRST;

The precompiled source code:

FIRST: PROC(COMAREA_PTR)

OPTIONS(MAIN);

DCL 1 SQLCA ... ;


LOCPROC: PROC;

/* Local redefinitions */

DCL 1 SQLCA ... ;


CALL SQLSTART(SQLCA, ...);

CALL SQLPARAM(ADDR(AGE),...);

CALL SQLEXEC(...);

WHILE SQLCA.SQLCODE = 0 DO;


..

END;


...

END LOCPROC;

END FIRST;

Fig. 15. Embedded SQL in nested PL/I procedures


an effective way of opening the compiler to multiple databases and transactionprocessing monitors.

4.1.6 The Subtleties of Semantic AnalysisThe textual replacement performed by a SQL preprocessor implies odd seman-tics, which has to be emulated by our compiler for functional equivalence.

The expansion of all SQL statements refers to a SQLCA data structure (SQLcommunication area) which contains the error code and messages upon termi-nation of the SQL statement.

This reference is lexical: the precompiler just generates accesses to a variablenamed SQLCA, and leaves the responsibility of dealing with it to the host lan-guage compiler. A similar mechanism is used for host variable parameters, asshown in figure 15 where the original PL/I code with embedded SQL statementsis shown side by side with a simplified version of what the precompiled sourcemay look like.

The SQL cursor declaration generates no code whatsoever. It is a compile-time declaration, which will only result in code being generated when the cursoris being opened, read and closed. The scope where this cursor is being openedmay differ significantly from the cursor declaration’s. Figure 15 shows how theprecompiled code will rely on PL/I’s scoping rules, and refers to a local SQLCAvariable when opening a cursor, as well as local parameters for host variablesif defined. This means that in order to emulate the exact behavior of the SQLprecompiler, one must attach the most local SQLCA declaration to every SQLstatement, and perform semantic analysis on the SQL statement in the contextwhere the cursor is being opened, in addition to the context where the cursor isbeing defined.

The cursor definition must be thus checked semantically where it is declared,as this is exactly the existing precompiler’s behavior, and it must be checkedagain where it is opened, as it is required to produce code that will behave thesame as the precompiled code.

4.1.7 Related WorkThere have been a number of recent publications about analysis of mixed lan-guage programs. Synytskyy et. al.[66] describes an effort aimed at parsing mixedprograms containing HTML, ASP.NET and Visual Basic code, using islandgrammars[59] and TXL[33].

Even though their results are impressive, they just cannot be applied to acompiler. They claim that their parsing tools are robust, but in this context,robustness must be understood as the ability to survive faulty code, and allow-ing the analysis to proceed with whatever code (islands) has been recognized,ignoring the rest as ocean.

Robustness for a compiler is an essential property, but it holds a totally differentmeaning. It implies that the compiler will not crash nor loop on any input, andwill provide the best possible errormessage to allow the developer to diagnose andcorrect the error. It does not mean that parts of the code are going to be ignoredbecause they have not been recognized as being of interest for the task at hand.

34 D. Blasband

More generally, one can question the relevance of island grammars for all butvery superficial analysis that can survive some level of imprecision. One dependsextensively on the accuracy of the definition of islands, as any flaw may gounnoticed, the unrecognized construct simply being ignored as ocean.

Sloane describes a technique[64] to embed a domain specific language inHaskell[48], using the language’s ability to be extended with user-defined opera-tors and language constructions. This is a fertile research domain for Haskell (theauthor’s favorite being Haskore[47] to represent music concepts), but this tech-nique is hard to extrapolate to other implementation languages. Using Haskellin this way also implies compromises and restrictions with regard to the syntaxof the language to integrate.

Other efforts in the area of language extension or embedding includeJastAdd[39], Silver[70], or Stratego[29]. These projects have demonstrated thepracticality and usefulness of language extensions, but how they would relate tothe PL/I case with embedded SQL and CICS statements is unclear.

They mostly concentrate on modern languages such as Java, which has a rea-sonably clean syntax, powerful abstractions and adequately defined semantics.PL/I has none of these valuable properties. Expecting the nice use cases on Javato extrapolate to PL/I without serious validation may be a bit optimistic.

Even more importantly, these efforts concentrate on the ability to integrateextensions to a host language from the ground up, where the extension designeris not restricted in any way. Whether these projects would be able cope withill-designed existing extensions without having the ability the change anythingabout them is an open issue as well.

4.2 The Virtue of Intermediate Representations

This section lists a few of the many places in the compiler where intermediaterepresentations have contribute to the flexibility and robustness of the compiler.

4.2.1 Bidirectional Type CoercionPL/I specifies that integer literals are to be represented as fixed binary values,with a scale depending on the magnitude of the integer at hand (meaning forinstance that 7 and 0007 are not represented by the same data type).

On the other hand, indexes in arrays must be converted to plain integers (asthe computation that produces the offset of the array element is native .NETcode, which can only deal with native types).

Figure 4.2.1 shows how a PL/I expression such as A(7) is represented in thecompiler’s intermediate representation. ArrayElement’s are binary (in the senseof having two sub-nodes) nodes that extract an element from an array. Boxeswith a dashed line style denote casts. Figure 4.2.1 shows how the double castis simplified in the intermediate representation, recognizing that casting fromnative integers to a fixed decimal data type, followed by a cast back to nativeintegers amounts to a nil operation.

The intermediate fixed decimal node is required because there are cases in PL/1where 7 and 007 are not the same. When such a constant is passed as parameter


ArrayElement

. . . A. . . .NET int

FIXED DEC(1,0)

Int: 7

⇒

ArrayElement

. . . A. . . Int: 7

Fig. 16. Simplifying scaffoldings of casts

to a function, a temporary variable is defined with the type attached to the literalexpression (FIXED DEC(1) for the former, FIXED DEC(3) for the latter).

This case is referred to as bidirectional type coercion, when a chain of castsstarts and ends with the same type, and that intermediate types introduce nosemantic change, for instance, if one can ensure that they don’t reduce precision.

Without this provision, a cast from a float to an integer followed by a castback to a float would be wrongly recognized as a nil operation, neglecting theimpact induced by truncation.

4.2.2 Variations on Constant FoldingThe compiler’s intermediate representation supports constant folding, with twotwists.

Deferred constant folding is used to express numeric expressions that areknown to yield a constant, but where some of the (constant) components arenot known when the expression is elaborated. For instance, given a scope thatcontains three variables f , g and h, with resp. sizes sf , sg and sh, their offsetsare resp. of = 0, og = os + sf and oh = og + sg. At some stage, the compilerneeds oh, but og and of have not been evaluated yet, and restructuring thecompiler so that all the required information is computed in the appropriateorder would be very cumbersome. The compiler then builds an expression of theform oh = og +sg, where og is a placeholder which is known to be a constant

integer, its exact value being filled whenever og is to be computed. At codegeneration time, og +sg is evaluated, and the value placed in the placeholder is

used to yield a constant value. An attempt to evaluate og before the placeholder

is filled with a constant value results in a compilation error.An optimization implemented for Multics’s PL/I compiler[19] is supported,

namely to always move the constant part of any expression to the same side(operator commutativity permitting) to allow for better optimizations (c1 + a issystematically stored as a+ c1 if c1 is constant, a+ c1 + b is stored as a+ b+ c1,etc.) All constant sub-expressions are grouped on the same side of the tree,a+ b+ c1 + d+ e+ f + c2 + g is transformed into a+ b+ d+ e+ f + g+ c1 + c2where c1 + c2 can then be simplified.

36 D. Blasband

4.2.3 Substrings of SubstringsThe most common operation performed on MemoryArea’s (See 3.1) is the slice, orsubstring, which returns a MemoryArea starting at an offset, with a given length.For instance, when facing a designator of the form a.b, fetching b is merely asubstring of a denoted s(a, ob, sb) where s represents the substring operation, obrepresents the offset of b within a, and sb represents the size of b expressed inbytes.

Similarly, fetching an array element a(i) is a substring denoted s(a, (i − 1) ∗sa, sa) if the array a starts with index 1, and sa is the size of a’s elementsexpressed in bytes. In fact, even fetching a local or global parameter is little morethan extracting a MemoryArea from the local or global data space allocated to theprocedure at hand, also represented by a MemoryArea. Substrings of substringsare thus pervasive throughout the generated code, and can be simplified byapplying a rewriting rule of the form:

s(s(a, o1, s1), o2, s2) = s(a, o1 + o2, s2)

o1+o2 can often be simplified by constant folding. Even if it can’t, performing theaddition separately and reducing the depth of the nested substring also improvesperformance.

4.3 Spurious Error Message Limitation

Improving error reporting is a pervasive concern in compiler design, even ifmainly concentrated on parsing. Most compiler-compilers provide mechanismsto improve the relevance and usability of error messages produced during pars-ing, automatically or through user hints. This PL/I compiler uses the simplisticmechanism provided by DURA[25] to implement parse time error recovery.

Non-terminals that can be used for error recovery are marked explicitly assuch. When the parser is blocked because it has no possible action (shift orreduce) in its current state when facing the current token, it skips tokens untilit finds one which is in the FOLLOW set attached to a partially recognizednon-terminal which is marked as supporting error recovery. This non-terminal isthen reduced forcefully, and the parser continues its normal processing.

This scheme depends on the discriminating power of the FOLLOW sets. Itworks best if a non-terminal’s’ FOLLOW set is a good oracle for its reduction.This brute force approach works much better on flat languages such as COBOLthan on highly nested block-structured languages such as C and PL/I, as itprovides no provision to deal with nesting, and reduce at the most appropriatelevel if multiple reduction levels are available.

Error recovery is not limited to parsing. It is common for many production-level compilers to list dozens of error messages from a single actual error. Thisforces users to concentrate on the first error message, assuming that the sourcefor all the reported errors is likely to be the initial one, fix the underlying problemand recompile. The mere number of spurious error messages makes any morestructured approach pointless.


When facing an error during semantic analysis or code generation, it is veryhard to assert whether it is the result of a separately reported error. It is thustempting to report it, just to be on the safe side. The sole drawback of thisconservative approach if the large number of spurious error messages.

The PL/I compiler described in this document uses an effective even if prim-itive scheme to reduce spurious error messages. Errors are not reported in thevoid. They are attached to a non-terminal, which then provides them with aposition in the original unprocessed source code, that can be reported to theuser as useful positioning information. Error reporting can be made lazy, so thatan error is reported only if the non-terminal it is attached to does not haveattached errors already, nor any of its recursively reachable subnodes. The re-cursive walkthrough through the subnodes does induce a performance penalty,but it is negligible in the total compilation time. Besides, the walk through isperformed by efficient visitors[9] generated by DURA.

This technique allows the compiler writer to report errors conservatively, with-out having to care whether they are the effect of previously reported errors ornot. Only one error (the first and innermost) is reported for any subtree.

These attached errors can also be queried explicitly when spurious error mes-sages are not always attached to the same non-terminal or subtree. For instance,this can be used to report an error when using a variable only if the matchingvariable declaration has no previously attached error.

4.3.1 Related WorkThe rather primitive syntactical error recovery mechanism described above forDURA is similar to what yacc[51] provides, but there are more sophisticatedapproaches, such as Burke and Fisher[31], which tries all possible sequences ofsingle token deletion, insertion and replacement on a fixed size window beforeselecting the sequence that seems to maximize the chances of successful resyn-chronization of the parser. This kind of sophisticated algorithm does not mapefficiently to a parser with a backtracking lexer, as the very definition of a token(in terms of position and length) may vary.

Incidentally, integrating a Burke and Fisher error recovery algorithm does notfit easily into a GLR parser either. Checking for successful resynchronizationmay require multiple forks on the LR DAG of stack slices. Early papers aboutElkhound[57] mention support for a Burke and Fisher error recovery schemeas subject to further work, but there hasn’t been any publication describing aworking implementation since.

One can speculate that the scannerless variant of GLR[71] (commonly referredto as SGLR) will be even less suited for Burke and Fisher, if only because SGLRdoes not even support the token concept to start with, turning the computationof all the possible token operation over a non-trivial range into a combinatorialissue.

De Jonge et. al. [34] propose a SGLR-specific error recovery scheme, basedon a combination of techniques to reduce the scope of error repair, based on therecognition of regions that can be skipped if they contain an error. These regionsare based on indentation, which makes them sensitive to programming style.

38 D. Blasband

Even though this technique based on bridge parsing[61] compares favourably tothe JDT[6] Java parser in terms of precision and relevance of the synchronization,it probably cannot be applied conveniently to the case at hand. First, regionsrecognition is based on reliable tokenization (braces for Java), while PL/I doesnot even have reserved words (See figure 5) and the block delimiters BEGIN orDO and END can even be omitted in some cases (See figure 18). Then, indentationstyle in PL/I is not as standardized as it is in Java, making it less suitable assource of information for reliable block detection. More generally, this techniquebased on bridge parsing seems to be aimed at robust parsing to be performedon the fly within an IDE (which further validates the comparison with JDT). Insuch a context, programs are parsed incrementally during edition, more often inan incorrect state than not, and where the ability to extract meaningful partialinformation is essential. Error recovery in a compiler deals with correct inputmost of the time, and partial parse trees are pointless. If an error is detectedduring parsing, the compiler stops and does not even attempt to go further andgenerate code that would be incorrect in any case.

Beyond parsing, Norman Ramsey[63] describes a technique applied to com-pilers implemented using a functional programming style, where a specific valueis used to keep track of intermediate results that are unavailable because of analready reported error, allowing for the systematic treatment and recovery ifneeded, without the generation of any additional error message.

Older contributions[46] suggest that the symbol table could hold error mark-ers, so that entries marked as having an error do not induce any additionalerror message. This scheme is limited in scope, as some errors have no directrelationship with entries in the symbol table.

4.4 A Testing Infrastructure

For anyone who has been exposed to PL/I, and as this document (hopefully)makes abundantly clear, it is a very complex language. It is also a very poorlydefined one, where a number of behaviors are not explicitly described except forreference compilers.

In such a context, it would have been foolish to start working on a newcompiler by relying on the available documentation and ad hoc testing only.Something more structured is required, for us as well as for the first customers.Big organizations that have sizable PL/I portfolios are very risk averse, and onemust come up with a serious story regarding quality and testing before they evenconsider using a compiler that does not have a long history of successful use inproduction.

Over 25% of the budget for the development of this PL/I compiler was allo-cated to the development of a complete test infrastructure that would allow usto guarantee a decent level of quality even for the very first customers. This testinfrastructure allows us to define tests as one or more PL/I source programs, aswell as data files if necessary. Each test comes with an expected result (compi-lation failure, compilation success, execution output), so that the tool can run


all the tests unattended, and report any discrepancy between the expected andactual results.

4.4.1 VersionsA simple conditional construct allows one to define multiple versions for tests,each with its own expected result, etc. This allows for the convenient definitionof multiple combinations, by testing different data types or any other slightvariations on an original test.

IMP_VAR : PROC OPTIONS(MAIN);

.IF 0

.END

.IF 1

DCL I FIXED BIN(31,0);

.END

I = 2;

DISPLAY(’I=’ !! I);

CALL MYPROC2;

MYPROC2 : PROC;

.IF 2

DCL I FIXED BIN(31,0);

.END


I = 4;


END MYPROC2;

END IMPVAR;

Fig. 17. Versioned test

This versioning facility structures the tests hierarchically, by keeping the var-ious forms used to test a given feature together.

4.4.2 Keeping Track of RegressionsIn an ideal world, one would want 100% of all the tests to run successfully be-fore delivering a new release, as a way of ensuring that there is no known bug norlimitation in the compiler. In practice, a more nuanced approach is necessary, assome tests are entered in the test infrastructure as a way of keeping track of futurework, even though they depend on a feature that is only planned, and that willnot be made available for months. Imposing a 100% coverage level as a matter ofprinciple would be vastly suboptimal, as it would prevent one from using the testinfrastructure as a tool to keep track of these planned future developments.

40 D. Blasband

On the other hand, it is equally unreasonable to accept failed tests withoutfurther ado, as it defeats the purpose of this test infrastructure altogether.

In order to address this concern, the test infrastructure detects and reportsregressions, which are tests that have succeeded at least once in the past, andwhich currently fail. The policy about pre-delivery testing has thus moved froma 100% success rate to a 0% regression rate.

4.4.3 Comparing Results with the MainframePL/I’s documentation is nowhere close to exhaustive. There are many questionsthat cannot be answered by any other mean than actual tests using the originalmainframe PL/I compiler.

The test infrastructure allows individual tests to require a validation on themainframe (Some tests, such as the ones that exercise the interface between PL/Icode and C# code obviously do not require such mainframe-based validation).The regression testing tool can then generate a self-contained zOs JCL[20] thatcan then be uploaded on the mainframe, where it runs as a whole. The output ofthis JCL can then be downloaded, and read by the regression testing tool whichthen ensures that whatever compiles on the mainframe compiles under Windowsand .NET, and that the execution results are identical as well.

4.4.4 Relaxing ComparisonsThis scheme for comparing the results obtained with our compiler and the main-frame compiler has proven invaluable in detecting some hard to find differencesof behavior. In places, it has shown to be too pedantic about equivalence, andneeded to be relaxed.

A variable having an address and a size, one can get quite some informationabout it even without the matching declaration, for instance by checking itsphysical representation. If needed, this technique can be used to reverse-engineera data type in absence of an explicit declaration.

Such techniques do not apply to the type returned by functions, as they arenot necessarily mapped to a memory area one can inspect. One can display thereturned value, and make guesses based on the format of the output (the numberof positions, rounding, etc.) but that usually is not very conclusive.

When dealing with a user-defined function, this is barely an issue, as thefunction definition states the returned types explicitly, or by default, receivesone based on its name, using the same mechanism allowing implicit variables tobe typed. On the other hand, PL/I also comes with a large number of builtinfunctions (or builtins), that address a number of issues (memory representations,trigonometry, string handling, etc.).

These builtins are supported by our compiler, mimicking the returned datatypes as announced in the mainframe PL/I compiler. The compiler even providesa command-line option that allows one to set the precision of some of thesebuiltins, emulating a similar option available for the mainframe compiler.

It soon appeared that a number of tests that display values returned bybuiltins produced different output depending on whether our compiler or the


original mainframe compiler is being used. These differences were limited to nu-meric formatting issues only, suggesting that the builtins return the same valuesbut with different types. This again demonstrates that the IBM compiler’s doc-umentation is not a totally reliable source of information when it comes to theexact behaviour of the original mainframe compiler.

Reverse-engineering the exact types as returned by the builtins would havebeen tedious, imperfect (as explained above) and ultimately probably not worththe trouble. In order to bump into this difference of behaviour, one must doexactly what the offending tests do, namely display the value returned by thebuiltin without further ado. Any other usage of the builtin, like assignment toa variable or use in some computation, would induce no difference, since thediscrepancy is limited to data types only.

Displaying the result of builtins directly is rare enough in production-levelcode. We opted for the ability to mark individual tests as requiring a morerelaxed comparison between the outputs, ignoring white space and thereby beingmore forgiving to type errors and the resulting differences in formatting.

4.5 Dealing with Context Sensitivity

The parsing techniques available today deal with context-free languages only.Still, parsing real world languages commonly requires some level of context-sensitivity. The most documented example of this is the infamous typedef prob-lem, which shows how some knowledge about the currently available data typesare required to parse C, C++, Java or even C# accurately[43].

Another, less documented example of this intrusion of semantics into syntaxis COBOL, where a condition such as

IF A = B OR C THEN

...

can be a shorthand for

IF A = B OR A = C THEN

...

or may have to be understood as

IF (A = B) OR C THEN

...

if C is a level 88 indicator variable, in effect representing what one would call aboolean in a more modern language.

Such cases can be addressed by maintaining semantic information at parsetime and feeding this information in the parser or lexer, or by using a GLR[56][67]parser that builds a DAG of all the possible interpretations of the input, usinga separate pass to remove the branches that are incorrect.

As shown below, a GLR-based solution works best when the variants are localand do not introduce a combinatorial explosion of the possible interpretationsof the input.

42 D. Blasband

4.5.1 Structural Context-SensitivityIn PL/I’s case, a more structural form of context-sensitivity has to be addressed.The END keyword is used to close a block as well as a procedure, as in figure 18(Specifying the name of the procedure when closing it is optional, one could havewritten END; instead of END MAX;).

MAX: PROC(I,J) RETURNS (FIXED BIN);

DCL I FIXED BIN;

DCL J FIXED BIN;

IF J > I THEN

DO;

RETURN(J);

END;

ELSE

DO;

RETURN (I);

END;

END MAX;

MAX: PROC(I,J) RETURNS (FIXED BIN);

DCL I FIXED BIN;

DCL J FIXED BIN;

IF J > I THEN

DO;

RETURN(J);

END;

ELSE

DO;

RETURN (I);

END MAX;

Fig. 18. A PL/I procedure with and without balanced blocks

An END keyword with the procedure name specified implicitly closes any openblock that could terminate with an END, as shown in the second piece of codeshown in figure 18. This form introduces a mismatch between the number ofPROC and DO keywords on one side, and the number of END keywords on theother. This case can be supported by a combination of two tricks:

– The Commit (See 1.2.3) method for the procedure ensures that if present, thefinal identifier following the END keyword matches the procedure name. Theverification of this matching property is mentioned by Hopcroft et. al.[45] asequivalent to the context-sensitive grammar anbn(a|b) ∗ anbn. Having anbn

with the same n on the two extremities cannot be enforced by a context-freeparser alone.

– A grammar rule with low priority (enforced by the YIELD operator, see fig-ure 7) that allows a DO block to be reduced without the final END keyword.

To ensure that such an open-ended DO block does not successively reduce all itsviable prefixes (DO S1 S2 . . .Sn−2 Sn−1 Sn, then DO S1 S2 . . .Sn−2 Sn−1, DO S1

S2 . . .Sn−2, etc.), the Commit method also ensures that this open-ended formonly occurs when the following lexeme is valid (END, ELSE, OTHERWISE or WHEN).This can be seen as an ad hoc restriction on the non-terminal’s FOLLOW setfor some of its possible derivations.

The procedure names at the top and at the bottom of the procedure are bothpart of the same grammar rule, so that ensuring that they are equal is easy: they


are synthesized together in the same non-terminal. It is thus easy to ensure thatthey match in a Commit method defined in this non-terminal class.

Things get slightly more complicated when a similar treatment must be per-formed on plain statements, such as the multi-way test provided by the SELECTstatement as described in figure 19.

Statement ⇒ IfStatement

⇒ DoStatement

⇒ CallStatement

⇒ SelectStatement

⇒ LabeledStatement

. . .

LabeledStatement ⇒ Label “:” Statement

SelectStatement ⇒ SELECT “(” Expression “)”

WhenClause ∗OtherwiseClause?

END Label? “;”

Fig. 19. The statements in the PL/I grammar

The ability to close all open blocks with a single END clause that is availableto procedures as described above applies to SELECT statements as well.

The END SLAB; clause closes all the open DO blocks, by specifying the sameidentifier that is used as label for the SELECT statement. On the other hand,unlike what happened with procedures, the label of the statement is not re-duced together with its END clause, making the verification of their equal-ity cumbersome. The opening and closing label are not reduced in the samenon-terminal, so that they cannot be checked for equality in a simple Commit

method.One solution would have been to distribute the leading label to all statement

non-terminals, so that it gets reduced together with the end label if any. Thiswould then replace the LabeledStatement non-terminal class, by a situationwhere all statements (IfStatement, DoStatement, etc.) support a prefixing la-bel. This is not a very appealing solution, as it increases the entropy of the entiregrammar to cope with a local parsing issue.

The solution that has been finally implemented is to extract the context totest against directly from the LR stack. In other words, accept to reduce aSELECT statement with an END clause together with an identifier if the LR stackcontains a matching label in the appropriate position to be attached to theSELECT statement.

44 D. Blasband

SLAB:

SELECT (XVAR)

WHEN (1)

...

WHEN (2)

...

WHEN (3)

DO;

...

DO;

...

END;

...

END;

END SLAB;

SLAB:

SELECT (XVAR)

WHEN (1)

...

WHEN (2)

...

WHEN (3)

DO;

...

DO;

...

END SLAB;

Fig. 20. A PL/I SELECT statement with and without balanced blocks

Extracting data directly from the LR stack is not a common way of dealingwith parsers, but it could be formalized as a way of querying the partial parsecurrently active when reducing a non-terminal. In any case, it is much cleaner,simpler and safer than the alternative, namely, maintaining global structures tokeep track of the very same information.

4.5.2 GLR’s Combinatorial ExplosionThe approach described above aims at deciding for the most appropriate courseof action as early as possible, namely at parse time. The alternative to thisapproach is to keep track of a larger number of possible parse trees in the formof a GLR DAG but that could prove toxic.

GLR builds a DAG that synthesizes all valid parse trees, leaving the respon-sibility of reducing this DAG to (at most) a single valid parse tree to a separateprocess commonly referred to as a disambiguation filter [54][55]. Since GLR’smain property is to perform a single pass forward on the input, defining pri-orities in a way similar to the YIELD operator (See figure 7) would make nosense whatsoever. All possible evolutions must be tested simultaneously, as GLRprovides no way to backtrack and try another, less plausible analysis path. Infact, there is no such thing as a less plausible analysis path as far as GLR isconcerned.

A grammar rule such as the one shown in figure 7 then causes a combinatorialexplosion of the number of possible parse trees. Since a PL/I DO group can havea closing END clause, but may as well not have it, a GLR parser must forkand consider reducing the DO group on every statement it includes. When DO

groups are nested, the number of possible actions that can be performed growsexponentially.


Therefore, parsing PL/I with GLR and using a post-parse disambiguationfilter only is doomed to fail. Filtering out all invalid interpretations of the inputis straightforward, but GLR parsers are not discriminant enough. They keeptrack of far too many incorrect, partial or complete parse trees to be practicalon real-world PL/I systems.

This can only work if the GLR parser at hand provides hooks to allow dis-ambiguation filters to be applied during parsing on partial parse trees, so thatthe DAG can be reduced earlier and the combinatorial explosion can be avoided.For instance, such a filter would be able to ensure that a reduction of a DO groupwithout the matching END clause is only possible if the current token is of a givenclass, as described in 4.5.1.

4.5.3 Related WorkAs fas as the author is aware, the only other PL/I parser commonly available onthe market as a reusable component is provided by CoCoLab[4]. This section isbased on[42].

IdentNt ⇒ Ident

⇒ DO

⇒ IF

⇒ THEN

⇒ ELSE

⇒ BEGIN

. . .

Fig. 21. CoCoLab PL/I parser’s definition of an identifier

CoCoLab’s PL/I parser is generated using Lark[41], which can generate aparser and a strongly typed parse tree in a number of languages, including C[30],Modula II[75], Eiffel[58] and Ada[23]. It uses a grammar rule as depicted infigure 21 to deal with lexical ambiguity as presented in the sample in figure 5. Itthus allows any of the language’s keywords to be used whenever an identifier isrequired. The parser then relies on Lark-generated parsers’ ability to backtrackto consider alternate interpretation of the input if needed. This trick allows thelexer to be simpler, since all the backtracking takes place at the parser level.

This lack of backtracking at the lexical level is only made possible by the factthat the language allows for a single decomposition in lexemes, as opposed toFortran where a more sophisticated form of backtracking is required[25].

SQL and CICS are dealt with separately, by running ad hoc parsers on theextracted statements, then plugging the resulting partial parse trees in the mainprogram’s parse tree. Unbalanced blocks, as shown in figure 20, are addressed

46 D. Blasband

by ad hoc semantic actions to make sure the appropriate level of nesting isreduced. (Whether these semantic actions are implemented using one of theformalisms supported by the Cocktail toolbox, or directly in one of the supportedprogramming languages is not clear at this stage. Would the latter be true, thiswould of course restrict the usage of this PL/I parser to this language).

CoCoLab’s PL/I parser has been used to implement a number of industrialtools. According to Grosch[42], there is no intrinsic limitation that prevents itfrom being used as front end for a compiler, even though it hasn’t so far.

5 Lessons Learned

5.1 Intermediate Representations

Having workable intermediate representations between the PL/I parse tree andthe IL assembler source file is the single most important design decision in thisentire project.

Generating assembler directly would have been more comfortable in the begin-ning, but the intermediate representations proved invaluable in surviving designimperfections. It would have been unreasonably optimistic to believe that wecould foresee all the issues we would be facing in the course of the project. Theintermediate representations are a flexible place where design changes can beapplied late in the project, quite more flexible than the parse tree, for instance,which is constrained by the grammar and can only undergo limited structuralchanges.

5.2 Strong Typing

Compilers are complex pieces of software, where developers can use all the helpthey can get. A language that guarantees static type safety is an essential asset.Interface are used liberally to define cross-cutting concerns, classes that sharean essential property (having a type, being callable, requiring a separate .NETfunction, etc.) while they are in totally different parts of the inheritance tree,which is induced by DURA and the PL/I grammar.

This does not preclude us from providing generic walk through services on theparse tree (subnodes, ancestor nodes, enclosing statement, procedure or program,etc.) but type consistency errors should be caught as early as possible, long beforestarting to test.

5.3 The Regression Testing Infrastructure

This is less about lessons learned than about applying past experience, gainedthe hard way.

The regression testing infrastructure is a serious investment in its own right,but when dealing with non-trivial software development projects, every pennyput in structured and automated testing as opposed to tedious and manual test-ing processes is well spent money. Supporting a systematic process to compare


the results of our compiler with what the original mainframe compiler producesmakes it even more valuable.

This infrastructure turned out to a true deliverable: our first few customersare perfectly aware of the fact that the compiler does not have a long history ofproduction-level usage. They are typical mainframe shops, conservative and riskaverse.

Having a complete and compelling story to tell about how we ensure that thecompiler is of the highest possible quality, even in its earliest releases, has provena very potent sales argument.

5.4 Reusing a Parser Designed for Modernization

It has long been advocated that parsing technologies developed for compilerswere unpractical for the implementation of modernization tools[69]. This PL/Icompiler has gone the opposite route, namely, reusing a parser primarily designedfor modernization in a compiler context.

With the benefit of hindsight, we identified three critical success factors inthis endeavor.

5.4.1 PerformanceModernization parsers generally focus more on versatility and flexibility thanperformance, mainly based on the flawed assumption that modernization tasksare not performance-critical. In practice, modernization tasks are run routinelyon portfolios made of thousands of programs and millions of lines of code. Run-ning them in days or hours rather than weeks makes a big difference, speciallywhen a modernization process is rerun numerous times in the course of a project,before the final delivery transformation.

A parser to be used in a compiler aimed at porting mainframe code shouldhave at least comparable performance from the mainframe user’s point of view,considering that a mainframe CPU is shared among large numbers of users, eventhough it is still full factors slower than what hand-crafted parser could achieve.

5.4.2 LogisticsA compiler is meant to be used continuously by a large number of users, asopposed to a modernization tool, that is aimed at being used sporadically by asmall number of highly educated users, who can live with more or less comfort-able infrastructures. To be integrated in a workable compiler, a parser must berobust and self-contained. It should not load grammar files or even transitiontables on the fly. Grammar composition if any, to deal with SQL and CICS forinstance, must have happened at compile-time. As far as the user is concerned,the parser must be a non-issue.

For the compiler user, the mere ability to tweak the grammar is irrelevant. Onthe other hand, having a compiler stop with some cryptic error message becauseon-the-fly grammar composition failed is unacceptable.

48 D. Blasband

Error reporting must be usable and consistent. If preprocessing is performedby the compiler, reported line numbers must refer to the original line number asopposed to the preprocessed line number.

In a nutshell, the fact that the parser was originally designed for modernizationshould have no impact whatsoever with the way it is integrated in the resultingcompiler.

5.4.3 Strong TypingA parser can easily produce a generic tree in memory, as a Lisp-like list oflists, or any similar representation. Some early legacy transformation tools evenflattened parse trees onto a relational database. For reasonably simple transfor-mation tasks, this is perfectly adequate, but when dealing with a production-level compiler, one must have a more robust, strongly typed and scalable datastructure. Strong typing allows for both more efficient memory occupation andcompile-time validation.

Strong typing also implies using a language that supports static type check-ing, as opposed to a dynamic environment such as the RainCode Engine forPL/I together with the RainCode scripting language to implement the compiler(as a matter of fact, we even implemented a prototype of a tiny subset of thePL/I compiler using the RainCode scripting language, just to demonstrate thefeasibility of the intermediate generation of IL assembler code), but being Turingcomplete is only part of the story. Compilers are hard. Developing a robust onerequires static validation, a type-full language, far beyond what a dynamicallytyped, fully interpreted language can provide. Such languages are better suitedto implement short-lived transformations and analysis tasks.

On the contrary, it has always been the author’s belief that providing anAda[23]-only API for ASIS[1] has been one of the main reason for the limitedmarket acceptance of this API. A static, rigorous language such as Ada is agreat tool to send rockets in space, but totally inadequate to develop ad hocanalysis tools. ASIS has been used successfully as technical backbone for full-fledged source code analysis or transformation products, where one can justifythe high cost of integration of such a rigid component. It has failed in providingthe developer with the ability to write simple queries or short-lived tools on thefly to fulfill a local and immediate need.

5.5 Generating Code for a Virtual Machine

This paper shows abundantly how generating code for a virtual machine such as.NET is orders of magnitude simpler and safer than generating native code fora physical processor. The .NET instruction set is designed to make the compilerwriter’s life easier, and utilities such as the PEVERIFY tool (see section 1.3.5)allows one to statically detect numerous bugs that would have required lengthydebugging session if looking at a more conventional target platform.

Many details are abstracted away by the virtual machine, which is the wholepoint of using it in the first place. How these abstractions are mapped to a


physical machine is usually not documented, and for a reason, as one wants tobe able to change this mapping at any time.

In places, this mapping is excessively naive, and since it is not documented,it can only be explored by some tedious trial and error process. .NET structures(See 3.1) do not map directly to hardware registers, so temporary structure val-ues are allocated on the stack, and references to these allocated structures arepassed to represent structure values. The caveat of the current implementationlies in the fact that such stack allocated structure are never reused across state-ments of a function, resulting in a stack usage that grows quasi-linearly with thenumber of structure type variables used within the function. This has required aserious overhaul of the compiler’s code generator to rely on structure referencesrather than structure values whenever possible.

Abstraction is essential. Sometimes though, the nature of what is being ab-stracted away can cause serious problems.

6 Further Work

The compiler described in this document is now used to port production-levelPL/I programs to the .NET platform. If supports a large subset of the language,but it is not complete yet, as it has been developed incrementally, to supportthe features that were needed to compile existing portfolios as they were pro-cessed. The full language definition is unreasonably large, and there is no pointin starting with the reference manual and implement it cover to cover. This taskwould be made even harder by the fact that multiple (and conflicting) referencedocuments exist.

Based on future deployments of ported portfolios, a number of future exten-sions are currently being planned:

– More CICS verbs must be supported at the compiler level, and translatedto calls to the runtime environment, as only a subset is supported at thistime. SQL support is comprehensive enough as it is, as the various SQLstatements that can be performed only use a limited set of calls to theruntime.

– PL/I’s controlled allocations must be supported (See section 1.1.2). Thecompiler must also be extended to support PL/I’s ability to define fixed sizedareas, and allocate memory within these areas. Such areas can be deallocatedas a whole, implicitly deallocating all the variables that have been allocatedin the area in a single operation.

– Our preprocessor also needs to be extended in places, to support morepreprocessing-level builtin functions.

– The techniques and tools presented in this document are being used to de-velop compilers for other languages than PL/I, demonstrating their use-fulness beyong the scope of this language. These new compilers will sharePL/I’s computing model and runtime environment to allow for reuse andinteroperability across languages.

50 D. Blasband

7 Conclusion

This paper describes a compiler development project in a constrained environ-ment. The product had to be available on the market within months, much lessthan what the literature suggests as a reasonable development time for such atool.

Reusing a proven in use parsing infrastructure provided enormous benefits,but that was only made possible by the firm foundations it was built on. Pro-moting (or demoting, depending on the point of view) a modernization parserto a compilation parser can be done, but requires a number of basic propertiesto be checked for.

More generally, the differences between a compiler and a migration tool havebeen emphasized. It is the author’s strong opinion that they are intrinsicallydifferent, and extrapolating techniques from one area to the other is more likelyto fail than not.

This paper also describes an alternate way of dealing with embedded lan-guages such as CICS and SQL from a compiler’s perspective (even though sim-ilar approaches have been used in the modernization realm, where dealing withthe output of a preprocessor obviously is not an option), by avoiding externalprecompilers and by extending the compiler to support the extensions as firstclass citizens of the grammar.

The implementation of the CICS or SQL services are pluggable in the runtime,making them an order of magnitude simpler and cheaper to interface than whenthey are handled at the compilation or precompilation level.

Generating for a VM designed for compiler writers greatly simplifies the de-sign of the compiler, by providing powerful abstractions (such as a stack basedevaluation or composite structures) and tools (such as PEVERIFY – See 1.3.5).

This kind of non-performance critical compilers will become more common,specially in VM-targeted languages. An ever increasing part of the executiontime is devoted in external components (databases, network latencies, runtimelibrary) up to a point where the generated code’s performance is getting less andless important. When targeting a virtual machine architecture such as .NET, thistrend is emphasized by the fact that one’s mere ability to optimize beyond thetrivial is seriously hampered by the level of abstraction provided by the VM.

Acknowledgments. The author wishes to thank the anonymous reviewers, fortheir thorough and constructive comments that have improved this paper in styleas well as content.

Ralf Lammel, from the University of Koblenz-Landau has been supportive andhelpful throughout the process, providing numerous references and commentsbefore, during and after GTTSE.

Josef Grosch, of CoCoLab, graciously answered all the author’s questionsregarding how Lark[41] had been used to parse PL/I, and the tricks that hadbeen necessary to address the numerous oddities in the language.

More generally, the author wishes to express his gratitude to his excellentcolleagues Boris Pereira, Dirk Amadori, Nicolas Louvet, Ludovic Langevine,


Yannick Barthol, Laurent Ferier and Maxime Van Assche for their dedicationto the success of this project. If anything, this project is the demonstration ofwhat teamwork and great people can achieve.

The author’s partners in crime in RainCode, Alain Corchia, Juan Diez Perezand Stanislas Pinte, must be thanked for their willingness to put up with someof the wildest ideas one can come up with, and their patience and forgivenesswhen these ideas prove plain wrong.

Last but not least, this project would have not been possible without theunconditional support of Lars Mikaelsson and Guy Van Roy from Microsoft,and the equally unconditional trust in our abilities repeatedly demonstrated byRobert Elgaard, from SDC in Denmark. He is the one who originally decidedto go for a compiler that did not yet exist, based on little more evidence thanminimal prototypes, early blueprints, architectural ideas, and a serious amountof enthusiasm and optimism.

References

1. ASIS Working Group (ASISWG), http://www.sigada.org/WG/asiswg/ (last vis-ited: June 1, 2012)

2. CICS, http://en.wikipedia.org/wiki/CICS (last visited: June 1, 2012)

3. Clerity Solutions’ website, http://www.clerity.com (last visited: June 1, 2012)

4. The cocolab website, http://www.cocolab.com (last visited: June 1, 2012)

5. Conversions from scaled FIXED BINARY, http://publib.boulder.ibm.com/

infocenter/ratdevz/v7r5/index.jsp?topic=/com.ibm.ent.pl1.zos.doc/

topics/ibmm2mst195.html (last visited: December 6, 2011)6. Eclipse Java development tools (JDT), http://www.eclipse.org/jdt/ (last vis-

ited: June 1, 2012)

7. Enterprise java beans, http://jcp.org/en/jsr/detail?id=318 (last visited: June1, 2012)

8. Fujitsu NetCOBOL for .NET,http://www.netcobol.com/products/Fujitsu-NetCOBOL-for-.NET/overview

9. Hierarchical Visitor Pattern, http://c2.com/cgi/wiki?HierarchicalVisitorPattern (last visited: June 1, 2012)

10. HTWC’s website, http://www.htwc.com (last visited: June 1, 2012)

11. ILASM, http://msdn.microsoft.com/en-us/library/496e4ekx.aspx (last vis-ited: June 1, 2012)

12. Internet Information Services, http://www.microsoft.com/windowsserver2008/en/us/internet-information-services.aspx (last visited: June 1, 2012)

13. MicroFocus COBOL SQL Option Preprocessor, http://supportline.microfocus.com/documentation/books/sx40sp2/spsqlp.htm

14. Multics, http://en.wikipedia.org/wiki/Multics (last visited: June 1, 2012)

15. PEVerify Tool, http://msdn.microsoft.com/en-us/library/62bwd2yd.aspx

(last visited: June 1, 2012)16. PL/M, http://en.wikipedia.org/wiki/PL/M (last visited: June 1, 2012)

17. PL/S, http://en.wikipedia.org/wiki/PL/S (last visited: June 1, 2012)

18. Tuxedo ART, http://www.oracle.com/us/products/middleware/tuxedo/tuxedo-11g-feature-066057.html (last visited: June 1, 2012)

http://www.sigada.org/WG/asiswg/

http://en.wikipedia.org/wiki/CICS

http://www.clerity.com

http://www.cocolab.com

http://publib.boulder.ibm.com/infocenter/ratdevz/v7r5/index.jsp?topic=/com.ibm.ent.pl1.zos.doc/topics/ibmm2mst195.html



http://www.eclipse.org/jdt/

http://jcp.org/en/jsr/detail?id=318

http://www.netcobol.com/products/Fujitsu-NetCOBOL-for-.NET/overview

http://c2.com/cgi/wiki?HierarchicalVisitorPattern

http://c2.com/cgi/wiki?HierarchicalVisitorPattern

http://www.htwc.com

http://msdn.microsoft.com/en-us/library/496e4ekx.aspx

http://www.microsoft.com/windowsserver2008/en/us/internet-information-services.aspx

http://www.microsoft.com/windowsserver2008/en/us/internet-information-services.aspx

http://supportline.microfocus.com/documentation/books/sx40sp2/spsqlp.htm

http://supportline.microfocus.com/documentation/books/sx40sp2/spsqlp.htm

http://en.wikipedia.org/wiki/Multics

http://msdn.microsoft.com/en-us/library/62bwd2yd.aspx

http://en.wikipedia.org/wiki/PL/M

http://en.wikipedia.org/wiki/PL/S

http://www.oracle.com/us/products/middleware/tuxedo/tuxedo-11g-feature-066057.html

http://www.oracle.com/us/products/middleware/tuxedo/tuxedo-11g-feature-066057.html

52 D. Blasband

19. The Multics PL/1 Compiler (1969), http://www.multicians.org/pl1-raf.html(last visited: June 1, 2012)

20. z/OS V1R7.0 MVS JCL Reference, International Business Machines, 1988 (2006)(last visited: June 1, 2012)

21. Decimal structure (2010), http://msdn.microsoft.com/en-us/library/system.decimal.aspx (last visited: June 1, 2012)

22. Point structure (2010), http://msdn.microsoft.com/en-us/library/system.windows.point.aspx (last visited: June 1, 2012)

23. Ada. Reference Manual for the Ada Programming Language, ANSI/MIL-std 1815-a. U.S. Department of Defense (1983)

24. Blasband, D.: The YAFL Programming Language, 2nd edn., PhiDaNi Software(1994)

25. Blasband, D.: Automatic analysis of ancient languages. PhD thesis, UniversiteLibre de Bruxelles (2000)

26. Blasband, D.: Parsing in a hostile world. In: WCRE, pp. 291–300 (2001)27. Blasband, D.: Hard facts vs soft facts. In: Hassan, A.E., Zaidman, A., Penta, M.D.

(eds.) WCRE, pp. 301–304. IEEE (2008)28. Blasband, D., Real, J.-C.: All-purpose quantifiers in an OO language. In: Proceed-

ings of TOOLS Asia 1998 (1998)29. Bravenboer, M., Kalleberg, K.T., Vermaas, R., Visser, E.: Stratego/xt 0.17. a lan-

guage and toolset for program transformation. Sci. Comput. Program. 72(1-2),52–70 (2008)

30. Kernighan, B.W., Ritchie, D.M.: The C programming language. Prentice-Hall(1989)

31. Burke, M., Fisher Jr., G.A.: A practical method for syntactic error diagnosis andrecovery. In: Proceedings of the SIGPLAN 1982 Symposium on Compiler Con-struction, pp. 67–78. ACM (1982)

32. Corbato, F.J., Vyssotsky, V.A.: Introduction and overview of the multics system.In: AFIPS Conf. Proc., vol. 27, pp. 185–196 (1965)

33. Cordy, J.R.: The TXL source transformation language. Sci. Comput.Program 61(3), 190–210 (2006)

34. de Jonge, M., Nilsson-Nyman, E., Kats, L.C.L., Visser, E.: Natural and FlexibleError Recovery for Generated Parsers. In: van den Brand, M., Gasevic, D., Gray,J. (eds.) SLE 2009. LNCS, vol. 5969, pp. 204–223. Springer, Heidelberg (2010)

35. DeRemer, F.: Simple LR(k) grammars. Communications of the ACM 14(7),453–460 (1971)

36. DeRemer, F., Pennello, T.J.: Efficient computation of LALR(1) lookahead sets.ACM Transactions on Programming Languages and Systems 4(4), 615–649 (1982)

37. Dijkstra, E.W.: The humble programmer. Commun. ACM 15(10), 859–866 (1972)38. Earley, J.: An efficient context-free parsing algorithm. Communications of the

ACM 13(2) (1970)39. Ekman, T., Hedin, G.: The jastadd extensible java compiler. In: Gabriel, R.P.,

Bacon, D.F., Lopes, C.V., Jr, G.L.S. (eds.) OOPSLA, pp. 1–18. ACM (2007)40. Ellis, M.A., Stroustrup, B.: The Annotated C++ Reference Manual. Addison-

Wesley, Reading (1990) ISBN 0-201-51459-141. Grosch, J.: Lark - An LR(1) Parser Generator With Backtracking. Technical

report, CoCoLab - Datenverarbeitung (April 1998)42. Grosch, J.: Personal communication (2012)43. Herman, D.: The C Typedef Parsing Problem (2009),

http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html

(last visited: June 1st, 2012)

http://www.multicians.org/pl1-raf.html

http://msdn.microsoft.com/en-us/library/system.decimal.aspx

http://msdn.microsoft.com/en-us/library/system.decimal.aspx

http://msdn.microsoft.com/en-us/library/system.windows.point.aspx

http://msdn.microsoft.com/en-us/library/system.windows.point.aspx

http://calculist.blogspot.com/2009/02/c-typedef-parsing-problem.html


44. Holt, R.C.: Teaching the fatal disease: (or) introductory computer programmingusing pl/i. SIGPLAN Not. 8, 8–23 (1973)

45. Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Com-putation. Addison-Wesley (1979)

46. Horning, J.J.: What the Compiler Should Tell the User. In: Bauer, F.L., Grif-fiths, M., Hornig, J.J., McKeeman, W.M., Waite, W.M., DeRemer, F.L., Hill, U.,Koster, C.H.A., Poole, P.C. (eds.) CC 1974. LNCS, vol. 21, pp. 525–548. Springer,Heidelberg (1974)

47. Hudak, P., Makucevich, T., Gadde, S., Whong, B.: Haskore music notation - analgebra of music. J. of Functional Programming 6(3), 465–483 (1996)

48. Hutton, G.: Programming in Haskell. Cambridge Univ. Press, Cambridge (2007)49. International Business Machines. CICS,

http://www-4.ibm.com/software/ts/cics/ (last visited: June 1st, 2012)50. International Business Machines Corp., OS and DOS PL/1 Language Reference

Manual (1981)51. Johnson, S.C.: YACC — Yet another compiler - compiler. Computing Science

Technical Report No. 32, Bell Laboratories, Murray Hill, N.J. (1975)52. Jones, C.: The Year 2000 Software Problem - Quantifying the Costs and Assessing

the Consequences. Addison-Wesley (1998) ISBN 978-020130964553. Kadhim, B.M.: Debugger generation in a compiler generation system. PhD thesis,

University of Colorado (1998)54. Klint, P., Visser, E.: Using Filters for the Disambiguation of Context-free Gram-

mars. Technical Report P9426, Programming Research Group, University of Am-sterdam (December 1994)

55. Klint, P., Visser, E.: Using filters for the disambiguation of context-free grammars(March 16, 1994)

56. Lang, B.: Deterministic Techniques for Efficient Non-Deterministic Parsers. In:Loeckx, J. (ed.) ICALP 1974. LNCS, vol. 14, pp. 255–269. Springer, Heidelberg(1974)

57. McPeak, S., Necula, G.C.: Elkhound: A Fast, Practical GLR Parser Generator. In:Duesterwald, E. (ed.) CC 2004. LNCS, vol. 2985, pp. 73–88. Springer, Heidelberg(2004)

58. Meyer, B.: Eiffel: The Language. Prentice-Hall (1992) ISBN 0-13-247925-759. Moonen, L.: Generating robust parsers using island grammars. In: WCRE, pp.

13–22 (2001)60. Naur, P., et al.: Report on the algorithmic language ALGOL 60. Communications

of the ACM 3(5), 299–314 (1960)61. Nilsson-Nyman, E., Ekman, T., Hedin, G.: Practical Scope Recovery Using Bridge

Parsing. In: Gasevic, D., Lammel, R., Van Wyk, E. (eds.) SLE 2008. LNCS,vol. 5452, pp. 95–113. Springer, Heidelberg (2009)

62. Parr, T.J., Quong, R.W.: ANTLR: A Predicated- LL(k) Parser Generator. Software- Practice and Experience 25(7), 789–810 (1995)

63. Ramsey, N.: Eliminating spurious error messages using exceptions, polymorphism,and higher-order functions. Dept of Computer Science, University of Virginia(1996)

64. Sloane, A.M.: Post-design domain-specific language embedding: A case study inthe software engineering domain. In: HICSS, p. 281 (2002)

65. Smith, B.C.: Reflection and semantics in a procedural language. Technical ReportTR-272. MIT, Cambridge, MA (1982)

66. Synytskyy, N., Cordy, J.R., Dean, T.R.: Robust multilingual parsing using islandgrammars. In: CASCON, pp. 266–278. IBM (2003)

http://www-4.ibm.com/software/ts/cics/

54 D. Blasband

67. Tomita, M.: An efficient context-free parsing algorithm for natural languages.IJCAI 2, 756–764 (1985)

68. Vadim Maslov, C.D.: BTYacc – Backtracking yacc – home page,http://www.siber.com/btyacc/ (last visited: June 1, 2012)

69. van den Brand, M., Sellink, M.P.A., Verhoef, C.: Current parsing techniques insoftware renovation considered harmful. In: IWPC, p. 108. IEEE Computer Society(1998)

70. Van Wyk, E., Krishnan, L., Bodin, D., Schwerdfeger, A.: Attribute Grammar-Based Language Extensions for Java. In: Bateni, M. (ed.) ECOOP 2007. LNCS,vol. 4609, pp. 575–599. Springer, Heidelberg (2007)

71. Visser, E.: Scannerless generalized-LR parsing. Technical Report P9707, Program-ming Research Group, University of Amsterdam (July 1997)

72. Wagner, T.A., Graham, S.L.: Incremental analysis of real programming languages.In: PLDI, pp. 31–43 (1997)

73. Clocksin, W.F., Mellish, C.S.: Programming in Prolog, 4th edn. Springer (1994)ISBN 3-540-58350-5

74. Wirth, N.: The design of a PASCAL compiler. Software–Practice and Experi-ence 1(4), 309–333 (1971)

75. Wirth, N.: Programming in Modula II, 4th edn. Springer (1988)ISBN 3-540-50150-9

http://www.siber.com/btyacc/

Variation Programming with the Choice Calculus�

Martin Erwig and Eric Walkingshaw

School of EECSOregon State University

Abstract. The choice calculus provides a language for representing and trans-forming variation in software and other structured documents. Variability is cap-tured in localized choices between alternatives. The space of all variations isorganized by dimensions, which provide scoping and structure to choices. Thevariation space can be reduced through a process of selection, which eliminatesa dimension and resolves all of its associated choices by replacing each with oneof their alternatives. The choice calculus also allows the definition of arbitraryfunctions for the flexible construction and transformation of all kinds of variationstructures. In this tutorial we will first present the motivation, general ideas, andprinciples that underlie the choice calculus. This is followed by a closer look atthe semantics. We will then present practical applications based on several smallexample scenarios and consider the concepts of ”variation programming” and”variation querying”. The practical applications involve work with a Haskell li-brary that supports variation programming and experimentation with the choicecalculus.

1 Introduction

Creating and maintaining software often requires mechanisms for representing varia-tion. Such representations are used to solve a diverse set of problems, such as manag-ing revisions over time, implementing optional features, or managing several softwareconfigurations. Traditionally, research in each of these areas has worked with differ-ent variation representations, obfuscating their similarities and making the sharing ofresults difficult. The choice calculus [12] solves this by providing a formal model forrepresenting and reasoning about variation that can serve as an underlying foundationfor all kinds of research on the topic [10].

More specifically and relevant to the central topics of this summer school, the choicecalculus supports both generative and transformational techniques in the area of soft-ware engineering. The generative aspect is obvious: The representation of variation insoftware supports, through a process of selection, the generation of specific variants ofthat software.

How a variation representations can support transformations may be less obvious.To explain the relationship, we first point out that transformations can be distinguishedinto two kinds: (A) simple and automatic transformations, and (B) complicated and (atleast partially) manual transformations. The first kind of transformation is the one we

� This work is partially supported by the Air Force Office of Scientific Research under the grantFA9550-09-1-0229 and by the National Science Foundation under the grant CCF-0917092.


56 M. Erwig and E. Walkingshaw

love: We have a representation of the transformation that we can apply as often as wewant to produce some desired output from all kinds of inputs in an instant.

However, the second kind of transformation is also ubiquitous in software engineer-ing. Consider, for example, the editing of software in response to changed requirementsor bug reports. Such a transformation often requires many changes in different parts ofa software system and involves the creation of a network of interdependent changes.If not done carefully, inconsistencies and other errors can be introduced, which maynecessitate further costly and time-consuming editing. This kind of transformation ismuch more arduous than the automatic kind, but is nevertheless quite common. More-over, since it is so complicated to deal with, it is even more deserving of attention.

A structured variation representation can support complicated transformations as fol-lows. First, we can embed variation in the software artifact at all those places wherechanges are required. By creating a new variant we keep the original version and soalways have a consistent version to fall back on. This benefit is also provided by tradi-tional version control systems. However, usually the representation provided by thesetools are quite impoverished (line-based patches), making it difficult to view multipleindependent changes in context or apply changes in different orders.

Second, a structured variation representation supports exploratory editing of softwareartifacts. Whenever a particular change can be applied in several different ways, wecan represent several alternatives and delay a decision, which might depend on otherchanges not even made at this point.

Ultimately, a variation representation supports the integrated representation of a setof closely related programs, a concept we have identified as program fields [11]. Pro-gram fields are essentially an extensional representation of a set of programs togetherwith a set of direct transformations between them. Under this view, applying trans-formations is expressed by trading decisions about which changes to apply. We willillustrate this aspect later with examples.

We will start the tutorial in Section 2 by discussing the requirements of a variationrepresentation and then illustrating how these requirements are realized in the choicecalculus, which provides a generic annotation language that can be applied to arbitraryobject languages. Specifically, we will demonstrate how we can synchronize variationin different parts of an object program through the concept of choices that are bound bydimensions. We will also show how this representation supports modularity as well asdependent variation. In addition, we will discuss the need for a construct to explicitlyrepresent the sharing of common parts in a variation representation. The behavior of thesharing construct introduced by the choice calculus poses some challenges for the trans-formation of variational artifacts. We will therefore ignore the sharing representation inthe later parts of the tutorial that are concerned with variation programming.

The most basic operation on a variation representation is the selection of a partic-ular variant. In Section 3 we will define the semantics of the choice calculus, whichessentially defines a mapping from decisions to plain object programs. The semanticsis also implemented as part of the domain-specific language that we use for variationprogramming and often serves a useful tool to understand variation representations.

The semantics is essentially based on a function for eliminating dimensions and as-sociated choices. And even though choice elimination is an essential component ofthe choice calculus, it is only one very simple example from a set of many interesting

Variation Programming with the Choice Calculus 57

operations on variation structures. More sophisticated operations can be defined oncewe integrate the choice calculus representation into an appropriate metaprogrammingenvironment. We will present such an integration of the choice calculus into Haskellin Section 4. We will discuss several different approaches to such an integration andchoose one that is simple but powerful.

This integration provides the basis for writing programs to query, manipulate, andanalyze variation structures. We call this form of writing programs that exploit varia-tion structures variation programming. Variation programming embodies the transfor-mational aspects of a static variation representation. We will introduce the basic el-ements of variation programming with programs on variational lists in Section 5. Wewill illustrate how to generalize “standard” list functions to work on variational lists andalso develop functions that manipulate the variational structure of lists in a purposefulmanner.

In Section 6 we consider the application of variation programming to variationalprograms (the maintenance of variational software). We use an extremely simplifiedversion of Haskell for that purpose.

This tutorial is full of languages. Understanding which languages are involved, whatroles they play, and how they are related to one another is important to keep a clearview of the different representations and their purpose and how variation programmingworks in the different scenarios. Here is a brief summary of the languages involved.

– The choice calculus is a generic language that can be applied to, or instantiated by,different object languages. Specifically, given an object language L, we write V (L)for the result of L’s integration with the choice calculus.

– Object languages, such as list data structures or Haskell, are placed under variationcontrol by integrating their representation with the choice calculus.

– Variational languages are the result of the combination of an object language withthe choice calculus. We write VL for the variational version of the object languageL, that is, we have VL = V (L). For example, we have the variational languagesVList =V (List) and VHaskell =V (Haskell).

– We are using Haskell as a metalanguage to do variation programming, and werepresent the choice calculus, all object languages, and variational languages asdata types in Haskell to facilitate the writing of variation programs.

Finally, in this tutorial we assume some basic familiarity with Haskell, that is, knowl-edge of functions and data types and how to represent languages as data types. Knowl-edge of monads and type classes are useful, but not strictly required.

2 Elements of the Choice Calculus

In this section we will introduce and motivate the concepts and constructs of the choicecalculus. We use a running example of varying a simple program in the object languageof Haskell, but the choice calculus is generic in the sense that it can be applied to anytree-structured document.

Consider the following four implementations of a Haskell function named twice thatreturns twice the value of its argument.


twice x = x+x twice y = y+y

twice x = 2*x twice y = 2*y

These definitions vary in two independent dimensions with two possibilities each. Thefirst dimension of variation is in the name of the function’s argument: those in the leftcolumn use x and those in the right column use y. The second dimension of variation isin the arithmetic operation used to implement the function: addition in the top row andmultiplication in the bottom.

We can represent all four implementations of twice in a single choice calculus ex-pression, as shown below.

dim Par〈x,y〉 indim Impl〈plus, times〉 intwice Par〈x,y〉 = Impl〈Par〈x,y〉+Par〈x,y〉,2*Par〈x,y〉〉

In this example, we begin by declaring the two dimensions of variation using the choicecalculus dim construct. For example, dim Par〈x,y〉 declares a new dimension Par withtags x and y, representing the two possible parameter names. The in keyword denotesthe scope of the declaration, which extends to the end of the expression if not explicitlyindicated otherwise (for example, by parentheses).

We capture the variation between the different implementations in choices that arebound by the declared dimensions. For example, Par〈x,y〉 is a choice bound by thePar dimension with two alternatives, x and y. Note that x and y are terms in the objectlanguage of Haskell (indicated by typewriter font), while the tags x and y are identifiersin the metalanguage of choice calculus (indicated by italics).

Each dimension represents an incremental decision that must be made in order to re-solve a choice calculus expression into a concrete program variant. The choices boundto that dimension are synchronized with this decision. This incremental decision pro-cess is called tag selection. When we select a tag from a dimension, the correspondingalternative from every bound choice is also selected, and the dimension declaration it-self is eliminated. For example, if we select the y tag from the Par dimension (Par.y),we would produce the following choice calculus expression in which the Par dimensionhas been eliminated and each of its choices has been replaced by its second alternative.

dim Impl〈plus, times〉 intwice y = Impl〈y+y,2*y〉

If we then select Impl.times, we produce the variant of twice in the lower-right cornerof the above grid of variants.

In the above examples, the choice calculus notation is embedded within the syntaxof the object language. This embedding is not a textual embedding in the way that, forexample, the C Preprocessor’s #ifdef statements are integrated with program sourcecode. Instead, choices and dimensions operate on an abstract-syntax tree view of theobject language. This imposes constraints on the placement and structure of choices anddimensions. For example, every alternative of a choice must be of the same syntacticcategory. When it is necessary to do so, we represent the underlying tree structure ofthe object language explicitly with Y-brackets. For example, we might render the ASTfor twice x = x+x as =�twice,x,+�x,x��, that is, the definition is represented as atree that has the = operation at the root and three children, (1) the name of the function


twice, (2) its parameter x, and (3) the RHS, which is represented by abother tree withroot + and two children that are both given by x. Usually we stick to concrete syntax,however, for readability.

Returning to our choice calculus expression encoding all four variants of thefunction twice, suppose we add a third option z in the parameter name dimen-sion. We show this extension below, where newly added tags and alternatives areunderlined.

dim Par〈x,y,z〉 indim Impl〈plus, times〉 intwice Par〈x,y,z〉 = Impl〈Par〈x,y,z〉+Par〈x,y,z〉,2*Par〈x,y,z〉〉

Exercise 1. How many variants does this choice calculus expression represent? Extendthe example to declare a new independent dimension, FunName that is used to vary thename of the function between twice and double. Now how many variants are encoded?

As you can see, the above extension with tag z required making the same edit to sev-eral identical choices. As programs get larger and more complex, such repetitive tasksbecome increasingly prone to editing errors. Additionally, we often want to share asubexpression between multiple alternatives of the same choice. For example, a pro-gram that varies depending on the choice of operating system, say Windows, Mac, andLinux, might have many choices in which the cases for Mac and Linux are the samesince they share a common heritage in Unix. It would be inconvenient, error prone, andinefficient to duplicate the common code in each of these cases.

As a solution to both of these problems, the choice calculus provides a simple sharingmechanism. Using this, we can equivalently write the above variational program asfollows.

dim Par〈x,y,z〉 indim Impl〈plus, times〉 inshare v = Par〈x,y,z〉 intwice v = Impl〈v+v,2*v〉

Note that now we need only extend the dimension with the new tag z and add thez alternative once. The choice calculus variable v stores the result of this choice andis referenced in the definition of twice. Because sharing is expanded only after alldimensions and choices have been resolved, the following expression encodes preciselythe same variants as the above.

dim Impl〈plus, times〉 inshare v = (dim Par〈x,y,z〉 in Par〈x,y,z〉) intwice v = Impl〈v+v,2*v〉

This feature provides a convenient way to limit the scope of a dimension to a singlechoice. We call such dimensions atomic, a concept that will be revisited in Section 4.


Exercise 2. Extend the above choice calculus expression to include a second functionthrice that triples the value of its input, and that varies synchronously in the samedimensions as twice. That is, a selection of Impl.plus and Par.x (followed by share-variable expansion) should produce the following expression.

twice x = x+x

thrice x = x+x+x

Exercise 3. Modify the expression developed in Exercise 2 so that the implementationmethods of the two functions vary independently. (Hint: Since dimensions are locallyscoped, you can reuse the dimension name Impl.) Finally, extend thrice’s Impl dimen-sion to include an option that implements thrice in terms of twice.

Dimensions can also be dependent on a decision in another dimension. For example,consider the following three alternative implementations of twice, where those in thetop row implement the function with a lambda expression, while the one in the bottomrow use Haskell’s operator section notation to define the function in a pointfree way(that is, without explicitly naming the variable).

twice = \x -> 2*x twice = \y -> 2*y

twice = (2*)

Again we have two dimensions of variation. We can choose a pointfree representation ornot, and we can again choose the parameter name. In this case, however, it doesn’t makesense to select a parameter name if we choose the pointfree style, because there is noparameter name! In other words, the parameter name dimension is only relevant if wechoose “no” in the pointfree dimension. In the choice calculus, a dependent dimensionsis realized by nesting it in an alternative of another choice, as demonstrated below.

dim Pointfree〈yes,no〉 intwice v = Pointfree〈(2*),share v = (dim Par〈x,y〉 in Par〈x,y〉) in \v -> 2*v〉

If we select Pointfree.yes, we get the variant twice = (2*), with no more selections tomake. However, if we select Pointfree.no we must make a subsequent selection in thePar dimension in order to fully resolve the choice calculus expression into a particularvariant.

Throughout this discussion we have implicitly taken the “meaning” of a choice cal-culus expression to be the variants that it can produce. In the next section we formalizethis notion by presenting a formal semantics for choice calculus expressions.

3 Syntax and Semantics of the Choice Calculus

Although much of this tutorial will focus on a domain-specific embedded language(DSEL) for variation research, one of the most important goals of the choice calculusis to serve as a formal model of variation that can support a broad range of theoretical


e ::= a�e, . . . ,e� Object Structure| dim D〈t, . . . , t〉 in e Dimension| D〈e, . . . ,e〉 Choice| share v = e in e Sharing| v Reference

Fig. 1. Choice calculus syntax

research. Before moving on, therefore, we will briefly discuss the formal syntax andsemantics of choice calculus expressions. Because the DSEL is based on the choicecalculus, these details will be helpful throughout the rest of this tutorial.

The syntax of choice calculus expressions follows from the discussion in the previ-ous section and is provided explicitly in Figure 1. There are a few syntactic constraintson choice calculus expression not expressed in the grammar. First, all tags in a singledimension must be pairwise different so they can be uniquely referred to. Second, eachchoice D〈en〉 must be within the static scope of a corresponding dimension declarationdim D〈tn〉 in e. That is, the dimension D must be defined at the position of the choice,and the dimension must have exactly as many tags as the choice has alternatives. Fi-nally, each sharing variable reference v must be within scope of a corresponding shareexpression defining v.

Exercise 4. Which of the following are syntactically valid choice calculus expressions?

(a) dim D〈t1, t2, t3〉 in (dim D〈t1, t2〉 in D〈e1,e2,e3〉)(b) share v = D〈e1,e2〉 in (dim D〈t1, t2〉 in v)(c) dim D〈t1, t2, t3〉 in (share v = D〈e1,e2,e3〉 in (dim D〈t1, t2〉 in v))

The object structure construct is used to represent the artifact that is being varied, forexample, the AST of a program. Therefore, a choice calculus expression that consistsonly of structure expressions is just a regular, unvaried artifact in the object language.We call such expressions plain. While the structure construct provides a generic treerepresentation of an object language, we could imagine expanding this construct intoseveral constructs that more precisely capture the structure of a particular object lan-guage. This idea is central to the implementation of our DSEL, as we’ll see in the nextsection. Also, we often omit the brackets from the leaves of structure expressions. Sowe write, for example, +�x,x� rather than +�x��,x�� to represent the structure ofthe expression x+x explicitly.

In the previous section we introduced tag selection as a means to eliminate a di-mension of variation. We write �e�D.t for the selection of tag t from dimension D inexpression e. Tag selection consists of (1) finding the first declaration dim D〈tn〉 in e′

in a preorder traversal of e, (2) replacing every choice bound by the dimension in e′ withits ith alternative, where i is the index of t in tn, and (3) removing the dimension declara-tion. Step (2) of this process is called choice elimination, written �e′�D.i (where the tagname has been replaced by the relevant index), and defined formally in Figure 2. Thisdefinition is mostly straightforward, replacing a matching choice with its ith alternative


�a�e1, . . . ,en��D.i = a��e1�D.i, . . . ,�en�D.i�

�dim D′〈tn〉 in e�D.i =

{dim D′〈tn〉 in e if D = D′

dim D′〈tn〉 in �e�D.i otherwise

�D′〈e1, . . . ,en〉�D.i =

{�ei�D.i if D = D′

D′〈�e1�D.i, . . . ,�en�D.i〉 otherwise

�share v = e in e′�D.i = share v = �e�D.i in �e′�D.i

�v�D.i = v

Fig. 2. Choice elimination

and otherwise propagating the elimination downward. Note, however, that propagationalso ceases when a dimension declaration of the same name is encountered—this main-tains the static scoping of dimension names.

Exercise 5. Given e = dim A〈a1,a2〉 in A〈A〈1,2〉,3〉 what is the result of the selection�e�A.a1? Is it possible to select the plain expression 2?

By repeatedly selecting tags from dimensions, we will eventually produce a plain ex-pression. We call the selection of one or several tags collectively a decision, and adecision that eliminates all dimensions (and choices) from an expression a completedecision. Conceptually, a choice calculus expression then represents a set of plain ex-pressions, where each is uniquely identified by the complete decision that must be madein order to produce it. We therefore define the semantics domain of choice calculus ex-pressions to be a mapping from complete decisions to plain expressions.

We write �e� to indicate the semantics of expression e. We represent the denota-tion of e (that is, the mapping from decisions to plain expressions) as a set of pairs,and we represent decisions as n-tuples of dimension-qualified tags. For simplicity andconciseness, we enforce in the definition of the semantics that tags are selected fromdimensions in a fixed order, the order that the dimension declarations are encounteredin a preorder traversal of the expression (see [12] for a discussion of this design deci-sion). For instance, in the following example, tags are always selected from dimensionA before dimension B.

�dim A〈a1,a2〉 in A〈1,dim B〈b1,b2〉 in B〈2,3〉〉� ={(A.a1,1),((A.a2,B.b1),2),((A.a2,B.b2),3)}

Note that dimension B does not appear at all in the decision of the first entry in thisdenotation since it is eliminated by the selection of the tag A.a.

Exercise 6. Write the semantics of the above expression if the tag ordering constraintis removed.


Vρ (a��) = {((),a��)}Vρ (a�en

�) = {(δ n,a�e′n�) | (δ1,e′1) ∈Vρ (e1), . . . ,(δn,e

′n) ∈Vρ (en)}

Vρ (dim D〈tn〉 in e) = {((D.ti,δ ),e′) | i ∈ {1, . . . ,n}, (δ ,e′) ∈Vρ (�e�D.i)}Vρ (share v = e1 in e2) =

⋃{{(δ1δ2,e

′2) | (δ2,e

′2) ∈Vρ⊕(v,e′1)

(e2)} | (δ1,e′1) ∈Vρ (e1)}

Vρ (v) = {((),ρ(v))}

Fig. 3. Computing the semantics of a choice calculus expression e, �e� =V∅(e)

Finally, we provide a formal definition of the semantics of choice calculus expres-sions in terms of a helper function V in Figure 3. The parameter to this function, ρ ,is an environment, implemented as a stack, mapping share-variables to plain expres-sions. The semantics of e is then defined as an application of V with an initially emptyenvironment, that is, �e� =V∅(e).

The definition of V relies on a somewhat dense notation, so we will briefly describethe conventions, then explain each case below. We use δ to range over decisions, con-catenate decisions δ1 and δ2 by writing δ1δ2, and use δ n to represent the concatenationof decisions δ1, . . . ,δn. Similarly, lists of expressions en can be expanded to e1, . . . ,en,and likewise for lists of tags tn. We associate v with e in environment ρ with the notationρ ⊕ (v,e), and lookup the most recent expression associated with v by ρ(v).

For structure expressions there are two sub-cases to consider. If the expression is aleaf, then the expression is already plain, so the result is an empty decision (representedby the nullary tuple ()) mapped to that leaf. Otherwise, we recursively compute thesemantics of each subexpression and, for each combination of entries (one from eachrecursive result), concatenate the decisions and reconstruct the (now plain) structureexpression.

On a dimension declaration, we select each tag ti in turn, computing the semantics of�e�D.i and prepending D.ti to the decision of each entry in the result. Note that there isno case for choices in the definition of V . Since we assume that all choices are bound,all choices will be eliminated by selections invoked at their binding dimension declara-tions. In the event of an unbound choice, the semantics are undefined.

Exercise 7. Extend V to be robust with respect to unbound choices. That is, unboundchoices should be preserved in the semantics, as demonstrated in the following example.

�A〈dim B〈b1,b2〉 in B〈1,C〈2,3〉〉,4〉� ={(B.b1,A〈1,4〉),(B.b2,A〈C〈2,3〉,4〉)}

The case for sharing computes the semantics of the bound expression e1, then computesthe semantics of the scope e2 with each variant e′1 of e1 added to the environment ρ ,in turn. Each resulting expression e′2 is then associated with the combined decisionthat produces it. References to share-bound variables simply lookup the correspondingplain expression in ρ .


In our work with the choice calculus, we have identified a set of semantics-preservingtransforming laws for choice calculus expressions and related notions of representativenormal forms with desirable properties (such as minimizing redundancy) [12]. This isthe theoretical groundwork for a comprehensive theory of variation that can be reusedby tool developers and other researchers. In the next section we switch gears by intro-ducing a more exploratory thrust of this reasearch—a variation programming language,based on the choice calculus, for representing and manipulating variation.

4 A Variation DSEL in Haskell

The choice calculus, as presented in the previous two sections, is an entirely staticrepresentation. It allows us to precisely specify how a program varies, but we cannotuse the choice calculus itself to edit, analyze, or transform a variational program. Inthe previous section we supplemented the choice calculus with mathematical notationto define some such operations, for example, tag selection. In some regards, math isan ideal metalanguage since it is infinitely extensible and extremely flexible—we candefine almost any operation we can imagine. However, it’s difficult to test an operationdefined in math or to apply it to several examples quickly to observe its effect. In otherwords, it’s hard to play around with math. This is unfortunate, since playing around canoften lead to challenged assumptions, clever insights, and a deeper understanding of theproblem at hand.

In this section, we introduce a domain-specific embedded language (DSEL) in Haskellfor constructing and manipulating variational data structures. This DSEL is based onthe choice calculus, but is vastly more powerful since we have the full power of themetalanguage of Haskell at our disposal. Using this DSEL, we can define all sorts of newoperations for querying and manipulating variation. Because the operations are definedin Haskell, certain correctness guarantees are provided by the type system, and mostimportantly, we can actually execute the operations and observe the outputs. Throughthis DSEL we can support a hands-on, exploratory approach to variation research.

In the rest of this tutorial we will be exploring the interaction of variation represen-tations and functional programming. Combining these ideas gives rise to the notion ofvariation programming, an idea that is explored more thoroughly in Sections 5 and 6.

In the DSEL, both the variation representation and any particular object languageare represented as data types. The data type for the generic variation representation isgiven below. As you can see, it adapts the dimension and choice constructs from thechoice calculus into Haskell data constructors, Dim and Chc. The Obj constructor willbe explained below. In this definition, the types Dim and Tag are both synonyms for thepredefined Haskell type String.

data V a = Obj a

| Dim Dim [Tag] (V a)

| Chc Dim [V a]

The type constructor name V is intended to be read as “variational”, and the type param-eter a represents the object language to be varied. So, given a type Haskell representingHaskell programs, the type V Haskell would represent variational Haskell programs(see Section 6).


The Obj constructor is roughly equivalent to the object structure construct from thechoice calculus. However, here we do not explicitly represent the structure as a tree, butrather simply insert an object language value directly. An important feature of the DSELis that it is possible for the data type representing the object language to itself containvariational types (created by applying the V type constructor to its argument types), andoperations written in the DSEL can query and manipulate these nested variational val-ues generically. This is achieved through the use of the “scrap your boilerplate” (SYB)library [19] which imposes a few constraints on the structure of a. These constraintswill be described in Section 5.1. In the meantime, we will only use the very simpleobject language of integers, Int, which cannot contain nested variational values.

One of the advantages of using a metalanguage like Haskell is that we can definefunctional shortcuts for common syntactic forms. In Haskell, these are often called“smart constructors”. For example, we define the following function atomic for definingatomic dimensions (a dimension with a single choice as an immediate subexpression).

atomic :: Dim -> [Tag] -> [V a] -> V a

atomic d ts cs = Dim d ts $ Chc d cs

Exercise 8. Define the following smart constructors:

(a) dimA :: V a -> V a, which declares a dimension A with tags a1 and a2

(b) chcA :: [V a] -> V a, which constructs a choice in dimension A

These smart constructors will be used in examples throughout this section.

Note that we have omitted the sharing-related constructs from the definition of V. Thisdecision was made primarily for two reasons. First, some of the sharing benefits of thechoice calculus share construct are provided by Haskell directly, for example, throughHaskell’s let and where constructs. In fact, sharing in Haskell is much more powerfulthan in the choice calculus since we can also share values via functions. Second, theinclusion of an explicit sharing construct greatly complicates some important resultslater. In particular, we will show that V is a monad, while it is unclear whether this istrue when V contains explicit sharing constructs. Several other operations are also muchmore difficult to define with explicit sharing.

There are, however, advantages to the more restricted and explicit form of sharingprovided by the choice calculus. The first is perhaps the most obvious—since sharingis handled at the metalanguage level in the DSEL, it introduces redundancy when re-solved into the variation representation (the V data type). This puts an additional burdenon users to not introduce update anomalies and makes operations on variational datastructures necessarily less efficient.

A more subtle implication of the metalanguage-level sharing offered by the DSELis that we lose the choice calculus’s property of static (syntactic) choice scoping. Inthe choice calculus, the dimension that binds a choice can always be determined byexamining the context that the choice exists in; this is not the case in the DSEL. Forexample, in the following choice calculus expression, the choice in A is unbound.

share v = A〈1,2〉 in dim A〈a1,a2〉 in v


Meanwhile, in the corresponding DSEL expression, the choice in A is bound by thedimension surrounding the variable reference. This is demonstrated by evaluating thefollowing DSEL expression (for example, in GHCi), and observing the pretty-printedoutput.

> let v = chcA [Obj 1, Obj 2] in dimA v

dim A<a1,a2> in A<1,2>

In effect, in the choice calculus, sharing is expanded after dimensions and choices areresolved, while in the DSEL sharing is expanded before.

Exercise 9. Compare the semantics of the following expression if we expand sharingbefore dimensions and choices are resolved, with the semantics if we expand sharingafter dimensions and choices are resolved.

share v = (dim A〈a1,a2〉 in A〈1,2〉) in (v,v)

The result in either case is a mapping with pairs of integers such as (2,2) in its range.

The lack of static choice scoping, combined with the more unrestricted form of sharingoffered by Haskell functions, also opens up the possibility for choice capture. This iswhere a choice intended to be bound by one dimension ends up being bound by another.As an example, consider the following operation insertA that declares a dimension A,then inserts a choice in A into some expression, according to the argument function.

insertA :: (V Int -> V Int) -> V Int

insertA f = dimA (f (chcA [Obj 1, Obj 2]))

The author of this operation probably expects that the inserted choice will be bound bythe dimension declared in this definition, but if the argument function also declares adimension A, the choice could be captured, as demonstrated below.

> insertA (\v -> Dim "A" ["a3","a4"] v)

dim A<a1,a2> in dim A<a3,a4> in A<1,2>

Now the choice is bound by the dimension in the argument, rather than the intendeddimension declared in the insertA function.

Despite all of these qualms, however, the additional power and simpler variationmodel that results from the off-loading of sharing to the metalanguage makes possiblea huge variety of operations on variational expressions. Exploring these operations willform the bulk of the remainder of this tutorial. Supporting this additional functionalitywhile maintaining the structure, safety, and efficiency of the choice calculus’s sharingconstructs remains an important open research problem.

An important feature of the V data type is that it is both a functor and a monad.Functors and monads are two of the most commonly used abstractions in Haskell. Bymaking the variation representation an instance of Haskell’s Functor and Monad typeclasses, we make a huge body of existing functions and knowledge instantly availablefrom within our DSEL, greatly extending its syntax. Functors are simpler than (andindeed a subset of) monads, so we will present the Functor instance first, below. The


Functor class contain one method, fmap, for mapping a function over a data structurewhile preserving its structure.

fmap :: Functor f => (a -> b) -> f a -> f b

For V, this operation consists of applying the mapped function f to the values storedat Obj nodes, and propagating the calls into the subexpressions of Dim and Chc

nodes.

instance Functor V where

fmap f (Obj a) = Obj (f a)

fmap f (Dim d ts v) = Dim d ts (fmap f v)

fmap f (Chc d vs) = Chc d (map (fmap f) vs)

Consider the following variational integer expression ab, where dimB and chcB are smartconstructors similar to dimA and chcA.

> let ab = dimA $ chcA [dimB $ chcB [Obj 1, Obj 2], Obj 3]

> ab

dim A<a1,a2> in A<dim B<b1,b2> in B<1,2>,3>

Using fmap, we can, for example, increment every object value in a variational integerexpression.

> fmap (+1) ab

dim A<a1,a2> in A<dim B<b1,b2> in B<2,3>,4>

Or we can map the function odd :: Int -> Bool over the structure, producing a vari-ational boolean value of type V Bool.

> fmap odd ab

dim A<a1,a2> in A<dim B<b1,b2> in B<True,False>,True>

Exercise 10. Write an expression that maps every integer i in ab to a choice between i

and i+1. What is the type of the resulting value?

The definition of the Monad instance for V is similarly straightforward. The Monad typeclass requires the implementation of two methods: return for injecting a value intothe monadic type, and >>= (pronounced “bind”) for sequentially composing a monadicvalue with a function that produces another monadic value.

return :: Monad m => a -> m a

(>>=) :: Monad m => m a -> (a -> m b) -> m b

The monad instance definition for the variational type constructor V is as follows. Thereturn method is trivially implemented by the Obj data constructor. For >>=, at an Obj

node, we simply return the result of applying the function to the value stored at thatnode. For dimensions and choices, we must again propagate the bind downward intosubexpressions.


instance Monad V where

return = Obj

Obj a >>= f = f a

Dim d t v >>= f = Dim d t (v >>= f)

Chc d vs >>= f = Chc d (map (>>= f) vs)

The effect of a monadic bind is essentially to replace every value in the structure withanother monadic value (of a potentially different type) and then to flatten the results.The concatMap function on lists is a classic example of this pattern (though the orderof arguments is reversed). In the context of variation representations, we can use thisoperation to introduce new variation into a representation. For example, consider againthe expression ab. We can add a new dimension S, indicating whether or not we want tosquare each value (the line break in the output was inserted manually).

> Dim "S" ["n","y"] $ ab >>= (\i -> Chc "S" [Obj i, Obj (i*i)])

dim S<y,n> in dim A<a1,a2> in

A<dim B<b1,b2> in B<S<1,1>,S<2,4>>,S<3,9>>

Each value in the original expression ab is expanded into a choice in dimension S. Theresulting expression remains of type V Int. Compare this to the result of Exercise 10.

Finally, the DSEL provides several functions for analyzing variational expressions.For example, the function freeDims :: V a -> Set Dim returns a list of all free (un-bound) dimensions in a given variational expression (without duplicates). Several otherbasic static analyses are also provided. Significantly, a semantics function for variationalexpressions, sem, is provided. This is based on the semantics of the choice calculus fromthe previous section. Similarly, the semantics of a variational expression of type V a isa mapping from decisions (lists of qualified tags) to plain expressions of type a. Morecommonly, we use a function psem which computes the semantics of an expression andpretty prints the results. For example, the pretty printed semantics of the expression ab

are shown below.

> psem ab

[A.a1,B.b1] => 1

[A.a1,B.b2] => 2

[A.a2] => 3

Each entry in the semantics is shown on a separate line, with a decision on the left ofeach arrow and the resulting plain expression on the right.

While this section provided a brief introduction to some of the features providedby the DSEL, the following sections on variational programming will introduce manymore. In particular, Section 5.1 will describe how to make a non-trivial data type vari-ational, Section 5.2 and Section 5.3 will present a subset of the language designed forthe creation of complex editing operations on variational expressions.

5 Variational Lists

We start exploring the notion of variation programming with lists, which are a simplebut expressive and pervasive data structure. The familiarity with lists will help us to


identify important patterns when we generalize traditional list functions to the case ofvariational lists. The focus on a simple data structure will also help us point out theadded potential for variation programming. We present variation programming withlists in several steps.

First, we explain the data type definition for variational lists and present severalexamples together with some helper functions in Section 5.1. Second, we develop vari-ational versions for a number of traditional list functions in Section 5.2. We can observethat, depending on the types involved, certain patterns of recursion become apparent.Specifically, we will see that dependent on the role variation plays in the types of thedefined functions, variational parts have to be processed using fmap, effectively treatingthem in a functorial style, or using >>=, treating them as monadic values. In Section5.3 we turn our attention to editing operations for variational lists. While the adaptedtraditional list functions will naturally produce variational data structures (such as lists,numbers, etc.), these are a result from already existing variations in the lists that weregiven as arguments and thus result more as a kind of side effect. In contrast, list edit-ing operations introduce or change variation structure purposefully. We will presentin Section 5.4 some comments and observations on the different programming stylesemployed in the two subsections 5.2 and 5.3.

As a motivating example we consider how to represent menu preferences usingchoices and dimensions. Suppose that we prefer to order meat or pasta as the maincourse in a restaurant and that with meat we always order french fries on the side. Also,if we order pasta, we may have cake for dessert. Using the choice calculus we canrepresent these menu options as follows (here ε represents an empty token that, whenselected, does not appear in the list as an element but rather disappears).

dim Main〈meat,pasta〉 inMain〈[Steak,Fries],[Pasta,dim Dessert〈yes,no〉 in Dessert〈Cake,ε〉]〉

Here we have used a simple list notation as an object language. This notation leavesopen many questions, such as how to nest lists and how to compose variational list anda list without variations. We will look at these questions in more detail in the following.

5.1 Representing Variational Lists

Lists typically are represented using two constructors for empty lists and for addingsingle elements to lists. Since lists are the most important data structures in functionalprogramming, they are predefined in Haskell and supported through special syntax.While this is nice, it prevents us from changing the representation to variational lists.Therefore, we have to define our own list representation first, which we then can extendin a variety of ways to discus the transition to variational lists.

A standard definition of lists is as follows.

data List a = Cons a (List a)

| Empty

To create variational lists using the V data type, we have to apply V somewhere inthis definition. One possibility is to apply V to a thus making the elements in a listvariable.


data List a = Cons (V a) (List a)

| Empty

While this definition is quite convenient1 as far as varying elements is concerned, it doesnot allow us to vary lists themselves. For example, we cannot represent a list whose firstelement is 1 and whose tail is a either [2] or [3,4].

This limitation results from the fact that we cannot have a choice (or any other vari-ational constructs) in the second argument of Cons. This shortcoming can be addressedby throwing in another V type constructor.

data List a = Cons (V a) (V (List a))

| Empty

This representation avoids the above problem and is indeed the most general representa-tion imaginable. However, the problem with this representation is that it is too general.There are two major drawbacks. First, the representation makes the definitions of func-tions cumbersome since it requires to process two variational types for one constructor.More importantly, the way our DSEL is implemented does not allow the application ofV to different types in the same data type, and thus cannot deal with the shown definitionof List. This limitation is a consequence of the employed SYB library [19].2

A drawback of either of the two previous two approaches is that changing the typeof existing constructors may break existing code. This aspect matters when variationalstructure is added to existing data structures. In such a situation we would like to beable to continue using existing functions without the need for any changes in existingcode.

Therefore, we choose the following representation in which we simply add a newconstructor, which serves as a hook for any form of variation to be introduced into lists.This definition yields what we call an expanded list, where “expanded” means that it cancontain variational data. However, this expansion is not enough, we also need a type forvariational lists, that is, lists that are the object of the V type constructor. We introducea type abbreviation for this type. The two types List a and VList a for expanded andvariational lists, respectively, depend mutually on one another and together accomplishthrough this recursion the lifting of the plain list data type into its fully variationalversion.

type VList a = V (List a)

data List a = Cons a (List a)

| Empty

| VList (VList a)

We are using the convention to use the same name for the additional constructor as forthe variational type, in this case VList. This helps to keep the variational code moreorganized, in particular, in situations where multiple variational data types are used.

1 Moreover, if this definition were all we needed, we could apply it directly to the predefinedHaskell lists.

2 It is possible to lift this constraint, but doing so requires rather complex generic programmingtechniques that would make the library much more difficult to use.


list :: List a -> VList a

list = Obj

single :: a -> List a

single a = Cons a Empty

many :: [a] -> List a

many = foldr Cons Empty

vempty :: VList a

vempty = list Empty

vsingle :: a -> VList a

vsingle = list . single

vcons :: a -> VList a -> VList a

vcons x = list . Cons x . VList

vlist :: [a] -> VList a

vlist = list . many

Fig. 4. Auxiliary functions for variational lists

With the chosen definition for the data type List we can represent the variationallist for representing our menu choices as follows. First, we introduce a data type forrepresenting food items.

data Food = Steak | Pasta | Fries | Cake

Note that for the above data type we also derive instances for Eq, Show, Data andTypeable. Instances of Data and Typeable are required for the SYB library to work.Every data type in this tutorial that will be used with the V type constructor alsoderives instances for these classes, although we don’t show this explicitly eachtime.

We also introduce a few auxiliary functions that help make the writing of variationallists more concise, see Figure 4. For example, vempty represents an empty variationallist, vsingle constructs a variational list containing one element, and vcons takes anelement and adds it to the beginning of a variational list. The function vlist transformsa regular Haskell list into a VList, which lets us reuse Haskell list notation in construct-ing VLists. All three definitions are based on corresponding List versions and use thesynonym list for Obj, which lifts an object language expression into a variational ex-pression. The function list is more concrete than Obj in the sense that it explicitly tellsus that a List value is lifted to the variational level. It can also be understood as indi-cating, within a variational expression: “look, here comes an ordinary list value”. Weuse similar synonyms for other object languages (for example, int or haskell), and wewill even use the synonym obj for generic values.

Exercise 11. The function vcons shown in Figure 4 adds a single (non-variational)element to a variational list. Define a function vvcons that adds a choice (that is, a vari-ational element) to a variational list. (Hint: Since you have to deal with two occurrencesof the V constructor, you might want to exploit the fact that V is a monad.)

Using these operations, we can give the following definition of the menu plan as avariational list.


type Menu = VList Food

dessert :: Menu

dessert = atomic "Dessert" ["yes","no"] [vsingle Cake,vempty]

menu :: Menu

menu = atomic "Main" ["meat","pasta"]

[vlist [Steak,Fries],Pasta ‘vcons‘ dessert]

We can examine the structure we have built by evaluating menu (again, the line breakwas inserted manually).

> menu

dim Main<meat,pasta> in

Main<[Steak;Fries],[Pasta;dim Dessert<yes,no> in Dessert<[Cake],[]>]>

Note that we have defined the pretty printing for the List data type to be similar toordinary lists, except that we use ; to separate list elements. In this way we keep anotation that is well established but also provides cues to differentiate between lists andvariational lists.

Since the presence of nested dimensions complicates the understanding of variationalstructures, we can use the semantics of menu to clarify the represented lists.

> psem menu

[Main.meat] => [Steak;Fries]

[Main.pasta,Dessert.yes] => [Pasta;Cake]

[Main.pasta,Dessert.no] => [Pasta]

Exercise 12. Change the definition of menu so that we can choose dessert also for a meatmain course. There are two ways of achieving this change: (a) by copying the dessert

dimension expression into the other choice, or (b) by lifting the dimension declarationout of the main choice.

Before we move on to discuss variational list programs, we show a couple of operationsto facilitate a more structured construction of variation lists. These operations are notvery interesting from a transformational point of view, but they can be helpful in decom-posing the construction of complicated variational structures into an orderly sequenceof steps.

This doesn’t seem to be such a big deal, but if we take a closer look at the definitionof menu shown above, we can observe that we have employed in this simple exam-ple alone five different operations to construct lists, namely, vsingle, vempty, vlist,vcons, and []. To decide which operation to use where requires experience (or extensiveconsultation with the Haskell type checker).

In the construction of menu we can identify two patterns that seem to warrant supportby specialized operations. First, the definition of dessert is an instance of a dimensionrepresenting that something is optional. We can therefore define a function opt forintroducing an optional feature in a straightforward way as follows.


opt :: Dim -> a -> VList a

opt d x = atomic d ["yes","no"] [vsingle x,vempty]

Second, the definition of menu was given by separating the tags and the lists they label.A more modular definition can be given if we define the two different menu optionsseparately and the combine them to a menu. To do that we introduce some syntacticsugar for defining tagged variational lists.

type Tagged a = (Tag,V a)

infixl 2 <:

(<:) :: Tag -> V a -> Tagged a

t <: v = (t,v)

Next we can define an operation alt for combining a list of tagged alternatives into adimension.

alt :: Dim -> [Tagged a] -> V a

alt d tvs = atomic d ts vs where (ts,vs) = unzip tvs

With the help of opt, <:, and alt we can thus give a the following, slightly moremodular definition of menu.

dessert = opt "Dessert" Cake

meat = "meat" <: vlist [Steak,Fries]

pasta = "pasta" <: Pasta ‘vcons‘ dessert

menu = alt "Main" [meat,pasta]

This definition produces exactly the same (syntactic) variational list as the definitiongiven above.

5.2 Standard Variational List Functions

Among the most common functions for lists are functions to transform lists or to ag-gregate them. In the following we will illustrate first how to implement some of thesefunctions directly using pattern matching and recursion. We will later introduce moregeneral variational list functions, such as map and fold.

Let us start by implementing the function len to compute the length of a variationallist. The first thing to realize is that the return type of the function is not just Int butrather V Int since the variation in a list may represent lists of different lengths. Theimplementation can be performed by pattern matching: The length of an empty list iszero. However, we must be careful here to not just return 0 since the return type of thefunction requires a V Int value. We therefore have to lift the 0 into the V type using theconstructor Obj, for which we also provide the abbreviation int (see the discussion oflist above, and note that we could also use the return method from Monad for this).The length of a non-empty list is given by the length of the tail plus one. Again, be-cause of the structured return type of len we cannot simply add one to the result of therecursive call. Since len xs can produce, in general, an arbitrarily complex variationexpression overs integers, we have to make sure to add one to all variants, which can


be accomplished by the function fmap. Finally, to compute the length of a list whoserepresentation is distributed within a V structure, we have to carry the len computationto all the lists in the V representation. One could think of doing that again using fmap.However, looking at the involved types tells us that this is not the right approach, be-cause one would end up with a bunch of V Int values scattered all over a V value. Whatwe need instead is a single V Int value. So we are given a V (List a) value vl and afunction len of type List a -> V Int to produce a value of type V Int. If we abstractfrom the concrete types a little bit by replacing List a by a, V by m, and Int by b, wesee that to combine vl and len we need a function of the following type.

m a -> (a -> m b) -> m b

As we know (or otherwise could find out quickly using Hoogle [15]), this is exactly thetype of the monadic bind operation, which then tells us the implementation for the lastcase. Thinking about it, applying len to vl using monadic bind makes a lot of sensesince our task in this case is to compute variational data in many places and then join ormerge them into the existing variational structure of vl.

len :: List a -> V Int

len Empty = int 0

len (Cons _ xs) = fmap (+1) (len xs)

len (VList vl) = vl >>= len

Now if we try to apply len to one of the variational lists defined in Section 5, wefind that the types do not match up. While len is a function that works for lists thatcontain variational parts, it still expects an expanded list as its input. It seems we needan additional function that can be applied to values of type V (List a).

In fact, we have defined such a function already in the third case of len, and we couldsimply reuse that definition. Since it turns out that we need to perform such a lifting intoa V type quite often, we define a general function for that purpose.

liftV :: (a -> V b) -> V a -> V b

liftV = flip (>>=)

As is apparent from the type and implementation (and also from the discussion of thethird case of len), the liftV function is essentially the bind operation of the V monad.

With liftV we obtain the required additional version of the function len.

vlen :: VList a -> V Int

vlen = liftV len

We generally use the following naming convention for functions. Given a function f

whose input is of type T we use the name vf for its lifted version that works on valuesof type V T.

We can now test the definition of vlen by applying it to the example list menu definedin Section 5.1.

> vlen menu

dim Main<meat,pasta> in Main<2,dim Dessert<yes,no> in Dessert<2,1>>

As expected the result is a variational expression over integers. We can obtain a moreconcise representation by computing the semantics of this expression.


> psem $ vlen menu

[Main.meat] => 2

[Main.pasta,Dessert.yes] => 2

[Main.pasta,Dessert.no] => 1

Exercise 13. Implement the function sumL :: List Int -> V Int using pattern match-ing and recursion. Then define the function vsum :: VList Int -> V Int.

We have explained the definition of len in some detail to illustrate the considerationsthat led to the implementation. We have tried to emphasize that the generalization of afunction definition for ordinary lists to variational lists requires mostly a rigorous con-sideration of the types involved. In other words, making existing implementations workfor variational data structures is an exercise in type-directed programming in which thetypes dictate (to a large degree) the code [39].

Before moving on to defining more general functions on variational lists, we willconsider the definition of list concatenation as an example of another important listfunction. This will highlight an important pattern in the generalization of list functionsto the variational case.

The definition for the Empty and Cons case are easy and follow the definition forordinary lists, that is, simply return the second list or recursively append it to the tailof the first, respectively. However, the definition for a variational list is not so obvious.If the first list is given by a variation expression, say vl, we have to make sure that weappend the second list to all lists that are represented in vl. In the discussion of theimplementation of len we have seen that we have, in principle, two options to do that,namely fmap and >>=. Again, a sharp look at what happens to the involved types willtell us what the correct choice is. For the concatenation of lists we can observe that theresult type stays the same, that is, it is still a value of type List a, which means thatwe can traverse vl and apply the function cat with its second argument fixed to all liststhat we encounter. This can be accomplished by the function fmap. The situation for lenwas different because its result was a variational type, which required the flattening ofthe resulting cascading V structures through >>=.

cat :: List a -> List a -> List a

cat Empty r = r

cat (Cons a l) r = Cons a (l ‘cat‘ r)

cat (VList vl) r = VList (fmap (‘cat‘ r) vl)

As for len, we also need a version of cat that works for variational lists.3 A simplesolution is obtained by simply lifting the variational list arguments into the List typeusing the VList constructor, which facilitates the application of cat.

vcat :: VList a -> VList a -> VList a

vcat l r = list $ cat (VList l) (VList r)

3 Remember that List a represents only the expanded list type and that VList a is the varia-tional list type.


To show vcat in action, assume that we extend Food by another constructor Sherry

which we use to define the following variational list representing a potential drink be-fore the meal.

aperitif :: VList Food

aperitif = opt "Drink" Sherry

When we concatenate the two lists aperitif and menu, we obtain a variational listthat contains a total of six different variants. Since the evaluation of vcat duplicatesthe dimensions in menu, the resuting term structure becomes quite difficult to read andunderstand. We therefore show only the semantics of the result.

psem $ vcat aperitif menu

[Drink.yes,Main.meat] => [Sherry;Steak;Fries]

[Drink.yes,Main.pasta,Dessert.yes] => [Sherry;Pasta;Cake]

[Drink.yes,Main.pasta,Dessert.no] => [Sherry;Pasta]

[Drink.no,Main.meat] => [Steak;Fries]

[Drink.no,Main.pasta,Dessert.yes] => [Pasta;Cake]

[Drink.no,Main.pasta,Dessert.no] => [Pasta]

Exercise 14. Define the function rev for reversing expanded lists. You may want to usethe function cat in your definition. Also provide a definition of the function vrev forreversing variational lists. Before testing your implementation, try to predict what theresult of the expression vrev menu should be.

All of the examples we have considered so far have lists as arguments. Of course,the programming with variational lists should integrate smoothly with other, non-variational types. To illustrate this, we present the definition of the functions nth andvnth to compute the nth element of a variational list (recall that we use obj as a syn-onym for Obj, to maintain letter-case consistency with list and int).

nth :: Int -> List a -> V a

nth _ Empty = undefined

nth 1 (Cons x _) = obj x

nth n (Cons _ xs) = nth (n-1) xs

nth n (VList vl) = vl >>= nth n

We can observe that the integer parameter is passed around unaffected through thevariational types. The lifting to variational lists is straightforward.

vnth :: Int -> VList a -> V a

vnth n = liftV (nth n)

We also observe that the computation of nth can fail. This might be more annoying thanfor plain lists because in general the length of the list in a variational list expressions isnot obvious. Specifically, the length can vary! Therefore, it is not obvious what argu-ment to call vnth with. For example, the following computation produces the expectedresult that the first item in a menu list is either Steak or Pasta.


> vnth 1 menu

dim Main<meat,pasta> in Main<Steak,Pasta>

However, since there is no second item for the Main.pasta, Dessert.no list, the com-putation vnth 2 menu fails. This is a bit disappointing since for some variants a secondlist element does exist. A definition for nth/vnth using a V (Maybe a) result type seemsto be more appropriate. We leave the definition of such a function as an exercise. Asanother exercise consider the following task.

Exercise 15. Define the function filterL:: (a -> Bool) -> List a -> List a andgive a definition for the corresponding function vfilter that operates on variationallists.

The final step in generalizing list functions is the definition of a fold operation (andpossibly other generic list processing operations) for variational lists. The definitionfor fold can be easily obtained by taking the definition of len (or sum/sumList fromExercise 13) and abstracting from the aggregating function +.

fold :: (a -> b -> b) -> b -> List a -> V b

fold _ b Empty = obj b

fold f b (Cons a l) = fmap (f a) (fold f b l)

fold f b (VList vl) = vl >>= fold f b

With fold we should be able to give more succinct definitions for functions, such aslen, which is indeed the case.

len :: List a -> V Int

len = fold (\_ s->succ s) 0

Finally, we could also consider recursion on multiple variational lists. We leave this asan exercise.

Exercise 16. Implement the function zipL:: List a -> List b -> List (a,b) andgive a definition for the corresponding function vzip that operates on variational lists.

As an example application of vzip, consider the possible meals when two peopledine.

> psem $ vzip menu menu

[Main.meat,Main.meat] => [(Steak,Steak);(Fries,Fries)]

[Main.meat,Main.pasta,Dessert.yes] => [(Steak,Pasta);(Fries,Cake)]

[Main.meat,Main.pasta,Dessert.no] => [(Steak,Pasta)]

[Main.pasta,Main.meat,Dessert.yes] => [(Pasta,Steak);(Cake,Fries)]

[Main.pasta,Main.meat,Dessert.no] => [(Pasta,Steak)]

[Main.pasta,Main.pasta,Dessert.yes,Dessert.yes] => [(Pasta,Pasta);

(Cake,Cake)]

[Main.pasta,Main.pasta,Dessert.yes,Dessert.no] => [(Pasta,Pasta)]

[Main.pasta,Main.pasta,Dessert.no] => [(Pasta,Pasta)]


Now, this looks a bit boring. Maybe we could consider filtering out some combinationsthat are considered “bad” for some reason, for example, when somebody has dessertwhile the other person is still having the main course. We might also consider a morerelaxed definition of vzip in which one person can have pasta and dessert and the otherperson can have pasta and no dessert. Note that while we can select this possibility inthe above semantics, the corresponding variant does not reflect this since when two listsof differing lengths are zipped, the additional elements of the longer list are discarded.

5.3 Edit Operations for Variational Lists

The menu example that we introduced in Section 5.1 was built in a rather ad hoc fashionin one big step from scratch. More realistically, variational structures develop over time,by dynamically adding and removing dimensions and choices in an expression, or byextending or shrinking choices or dimensions. More generally, the rich set of laws thatexists for the choice calculus [12] suggest a number of operations to restructure vari-ation expressions by moving around choices and dimensions. Specifically, operationsfor the factoring of choices or the hoisting of dimensions reflect refactoring operations(that is, they preserve the semantics of the transformed variation expression). These areuseful for bringing expressions into various normal forms.

In this section we will present several operations that can be used for the purpose ofevolving variation representations. Most of these operations will be generic in the sensethat they can be applied to other variational structures, and we will actually reuse someof them in Section 6.

As a motivating example let us assume that we want, in our dinner decisions, tothink first about the dessert and not about the main course. To obtain an alternativelist representation with the Dessert dimension at the top we could, of course, build anew representation from scratch. However, this approach does not scale very well, andthe effort becomes quickly prohibitive as the complexity of the variational structuresinvolved grow. An alternative, more flexible approach is to take an already existingrepresentation and transform it accordingly. In our example, we would like to split thedeclaration part off of a dimension definition and move it to the top level. This amountsto the repeated application of commutation rules for dimensions [12]. We can breakdown this operation into several steps as follows. Assume e is the expression to berearranged and d is the name of the dimension declaration that is to be moved.

(1) Find the dimension d that is to be moved.(2) If the first step is successful, cut out the found dimension expression Dim d ts e’

and remember its position, which can be done in a functional setting through theuse of a context c, that is, an expression with a hole that is conveniently representedby a function.

(3) Keep the scope of the found dimension declaration, e’, at its old location, whichcan be achieved by applying c to e’.

(4) Finally, move the declaration part of the dimension definition to the top level, whichis achieved by wrapping it around the already changed expression obtained in theprevious step; that is, we produce the expression Dim d ts (c e’).

To implement these steps we need to solve some technically challenging problems.


For example, finding a subexpression in an arbitrary data type expression, removingit, and replacing it with some other expression requires some advanced generic pro-gramming techniques. To this end we have employed the SYB [19] and the “scrap yourzipper” [1] libraries for Haskell, which allow us to implement such generic transfor-mation functions. Since a detailed explanation of these libraries and how the providedfunctions work is beyond the scope of this tutorial, we will only briefly mention whatthe functions will do as we encounter them. The approach is based on a type C a, whichrepresents a context in a type V a. Essentially, a value of type C a represents a pointerto a subexpression of a value of type V a, which lets us extract the subexpression andalso replace it. A context is typically the result of an operation to locate a subexpressionwith a particular property. We introduce the following type synonym for such functions.

type Locator a = V a -> Maybe (C a)

The Maybe type indicates that a search for a context may fail. As a generic function tolocate subexpressions and return a matching context, we provide the following functionfind that locates the first occurrence of a subexpression that satisfies the given predi-cate. A predicate in this context means a boolean function on variational expressions.

type Pred a = V a -> Bool

The function find performs a preorder traversal of the expression and thus locates thetopmost, leftmost subexpression that satisfies the predicate.

find :: Data a => Pred a -> Locator a

That Data class constraint is required for the underlying zipper machinery in the imple-mentation of find. The function find already realizes the first step of the transformationsequence needed for refactoring the representation of the variational list menu. All weneed is a predicate to identify a particular dimension d, which is quite straightforwardto define.

dimDef :: Dim -> Pred a

dimDef d (Dim d’ _ _) = d == d’

dimDef _ _ = False

The second step of cutting out the dimension is realized by the function extract, whichconveniently returns as a result a pair consisting of the context and the subexpressionsitting in the context. The function extract is an example of a class of functions thatsplit an expression into two parts, a context plus some additional information about theexpression in the hole. In the specific case of extract that information is simply theexpression itself. This level of generality is sufficient for this tutorial, and we thereforerepresent this class of functions by the following type.

type Splitter a = V a -> Maybe (C a,V a)

The definition of extract uses find to locate the context and then simply extracts thesubexpression stored in the context using the predefined zipper function getHole.

extract :: Data a => Pred a -> Splitter a

extract p e = do c <- find p e

h <- getHole c

return (c,h)


The third step of applying the context to the scope of the dimension expression requiresthe function <@, whose definition is based on elementary zipper functions that we don’tshow here.

(<@) :: Data a => C a -> V a -> V a

The function <@ can also be understood as inserting the second argument into the holeof the first.

Finally, we can combine all these functions and define a function hoist for hoistingdimension declarations. To avoid having to deal with error handling when we call hoistwe return as a default expression the original expression in case the process of liftingfails at some stage.

hoist :: Data a => Dim -> V a -> V a

hoist d e = withFallback e $ do

(c,Dim _ ts e’) <- extract (dimDef d) e

return (Dim d ts (c <@ e’))

Note that the function withFallback is simply a synonym for fromMaybe. We can applyhoist to menu to obtain a different choice calculus representation, which produces theexpected result (the line break was manually inserted).

> hoist "Dessert" menu

dim Dessert<yes,no> in dim Main<meat,pasta> in

Main<[Steak;Fries],[Pasta;Dessert<[Cake],[]>]>

There are two obvious shortcomings of the current definition for hoist. One problemis that moving the dimension might capture free Dessert choices.4 The other problemis that the Dessert decision might be made for nothing since it does not have an effectwhen the next decision in Main is to select meat.

The first problem can be easily addressed by extending the definition of hoist

by a check for capturing unbound Dessert choices that returns failure (that is,Nothing) if d occurs free anywhere in e. This failure will be caught eventually bythe withFallback function that will ensure that the original expression e is returnedinstead.

safeHoist :: Data a => Dim -> V a -> V a

safeHoist d e = withFallback e $ do

(c,Dim _ ts e’) <- extract (dimDef d) e

if d ‘Set.member‘ freeDims e

then Nothing

else return (Dim d ts (c <@ e’))

The function freeDims returns a set (as Haskell’s Data.Set) of the dimension names ofunbound choices, as described in Section 4.

4 Capturing free desserts sounds actually quite appealing from an application point ofview. :)


Exercise 17. The implementation of safeHoist prevents the lifting of the dimension d

if this would cause the capture of d choices.

(a) What other condition could cause, at least in principle, the hoisting of a dimensionto be unsafe (in the sense of changing the semantics of the variational list)?

(b) Why don’t we have to check for this condition in the implementation of safeHoist?(Hint: Revisit the description of how the function find works.)

The second problem with the definition of hoist can be seen if we compare the se-mantics of the menu with the hoisted dimension with the semantics of the originalexpression (shown in Section 5.1).

dMenu = hoist "Dessert" menu

> psem $ dMenu

[Dessert.yes,Main.meat] => [Steak;Fries]

[Dessert.yes,Main.pasta] => [Pasta;Cake]

[Dessert.no,Main.meat] => [Steak;Fries]

[Dessert.no,Main.pasta] => [Pasta]

It is clear that the Dessert decision has no effect if the Main decision is meat. Thereason for this is that the Dessert choice appears only in the pasta choice of the Main

dimension. We can fix this by moving the Main choice plus its dimension declarationinto the no alternative of the Dessert choice.

This modification is an instance of the following slightly more general transforma-tion schema, which applies in situations in which a choice in dimension B is availableonly in one of the alternatives of all choices in another dimension A. (Here we showfor simplicity the special case in which A has only one choice with two alternatives.)Such an expression can be transformed so that the selection of b1 is guaranteed to havean effect, that is, we effectively trigger the selection of a2 by copying the alternative,because the selection of a1 would leave the decision to pick b1 without effect.

dim B〈b1,b2〉 in dim A〈a1,a2〉 in A〈[a1],[a2;B〈b1,b2〉]〉� dim B〈b1,b2〉 in B〈[a2;b1],dim A〈a1,a2〉 in A〈[a1],[a2;b2]〉〉

Note that the selection of b2 does not have this effect since we can still select betweena1 and a2 in the transformed expression. This transformation makes the most sense inthe case when B represents an optional dimension, that is, b1 = yes, b2 = no, and b2= ε ,because in this case the selection of b2 = no makes no difference, no matter whether wechoose a1 or a2.

This transformation can be extended to the case in which A has more than two alter-natives and more than one choice, which requires, however, that each A choice containsthe B choice in the same alternative k.

We will next define a function that can perform the required transformation automat-ically. For simplicity we assume that the choice in b to be prioritized (corresponding tothe choice in B above) is contained in the second alternative of the choice in a (whichcorresponds to A above).


prioritize :: Data a => Dim -> Dim -> V a -> V a

prioritize b a e = withFallback e $ do

(dA,ae) <- extract (dimDef a) e

(cA,Chc _ [a1,a2]) <- extract (chcIn a) ae

(cB,Chc _ [b1,b2]) <- extract (chcIn b) a2

return $ dA <@ (Chc b [cB <@ b1,cA <@ (Chc a [a1,cB <@ b2])])

The function works as follows. Much like most transformations, it will first decomposethe expression to be transformed into a collection of (nested) contexts and expressions,which are then used to build the result expression. Specifically, we first find the locationof the dimension definition for a and remember it in the form of a context dA. Next,we find the context cA of the a choice. Finally, we find the choice to be prioritized inthe second alternative of the a choice, a2. Both choices are found using the extract

function with the predicate chcFor that finds a particular choice, similar to the dimDef

predicate.

chcIn :: Dim -> Pred a

chcIn d (Chc d’ _) = d == d’

chcIn _ _ = False

Having thus isolated all the required subexpressions, we can assemble the result byapplying the contexts following the RHS of the above transformation schema.

Note that this transformation does not preserve the semantics; in fact, the reason forapplying it is that it makes the semantics more compact. The transformation is, however,variant preserving; that is, no variants are added or removed, only the decisions to reachthe variants have changed. This can be best seen by comparing the semantics of dMenushown above with the semantics of dMenu with the Dessert choice prioritized over theMain choice.

> psem $ prioritize "Dessert" "Main" dMenu

[Dessert.yes] => [Pasta;Cake]

[Dessert.no,Main.meat] => [Steak;Fries]

[Dessert.no,Main.pasta] => [Pasta]

The prioritization of the Dessert choice has removed the effectively unavailbale deci-sion for meat in the case of yes for Dessert.

Exercise 18. The implementation of prioritize assumes that the choice to be liftedis located in the second alternative of the choice in a. Generalize the implementation ofprioritize so that the choice in b can be lifted out of either alternative.

As a final example we illustrate how to combine the two previously defined transfor-mations. In terms of the choice calculus, combining dimension hoisting and choiceprioritization leads to a transformation that we call dependency inversion.

dim A〈a1,a2〉 in A〈[a1],[a2;dim B〈b1,b2〉 in B〈b1,b2〉]〉� dim B〈b1,b2〉 in B〈[a2;b1],dim A〈a1,a2〉 in A〈[a1],[a2;b2]〉〉

Reusing the definitions for hoist and prioritize, the definition of inversion is ratherstraightforward.


invert :: Data a => Dim -> Dim -> V a -> V a

invert b a = prioritize b a . hoist b

The definition of invert demonstrates that we can build more complicated variationprograms out of simpler components and thus illustrate the compositional nature of ourvariation DSEL.

5.4 Variation Programming Modes

To close this section, we share a few thoughts on the nature of variation programming.The two sections 5.2 and 5.3 have illustrated that making data structures variationalleads to two different programming modes or attitudes. On the one hand, the focus canbe on manipulating the data structure itself, in which case the variational parts are justmaintained but not essentially changed. This is what Section 5.2 was all about. On theother hand, the focus can be on changing the variation in the data structure, in whichcase the existing represented objects are kept mostly intact. This is what Section 5.3was concerned with.

The different ways of processing edits to data structures have been classified underthe name of persistence [25]. Imperative languages typically support no persistence,that is, edits to data structures are destructive and make old versions inaccessible. Incontrast, data structures in functional languages are by default fully persistent, that is,old versions are in principle always accessible as long as a reference to them is kept.(There are also the notions of partial persistence and confluent persistence that are notof interest here.) Variational data structures add a new form of persistence that we callcontrolled persistence because it gives programmers precise control over what versionsof a data structure to keep and how to refer to them. In contrast to all other forms of per-sistence (or non-persistence), which happen rather automatically, controlled persistencerequires a conscious effort on part of the programmer to create and retrieve differentversions of a data structure, and it keeps information about the versions around for theprogrammer to see and exploit.

6 Variational Software

The motivation for the choice calculus was the representation of variation in software,and having uncovered some basic principles of variation programming in Section 5, weare finally in a position to look at how we can put the choice calculus to work, throughvariation programming, on variational software.

As a running example we pick up the twice example that was introduced earlierin Section 2. We will introduce a representation of (a vastly simplified version of) theobject language Haskell in Section 6.1, together with a number of supporting functions.After that we will consider in Section 6.2 several simple example transformations forvariational Haskell programs.

6.1 Representing Variational Haskell

Following the example given in Section 5.1 we will first introduce a data type definitionfor representing Haskell programs and then extend it to allow for variations.


Because of the limitations of our current library that are imposed by the use of theSYB library [19], we have to make a number of simplifying assumptions and compro-mises in our definition. One constraint is that within a data type definition the V typeconstructor can be applied only on one type. This has several implications. First, wecannot spread the definition of Haskell over several data types. We actually would haveliked to do that and have, for example, different data types for representing expressionsand declarations (for values, function, types, etc.). Since this is not possible, we areforced to represent function definitions using a Fun constructor as part of the expressiondata type. But this is not all. Ordinarily, we would represent parameters of a functiondefinition by simple strings. However, since we want to consider as an example the re-naming of function parameters, we would have to represent variational parameters by atype V String or so, which is unfortunately not possible since we have committed theV type constructor to the expression data type already. The solution to this problem isto represent function parameters also as expressions. Although we can ensure throughthe use of smart constructors that we build only function definitions that use variablenames as parameters, this forced restriction on the representation is less than ideal.

Therefore, for the purpose of this tutorial, we will work with the following data typefor representing Haskell expressions and programs.

data Haskell = App Haskell Haskell

| Var Name

| Val Int

| Fun Name [Haskell] Haskell Haskell

...

As we did with lists, we can now add a constructor for introducing variational expres-sions.

type VHaskell = V Haskell

data Haskell = App Haskell Haskell

| Var Name

| Val Int

| Fun Name [Haskell] Haskell Haskell

...

| VHaskell VHaskell

Before we construct the representation of the variational twice function, we introducea few more abbreviations and auxiliary functions to make the work with variationalHaskell programs more convenient.

First, we introduce a function that turns a string that represents a binary function intoa constructor for building expressions using that function. Consider, for example, thefollowing simple Haskell expression.

2*x

When we try to represent this expression with the above data type, we have quite somework to do. First, we have to turn 2 and x into the Haskell expressions using the con-structors Val and Var, respectively. Then we have to use the App constructor twice toform the application. In other words, we have to write the following expression.


haskell :: Haskell -> VHaskell

haskell = Obj

choice :: Dim -> [Haskell] -> Haskell

choice d = VHaskell.Chc d.map haskell

(.+) = op "+"

(.*) = op "*"

x,y,z :: Haskell

[x,y,z] = map Var ["x","y","z"]

Fig. 5. Auxiliary functions for variational Haskell programs

App (App (Var "*") (Val 2)) (Var x).

The function op defined below performs all the necessary wrapping for us automatically.(Less importantly, it also adds enclosing parentheses around the function name, whichis exploited by the pretty printer to produce an infix representation.)

op :: Name -> Haskell -> Haskell -> Haskell

op f l r = App (App (Var ("(" ++ f ++ ")")) l) r

In Figure 5 we also define two infix operators that are defined as an abbreviation for acall to the op function. These are not essential but will make the twice example lookeven nicer. There we also introduce names for a few variable references. Moreover,in addition to the haskell synonym for the Obj constructor we also provide a smartconstructor to build choices of Haskell expressions more directly.

Finally, we define a function fun, which provides an abbreviation for the Fun con-structor.

fun :: Name -> [Haskell] -> Haskell -> VHaskell

fun n vs e = haskell $ Fun n vs e withoutScope

withoutScope :: Haskell

withoutScope = Var ""

In particular, fun constructs a function definition with an empty scope of the functiondefinition since in our example we are interested only in the definition of twice and notits uses.

With all these preparations, we can now represent the variational definition of twicein our DSEL as follows.

twice = Dim "Par" ["x","y"]

$ Dim "Impl" ["plus","times"]

$ fun "twice" [v] i

where v = choice "Par" [x,y]

i = choice "Impl" [v .+ v, Val 2 .* v]

For comparison here is again the definition given in Section 2..

dim Par〈x,y〉 indim Impl〈plus, times〉 intwice Par〈x,y〉 = Impl〈Par〈x,y〉+Par〈x,y〉,2*Par〈x,y〉〉

To check that this definition mirrors the one given in Section 2, we can evaluate twice

(the line breaks were added manually).


> twice

dim Par<x,y> in

dim Impl<plus,times> in

twice Par<x,y> = Impl<Par<x,y>+Par<x,y>,2*Par<x,y>>

To check that this definition actually represents the desired four different implementa-tions of twice we can compute its semantics.

> psem twice

[Par.x,Impl.plus] => twice x = x+x

[Par.x,Impl.times] => twice x = 2*x

[Par.y,Impl.plus] => twice y = y+y

[Par.y,Impl.times] => twice y = 2*y

Looking back at the definition of twice, notice how we have used Haskell’s where

clause to factor out parts of the definition. Whereas the definition of i is not reallyessential, the definition of v is, in fact, needed to avoid the copying of the parame-ter choice. In Section 2 we have seen how the share construct of the choice calculusfacilitates the factorization of common subexpressions. We have earlier said that, fortechnical reasons, the current realization of the choice calculus as a Haskell DSEL doesnot support sharing, but we can see here that the situation is not completely dire sincewe can simulate the missing sharing of the choice calculus (at least to some degree)using Haskell’s let (or where) bindings. Here is a slightly changed definition of thetwice function that comes close to the example given in Section 2.

twice = Dim "Par" ["x","y"] $

Dim "Impl" ["plus","times"] $

let v = choice "Par" [x,y] in

fun "twice" [v] (choice "Impl" [v .+ v, Val 2 .* v])

But recall from Section 4 that there is an important difference between Haskell’s let

and the share construct of the choice calculus, and that is the time when bindings areexpanded. In the choice calculus shared expressions will be expanded only after alldimensions have been eliminated using tag selection, whereas in Haskell the expansionhappens always before any selection.

6.2 Edit Operations for Variational Haskell

As an example for an editing operation we consider the task of turning a plainfunction definition into a variational one. To this end, we start with the plain vari-ant of twice with parameter name x and implemented by +, and add dimensionsto it.

xp = fun "twice" [x] (x .+ x)

Let us first consider the variation of the parameter name. In order to generalize thecurrent definition xp, we need to do the following two things.

(1) Add a dimension declaration for Par.(2) Replace references to x by choices between x and y.


The first step is easy and simply requires the addition of a dimension declaration (usingthe Dim constructor). The second step requires a traversal of the abstract syntax tree rep-resenting twice and the application of a transformation at all places where a variable x isencountered. This can be accomplished by employing the everywhere traversal functionof the SYB library [19]. All we need is the definition of a transformation that identifiesthe occurrence of x variables and replaces them by choices. Such a transformation isindeed easy to define.5

addPar :: Haskell -> Haskell

addPar (Var "x") = choice "Par" [x,y]

addPar e = e

We can use this transformation as an argument for the everywhere traversal. Sinceeverywhere is a generic function that must be able to traverse arbitrary data types andvisit and inspect values of arbitrary types, the transformation passed to it as an argu-ment must be a polymorphic function. The SYB library provides the function mkT thatperforms this task; that is, it generalizes the type of a function into a polymorphic one.We can therefore define the transformation to turn the fixed x variables in twice intochoices between x and y as follows.

varyPar :: VHaskell -> VHaskell

varyPar = Dim "Par" ["x","y"] . everywhere (mkT addPar)

We can confirm that varyPar has indeed the desired effect.

> varyPar xp

dim Par<x,y> in twice Par<x,y> = Par<x,y>+Par<x,y>

A limitation of the shown transformation is that it renames all found variable namesx and not just the parameter of twice. In this example, this works out well, but ingeneral we have to limit the scope of the transformation to the scope of the variabledeclaration that is being varied. We can achieve this using the function inRange that wewill introduce later. See also Exercise 21.

The next step in generalizing the function definition is to replace the addition-based implementation by a choice between addition and multiplication. This trans-formation works in exactly the same way, except that the function for transformingindividual expressions has to do a more elaborate form of pattern matching on theexpressions.

addImpl :: Haskell -> Haskell

addImpl e@(App (App (Var "(+)") l) r)

| l == r = choice "Impl" [e, Val 2 .* r]

addImpl e = e

With addImpl we can define a transformation similar to varyPar that adds the variationof the implementation method as a new dimension.

5 Here the fact that we have to represent parameters as expressions comes to our advantage sincewe do not have to distinguish the different occurrences of variables (definition vs. use) and candeal with both cases in one equation.


varyImpl :: VHaskell -> VHaskell

varyImpl = Dim "Impl" ["plus","times"] . everywhere (mkT addImpl)

To verify the effect of varyImpl we can apply it directly to xp or to the variationalprogram we have already obtained through varyPar xp.

> varyImpl xp

dim Impl<plus,times> in twice x = Impl<x+x,2*x>

> varyImpl (varyPar xp)


dim Par<x,y> in

twice Par<x,y> = Impl<Par<x,y>+Par<x,y>,2*Par<x,y>>

We can see that the latter expression is not the same as twice since the dimensions occurin a different order. However, if we reverse the order of application for the two variation-adding transformations, we can verify that they indeed produce the same result as thehand-written definition for twice.

> varyPar (varyImpl xp) == twice

True

Exercise 19. One might think that even though the two expressions twice andvaryImpl (varyPar xp) are not syntactically equal, their semantics might be, because,after all, they really represent the same variations. Explain why this is, in fact, not thecase.

As a final example we consider the task of extending the parameter dimension by an-other option z, as we have illustrated in Section 2. This transformation involves thefollowing steps.

(1) Extend the tags of the dimension declaration for Par by a new tag z.(2) Extend all Par choices that are bound by the dimension declaration by a new alter-

native z.

The first step is rather straightforward and can be implemented using a similar approachto what we have done in Section 5.3, namely by extracting the definition, manipulatingit, and putting it back.

However, the change to all bound choices is more complicated. This is because itis not sufficient to find one choice (or even a fixed number of choices), and we can’ttherefore simply reuse the extract function for this purpose. To deal with a variablenumber of choices we define a function inRange that applies a transformation to selec-tive parts of a variational expression. More specifically, inRange takes a transformationf and two predicates on variational expressions, begin and end that mark regions ofthe expression in which f is to be applied; that is, inRange effectively applies f to allnodes in the expression that are “between” nodes for which begin is true and nodes forwhich end is not true. The function works as follows. The expression to be transformedis traversed until a node is encountered for which the begin predicate yields True. Then


the traversal continues and the transformation f is applied to all nodes encountered onthe way until a node is found for which the predicate end yields True. In that case thetraversal continues, applying f to other siblings of the matching end node, but doesnot descend beneath that node. When all descendants of a begin-matching node havebeen transformed or terminated by an end-matching node, the traversal continues untilanother node matching begin is found.

inRange :: Data a => (V a -> V a) -> (Pred a,Pred a) -> V a -> V a

Even though the implementation for inRange is quite elegant and not very complicated,we do not show it here (as we did for find and <@) because it is based on navigationalfunctions from the underlying zipper library. The interested reader can find the defini-tion in the accompanying source code.

With the help of the function inRange we can now implement the transformation forextending a dimension. This function takes four parameters, the name of the dimension,the new tag, a function to extend the bound choices, and the expression in which to per-form the update. It works as follows. First, we locate the definition of the dimension d tobe extended and remember the position in the context c. We then perform the extensionof all choices bound by d by applying the function inRange to the scope of the found di-mension, e. Finding all the relevant choices is accomplished by the two predicates thatare passed as arguments to inRange. The first, chcFor d, finds choices in the scope ofd, and the second dimDef d stops the transformation at places where another dimensiondefinition for d ends the scope. In this way the shadowing of dimension definitions isrespected. Finally, we construct the result by inserting a dimension declaration with thenew tag t and the changed expression e’ into the context c

extend :: Data a => Dim -> Tag -> (V a -> V a) -> V a -> V a

extend d t f e = withFallback e $ do

(c, Dim _ ts e) <- extract (dimDef d) e

let e’ = f ‘inRange‘ (chcFor d,dimDef d) $ e

return (c <@ Dim d (ts++[t]) e’)

All we need now for the extension of twice by a new option for z is a function forextending choices by new expression alternatives. This can be easily done using thefollowing function addAlt.

addAlt :: V a -> V a -> V a

addAlt a (Chc d as) = Chc d (as ++ [a])

We can extend the variational expression twice as planned by employing extend andaddAlt.

twiceZ :: VHaskell

twiceZ = extend "Par" "z" (addAlt (haskell z)) twice

To check whether the function works as expected, we can evaluate twiceZ.

> twiceZ

dim Par<x,y,z> in


twice Par<x,y,z> = Impl<Par<x,y,z>+Par<x,y,z>,2*Par<x,y,z>>


Exercise 20. Define a function swapOptions that exchanges the two tags of a binarydimension and the corresponding alternatives in all bound choices.

Exercise 21. Define a function renamePar that adds a choice of parameter names to thedefinition of a specific function f by creating a dimension and corresponding choicesthat store the existing parameter name and a newly given name. Be careful to extendonly those parameter names that are bound by f.

The function should be defined so that the expression renamePar xp "x" "y" pro-duces the same result as varyPar xp.

The ability to programmatically edit variation representations is an important aspect ofvariation programming and our DSEL that we have barely scratched the surface of inthis section. Identifying, characterizing, and implementing editing operations is also animportant area for future research since it directly supports the development of tools formanaging and manipulating variation.

7 Further Reading

In this section we provide some pointers to related work in the area of representingand transforming software variation. The purpose of this section is not to discuss therelated work in depth or present a detailed comparison with the material presented inthis tutorial, but rather point to several important works in the literature concerningvariation representation.

In general, the field of of software configuration management (SCM) is concernedwith managing changes in software systems and associated documents [38]. It is a sub-field of the more general area of configuration management [21], which encompassesthe theory, tools, and practices used to control the development of complex systems.Among the differnt kinds of SCM tools, revision control systems [26] are probablymost widely used and manage changes to software and documents over time [37] andas repositories to facilitate collaboration [5]. In the context of revision control systems,the requirement to work on software in parallel with many developers leads to the prob-lem of having to merge different versions of software [22]. As one interesting examplefor the many approaches in this field, the Darcs versioning system [8] provides a for-malized [30] merge operation that can combine patches from separate branches.

The field of feature-oriented software development (FOSD) [2] takes the view thateach piece of software offers a specific set of features and that these features can bemodeled and implemented, at least to some degree, independently of one another. Thegoal is to represent features in such a way that allows software to be assembled mostlyautomatically from these features. Features are a specific way of expressing variation insoftware, and approaches to FOSD are thus relevant and an interesting source of ideasfor variation representation and transformation.

On a very high level, features and their relationships are described with the help offeature models, which can be expressed as diagrams [16], algebras [14], propositional


formulas [3] (and more). Feature models describe the structure of software product lines(SPLs) [27, 29].

Approaches to the implementation of features can be categorized into roughly threedifferent kinds of approaches.

First, annotative approaches express variation through a separate language. The mostwell-known annotative tool is the C Preprocessor (CPP) [13], which supports variationthrough #ifdef annotations, macro-expansion, etc. [35]. Even though very popular, theuse of CPP often leads to code that is hard to understand [34]. A principal problemof CPP is that it cannot provide any kind of syntactic correctness guarantees for therepresented variations, and consequently one can find many ill-formed variants in CPP-annotated software [20]. Other annotative approaches that, unlike CPP, respect the ab-stract syntax of the underlying object language and guarantee syntactic correctness ofsoftware variants include the CIDE tool [17], the TaP (“tag and prune”) strategy [6],and the choice calculus on which this tutorial is based.

Second, the probably most popular approach in the area of FOSD is the composi-tional approach, in which features are implemented as separate building blocks that canbe composed into programs. By selecting different sets of features, different programvariants are created. This idea is often realized through extensions to object-orientedlanguages, such as mixins [4, 7], aspects [9, 18, 23], or both [24].

Third, in the metaprogramming aproach, one encodes variability using metapro-gramming features [31,32] of the object language itself. Typical examples can be foundin the realm of functional programming languages, such as MetaML [36], TemplateHaskell [33], or Racket [28].

8 Concluding Remarks

In this tutorial we have presented both a formal model for representing variation anda DSEL that both partially implements this model, and extends it to the new domainof variation programming. We have illustrated variational programming with two ex-tended examples of variational lists and variational Haskell programs. We would like toconclude with two final, take-home points about the motivation behind this research.

First, variation is a fact of software engineering life, but the current tools for man-aging this variation are often inadequate. We believe that the path to better supportfor variation is through a better understanding of the problems and the developmentof clear and reusable solutions. These things can only be achieved by establishing asimple, sound, and formal foundation on which a general theory of variation can bebuilt. The choice calculus is a structured and flexible representation for variation thatcan serve as this foundation.

Second, in addition to the simple selection of variants, a structured variation rep-resentation offers many other opportunities for queries and transformations. In otherwords, the potential exists for variation programming. By integrating the represen-tation offered by the choice calculus into a programming environment, this can beachieved. We have used Haskell for this purpose, but many other embeddings areconceivable.


References

1. Adams, M.D.: Scrap Your Zippers – A Generic Zipper for Heterogeneous Types. In: ACMSIGPLAN Workshop on Generic Programming, pp. 13–24 (2010)

2. Apel, S., Kastner, C.: An Overview of Feature-Oriented Software Development. Journal ofObject Technology 8(5), 49–84 (2009)

3. Batory, D.: Feature Models, Grammars, and Propositional Formulas. In: Obbink, H., Pohl,K. (eds.) SPLC 2005. LNCS, vol. 3714, pp. 7–20. Springer, Heidelberg (2005)

4. Batory, D., Sarvela, J.N., Rauschmayer, A.: Scaling Step-Wise Refinement. IEEE Trans. onSoftware Engineering 30(6), 355–371 (2004)

5. Bernstein, P.A., Dayal, U.: An Overview of Repository Technology. In: Int. Conf. on VeryLarge Databses, pp. 705–712 (1994)

6. Boucher, Q., Classen, A., Heymans, P., Bourdoux, A., Demonceau, L.: Tag and Prune: APragmatic Approach to Software Product Line Implementation. In: IEEE Int. Conf. on Au-tomated Software Engineering, pp. 333–336 (2010)

7. Bracha, G., Cook, W.: Mixin-Based Inheritance. In: ACM SIGPLAN Int. Conf. on Object-Oriented Programming, Systems, Languages, and Applications, pp. 303–311 (1990)

8. Darcs, darcs.net9. Elrad, T., Filman, R.E., Bader, A.: Aspect-Oriented Programming: Introduction. Communi-

cations of the ACM 44(10), 28–32 (2001)10. Erwig, M.: A Language for Software Variation. In: ACM SIGPLAN Conf. on Generative

Programming and Component Engineering, pp. 3–12 (2010)11. Erwig, M., Walkingshaw, E.: Program Fields for Continuous Software. In: ACM SIGSOFT

Workshop on the Future of Software Engineering Research, pp. 105–108 (2010)12. Erwig, M., Walkingshaw, E.: The Choice Calculus: A Representation for Software Variation.

ACM Trans. on Software Engineering and Methodology 21(1), 6:1–6:27 (2011)13. GNU Project. The C Preprocessor. Free Software Foundation (2009),

gcc.gnu.org/onlinedocs/cpp/

14. Hofner, P., Khedri, R., Moller, B.: Feature Algebra. In: Misra, J., Nipkow, T., Sekerinski, E.(eds.) FM 2006. LNCS, vol. 4085, pp. 300–315. Springer, Heidelberg (2006)

15. Hoogle, http://haskell.org/hoogle/16. Kang, K.C., Cohen, S.G., Hess, J.A., Novak, W.E., Peterson, A.S.: Feature-Oriented Do-

main Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-21, SoftwareEngineering Institute, Carnegie Mellon University (November 1990)

17. Kastner, C., Apel, S., Kuhlemann, M.: Granularity in Software Product Lines. In: IEEE Int.Conf. on Software Engineering, pp. 311–320 (2008)

18. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: Getting Startedwith Aspect. J. Communications of the ACM 44(10), 59–65 (2001)

19. Lammel, R., Peyton Jones, S.: Scrap Your Boilerplate: A Practical Design Pattern for GenericProgramming. In: ACM SIGPLAN Workshop on Types in Language Design and Implemen-tation, pp. 26–37 (2003)

20. Liebig, J., Kastner, C., Apel, S.: Analyzing the Discipline of Preprocessor Annotations in30 Million Lines of C Code. In: Int. Conf. on Aspect-Oriented Software Development, pp.191–202 (2011)

21. MacKay, S.A.: The State of the Art in Concurrent, Distributed Configuration Management.Software Configuration Management: Selected Papers SCM-4 and SCM 5, 180–194 (1995)

22. Mens, T.: A state-of-the-art survey on software merging. IEEE Trans. on Software Engineer-ing 28(5), 449–462 (2002)

23. Mezini, M., Ostermann, K.: Conquering Aspects with Caesar. In: Int. Conf. on Aspect-Oriented Software Development, pp. 90–99 (2003)

darcs.net

gcc.gnu.org/onlinedocs/cpp/

http://haskell.org/hoogle/


24. Mezini, M., Ostermann, K.: Variability Management with Feature-Oriented Programmingand Aspects. ACM SIGSOFT Software Engineering Notes 29(6), 127–136 (2004)

25. Okasaki, C.: Purely Functional Data Structures. Cambridge University Press, Cambridge(1998)

26. O’Sullivan, B.: Making Sense of Revision-Control Systems. Communication of the ACM 52,56–62 (2009)

27. Parnas, D.L.: On the Design and Development of Program Families. IEEE Trans. on SoftwareEngineering 2(1), 1–9 (1976)

28. PLT. Racket (2011), racket-lang.org/new-name.html29. Pohl, K., Bockle, G., van der Linden, F.: Software Product Line Engineering: Foundations,

Principles, and Techniques. Springer, Heidelberg (2005)30. Roundy, D.: Darcs: Distributed Version Management in Haskell. In: ACM SIGPLAN Work-

shop on Haskell, pp. 1–4 (2005)31. Sheard, T.: A Taxonomy of Meta-Programming Systems,

web.cecs.pdx.edu/~sheard/staged.html

32. Sheard, T.: Accomplishments and Research Challenges in Meta-programming. In: Taha, W.(ed.) SAIG 2001. LNCS, vol. 2196, pp. 2–44. Springer, Heidelberg (2001)

33. Sheard, T., Peyton Jones, S.L.: Template Metaprogramming for Haskell. In: ACM SIGPLANWorkshop on Haskell, pp. 1–16 (2002)

34. Spencer, H., Collyer, G.: #ifdef Considered Harmful, or Portability Experience With C News.In: USENIX Summer Technical Conference, pp. 185–197 (1992)

35. Stallman, R.M.: The C Preprocessor. Technical report, GNU Project, Free Software Founda-tion (1992)

36. Taha, W., Sheard, T.: MetaML and Multi-Stage Programming with Explicit Annotations.Theoretical Computer Science 248(1-2), 211–242 (2000)

37. Tichy, W.F.: Design, Implementation, and Evaluation of a Revision Control System. In: IEEEInt. Conf. on Software Engineering, pp. 58–67 (1982)

38. Tichy, W.F.: Tools for Software Configuration Management. In: Int. Workshop on SoftwareVersion and Configuration Control, pp. 1–20 (1988)

39. Wadler, P.: Theorems for Free! In: Conf. on Functional Programming and Computer Archi-tecture, pp. 347–359 (1989)

Appendix: Solutions to Exercises

Exercise 1

The choice calculus expression represents all of the following definitions.

twice x = x+x twice y = y+y twice z = z+z

twice x = 2*x twice y = 2*y twice z = 2*z

This gives us six total variants. Adding another dimension with two tags for the functionname sproduces the following twelve variants.

twice x = x+x twice y = y+y twice z = z+z

twice x = 2*x twice y = 2*y twice z = 2*z

double x = x+x double y = y+y double z = z+z

double x = 2*x double y = 2*y double z = 2*z

racket-lang.org/new-name.html

web.cecs.pdx.edu/~sheard/staged.html


Exercise 2

We can simply add the definition of thrice using an Impl choice for the implementationmethod as follows.

dim Par〈x,y,z〉 indim Impl〈plus, times〉 inshare v = Par〈x,y,z〉 intwice v = Impl〈v+v,2*v〉thrice v = Impl〈v+v+v,3*v〉

Exercise 3

Here we create a second Impl dimension with three tags, and use a corresponding choicewith three alternatives in the definition of thrice.

dim Par〈x,y,z〉 indim Impl〈plus, times〉 inshare v = Par〈x,y,z〉 intwice v = Impl〈v+v,2*v〉dim Impl〈plus, times, twice〉 inthrice v = Impl〈v+v+v,3*v,v+twice v〉

Exercise 4

(a) Invalid(b) Invalid(c) Valid

Exercise 5

The result is 1 since the selection recursively decends into the chosen alternative withthe same index, which is also the reason that it is not possible to select 2.

Exercise 6

When the ordering constraint is removed, we obtain an additional four entries for tupleswhich have a B tag in their first component.

�dim A〈a1,a2〉 in A〈1,dim B〈b1,b2〉 in B〈2,3〉〉� ={(A.a1,1),((A.a2,B.b1),2),((A.a2,B.b2),3),

((B.b1,A.a1),1),((B.b1,A.a2),2),((B.b2,A.a1),1),((B.b2,A.a2),3)}

We can observe that the selection of either B tag has no influence on the result when A.a1

is chosen as the second tag, which reflects the fact that the B dimension is dependent onthe selection of A.a2.


Exercise 7

The definition of V for choices is very similar to the case for trees.

Vρ(D〈en〉) = {(δ n,D〈e′n〉) | (δ1,e′1) ∈Vρ(e1), . . . ,(δn,e

′n) ∈Vρ(en)}

Exercise 8

The definitions can be obtained directly by partial application of the Dim and Chc con-structors.

dimA = Dim "A" ["a1","a2"]

chcA = Chc "A"

Exercise 9

Expanding sharing before dimensions and choices are resolved duplicates the A dimen-sion and will thus produce two independent decisions that result in a semantics withfour variants.

�share v = (dim A〈a1,a2〉 in A〈1,2〉) in (v,v)� =

{((A.a1,A.a1),(1,1)),((A.a1,A.a2),(1,2)),

((A.a2,A.a1),(2,1)),((A.a2,A.a2),(2,2))}

Conversely, if we expand sharing after dimensions and choices are resolved, we getonly one dimension, which leads to the following semantics.

�share v = (dim A〈a1,a2〉 in A〈1,2〉) in (v,v)� =

{(A.a1,(1,1)),(A.a2,(2,2))}

Exercise 10

The easiest solution is to employ the fmap function using an anonymous finction to mapan integer to a choice and apply it to ab.

> fmap (\i -> Chc "A" [Obj i, Obj (i+1)]) ab

dim A<a1,a2> in A<dim B<b1,b2> in B<A<1,2>,A<2,3>>,A<3,4>>

The type of the result is V (V Int).

Exercise 11

The monadic instance for V lets us combine the variational value and the variational listusing a standard monadic approach. Here we employ the do notation in the definition.

vvcons :: V a -> VList a -> VList a

vvcons vx vl = do {x <- vx; vcons x vl}


Exercise 12

We show the code for approach (b). In out definition we can reuse the function vvcons

defined in exercise 11.

fullMenu :: Menu

fullMenu = Dim "Main" ["meat","pasta"] $

Dim "Desert" ["yes","no"] $

chc "Main" [Obj Steak,Obj Pasta] ‘vvcons‘

chc "Desert" [vsingle Cake,vempty]

Exercise 13

We can observe that the type of sumL is similar to that of len (it is an instance), whichindicates that the function definition will have the same structure.

sumL :: List Int -> V Int

sumL Empty = list 0

sumL (Cons x xs) = fmap (x+) (sumL xs)

sumL (VList vl) = vl >>= sumL

The definition for vsum is obtained through simple lifting.

vsum :: VList Int -> V Int

vsum = liftV sumL

Exercise 14

The type of rev indicates that it preserves the overall structure of the list values to beprocessed. Therefore, the last case can be defined using fmap.

rev :: List a -> List a

rev Empty = Empty

rev (Cons x xs) = rev xs ‘cat‘ single x

rev (VList vl) = VList (fmap rev vl)

The definition for vrev can also use the fmap function.

vrev :: VList a -> VList a

vrev = fmap rev

Exercise 15

The definition for filterL has in principle the same type structure—at least as far thetransformed lists is concerned—and follows therefore the same pattern as the definitionfor rev.

filterL :: (a -> Bool) -> List a -> List a

filterL p Empty = Empty

filterL p (Cons x xs) | p x = Cons x (filterL p xs)

| otherwise = filterL p xs

filterL p (VList vl) = VList (fmap (filterL p) vl)


The definition for vfilter should be obvious given the solution for vrev.

vfilter :: (a -> Bool) -> VList a -> VList a

vfilter p = fmap (filterL p)

Exercise 16

The interesting cases in the definition for zipL are the last two where a partially appliedzipL to one list is distributed over the elements of the respective other list using fmap.

zipL :: List a -> List b -> List (a,b)

zipL Empty ys = Empty

zipL xs Empty = Empty

zipL (Cons x xs) (Cons y ys) = Cons (x,y) (zipL xs ys)

zipL (VList vl) ys = VList (fmap (‘zipL‘ ys) vl)

zipL xs (VList vl’) = VList (fmap (xs ‘zipL‘) vl’)

The definition for vzip simply injects the result of applying zipL, which is of type List

a, into the type VList a.

vzip :: VList a -> VList b -> VList (a,b)

vzip vl vl’ = list $ zipL (VList vl) (VList vl’)

Exercise 17

(a) Another potential problem for hoisting can the reordering of dimensions. Consider,for example, the following variational list that contains two occurrences of an A

dimension.

> dimA $ chc’A [1,2] ‘vvcons‘ (dimA $ vsingle 9)

dim A<a1,a2> in A<[1;dim A<a1,a2> in [9]],[2;dim A<a1,a2> in [9]]>

The semantics reveals that the decision in the second, rightmost dimension doesnot really have any effect on the plain results, which is not surprising since thedimension binds no choice.

[A.a1,A.a1] => [1;9]

[A.a1,A.a2] => [1;9]

[A.a2,A.a1] => [2;9]

[A.a2,A.a2] => [2;9]

Now consider the following variation of the above expression in which the right-most A dimension has been lifted to the top level.

> dimA $ dimA $ chc’A [1,2] ‘vvcons‘ (vsingle 9)

dim A<a1,a2> in dim A<a1,a2> in A<[1;9],[2;9]>

This expression can be the result of hoisting the rightmost occurrence of the A

dimension. This hoisting does not capture any free choices, but it does reorder thetwo dimensions, which leads to a different semantics.


[A.a1,A.a1] => [1;9]

[A.a1,A.a2] => [2;9]

[A.a2,A.a1] => [1;9]

[A.a2,A.a2] => [2;9]

(b) We don’t have to check for reordering since we always find the topmost, leftmostdimension definition, which, when hoisted, cannot swap positions with other di-mensions of the same name since there are non on the path from the root to thetopmost, leftmost dimension definition.

Exercise 18

Instead of extracting the choice in dimension b directly from a2, in prioritize’ weattempt to extract it first from a1, then from a2. In order to make this definition moreconcise, we introduce several helper functions in the body of prioritize’. The func-tions fstAlt and sndAlt describe how to reassemble the alternatives if the choice in b

is found in the first or second alternative, respectively. The tryAlt function takes onesof these functions as an argument, along with the corresponding alternative, and tries tofind a choice in dimension b. If it succeeds, it will return the reassembled expression,otherwise it will return Nothing. Finally, in the last line of the function, we employ thestandard mplus function from the MonadPlus type class to combine the results of the twoapplications of tryAlt. This will return the first of the two applications that succeeds,or Nothing if neither succeeds (in which case, the fallback expression e will be returnedfrom prioritize’).

prioritize’ :: Data a => Dim -> Dim -> V a -> V a

prioritize’ b a e = withFallback e $ do

(dA,ae) <- extract (dimDef a) e

(cA,Chc _ [a1,a2]) <- extract (chcFor a) ae

let fstAlt cB b1 b2 = [cA <@ Chc a [cB <@ b1,a2],cB <@ b2]

let sndAlt cB b1 b2 = [cB <@ b1,cA <@ Chc a [a1,cB <@ b2]]

let tryAlt f ai = do

(cB,Chc _ [b1,b2]) <- extract (chcFor b) ai

return $ dA <@ Chc b (f cB b1 b2)

tryAlt fstAlt a1 ‘mplus‘ tryAlt sndAlt a2

Note that this function still makes a few assumptions, such as that the involved dimen-sions are binary (contain two options), and that a choice in dimension b is contained inonly one of the two alternatives. Making this function more robust is left as an exercisefor the especially thorough reader.

Exercise 19

Even though the two expressions produce the same variants, the ordering of the tagswill be different in the decisions (the domain of the mapping yielded by the semantics).That is, in the semantics of xp the tags in the Par dimension appear first in each decision,while in varyImpl (varyPar xp) the Impl tags appear first.


Exercise 20

This editing function can be defined in a similar way as extend. First we will find therelevant dimension, then swap the alternatives of all bound choices. We again reuseextract and inRange for these tasks.

swapOptions :: Data a => Dim -> V a -> V a

swapOptions d e = withFallback e $ do

(c, Dim _ [t,u] e) <- extract (dimDef d) e

let e’ = swapAlts ‘inRange‘ (chcFor d,dimDef d) $ e

return (c <@ Dim d [u,t] e’)

The helper function swapAlts, passed as the transformation function to inRange, ex-changes the two alternatives of a binary choice.

swapAlts :: V a -> V a

swapAlts (Chc d [a,b]) = Chc d [b,a]

Note that these definitions assume that the dimension to be swapped is binary, and thatall bound choices have the appropriate number of tags. A pattern-matching error willoccur if these assumptions do not hold, though the solution can easily be made morerobust.

Exercise 21

The renamePar operation is similar to the addPar operation defined at the beginning of6.2. The difference is that instead of replacing every variable x with a choice betweenx and y, we want to find the first function definition in the expression, and apply ourchanges to this definition only. The following helper function is used by extract to findthe first function definition in a VHaskell expression.

firstFun :: Pred Haskell

firstFun (Obj (Fun _ _ _ _)) = True

firstFun _ = False

A second helper function, renameRef, serves as a generalized version of addPar thattakes two variable names as parameters. The first is the name of the variable to change;the second is the new variable name.

renameRef :: Name -> Name -> Haskell -> Haskell

renameRef x y (Var x’)

| x == x’ = choice "Par" [Var x,Var y]

| otherwise = Var x’

renameRef _ _ e = e

Finally, we are able to define renamePar as follows. After finding the first functiondefinition f, we apply the edit described by renameRef to the arguments and body of f.We do not apply the changes to the scope of f, thereby isolating the change to the firstfunction definition only.


renamePar :: VHaskell -> Name -> Name -> VHaskell

renamePar e x y = withFallback e $ do

(c, Obj (Fun f as b scope)) <- extract firstFun e

let as’ = everywhere (mkT (renameRef x y)) as

let b’ = everywhere (mkT (renameRef x y)) b

return (c <@ Dim "Par" [x,y] (Obj (Fun f as’ b’ scope)))

This function could be generalized in several ways, such as by introducing parametersfor the new dimension name or for the name of the function to apply the change to(rather than just the first one found).

Leveraging Static Analysis in an IDE

Robert M. Fuhrer

Google, 76 9th Avenue, New York, NY, 10017, [email protected]

Abstract. In recent years, Integrated Development Environments (IDEs) haverisen from nicety to an essential part of most programming language tool-chains.Indeed, they are widely seen as critical to the widespread adoption of a new pro-gramming language. This is due in part to the emergence of a higher-level dia-logue between a developer and his code, made possible by advanced tooling thataids in navigating, understanding, and manipulating code. In turn, much of thisadvanced tooling relies heavily on various forms of static analysis.

Unfortunately, many practitioners of static analysis methods are not well skilledin incorporating their analyses into an IDE context. The result is often high-poweredtools that lack key usability characteristics, and thus fall short of their potential toassist developers.

This tutorial attempts to help bridge the skill gap by describing several appli-cations of static analysis within an IDE setting. We describe the computation andpresentation of type hierarchy information, data flow information in the form ofdef/use chains, the use of a type inferencing engine to detect type-related code”smells”, and the use of memory effects analysis to determine the safety of certainparallelization-related refactorings.

Keywords: static analysis, IDEs, software development tooling.

1 Introduction

In recent years, Integrated Development Environments (IDEs) have risen from nicety toan essential part of most programming language tool-chains. Indeed, IDEs are widelyseen as critical to the widespread adoption of a new programming language, so that thelanguage and IDE are released together. In recognition of this fact, operating systemvendors such as Microsoft and Apple now offer high-quality IDEs for their respectiveplatforms, to lower the barrier to adoption. In the embedded space, the success of theArduino [21,3] platform seems to be due in part to the ready availability of its IDE,which simplifies and largely hides the fairly complex cross-platform toolchain, so thateven hobbyists are able to program a variety of these useful devices with ease.

The ascendancy of IDEs is also due in part to the emergence of a class of advancedtools that enable a higher-level dialogue between a developer and his code. Eclipse, In-telliJ’s IDEA and Microsoft’s Visual Studio all feature advanced tools for visualization,error detection, performance analysis and code manipulation. These tools help shift thedeveloper’s focus from low-level textual editing operations to higher-level properties(e.g. aspects of program correctness such as type or thread safety, or aspects of pro-gram structure such as encapsulation or reusability) and activities (e.g. restructuring


102 R.M. Fuhrer

via refactoring). The result is an environment that actively aids developers in navigat-ing, understanding, debugging and even manipulating their code. In turn, much of thisadvanced tooling relies heavily on various forms of static analysis.

Unfortunately, many practitioners of static analysis methods are not well versed inexposing or incorporating their analyses in an IDE context. An all-too-frequent result isa high-powered tool that lacks key usability characteristics or fails to integrate well intothe task work-flow. As a simple example, many tools perform sophisticated analyses,but present their results as a textual dump to a file or window, requiring the developerto correlate the results manually to the source code from which they derive. The work-flows that ensue are tedious and error-prone, and fall short of the tools’ potential toassist developers.

This tutorial attempts to help bridge the skill gap by describing several applicationsof static analysis within an IDE setting. In each case, we present the underlying analysis,and show how the resulting information can be exposed in a suitable way in the IDE,say as highlights in a textual source editor, or as a distinct view that focuses on a specificaspect of the program’s structure or behavior. In particular, we describe the computationand presentation of type hierarchy information, data flow information in the form ofdef/use chains, the use of a type inferencing engine to detect type-related code ”smells”,and the use of memory effects analysis to determine the safety of certain parallelization-related refactorings.

In each subsequent section, we present the basic formalisms needed to understandthe analysis being performed, along with the IDE infrastructure needed to expose ormake use of the information in some suitable manner. For space reasons, the sectionsdo not present every bit of code necessary to complete the implementation, but in eachcase it should be fairly obvious how to fill in the missing bits. Instead, the functionalityshown is basic, but still should serve as a reasonable foundation for a more full-featured,robust implementation.

Likewise, given the intended emphasis of this tutorial, we make no attempt to presentsophisticated and highly precise analyses; rather, the analyses described are simple butuseful. This will allow us to show a more complete picture of how the pieces fit together,with most of the important details intact.

Eclipse [10] is used throughout as the base application/user-interface framework,since it is open-source, and provides a reasonable set of APIs necessary to build suchtools. In some sections, APIs are also used from both the Java Development Toolkit(JDT [11]) and the IDE Meta-tooling Platform (IMP [7]), an eclipse.org project thatprovides a language-independent framework for building language-specific IDEs. Nev-ertheless, although Eclipse is used as a demonstration vehicle in this tutorial, there is lit-tle or no inherent bias in the techniques shown toward any specific IDE. In other words,the techniques do not make extensive demands of the IDE framework, and are thus ap-plicable to many existing extensible programming environments. In the few cases wheresuch demands may not be directly met by other environments, indications are given asto how to adapt the techniques accordingly.

The outline of the remainder of the paper is as follows. Section 2 briefly discussessome related work. Section 3 presents the basic IDE facilities used by the techniques ofsubsequent sections. Section 4 describes the extraction of type inheritance information

Leveraging Static Analysis 103

from Java source code and its presentation in a tree view. Section 5 details the compu-tation of local use-def relationships within Java methods, and presenting it by meansof textual highlights in the source editor. Section 6 shows a basic memory effects anal-ysis that is used as the underpinning for a simple parallelization refactoring for theX10 [8] language. Next, Section 7 presents a type inference engine for Java, and its usein a type-related ”code smell” detector. Finally, we make some concluding remarks inSection 8.

2 Related Work

There are several existing mechanisms for expressing static analyses that are relevantin the IDE context; we briefly review a few here. Rascal [20] and Stratego [6] are basedon declarative specifications of syntax and semantics, while JastAdd [12] uses Java-based attribute grammars for the same purpose. All three of these alternatives have alsobeen used to provide key IDE services such as syntax highlighting, content completion,and content outlining, but offer only a restricted subset of the services that full-fledgedIDEs typically provide. The declarative nature of the formalisms they use have the veryappealing property that various details of representation and API are hidden from theIDE developer. Further, Rascal and Stratego are interpretative, removing compilationfrom the edit-test-debug cycle, which makes them especially attractive for prototypinganalyses.

Of these alternatives, arguably only JastAdd is suitable for implementing a full-blown compiler per se, which is clearly a critical part of an IDE for any given language.1

Also, all three require that the parser, AST, and related components be implemented us-ing the fixed representations provided by their respective frameworks. This makes themunsuitable for use with existing compiler front-ends and AST hierarchies, forcing thedeveloper to re-implement the many subtle details of a language’s syntax and seman-tics. Aside from the engineering cost, this also has the side-effect of introducing thepossibility of deviation between the compiler’s and the IDE’s interpretation of a givenpiece of source text.

Hence, the strategy used throughout this paper is to use components that allow theuse of an arbitrary implementation strategy for the front-end on which the analyses rely.This strategy trades broad applicability to languages and the potential to harness staticanalysis in a wider variety of user interface affordances, for a sometimes nontrivialcost of additional implementation effort. Our intention is that the prime focus of thepaper, that of integrating static analyses in the rich and user-focused setting of an IDE,is nevertheless well addressed. A more detailed comparison of the techniques describedherein to these alternative approaches is beyond the scope of this paper.

3 Core IDE Facilities

This section describes some basic underlying IDE functionality used throughout thistutorial, including the IMP parsing API, the IMP Program Database (PDB), and IMP’s

1 JastAdd has in fact been used to implement a fully-compliant compiler for Java 1.5.

104 R.M. Fuhrer

indexing facility. Although IMP’s implementation is necessarily Eclipse-specific, analo-gous functionality could easily be built for other IDE frameworks, such as Visual Studioor IDEA. As such, the techniques described here can be applied to other frameworks,given the capabilities described in Section 3.1.

3.1 Basic IDE Requirements

One should be able to replicate most of the results described herein using any develop-ment platform that provides:

– the kind of building blocks common to most general-purpose user interface (UI)toolkits:• configurable menus,• list, tree (and perhaps, graph) widgets,• access to the view that currently has “focus” (i.e., the view that receives

keystrokes),• access to the (possibly unsaved changes to the) contents in a text editor buffer,• access to the currently-selected region of an open text editor,• the ability to register for notification of changes to the current selection in a

text editor, and• the ability to apply differing text attributes to various regions of the text in a

text editor;– the ability to annotate regions of any given source file with error/information

messages,– the ability to register for notification of changes to a given set of resources (e.g.

source files or folders), and– (optional) the ability to augment the set of initiated build activities.

3.2 Parsing

IMP provides a standard interface for interacting with any source text parser,IParseController. Its primary entry point is parse(...), which is responsible forparsing source text and producing a corresponding abstract syntax tree (AST). In someimplementations, the AST is simply the raw result of parsing, but to ease the support ofvarious advanced IDE functions, the AST has typically also been subjected to semanticprocessors such as name resolution or type inference. These processors “decorate” theAST nodes with additional information, such as the entity to which a reference binds,or the inferred type of an expression node.

Note however that IMP does not define or prescribe any interface for the AST as awhole nor for the individual AST nodes. Moreover, it does not assume that the afore-mentioned AST decorations exist. Rather, all of its IDE services treat the AST andeach node as opaque objects that various language-specific service implementations areresponsible for interpreting. This characteristic lack of bias makes it possible to inte-grate existing compiler front-ends into the kind of high-functionality IDEs that IMP isdesigned to support.


3.3 The IMP Program Database (PDB)

The Program Database is a repository of information about program artifacts – sourcecode, libraries, and so on – known to the IDE.2 Information in the form of facts isextracted from program artifacts by fact generators and fact updaters, or refined fromother facts by analyzers, and stored in one or more FactBases. The information isthen used to drive further analyses and various IDE views, as shown in Figure 3.1.

IMPProgram Database

(PDB)

WorkspaceFiles,

Directories

resourcechanges

PDBFactBase

PDBFactBase

...

PDBFact

Generators/Updaters

PDBFact

Generators/Updaters

... extraction,analysis

IDE View IDE View...

fact queries fact queries

presentation

factcreations/updates

factqueries

factcreations/updates

factcreationrequests

Fig. 3.1. The PDB and its role in the IDE

As shown in Figure 3.2, a PDB FactBase comprises a set of facts. A fact is a pair,consisting of a fact key and a fact value. The fact key is itself a pair of a fact type and afact context. Some relevant interfaces are shown in Listing 3.1.

The fact type identifies the kind of information, for example, a type hierarchy, a setof symbol references, a call graph of a given precision, a control-flow graph, and soon. Each fact type has an associated schema specifying the structure of the values ofall facts of that type, expressed as a Type. The fact value can be taken from a rich setof data types, including Booleans, strings, integers, reals, numeric ranges, source loca-tions, tuples, sets, lists, relations and maps. In addition, the PDB API provides a set ofpowerful operators over these types, including projection, selection, join, composition,closure, and the like. In fact, many classical analyses can be expressed solely usingthese high-level operators.

The fact context typically identifies the space of program elements that the fact de-scribes. The base interface IFactContext is just a tag interface; all information that

2 Artifacts are “known” to the IDE by virtue of being stored in the IDE’s set of working files ordirectories, or by being referenced by same.

106 R.M. Fuhrer

public interface IFactKey {Type getType();IFactContext getContext();

}public interface IFactContext { }public interface ISourceEntityContext extends IFactContext {

ISourceEntity getEntity();}

Listing 3.1. PDB Fact Interfaces

Fact: key: { type: T, context: C } value: V

...

FactBase

Fact: key: { type: T, context: C } value: V

Fig. 3.2. PDB FactBase structure

identifies the relevant set of program elements is exposed by various sub-interfaces,such as ISourceEntityContext. Often, the context designates the root entity thatcontains all resources to be included in the fact’s information, e.g., a project or sourcefolder. ISourceEntityContext expresses the set of resources involved in termsof an IMP ISourceEntity, which may refer to the workspace, a project, a folder,or a (set of) source file(s). Other kinds of contextual information are possible, however,such as a set of program entry points to be used as the roots of a call graph.

Using the PDB. The PDB framework decouples fact clients from fact producers.Clients that wish to use PDB facts simply query a given FactBase for the fact witha given type and context. If the FactBase already contains such a fact, its value isreturned; otherwise, if a suitable fact producer exists, it is invoked, and the resultingfact value is returned.

The FactBase acts as the locus for fact management, and in general, clients areresponsible for managing the lifetime of a FactBase. This arrangement leaves theclients, who are likely the only entities with enough information to determine a fact’s


p u b l i c i n t e r f a c e I F a c t G e n e r a t o r {IVa lue g e n e r a t e ( Type type , I F a c t C o n t e x t c o n t e x t ,

Map<IResou rce , IndexedDocumen tDesc r ip to r > work ingCop ie s )throws A n a l y s i s E x c e p t i o n ;

}p u b l i c i n t e r f a c e I F a c t U p d a t e r {

void u p d a t e ( Fac tBase f a c t B a s e , Type type , I F a c t C o n t e x t c o n t e x t ,I R e s o u r c e re s , i n t changeKind ,Map<IResou rce , IndexedDocumen tDesc r ip to r > work ingCop ie s )throws A n a l y s i s E x c e p t i o n ;

void u p d a t e ( Fac tBase f a c t B a s e , Type type , I F a c t C o n t e x t c o n t e x t ,I R e s o u r c e re s ,Map<IResou rce , IndexedDocumen tDesc r ip to r > work ingCop ie s )throws A n a l y s i s E x c e p t i o n ;

}Listing 3.2. Fact Producer Interfaces

life-span, in control of its life-span. Thus, if a fact is to be long-lived, it is simply placedwithin a long-lived FactBase. When the clients of a given FactBase are done withthe facts in that FactBase, they simply discard references to that FactBase, and allassociated data is reclaimed.

PDB Fact Producers. Fact producers are registered with the PDB framework via anIMP-supplied Eclipse extension point.3 Each fact producer extension identifies the setof fact types that it is capable of producing, along with a generator factory that is re-sponsible for creating a fact producer for a given fact type. The only requirement ofa fact producer is that it can consume a fact key, and create a suitable PDB IValue.With this design, arbitrary analysis engines can be used to contribute facts to a PDBFactBase for use by arbitrary client tools.

As shown in Listing 3.2, there are two principal producer-related interfaces:IFactGenerator and IFactUpdater. The former is intended to create a factwith a given key when no such fact yet exists. The latter is called in response to re-source changes (e.g., file creation, modification, and deletion) or source editor docu-ment changes. In both cases, the producer updates the value associated with the givenfact accordingly.

3.4 Indexing and the PDB

IMP’s indexing facility automatically tracks resource and source editor changes that arerelevant to a set of facts for which indexing has been requested, and invokes the proper

3 An Eclipse extension point is a named entity that manages a set of implementers of a given ex-tension. Extension points are used liberally throughout Eclipse to identify contributors of bothuser-interface elements such as menus, key-bindings, views, and non-UI-related contributionssuch as project builders.

108 R.M. Fuhrer

fact producers as needed to keep the facts in sync. Each instance of the indexing classIndexer maintains its own FactBase, along with a set of keys for facts that are tobe kept up-to-date with respect to resource changes. Clients add to this list by callingthe method keepFactUpdated(), passing a fact key. In the current implementation,such keys must have contexts of type ISourceEntityContext so that the indexercan determine which resources affect the given fact.

As of this writing, although the PDB provides notifications to fact updaters indicatingwhat relevant source artifacts have changed, it does not offer any help in computing theupdated fact, nor in determining what portions of a given fact’s value might be reusedwhen computing the update. We refer the reader to the considerable existing researchin the area of incremental static analysis [17,22,31,16].

To handle cases where a given kind of project-scope index (say, of the set of all de-fined types) is used extensively by the IDE, IMP also supports a mechanism for clientsto specify that a given type of fact should be created automatically for new projects.This off-loads resource event notifications and processing from language-specific codeto the IMP framework.

4 Computing and Presenting Type Hierarchies

One of the most-used bits of machinery in a modern IDE is the type hierarchy, as shownin Figure 4.1, which helps navigate large object-oriented programming code-bases. Thisview shows the inheritance relationships among classes and interfaces that are vital tounderstanding the program’s structure.

Fig. 4.1. Screenshot of an Eclipse type hierarchy view


In this section, we present a means of extracting such information from the sourcecode, updating the information in response to source changes, and presenting the in-formation in a suitable view. The IMP API described also provides support for persist-ing the extracted information to disk, in order to avoid unnecessary recomputation ofinfrequently-changed portions of the program. However, the use of that mechanism isstraightforward, and is not described here.

4.1 Presenting Multiple and Interface Inheritance

In many object-oriented languages, the presence of multiple inheritance (as in C++) orinterfaces (as in Java) gives rise to a set of entities that do not form a proper hierarchy.4

In such cases, a strict tree view obviously does not suffice to present all of the inheri-tance relationships, which form a directed acyclic graph (DAG) rather than a tree. As aresult, a more general graph view is required.

In spite of this fact, some environments (such as Eclipse’s Java Development Toolkit)accept the inaccuracy inherent in the use of a tree view, in order to gain a more visuallycompact and more easily navigable presentation, as well as much faster rendering. This istypically accomplished by duplicating portions of the graph where multiple inheritanceappears. For example, if the view depicts the super-types of a given type T, and multipleinheritance paths exist from T to some base type B, B and its super-types will appearmultiple times in the “tree” of T’s super-types, once for each such inheritance path.

4.2 Type Hierarchy Computation

The type inheritance information must be extracted from both the source code and therelevant application and/or language libraries. To do this, we use the IMP ProgramDatabase (PDB) and the associated indexing mechanism, along with language-specificcode for

– parsing the source code and producing a corresponding Abstract Syntax Tree (AST),and

– visiting the AST, extracting the inheritance relationships, and recording them in thePDB.

The following subsections give some necessary background on the API used for com-puting type hierarchies.

Type Inheritance Extraction. The basic tasks needed to implement the extraction ofthe type inheritance relationships are:

– Obtain an AST for each resource to be analyzed.– Run a visitor over each AST to find all of the supertype declarations, and record

them in a PDB fact.4 In Java, the set of classes do in fact form a hierarchy, but the set of classes and interfaces do

not. Thus if the focus is restricted to classes per se, a tree presentation is entirely appropriateand accurate.

110 R.M. Fuhrer

For the first task we implement an IFactGenerator, whose principle method isgenerate(). This is shown in Listing A.1, and iterates over the resources in the factcontext, obtaining an AST for each one. Note that the AST is either obtained from thefile, or from the ”working copy”, which corresponds to the contents of a currently-openeditor document, if one exists for that resource.

To actually create fact values of the appropriate types, we must first declare thosetypes, as shown in Listing A.2. A so-called alias type is simply a mechanism for givinga type with a specific interpretation a name, for clarity’s sake. In that sense, it is verymuch analogous to a C-language typedef.

For the second task, we simply implement a visitor over the AST. In this example, wehave used the Polyglot compiler framework [27], whose AST class hierarchy supportsthe visitor design pattern. In this simple example, the only place where inheritance infor-mation appears is in ClassDecl entities. The implementation is shown in Listing A.3.

4.3 Type Hierarchy Presentation

As part of its user interface framework, Eclipse provides a generic tree view which,like many Eclipse viewer-based views, can be instantiated with just two entities: oneto define the contents of the view, and one to define the presentation characteristics ofitems appearing in the view. The former is known as an IContentProvider, or, inthe tree case, an ITreeContentProvider specifically. The latter entity is knownas a label provider, and implements the interface ILabelProvider.

Listing A.4 shows the source code for a very simple view that makes use of thecontent and label providers given in Listing A.5.

The content provider is given the Map from ITuple to ITuple as its inputby the TypeHierarchyView via a call to setInput(). It then simply exposesthe sub-/super-type relationships recorded in that map in its implementation of thegetChildren() and related interface methods.

The label provider uses as the text label for a given type the fully-qualified typename, as stored in the first component of the ITuple that represents the type in themap. Custom element images are optional, and are omitted in this implementation forbrevity.

This view is currently quite basic, and shows the entire type hierarchy in the givencontext, starting from the root type(s) at all times. One obvious and important extensionwould permit the user to specify a type that roots a sub-hierarchy to display. This isfairly straightforward, given the infrastructure shown thus far.

5 Data-Flow Analysis and Editor Presentation of Use/Def Chains

A critical aspect of reasoning about the semantics of a given piece of code is deter-mining which value definitions (say, variable assignments) affect which references.The classical means of reasoning about such relationships is cast in terms of so-called“reaching definitions” or “use-def” chains [26]. This section presents a simple anal-ysis to compute this information intraprocedurally, along with a simple technique forpresenting the information in an IDE context. As with the other sections, the emphasis


here is not on a precise analysis that handles all the complexities of “real” languages,but rather on a basic analysis that may serve as a good foundation, along with the nec-essary interconnect and user interface componentry that makes this information usefulto the developer.

This section presents an encapsulation of the analysis that triggers the use-def anal-ysis via a simple user-invokable gesture, and presents the resulting information in theform of textual highlights in the source editor. In particular, we describe how to definea modal button that toggles the display of the use-def information, and how the analy-sis interacts with source editor text selections. In this mode, when the user clicks on alocal variable reference, the reaching definitions are highlighted; likewise, when a lo-cal variable value definition is selected, the references that can “see” that definition arehighlighted.

In the following, we use the term symbol to mean a name that “binds” (or “resolves”)to a specific variable declaration.5 This concept is distinct from the syntactic notion ofidentifier, a sequence of source-text characters conforming to a pattern prescribed bythe language specification. As such, not all occurrences of a given identifier, say x, in agiven piece of source text, necessarily correspond to the same symbol. In particularly,many languages limit the visibility of a local variable declaration to the innermost-surrounding scope (delimited by, say, curly braces). Thus, an identifier x can refer toone variable declaration when it appears in one scope, and to another declaration whenappearing in a sibling (i.e. neither parent nor child) scope.

This distinction between symbol and identifier is particularly important in languagesthat, like Java or C, permit “shadowing” declarations, in which a symbol in an innerscope can be declared with the same name as that of a symbol declared in an outerscope. Thus, more than one symbol with a given name may be in-scope at any point inthe source text, which could lead to ambiguity. In such cases, occurrences of the givenidentifier in the inner scope are specified to bind to the symbol declared in the innerscope, thus hiding the outer-scope symbol, and avoiding ambiguity. See, for example,Section 3.4 of [9] for a more thorough discussion.

In the classical program analysis literature, the term “definition” refers to an occur-rence of a given symbol that corresponds to an operation that can change the value of theassociated variable in memory. Common operations of this sort include assignment, ini-tialization, and increment. Other references such as argument passing in call-by-valuesemantics, or use as an operand to a binary arithmetic operator, or use on the right-hand side of an assignment operator do not affect the variable’s value, and are thus notconsidered definitions.

In the example Java class shown in Figure 5.1, the initialization of variable x is adefinition, as is the assignment to x inside the for-loop body, as well as the incrementof i in the for-loop header. By contrast, the reference to x on the right-hand side of theassignment, and the reference as an argument to the call toprintln, are not definitions.

5 Note that the term symbol as used here could refer to an arbitrary “l-value”, including an arrayelement, as in a[i], or a pointer dereference, as in *p. We do not address such possibilitieshere, for brevity’s sake. Consult the considerable literature [2,32,18,24,4] exploring the manyvariations on pointer, alias and array analysis, which are capable of much greater precision indealing with such complex references.

112 R.M. Fuhrer

class Foo { public void foo() { int x = 5; int y = 12; y = 17;

for(int i=0; i < 5; i++) { x = x + y; } System.out.println(x); }}

Fig. 5.1. Example use-def relationships

The “reaching definitions” for a given symbol reference X consist of the set of def-initions for that symbol that may have been the last definition executed at the time theprogram evaluates that reference X. Figure 5.1 shows a small example Java class forwhich the use-def relationships have been indicated by means of arrows from def to thecorresponding possible uses.

More specifically, the building blocks involved in implementing this functionalityare:

– the Abstract Syntax Tree (AST), which is analyzed to determine the structure andsemantics of the source code,

– document listeners, which are notified when an editor’s source text changes,– selection listeners, which are notified when the user selects some source text in an

editor,– the reaching definitions analyzer itself, and– annotations, which are regions of source text with associated attributes, such as

font, color, or style.

We assume in the following that each identifier reference in the AST has been resolvedto a specific symbol (i.e. declaration). (We then say that all references in the AST havebeen “resolved”.) If the AST does not already have this information, it can be producedby a usually relatively straightforward analysis, as defined by the binding rules of thelanguage in question. Given this information, each reference AST node can then be di-rectly annotated with its binding, if the AST API permits such annotation. Alternatively,a set of tables external to the AST can be formed, typically one per scope, to expose theidentifier-to-symbol bindings defined in each scope.

5.1 Analysis Formulation

We cast the analysis problem in terms of the classic “reaching definitions” analysis,expressed over AST nodes. This roughly follows a simplified form of the constraint


formulation given in [26]. Specifically, for each AST node n, we define the set of defi-nitions that reach n as follows:

RD(N) = {(v, n′) | def of variable v at node n′ reaches n} (1)

We will see below that the notion of “reaches” incorporates data- and control-flow in-formation derived from the AST. Essentially, each pair (v, n′) embodies the fact thatthe definition of variable v at node n′ was the last definition of v to have been executedwhen node n is executed.

We then obtain the use-def information we seek by post-processing the results of thereaching definitions analysis to produce a pair of maps from AST nodes to AST nodes.Here, the notation (ref v, n) identifies an AST node n that references variable v, while(def v, n) identifies a node n that defines variable v.

– UD(ref v, n) = {(v, n′) | (v, n′) ∈ RD(n)}– DU(def v, n) = {(v, n′) | (v, n) ∈ RD(n′)}

Intuitively, UD(ref v, n) maps a reference to variable v at AST node n to the set ofdefinitions that “reach” n. Conversely, DU(def v, n) maps a definition of symbol v atnode n to the set of references of v that it reaches.

Now, borrowing from [25], we define a control-flow graph over AST nodes (at asufficiently fine granularity), and define the “entry” of an AST node n to be the pointin program execution just after any predecessor of n has completed, but before n itselfhas executed. Similarly, we define the “exit” of n to be the execution point just after Nhas completed, but before any successor of n has begun.

To compute the reaching definitions information, we start with the definitions of thefollowing terms that will appear in our constraint system:

term meaningRDentry(n) the set of definitions reaching the entry of AST node nRDexit(n) the set of definitions leaving the exit of AST node n(v, n) a definition of variable v at AST node n(v, ∗) a definition of variable v at any AST nodeS \ S′ set difference: {t | t ∈ S ∧ t � S′}Using the above terms, we can now define constraints that look like so, with the

given corresponding interpretations:constraint interpretationRD(n) ⊆ RD(n′) the set of reaching definitions of AST node n

is a subset of that of AST node n′(v, n) ∈ RDentry(n′) the definition of variable v at AST node n

reaches AST node n′

The first set of constraints we write is for the data flow through side-effectingoperators:

114 R.M. Fuhrer

construct constraints descriptionv = E (v, v = E) ∈ RDexit(v = E) def: definition of value for v

reaches exitv = E RDentry(v = E) \ {(v, ∗)} ⊆ RDexit(v = E) kill: anything not killed by defini-

tion reaches exitv + + (v, v + +) ∈ RDexit(v + +) similar to def rule for assignmentv + + RDentry(v + +) \ {(v, ∗)} ⊆ RDexit(v + +) similar to kill rule for assignment

Next, we consider control-flow statements. In general, if statement S flows to state-ment S′, then the following constraint applies:

RDexit(S) ⊆ RDentry(S′) (2)

All of the subsequent constraints will result from straightforward applications of thisrule. Thus, the control-flow through statement blocks of the form

{S1; S2; S3; . . .Si; Si+1} (3)

give rise to the constraints:

1. RDexit(S1) ⊆ RDentry(S2)2. RDexit(S2) ⊆ RDentry(S3)3. . . .4. RDexit(Si) ⊆ RDentry(Si+1)

For simple for-loops, the situation is as shown in Figure 5.2.

for( init; cond; update)

body

6

1 2 5

3

4

Fig. 5.2. Control flow of a for-loop

Such a structure gives rise to the following constraints, which are obtained by apply-ing equation 2 to each control-flow edge in turn:

1. RDentry(for) ⊆ RDentry(init)2. RDexit(init) ⊆ RDentry(cond)3. RDexit(cond) ⊆ RDentry(body)4. RDexit(body) ⊆ RDentry(update)5. RDentry(cond) ⊆ RDexit(for)


Parser

Constraint Generator

(ASTVisitor)

Constraint Solver

Java Source File

AST

Reaching DefsConstraints

Map:ASTNode -> Reaching Defs

Fig. 5.3. Architecture of reaching definitions analysis

5.2 Analysis Implementation

Now that we have the necessary constraint structure, we can proceed with an imple-mentation of the analysis engine, which we will later integrate with the necessary UIapparatus. The basic architecture is shown in Figure 5.3.

The architecture is a fairly general one for static analysis: first, source files are parsedand semantically processed to obtain ASTs with binding and type information. Next,the ASTs are “distilled” into an intermediate form that is more amenable to the analysis;in this case, the form is that of a set of constraints. The representation we use is in effecta control-flow graph in which the nodes are in fact AST nodes. For some analyses, itmay be convenient to first convert the ASTs into another form such as three-addresscode and/or static-single assignment (SSA) form. This is unnecessary for the presentanalysis, however, and so we generate constraints directly from the AST itself.6

It should be noted that the current analysis is local and in fact intraprocedural, andmoreover, ignores aliasing, for the sake of a more self-contained exposition. Similarly,the control-flow graph is an approximation that ignores edges due to exceptional con-trol flow (that is, edges introduced by try/catch/finally constructs in Java).Although this latter simplification certainly introduces a degree of unsoundness, theessence of the flow of the analysis remains largely the same as in the more complex androbust case.

We first define the representation for our constraint system. Listing B.1 shows theclasses used for this purpose. The classes depicted in Listing B.1 are in fact much moregeneral than needed for reaching definitions analysis, since we will use a variation onthem in Section 7 that implements a type inference engine. Listing B.2 shows the classesspecifically intended for our reaching definitions analysis.

6 The process is quite similar for analyzing byte-code or other object code forms.

116 R.M. Fuhrer

The classes shown in Listing B.2 correspond to two types of entities: (a) classiccontrol-flow graph entry/exit nodes, and (b) representations of value definitions. Thereare two flavors of value definitions: one to represent definitions of specific variables atspecific locations (see the 2-argument constructor that takes both a variable binding andan ASTNode), and one to represent any definition of a given variable at any location.The latter is used in certain constraints to kill definitions entering a given AST node.

We use a factory to encapsulate the creation of the terms that appear in the variousconstraint classes, as shown in Listings B.3 and B.4. Following the formulation de-scribed earlier, a ConstraintTerm is used to refer to the entry/exit points of eachAST node (e.g. RDentry(n)), as well as the value definitions (e.g., (v, n)). Among otherthings, this factory ensures that a unique ConstraintTerm is created to representany AST node and value definition, wherever referenced in a constraint.

The AST visitor class (ConstraintVisitor) is generic, and relies on the domain-specific constraint creator sub-class, in this case RDConstraintCreator, to createconstraint terms for the various language constructs. The API of this class appears inListing B.5. The two primary methods are the handlers for assignments, which appearsin Listing B.6 and for for-loops, which is shown in B.7. The assignment handler deter-mines whether the left-hand-side is a local variable reference; if so, it creates a suitableconstraint. If not, it creates a “pass-through” constraint which simply propagates thereaching definitions from the right-hand side to the left-hand side. It should be easy tosee that the for-loop handler is a straightforward realization of the strategy describedabove.

5.3 Solving the Constraints

The basic flow for solving the constraints is a straightforward iterative work-list algo-rithm, and consists of three steps:

– Build the constraint graph.– Initialize the estimates of the solution for each node.– Process the work-list.

Each node in the constraint graph is associated with an ”estimate” of the solution thatevolves as the solution algorithm progresses. This estimate is, naturally, a set of valuedefinitions that reach that point in the program’s control flow graph. As seen in List-ing B.9, the initial estimate given to definition nodes is a singleton set containing thevariable being assigned, while the initial estimate given to other nodes is simply theempty set.

The code in Listing B.8 builds the constraint graph from the constraints, and con-structs a map that records the constraints in which each term appears. This is used toefficiently locate the constraints that must be examined when a term estimate changesduring solving. The reason for this additional complexity is that in our fairly genericconstraint framework, constraint terms in any specific constraint domain can have ar-bitrary sub-structure (e.g., unary and binary operators). Thus, caching this referenc-ing information avoids the repeated examination of arbitrary term structures duringconstraint solution.


The constraint solution algorithm is shown in Listing B.10. This is a simple work-listalgorithm in which the work-list items are constraint terms whose estimate has changed,and the estimates are sets of reaching value definitions. One such term is selected, andany associated constraints are inspected. For each unsatisfied constraint, the estimatesare adjusted to maintain the necessary subset relationship implied by the constraint,and if any estimate has changed, the corresponding constraint term is added back to thework list. The algorithm terminates when no such work items remain in the work list.It is not hard to see that constraint term estimates increase monotonically in size, so thealgorithm is guaranteed to terminate after a finite number of iterations.

5.4 Mapping Reaching Definition Sets into Use-Def Information

The following definition computes the references to a given definition from the reaching-definitions information computed above. As described earlier, this is used when the IDEuser selects a variable value definition, in order to highlight the portions of the text cor-responding to those references.

refsTo(def d) = { node r | r is a reference node ∧var(r) = var(d) ∧d ∈ RD(r) }

(4)

This information is calculated by the code in Listing B.11.The dual of the above is computed by the following, which computes the value def-

initions that can reach a given variable reference, using the same reaching-definitionsinformation computed earlier:

defsOf(ref r) = { d | r ∈ RD(r) ∧var(d) = binding(r) }

(5)

The implementation is shown in Listing B.12.

5.5 User Interface Integration

The desired user interface integration is depicted in Figure 5.5. There are four basicingredients:

– Create a toolbar Action to toggle the highlight uses/defs mode (shown in Fig-ure 5.4).

– Re-run the reaching-definitions analysis whenever the editor broadcasts a sourcedocument change notification.

– Update the source highlighting when the editor’s text selection changes.– Create a set of Annotations to indicate the desired source text highlights, and

convey that set to the source editor.

Eclipse provides a standard mechanism for arbitrary clients to obtain notification of tex-tual changes to open editor documents, based on the Observer design pattern. Likewise,Eclipse defines a notification mechanism for clients to observe changes in selections

118 R.M. Fuhrer

Fig. 5.4. Screenshot of use-def editor mode button

Fig. 5.5. Screenshot of use-def editor highlighting

made by the user in various views and editors. (Again, these basic services are providedby most general-purpose user interface frameworks, so that integration with other IDEswill follow the same general pattern.) To effect these connections to the Eclipse frame-work, we must:

– Create a document listener and register it with the appropriate editor.– Create a selection listener and register it with the editor.

The Workbench Action. Listing B.13 gives a somewhat simplified version of anAction class that registers the appropriate document and selection listeners. This ac-tion must be registered with the Eclipse platform via an extension like the one thatappears in Listing B.14.

Use-Def Listeners. The document listener shown in Listing B.15 is trivial, since ithas only one task: to invalidate any previously-computed analysis results. It does so bysimply nulling out the cached AST for the compilation unit of the editor that currentlyhas focus. This has the effect of forcing the recomputation of the reaching-definitionsinformation.

The selection listener is shown in Listing B.16; it responds to selection changes byrecomputing the annotations (highlights) based on the new selection. The followingsub-section describes the computation of annotations from the reaching-definitions in-formation created by the analysis described earlier.

Annotation Management. As described earlier, annotations are essentially styled re-gions of text in an Eclipse text editor. Each annotation has an associated type that is used


primarily to distinguish annotations maintained by different clients. Each editor main-tains a set of such annotations, which can be obtained and manipulated by arbitraryclients using standard API.

The flow of the top-level method called by our selection listener is straightforward,and is shown in Listing B.17. Essentially, the AST for the source text in the current editoris retrieved (or recomputed if needed). Next, the AST node corresponding to the currenttext selection is determined. An instance of the aforementioned UseDefAnalyzer iscreated, and the selected AST node is passed in to start the analysis. The result of theanalysis is a set of AST nodes, which are then mapped back into source text positionsvia the call to convertNodesToPositions(). (This latter method is not shown,but simply accesses positional information embedded in each AST node.)

These positions in turn are used to create the editor annotations via a call toconvertPositionsToAnnotationMap(), shown in Listing B.18, which arefinally given to the editor via the annotation model by submitAnnotations()(shown in Listing B.19).

Some additional machinery is needed to track which editor currently has focus. Thismakes use of the standard Eclipse API IPartListener for listening to events thatgovern a view’s lifecycle, including opening, closing, activating and deactivating. Suchcode is fairly straightforward, and omitted for brevity’s sake.

6 Effects Analysis and Refactoring

The goal of this section is to describe the implementation of a refactoring that par-allelizes a selected piece of code to run asynchronously with respect to its lexically-surrounding context. A refactoring is defined as a behavior-preserving source-to-sourcetransformation [14]. In general, behavior preservation is ensured by a set of precon-ditions associated with the transformation, based on the semantics of the code beingmanipulated.

The language used for this example is X10 [8], a language intended for program-ming on parallel hardware of various sorts, from loosely-coupled clusters of nodes totightly-coupled arrangements of processors as found in, say, GPU-based architectures.To that end, X10 extends the Java language’s object model and core semantics witha small number of constructs specifically targeting concurrency. The core parallelismconstructs of interest here are async S and finish S, for arbitrary statement S. Theformer construct causes S to be executed asynchronously with respect to its surround-ing context (say, by forking another thread); the latter “joins” with all nested asyncs(including those occurring within any methods called from S).

6.1 Refactoring

The specific refactoring we present here is a “loop-flat” parallelization that wraps eachiteration of a loop (hence the “flat” qualifier) in a distinct asynchronous activity.7 The

7 A more general transformation groups iterations into “chunks”, each of which can be exe-cuted in parallel, while individual iterations within a chunk are executed serially. See the richliterature on loop transformations [23,1,29,30] for more information.

120 R.M. Fuhrer

transformation schema is illustrated in Figure 6.1. Syntactically, the transformation istrivial, requiring only the insertion of finish and async keywords. The asyncoperator introduces the desired parallelization, while the finish operator providesthe necessary synchronization at the loop’s end.

for(p in r)S

finish for(p in r)async S

Fig. 6.1. Loop-flat parallelization transformation schema

Verifying the preconditions to ensure behavior preservation is more complicated, andinvolves a memory effects analysis to determine what memory locations are touched bythe pieces of code being parallelized. Naturally, if a mutable memory location is touchedby multiple iterations, the async-introduced reordering could change the loop’sbehavior.

Thus, a key portion of this transformation’s preconditions are that the memory effectof the loop body statement S must “commute with itself”. The term “commutes” asapplied to the pair of statements S, S′ means that the statements can be executed ineither order, with the same effect on the state of memory. This is clearly true if it canbe proved that the memory effects of S and S′ are disjoint, i.e., if the statements do nottouch the same memory locations. It may also be true if the effects are not disjoint (e.g.,where accumulators or proper reduction operators are used), but for simplicity’s sake,we consider here only the case of disjoint effects.

For this particular transformation, S[p1/p] and S[p2/p] must commute, for distinctvalues p1, p2 taken from the iteration domain r of the loop.8 We use abstract domainvalues p1 and p2, so that reasoning about their relative ordering applies to any two loopiterations.

Again, in order to show the analysis in context, we use a somewhat simple- mindedanalysis approach that is in effect a syntax-directed translation from AST nodes to asymbolic effects representation. This produces, for each AST node, the effects on mem-ory of the corresponding program fragment. The effects representation itself is fairlysimple, and is well-suited to array-intensive code, but would require nontrivial exten-sion to adequately handle heap-intensive code or code that performs arbitrary indexarithmetic. Given the above-mentioned restriction, an important property of the effectsrepresentation is that it be easy to determine pair-wise disjointness.

We restrict the analysis in several ways for the sake of simplicity of presentation:

– The analysis is alias-oblivious.– The analysis is intraprocedural.

8 Roughly, the notation S[x/y] means S with every occurrence of y replaced by x.


– The analysis is only able to handle field references through immutable objectreferences.

An effect e consists of a triple of sets of locations read, written, and updated:

e = {R,W,U} (6)

Locations in each of the sets R,W,U are represented symbolically, as abstract memorylocations, e.g. v or a[p]. The degree of symbolic abstraction is key to the precisionof the analysis.9 Specifically, the safety of the transformation relies on the analysisavoiding false negatives, in which two locations are stated to be distinct but in factcoincide. Likewise, the usefulness of the refactoring relies on the analysis avoidingfalse positives, in which two locations appear to be the same but are not. In that case,the analysis concludes that the memory effects do not commute, thus disallowing thetransformation, where in fact the effects do commute.

Naturally, the effects computation must be conservative; that is, the computed effectof a construct C must contain all locations that C may touch, in any possible executionof C. To that end, we employ a so-called “bottom” effect for terms that the analysis is notprepared to model directly. The bottom effect represents the set of all possible memorylocations, expressing complete uncertainty as to the effect of any so-marked construct.The bottom effect does not commute with any other effect (including itself), and thussafely forces the transformations preconditions to fail when encountered during theanalysis.

Consider the following simple array-intensive loop:

for(p: Point in a.region) {a[p] *= k;

}

This is a standard iteration idiom in X10 that enumerates each point in the N-dimensionalindex domain of the array a (expressed as a.region), and binds each successive do-main value to the loop induction variable p. The (possibly multi-dimensional) indexpoint p can be used directly to index the array a, as shown in the example.

To parallelize such a loop, we must first compute the effect ebody of the loop body, andthen ask whether the effect ebody commutes with itself. More precisely, we determinewhether the effect ebody[p1/p] (that is, the effect obtained from ebody by substituting p1for p throughout) commutes with ebody[p2/p] for arbitrary distinct values p1, p2 takenfrom the loop induction domain a.region. In this case, the answer is yes, because wecan directly prove that symbolic locations a[p1] and a[p2] are disjoint when p1 � p2.

6.2 Effects Analysis Implementation

The analysis proceeds by constructing a representation of the memory effects of a givenAST node n by visiting the node and its sub-structure, and building an effect bottom-upfrom the leaves.

9 There are many possible symbolic abstractions, each with their own strengths and weaknesses.See for example [5,15,13] for details.

122 R.M. Fuhrer

The code in Listings C.1 through C.10 makes use of various APIs from the X10compiler front-end, including its AST class hierarchy and type system. However, theX10 compiler is an instantiation of the Polyglot compiler framework, so that most ofthe API involved is in fact not specific to X10. That said, some modest liberties havebeen taken with certain APIs (e.g. those from the Effects class) to omit gratuitouscomplications while preserving the spirit of the actual implementation.

The main class of interest is the EffectsVisitor class, shown in Listing C.1.This class uses a field fEffects to maintain a map from AST nodes to their effects.The class extends Polyglot’s NodeVisitor class; as such, its most interesting methodis the interface method leave(...). That method consists of a case class to dispatchto the appropriate handler method based on the AST node type. Listing C.2 shows theutility method followedBy(), which computes a compound effect consisting of twoeffects in sequence, and is used by the various overloaded computeEffect(...)methods.

The code to compute the effects for various kinds of assignments appears in List-ing C.3. This is straightforward, for the most part, except for the use of the ”bottom”effect to model assignments to mutable fields. (Handling this properly would likely re-quire some form of alias analysis, which would introduce unnecessary complicationsinto the presentation, and is thus outside the scope of this paper.)

The computation of effects for local and field references is given in Listing C.4. Theonly interesting accommodation here is to ignore any effects on val variables, whichare immutable.

The computation of effects for various expressions is given in Listing C.5. The effectsof binary expressions are the simple composition of the effects of the operands, sincethe operators themselves introduce no additional effects. Some unary operators, notablythe incrementing/decrementing ones, do introduce additional write effects, and hencerequire special treatment beyond that of the operands.

The effects of control-flow statements like if or for are straightforward, and aregiven in Listing C.6.

The effect of a block is created by stripping all effects on locations that are local tothat block, as depicted in Listing C.7. This is accomplished by existentially quantifyingthe block’s effect over the locally-declared variables of that block.

6.3 Refactoring Mechanics

All Eclipse refactorings follow the same protocol, established by the baseRefactoring class:

– Quickly check initial preconditions before posting any dialogs; e.g., determinewhether the selected node is of the correct type.

– Collect any necessary additional information from the user, e.g., the target namefor a Rename refactoring.

– Perform detailed precondition checking to determine whether the transformation issafe.

– Produce a Change object that identifies the detailed textual changes to be made ifthe user accepts the refactoring.


Each of these steps corresponds to an abstract method on Refactoring. E.g., theinitial condition check is performed by checkInitialConditions(), shown forour parallelization refactoring in Listing C.8. In this case, we need only check thatthe selected node has the right type and that it is not already parallelized via a simplestructural query.

More detailed preconditions are then performed in step 2, as shown in Listing C.9.This invokes the effects analysis described above, which produces for the loop body arepresentation of all of its memory effects. As described earlier, the transformation’s pre-condition requires that the loop body effect (call it e) commute with itself when quantifiedwith two distinct domain values for the loop induction variable (call it i). In other words,the effect e[i1/i] must be disjoint with the effect e[i2/i], for distinct i1 and i2. This logic isencapsulated within the call to Effect.commutesWithForAll(). If the commu-tativity test fails, the check is essentially performed a second time, but this time to collectall of the interfering effects, via a call to Effect.interferenceWithForAll(),whose results are used to inform the user of the reason for the precondition failure.

The result of the Refactoring method createChange() is a Change ob-ject, which encapsulates the complete set of textual changes necessary to perform thetransformation. In general, the changes may span many files, though in this case thetransformation is limited in scope to the file containing the selected loop. The Changeobject is in fact a tree of Change objects, whose structure reflects the nature of thechanges. Moreover, individual changes are typically given meaningful labels that willappear in the user interface, and help the user understand the reason for the individualmodifications.

In this case, the transformation as shown in Listing C.10 is trivial, consisting solelyof the addition of two keywords (finish and async) at the appropriate text offsets.These offsets are simply obtained from the AST nodes for the loop and the loop body,respectively. However, if the loop is already wrapped inside a finish, the code omitsthe additional finish, which, although not harmful, is unsightly.

7 Type Analysis and Code Smells

Code smells are generally defined as outwardly visible manifestations of non-functionalcharacteristics of code that are undesirable from a design or maintenance perspec-tive [14, Chapter 3]. Smells usually correspond to structural properties, and are thereforesomewhat deeper than, say, style- or format-related aspects. Example smells includeduplicated code, overly long methods, tight coupling between nominally distinct com-ponents, lack of encapsulation, and so on. The particular smell used here as a runningexample, Overly-Specific Variable, illustrates the implementation of a type analysis en-gine, and a distinct means of exposing the resulting information to the developer.

In a language with inheritance, an Overly-Specific Variable is one that is declaredto be of a type that is more specific than is needed in order to satisfy the uses of thatvariable. In object-oriented terms, all of the members accessed via that variable aredefined by super-types of the variable’s declared type. In such a case, the variable’sdeclaration could be modified to specify a more general type. We cast it here in an

124 R.M. Fuhrer

class Foo {public ArrayList toList(String[] args) {

ArrayList l1 = new ArrayList();for(int i=0; i < args.length; i++) {

l1.add(args[i]);}return l1;

}public void foo() {

List l2 = toList(new String[] { "a", "b", "c" });for(Iterator it = l2.iterator(); it.hasNext(); ) {

System.out.println(it.next());}

}}

Listing 7.1. Example of Overly-Specific Variables

object-oriented context, but the smell applies equally well to languages with parametricpolymorphism, such as ML or Haskell.10

Consider the Java class in Listing 7.1. The local variable l1 in method toList()is declared as an ArrayList, but can instead be declared as the more general typeList, which provides all of the API used by toList(), namely, the List.add()method. Likewise, the return type of method toList() can be declared as List.Using the most general type possible in declarations is considered good coding practice,as it permits the use of any given code fragment in as many situations as possible.

To simplify smell detector implementation and integration, IMP provides a smallframework to encapsulate the registration and execution of smell detectors for Java,along with some basic utilities to present the identified code smells in the IDE. Italso provides helper classes to implement code transformations that remediate detectedsmells. The architecture is shown in Figure 7.1.

The framework is pluggable, so that clients can register Java smell detectors at var-ious granularities (workspace, project, source file, class and member) that get auto-matically instantiated and called as part of the build process. A sample smell detectorregistration appears in Listing D.1. It gives the human-readable name of the detector,along with the fully-qualified name of the implementation class.

The Java smell detector framework requires that all detector implementation classesimplement one of the following IMP-defined interfaces:

– IFieldSmellDetector– IMethodSmellDetector– ITypeSmellDetector– IUnitSmellDetector– IPackageSmellDetector– IProjectSmellDetector

10 In such languages, this smell manifests as a function having an overly-specific explicit typefor a formal argument.


interface ISmellDetector {// Used as the value of the marker type attribute.// Identifies the marker as describing a code smell instance.static final String k_smellMarkerType =

"org.eclipse.imp.smelldetector.smellmarker";

// The kind attribute’s value identifies the particular smell// described by this marker.static final String k_smellMarkerKindAttribute =

"org.eclipse.imp.smelldetector.smellmarkerkind";

String getName();}

interface IFieldSmellDetector extends ISmellDetector {void runOn(FieldDeclaration field, ICompilationUnit icu,

IFile file);}

interface IMethodSmellDetector extends ISmellDetector {void runOn(MethodDeclaration method, ICompilationUnit icu,

IFile file);}

interface ITypeSmellDetector extends ISmellDetector {void begin(TypeDeclaration type, ICompilationUnit icu,

IFile file);void end(TypeDeclaration type, ICompilationUnit icu,

IFile file);}

Listing 7.2. Smell Detector Interfaces

Each interface defines one or more methods that are invoked at appropriate pointsas each project and its constituent compilation unit ASTs are traversed. List-ing 7.2 depicts the field, method, and type detector interfaces, along with thebase detector interface. The base interface defines two string constants. One,k smellMarkerType, identifies a marker as being associated with a detected smell;the other, k smellMarkerKindAttribute, is a marker attribute key that is usedto hold the specific kind of smell detected. The smell kind helps identify the set of quickfixes that can remediate that kind of smell.

7.1 Implementing the Smell Detector

The steps required to implement a smell detector are:

Step 1 Create new plug-in project (or re-use an existing plug-in).Step 2 Add plug-in dependency for the smelldetector framework plug-in.

126 R.M. Fuhrer

IMP SmellDetector Manager

Smell DetectorMarker Resolution

Generator

WorkspaceResources

ResourceMarkers

Marker Resolution

QuickFixes

resourcechanges

relevantresources

+ ASTs

creates

appear inProblems View

appear incontext menu

of marker

Fig. 7.1. Architecture of smell detection framework

Step 3 Create an extension of smelldetector extension point.Step 4 Create a class implementing ISmellDetector or some sub-interface.Step 5 Create the analyzer implementation.Step 6 Create the remediator as a class implementing

IMarkerResolutionGenerator.

The following sub-sections will focus on the implementation of the detector interfaces,and of the analyzer and remediator.

7.2 Detector Interface Implementation

As shown in Listing D.2, the main responsibilities of the compilation unit detectorare to (a) invoke the analysis engine, and (b) examine the analysis engine’s outputto create a smell marker for each overly-specific variable detected. The engine’s out-put consists of a map from compilation units to information about the overly- specificvariables detected in that unit. The latter information is represented as a map from aConstraintTerm (i.e., a variable) to the set of types that that variable may be de-clared as. This implementation simply picks any upper bound (i.e., most general type)from the TypeSet computed by the analysis engine, and sets the NEW TYPE attributeon the smell marker. NEW TYPE is used by the remediator to identify the new declaredtype of the given variable, if the user accepts the quick-fix.

7.3 Analysis Implementation

As shown in Figure 7.2, the analysis implementation re-uses the constraint frameworkpresented earlier in Section 5, instantiated over the domain of sets of types (TypeSets).


Parser

Constraint Generator(ASTVisitor)Source Files

ASTs

Type Constraints

Type ConstraintSolver

Map:var TypeSet

Fig. 7.2. Architecture of type inference framework

That is, the solution to the constraint problem will consist of a mapping from variablesto the set of types that each variable could be legally declared without altering the be-havior of the program. One significant difference from the previous use-def instantiationof the constraint framework is that type inference is a whole-program analysis. As a re-sult, the set of constraints created here will represent all of the compilation units in thegiven project, rather than just one as in the previous case.

The heart of the engine analysis borrows from a formalism of Palsberg andSchwartzbach [28] that captures the type relationships among the constructs in a givenprogram. Its original purpose was for type checking a Smalltalk-like language, in orderto prove that certain kinds of errors cannot occur at run-time (e.g., that no “message notunderstood” errors are sent), and thereby optimize its implementation. Later work [33]adapted and extended the formalism to capture the type semantics of Java with para-metric types.

Figure 7.3 describes the notation used to represent constraint terms corresponding tothe type of various kinds of program entities, while Figure 7.4 describes the notationof the type constraints themselves. For any instance of a given kind of program con-struct, zero or more type constraints are generated that ensure that any satisfying typeassignment preserves the program’s type-correctness. A subset of the generation rulesis shown in Figure 7.5.

For example, the first constraint generation rule states that for each assignment ex-pression appearing in the program being analyzed, a constraint is generated to assertthat the type of the right-hand side expression must be a subtype of the declared type ofthe assignment’s left-hand side. This rule follows directly from the specified semanticsof the Java assignment operator. Likewise, the last rule ensures three type-correctnessproperties of a method call: (1) that the type of the method call expression is that of themethod’s return type, (2) that the type of each actual parameter in the call is of a sub-type of the respective formal parameter type, and (3) that the receiver expression is ofa sub-type of the type that declares the target method. For a more complete descriptionof the necessary typing rules, see [33].

To instantiate the previously-described constraint solution framework with the presenttype-constraint formulation, we must do the following:

128 R.M. Fuhrer

[E] the type of expression E[M] the return type of method M[F] the type of field FDecl(M) the type that defines member MParam(M, i) the type of the ith formal parameter of method M

Fig. 7.3. Notation for type constraint terms

For any constraint terms t1, t2 and literal type T:t1 = t2 the type of t1 must be the same as that of t2

t1 < t2 the type of t1 must be a em proper subtype of that of t2

t1 ≤ t2 either t1 = t2 or t1 < t2

t1 ≡ T t1 is defined to be T

Fig. 7.4. Notation for type constraints

– Define representations for the various type operators appearing in Figure 7.4, askinds of ConstraintOperator. Listing 7.3 gives the relevant definitions.

– Define appropriate ConstraintTerms and a “canonicalizing” term factory.Canonicalization is necessary to ensure that constraints that are intended to re-fer to a common entity refer to the exact same ConstraintTerm. Sincethis analysis is flow-insensitive, all references to a given variable use the sameConstraintTerm. 11 The constraint term classes appear in Listing D.4, whileListing D.5 depicts the term factory implementation.

– Define a ConstraintCreator that generates a set of Constraints for eachtype of program construct, according to the rules in Figure 7.5. This is sketched inListing D.6.

– Define a ConstraintSolver sub-class that initializes the type estimates of allConstraintTerms to appropriate values. Listing D.7 shows the key parts of theimplementation.

– Override ConstraintSolver.solveConstraints() to handle the vari-ous type operators. The implementation is given in Listing D.8. It is important tonote that enforceSubtypes() forces the satisfaction of a constraint strictlyby removing types from TypeSets. As a result, TypeSets always monotoni-cally decrease in size, which guarantees termination. It is also worth noting thatthe algorithm assumes that the original program type-checks successfully; other-wise, additional checking would be required to make the algorithm terminate if nosolution can be found.

11 For a flow-sensitive analysis, each reference to a given variable at a distinct program pointwould use a distinct ConstraintTerm.


For program expressions E, E1, . . . ,En:construct constraint(s)

assignment E1 = E2 [E2] ≤ [E1]access E. f to field F [E. f ] ≡ [F]

[E] ≤ Decl(F)returnE in method M [E] ≤ [M]method M in type T [Decl(M) ≡ Tthis in method M [this] ≡ Decl(M)direct call E.m(E1, . . . ,En) to method M [E.m(E1, . . . ,En) ≡ [M]

[Ei] ≤ [Param(M, i)][E] ≤ Decl(M)]

Fig. 7.5. Type constraint generation rules

c l a s s TypeOpera to r ex tends C o n s t r a i n t O p e r a t o r {p r i v a t e TypeOpera to r ( ) { }s t a t i c f i n a l TypeOpera to r Subtype = new TypeOpera to r ( ) ;s t a t i c f i n a l TypeOpera to r S u p e r t y p e = new TypeOpera to r ( ) ;s t a t i c f i n a l TypeOpera to r P r o p e r S u b t y p e = new TypeOpera to r ( ) ;s t a t i c f i n a l TypeOpera to r P r o p e r S u p e r t y p e = new TypeOpera to r ( ) ;s t a t i c f i n a l TypeOpera to r Equa l s = new TypeOpera to r ( ) ;

}Listing 7.3. Type constraint operators

7.4 Smell Remediation

Given the analysis implementation, there are just three additional steps required to pro-vide remediation for a code smell:

– Create a “resolution generator” class that returns one or more candidate resolutionsfor a given smell marker.

– Register the resolution generator class using an org.eclipse.ui.ide.markerResolution extension.

– Implement a “marker resolution” that, given an appropriate kind of smell marker,modifies the proper parts of the program source code to remediate the smell.

These steps materialize the flow depicted in the right-hand half of Figure 7.1. The firststep is trivial, and is illustrated in Listing D.9. Next, a resolution generator extensionlike the one in Listing D.10 specifies matching criterion to identify markers to be asso-ciated with the given resolution generator. Naturally, this criterion should be satisfiedby markers created by the IUnitDetector that appeared in Listing D.2. The Eclipseplatform takes care of invoking the resolution generator for all matching markers. Thefinal remediation component is the marker resolution itself, shown in Listing D.11,whose run() method encapsulates the remediation that rewrites the source AST. Thisstep is performed using the Eclipse JDT’s AST rewrite engine, which, unlike raw textmodification, provides a mechanism for creating and altering AST trees that guaranteesa structurally well-formed result.

130 R.M. Fuhrer

8 Conclusion

This paper presented four distinct scenarios for incorporating different types of staticanalysis into IDE-based tools:

– computation and use of type inheritance relationships in a type hierarchy view– computation of local use-def relationships and presentation as textual highlights in

a source editor– computation of memory effects for use in a parallelization refactoring– the use of type inference to detect type-related ”code smells”

Each of these applications requires somewhat different analysis techniques. In the firstcase, type hierarchy analysis is essentially a straightforward extraction of semantic in-formation (inheritance relationships) from the source text. In the second case, the com-putation of use-def information involves inspecting control- and data-flow informationextracted from the AST in order to produce a constraint system that can be solved witha simple work-list algorithm. The simplified memory effects analysis computes effectsby a bottom-up syntax-directed translation. Finally, the type-related code smell detec-tor performs a syntax-directed translation to type constraint rules, which are solved bya constraint solution engine that is mostly identical to that used for the use-def analy-sis. The smell detector is invoked as part of the build process, and integrates into theIDE in the same basic manner as an ordinary compiler. In several cases, the algorithmspresented were simplified, but any of a host of more precise or performant algorithmscould be substituted into the same framework.

Likewise, the various types of information were presented in very different manners,each appropriate to the nature and scope of the information being relayed to the devel-oper. The type hierarchy was presented in a separate IDE view, updated as the sourceartifacts change. Local use-def relationships were presented as textual highlights in thesource editor, updated as the source is changed. Memory effects were used to validatea user-proposed parallelization refactoring, and reflected back to the user in case theproposed transformation is deemed unsafe. Finally, type-related code smells were pre-sented as “markers” decorating the source text, which are typically displayed as part ofa “Problems” or “Errors” view.

The algorithmic implementations shown are imperative in nature, and are thus some-what more verbose than would be the case with some competing frameworks. However,this is mitigated by the fact that a significant part of the development expense is in factincurred on behalf of fairly generic and reusable components, such as the constraintrepresentations and constraint solution engines described in Section 5 and Section 4,and the memory effects framework of Section 6, which can be reused for a variety ofuseful code-manipulation refactorings. The same is true of portions of the IDE integra-tion code, which can be reused to expose a variety of kinds of analysis information. Forexample, the vast majority of the mechanism to produce editor source highlighting isavailable as a much-abstracted “occurrence marking” service in IMP. In other words,this paper in part describes a nascent framework of reusable analysis and integrationcomponents that can serve many purposes.

Another source of complexity in the presentation is due to the explicit manipulationof so-called “extension points” to plug into the Eclipse IDE. This is easily bypassed


by hiding such details under a simpler declarative specification, as is done by IMP’spreference page domain-specific language (DSL) compiler, by the Spoofax [19] DSLenvironment (which builds on IMP), or by Rascal’s DSL environment.

More importantly, though, the particular implementation details are not of primaryconcern; the more important lesson is how to provide value to the IDE user in variousways from static analysis of the source code.

Though the implementations are couched in terms of an Eclipse integration, mostof the necessary services that are required from the IDE are common to any extensibleapplication framework. The remainder are typically provided by other IDE platforms,such as Microsoft’s Visual Studio or NetBeans. As a result, the IDE extensions exploredin this paper can be readily adapted to those contexts as well.

We hope that these exemplars give the aspiring software tool developer insight intothe wide range of possibilities that these powerful components can address, and bringgreater value to the IDE user in understanding, navigating, and manipulating their codeeffectively.

A Type Hierarchy Code Listings

c l a s s F a c t G e n e r a t o r implements I F a c t G e n e r a t o r {IVa lue g e n e r a t e ( Type type , I F a c t C o n t e x t c o n t e x t ,

Map<IResou rce , Documen tDesc r ip to r> work ingCop ie s ) {I S o u r c e E n t i t y C o n t e x t sec = ( I S o u r c e E n t i t y C o n t e x t ) c o n t e x t ;Se t<I C o m p i l a t i o n U n i t> s r c s = sec . g e t E n t i t y ( ) . g e t A l l S o u r c e s ( ) ;M y F a c t V i s i t o r v i s i t o r = new M y F a c t V i s i t o r ( ) ;IMessageHand le r msgHandler = new Sav ingMessageHand le r ( ) ;I P r o g r e s s M o n i t o r progMon = new N u l l P r o g r e s s M o n i t o r ( ) ;

for ( I C o m p i l a t i o n U n i t u n i t : s r c s ) {I R e s o u r c e r s r c = u n i t . g e t R e s o u r c e ( ) ;ASTNode a s t ;i f ( work ingCop ie s . c o n t a i n s K e y ( r s r c ) ) {

a s t = ( ASTNode ) work ingCop ie s . g e t ( r s r c ) . getAST ( ) ;} e l s e {

a s t = ( ASTNode ) u n i t . getAST ( msgHandler , progMon ) ;}a s t . a c c e p t ( v i s i t o r ) ;

}return v i s i t o r . g e t V a l u e ( ) ;

}}Listing A.1. Type Hierarchy Fact Generator

c l a s s MyFactTypes {p r i v a t e f i n a l TypeFac to ry t f = TypeFac to ry . g e t I n s t a n c e ( ) ;p u b l i c f i n a l Type supe rTypes = c rea t eSchema ( ) ;

p u b l i c MyFactTypes ( ) {TypeS to re t s = new TypeS to re ( ) ;Type typeName = t f . a l i a s T y p e ( t s , t y p e N a m e , t f . s t r i n g T y p e ( ) ) ;

132 R.M. Fuhrer

Type t y p e = t f . t u p l e T y p e ( typeName , t f . s o u r c e L o c a t i o n T y p e ( ) ) ;t h i s . s upe rTypes = t f . r e l T y p e ( type , t y p e ) ;

}}Listing A.2. Type Hierarchy Schemas

class MyFactVisitor implements Visitor {private ValueFactory vf = ValueFactory.getInstance();// Maps types from the compiler front-end to ITuple repnprivate Map<Type, ITuple> frontEndTypeToValue = new HashMap();private IWriter writer = hierType.writer(vf);

public MyFactVisitor() { }

public void visit(ClassDecl cd) {ITuple thisType = findOrCreate(cd.type());ITuple baseType = findOrCreate(cd.supertype());writer.insert(vf.tuple(thisType, baseType));

}public IValue getValue() {return writer.done(); // signals value is completely created

}private ITuple findOrCreate(Type type) {// look up in frontEndTypeToValue and create if missing

}}

Listing A.3. Type Inheritance Extraction Visitor

class TypeHierarchyView extends ViewPart {private final Map<ITuple,Set<ITuple>> fHier;

public TypeHierarchyView(IRelation superTypes) {digest(superTypes);

}

private void digest(IRelation superTypes) {for(ITuple sup: superTypes) {

ITuple baseType = sup.get(0);ITuple superType = sup.get(1);if (!fHier.containsKey(baseType)) {

fHier.put(baseType, new HashSet());}fHier.get(baseType).add(superType);

}}

public void createPartControl(Composite parent) {Tree tree = new Tree(parent);


TreeViewer viewer = new TreeViewer(tree);

viewer.setLabelProvider(new HierLabelProvider());viewer.setContentProvider(new HierContentProvider());viewer.setInput(fHier);

}}

Listing A.4. Type Hierarchy View

class HierContentProvider implements ITreeContentProvider {private Map<ITuple,Set<ITuple>> fHier;

public void inputChanged(Viewer v, Object oldInput,Object newInput) {

fHier = (Map<ITuple,Set<ITuple>>) newInput;}public Object[] getChildren(Object o) {ITuple type = (ITuple) o;Set<ITuple> children = fHier.get(type);return children.toArray();

}// ...

}

class HierLabelProvider implements ILabelProvider {public String getText(Object o) {ITuple type = (ITuple) o;return ((IString) type.get(0)).getValue();

}public Image getImage(Object o) {return null;

}}

Listing A.5. Type Hierarchy View Providers

B Use-Def Code Listings

class ConstraintVisitor extends ASTVisitor {// traverse AST and generate constraintsConstraintCreator fCreator;List<Constraint> fConstraints = new HashSet();

ConstraintVisitor(ConstraintCreator cc) {fCreator = cc;

}

boolean visit(ArrayAccess access) {

134 R.M. Fuhrer

fConstraints.addAll(fCreator.create(access));}boolean visit(Assignment assign) {

fConstraints.addAll(fCreator.create(assign));}//...

}abstract class ConstraintCreator {

// generate constraints for each language constructabstract List<Constraint> create(ArrayAccess);abstract List<Constraint> create(Assignment);abstract List<Constraint> create(ConditionalExpression);abstract List<Constraint> create(MethodDeclaration);abstract List<Constraint> create(MethodInvocation);//

}

Listing B.1. Constraint Representation

class NodeLabel extends ConstraintTerm {ASTNode fNode;NodeLabel(ASTNode node) { fNode= node; }

}class EntryLabel extends NodeLabel { // RDentry[n]

EntryLabel(ASTNode node) { super(node); }public String toString() {

return "RD@entry[" + node + "]";}

}class ExitLabel extends NodeLabel { // RDexit[n]

ExitLabel(ASTNode node) { super(node); }public String toString() {

return "RD@exit[" + node + "]";}

}

class DefinitionLiteral extends ConstraintTerm { // (v,n)IVariableBinding fVarBinding; ASTNode fLabel;DefinitionLiteral(IVariableBinding v) { // (v,*)

this(v, null);}DefinitionLiteral(IVariableBinding v, ASTNode n) {

fVarBinding = v;fLabel = n;}public String toString() {

return "(" + fVarBinding + "," + fLabel + ")";}


}class SubsetOperator extends ConstraintOperator { }

Listing B.2. Representation classes for reaching definitions analysis

class RDConstraintTermFactory {// ... implementation given separately ...ConstraintTerm createEntryLabel(ASTNode node); // RDentry[n]ConstraintTerm createExitLabel(ASTNode node); // RDexit[n]ConstraintTerm createDefinitionLiteral(IVariableBinding v,

ASTNode n); // (v,n)ConstraintTerm createDefinitionWildcard(IVariableBinding v);// (v,*)

}

Listing B.3. Reaching definitions constraint factory

c l a s s RDCons t r a in tTe rmFac to ry {/ / R e s p o n s i b l e f o r c a n o n i c a l i z i n g c o n s t r a i n t t e rmsMap<ASTNode , Cons t r a in tTe rm > fTermMap ;

C o n s t r a i n t T e r m c r e a t e E n t r y L a b e l ( ASTNode n ) {C o n s t r a i n t T e r m t = fTermMap . g e t ( n ) ;i f ( t == n u l l )

fTermMap . pu t ( n , t = new E n t r y L a b e l ( n ) ) ;return t ;

}Map<I V a r i a b l e B i n d i n g , Map<ASTNode , D e f i n i t i o n L i t e r a l >> fVarMap =

new LinkedHashMap ( ) ; / / a l i n k e d hash−map f o r d e t e r m i n i s m

C o n s t r a i n t T e r m c r e a t e D e f i n i t i o n L i t e r a l ( I V a r i a b l e B i n d i n g b ,ASTNode n ) {

Map<ASTNode , D e f i n i t i o n L i t e r a l > l a b e l 2 D e f L i t =(Map ) fVarMap . g e t ( b ) ;

i f ( l a b e l 2 D e f L i t == n u l l )fVarMap . pu t ( var , l a b e l 2 D e f L i t = new LinkedHashMap ( ) ) ;

D e f i n i t i o n L i t e r a l d = l a b e l 2 D e f L i t . g e t ( l a b e l ) ;

i f ( d == n u l l ) {d = new D e f i n i t i o n L i t e r a l ( var , l a b e l ) ;l a b e l 2 D e f L i t . pu t ( l a b e l , d ) ;

}return d ;

}/ / . . . s i m i l a r methods f o r c r e a t i n g o t h e r C o n s t r a i n t T e r m t y p e s

}Listing B.4. Reaching definitions constraint term factory

136 R.M. Fuhrer

c l a s s R D C o n s t r a i n t C r e a t o r ex tends C o n s t r a i n t C r e a t o r {RDCons t r a i n t Te rm Fac t o ry f F a c t o r y ;

/ / c o n v e n i e n c e methodC o n s t r a i n t n e w S u b s e t C o n s t r a i n t ( C o n s t r a i n t T e r m l ,

C o n s t r a i n t T e r m r ) {return new C o n s t r a i n t ( l , r , S u b s e t O p e r a t o r . g e t I n s t a n c e ( ) ) ;

}

/ // / one method per language c o n s t r u c t t o g e n e r a t e c o n s t r a i n t s/ /L i s t <C o n s t r a i n t > c r e a t e ( Ass ignment a ) {

/ / . . . s ee L i s t i n g B.6 . . .}

L i s t <C o n s t r a i n t > c r e a t e ( F o r S t a t e m e n t f ) {/ / . . . s ee L i s t i n g B.7 . . .

}

/ / . . . c o n s t r a i n t g e n e r a t i o n f o r o t h e r language c o n s t r u c t s . . .}

Listing B.5. Reaching definitions constraint generation class

c l a s s R D C o n s t r a i n t C r e a t o r ex tends C o n s t r a i n t C r e a t o r {/ / . . .publ i c L i s t <C o n s t r a i n t > c r e a t e ( Ass ignment a s s i g n ) {

/ / R e s t r i c t i o n : o n l y hand l e l o c a l v a r i a b l e s ( i n t r a p r o c e d u r a l )E x p r e s s i o n l h s = a s s i g n . g e t L e f t H a n d S i d e ( ) ;E x p r e s s i o n r h s = a s s i g n . ge t R i gh t HandS i de ( ) ;

I V a r i a b l e B i n d i n g v a r B i n d i n g = g e t L o c a l B i n d i n g ( l h s ) ;i f ( v a r B i n d i n g == n u l l )

return passThrough ( a s s i g n ) ;

C o n s t r a i n t T e r ma s s i g n E n t r y = f V a r F a c t o r y . c r e a t e E n t r y L a b e l ( a s s i g n ) ,de f = f V a r F a c t o r y . c r e a t e D e f i n i t i o n ( va rB i nd i ng , a s s i g n ) ,defWi ld = f V a r F a c t o r y . c r e a t e D e f i n i t i o n ( v a r B i n d i n g ) , / / ( v , ∗ )r d E x i t = f V a r F a c t o r y . c r e a t e E x i t L a b e l ( a s s i g n ) ,d i f f = new R e a c h i n g D e f s D i f f e r e n c e ( a s s i g n E n t r y , defWi ld ) ;

L i s t <C o n s t r a i n t > r e s u l t = new L i s t <C o n s t r a i n t > ( ) ;

/ / ( v , v=E ) ⊆ RDexi t [ v=E]r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( def , r d E x i t ) ) ;/ / RDentry [ v=E] \ { ( v , ∗ ) } ⊆ RDexi t [ v=E]r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( d i f f , r d E x i t ) ) ;return r e s u l t ;

}p r i v a t e I V a r i a b l e B i n d i n g g e t L o c a l B i n d i n g ( E x p r e s s i o n l h s ) {

/ / i f LHS i s n t a s i m p l e name , i t c a n t be a l o c a l v a r i a b l ei f ( l h s . getNodeType ( ) != ASTNode . SIMPLE NAME )

return n u l l ;

SimpleName name = ( SimpleName ) l h s ;I B i n d i n g nameBinding = name . r e s o l v e B i n d i n g ( ) ;


/ / i f name i s n t a v a r i a b l e r e f e r e n c e , i g n o r e i ti f ( nameBinding . ge t K i nd ( ) != I B i n d i n g . VARIABLE)

return n u l l ;

I V a r i a b l e B i n d i n g v a r B i n d i n g = ( I V a r i a b l e B i n d i n g ) nameBinding ;

/ / i f v a r i a b l e r e f e r e n c e r e f e r s t o a f i e l d , i g n o r e i ti f ( v a r B i n d i n g . i s F i e l d ( ) )

return n u l l ;}

}

Listing B.6. Constraint generation for assignments

c l a s s R D C o n s t r a i n t C r e a t o r ex tends C o n s t r a i n t C r e a t o r {/ / . . .publ i c L i s t <C o n s t r a i n t > c r e a t e ( F o r S t a t e m e n t f o r S t m t ) {

/ / S i m p l i f i c a t i o n : assume e x a c t l y one i n i t expr , a c o n d i t i o n ,/ / and e x a c t l y one upda t e exprS t a t e m e n t body = f o r S t m t . getBody ( ) ;E x p r e s s i o n cond = f o r S t m t . g e t E x p r e s s i o n ( ) ;L i s t <Expres s i on > i n i t s = f o r S t m t . i n i t i a l i z e r s ( ) ;L i s t <Expres s i on > u p d a t e s = f o r S t m t . u p d a t e r s ( ) ;E x p r e s s i o n i n i t = ( E x p r e s s i o n ) i n i t s . g e t ( 0 ) ; / / 1 i n i tE x p r e s s i o n upda t e = ( E x p r e s s i o n ) u p d a t e s . g e t ( 0 ) ; / / 1 upda t eL i s t <C o n s t r a i n t > r e s u l t = new A r r a y L i s t ( ) ;

C o n s t r a i n t T e r m f o r E n t r y = f V a r F a c t o r y . c r e a t e E n t r y L a b e l ( f o r S t m t ) ;C o n s t r a i n t T e r m f o r E x i t = f V a r F a c t o r y . c r e a t e E x i t L a b e l ( f o r S t m t ) ;C o n s t r a i n t T e r m i n i t E n t r y = f V a r F a c t o r y . c r e a t e E n t r y L a b e l ( i n i t ) ;C o n s t r a i n t T e r m i n i t E x i t = f V a r F a c t o r y . c r e a t e E x i t L a b e l ( i n i t ) ;C o n s t r a i n t T e r m condEn t ry = f V a r F a c t o r y . c r e a t e E n t r y L a b e l ( cond ) ;C o n s t r a i n t T e r m condEx i t = f V a r F a c t o r y . c r e a t e E x i t L a b e l ( cond ) ;C o n s t r a i n t T e r m u p d a t e E n t r y = f V a r F a c t o r y . c r e a t e E n t r y L a b e l ( upda t e ) ;C o n s t r a i n t T e r m u p d a t e E x i t = f V a r F a c t o r y . c r e a t e E x i t L a b e l ( upda t e ) ;C o n s t r a i n t T e r m bodyEnt ry = f V a r F a c t o r y . c r e a t e E n t r y L a b e l ( body ) ;C o n s t r a i n t T e r m bodyEx i t = f V a r F a c t o r y . c r e a t e E x i t L a b e l ( body ) ;

r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( f o r E n t r y , i n i t E n t r y ) ) ; / / 1 .r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( i n i t E x i t , condEn t ry ) ) ; / / 2 .r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( condExi t , bodyEnt ry ) ) ; / / 3 .r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( bodyExi t , u p d a t e E n t r y ) ) ; / / 4 .r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( u p d a t e E x i t , condEn t ry ) ) ; / / 5 .r e s u l t . add ( n e w S u b s e t C o n s t r a i n t ( condExi t , f o r E x i t ) ) ; / / 6 .

return r e s u l t ;}

}

Listing B.7. Constraint generation for for-loops

138 R.M. Fuhrer

class ConstraintGraph {List<Constraint> fConstraints;Set<ConstraintTerm> fAllTerms;Map<ConstraintTerm,List<Constraint>> fEdgeMap;

class TermDecorator implements ITermProcessor {Constraint fConstraint;void setConstraint(Constraint c) { fConstraint=c; }public void processTerm(ConstraintTerm term) {

addToEdgeList(term, fConstraint);fAllTerms.add(term);

}}void initialize() { // turn Constraints into graph

TermDecorator decorator = new TermDecorator();for(Constraint c: getConstraints()) {

ConstraintTerm lhs = c.getLeft();ConstraintTerm rhs = c.getRight();

decorator.setConstraint(c);lhs.processTerms(decorator);rhs.processTerms(decorator);

}}

}

Listing B.8. Constraint graph construction

void initializeEstimates() {for(ConstraintTerm t: graph.getVariables()) {

if (t instanceof DefinitionLiteral)setEstimate(t, new DefinitionSet(t);

elsesetEstimate(t, new DefinitionSet());

}}

Listing B.9. Initializing term estimates

void solveConstraints() {while (!workList.empty()) {

ConstraintTerm t = workList.pop();for(c: getConstraintsInvolving(t)) {

satisfyConstraint(c);}

}}void satisfyConstraint(IConstraint c) {

ConstraintTerm lhs = c.getLHS();


ConstraintTerm rhs = c.getRHS();DefinitionSet lhsEst = getEstimate(lhs);DefinitionSet rhsEst = getEstimate(rhs);if (!rhsEst.containsAll(lhsEst))

setEstimate(rhs, rhsEst.unionWith(lhsSet));}

Listing B.10. Solving the constraints

Set<ASTNode> findRefsToDef(ASTNode def,IEstimateEnvironment reachingDefs) {

Set<ASTNode> result= new HashSet();

ASTNode method = getOwningMethod(def);SimpleName name = (SimpleName)

((Assignment) def).getLeftHandSide();

final IVariableBinding defBinding =(IVariableBinding) name.resolveBinding();

final DefinitionLiteral defLit =new DefinitionLiteral(defBinding, def);

// Search AST for variable references that refer to defmethod.accept(new ASTVisitor() {public boolean visit(SimpleName node) {

if (!Bindings.equals(node.resolveBinding(), defBinding))return false;

DefinitionSet rds =reachingDefs.getEstimate(

fVarFactory.createEntryLabel(node));

if (rds.contains(defLit))result.add(node);

return false;}

});return result;

}

Listing B.11. Calculating references to a given value definition

Set<ASTNode> findDefsForRef(ASTNode ref,IVariableBinding varBinding,IEstimateEnvironment rds) {

DefinitionSet defs =rds.getEstimate(fVariableFactory.createEntryLabel(ref));

final Set<ASTNode> result = new HashSet();

140 R.M. Fuhrer

for(DefinitionLiteral d: defs) {if (Bindings.equals(varBinding, def.getVarBinding()))

result.add(def.getLabel());}return result;

}

Listing B.12. Calculating value definitions that reach a given reference

class MarkUseDefsActionimplements IWorkbenchWindowActionDelegate {

boolean fInstalled = false;AbstractTextEditor fEditor;IDocumentListener fDocumentListener =

new MDUDocumentListener();ISelectionChangedListener fSelectListener =

new MDUSelectionListener(document);

public void run(IAction action) {fEditor = (AbstractTextEditor) PlatformUI.getWorkbench().

getActiveWorkbenchWindow().getActivePage().getActiveEditor();

IDocument doc =getDocumentProvider().getDocument(getEditorInput());

if (!fInstalled) {registerListeners(doc);fInstalled = true;

} else {unregisterListeners(doc);fInstalled = false;

}}

void registerListeners(IDocument document) {getSelProvider().

addSelectionChangedListener(fSelectListener);document.addDocumentListener(fDocumentListener);

}void unregisterListeners(IDocument document) {

getSelProvider().removeSelectionChangedListener(fSelectListener);

document.removeDocumentListener(fDocumentListener);}ISelectionProvider getSelProvider() {


return fEditor.getSelectionProvider();}IDocumentProvider getDocProvider() {

return fEditor.getDocumentProvider();}

}

Listing B.13. Workbench action class

<extension point="org.eclipse.ui.actionSets"><actionSet id="demo.analysisActions"

label="Analysis Actions"visible="true">

<actionclass="demo.MarkUseDefsAction"icon="icons/mark_usedefs.gif"id="demo.markUseDefsAction"label="&Toggle Mark Uses/Defs"state="false"style="toggle"toolbarPath="analysisGroup"tooltip="Displays uses/defs of the selected variable"/>

</actionSet></extension>

Listing B.14. Registering the “Mark Use-Defs” action

class MarkDefsUseAction {// ...CompilationUnit fCompilationUnit = null; // AST cache

// ... a nested class, since it needs access to the// field fCompilationUnit ...class MDUDocumentListener implements IDocumentListener {public void documentAboutToBeChanged(DocumentEvent e) {

// ... do nothing ...}

public void documentChanged(DocumentEvent event) {// Invalidate the AST cache so that the source// gets re-analyzedfCompilationUnit = null;

}}

}

Listing B.15. Use-Defs document listener

142 R.M. Fuhrer

class MarkDefsUseAction {// ...

// ... a nested class, since it needs access to the// field fCompilationUnit ...class MDUSelectionListener

implements ISelectionChangedListener {private final IDocument fDocument;

private MDUSelectionListener(IDocument document) {fDocument = document;

}

public void selectionChanged(SelectionChangedEvent e) {ISelection selection = e.getSelection();

if (selection instanceof ITextSelection) {ITextSelection textSel = (ITextSelection) selection;

int offset = textSel.getOffset();int length = textSel.getLength();

recomputeAnnotationsForSelection(offset, length,fDocument);

}}

}}

Listing B.16. Use-Defs selection listener

class MarkDefsUseAction {// ...void recomputeAnnotationsForSelection(int offset, int length,

IDocument document) {IAnnotationModel annotationModel =

fDocumentProvider.getAnnotationModel(getEditorInput());

// Get AST for the editor doc & find the selected ASTNode// The following uses the JDT’s ASTParser class to parse// if needed.CompilationUnit cu = getCompilationUnit();ASTNode selNode = NodeFinder.perform(cu, offset, length);

// Call the analyzer described earlierUseDefAnalyzer uda = new UseDefAnalyzer(cu);Set<ASTNode> usesDefs = uda.findUsesDefsOf(selNode);


// Convert ASTNodes to document positions (offset/length)Position[] positions = convertNodesToPositions(usesDefs);

submitAnnotations(convertPositionsToAnnotationMap(positions, document),

annotationModel);}

}

Listing B.17. Computing the annotations for a given selection

c l a s s MarkDefsUseAct ion {/ / . . .Map<Anno t a t i on , P o s i t i o n >c o n v e r t P o s i t i o n s T o A n n o t a t i o n M a p ( P o s i t i o n [ ] p o s i t i o n s ,

IDocument document ) {Map<Anno t a t i on , P o s i t i o n > posMap =

new HashMap ( p o s i t i o n s . l e n g t h ) ;

/ / map each p o s i t i o n i n t o an A n n o t a t i o n o b j e c tfor ( i n t i = 0 ; i < p o s i t i o n s . l e n g t h ; i ++) {

P o s i t i o n pos = p o s i t i o n s [ i ] ;

t ry {/ / C rea t e A n n o t a t i o n c o n s i s t i n g o f source t e x t i t s e l fS t r i n g message = document . g e t ( pos . o f f s e t , pos . l e n g t h ) ;

posMap . pu t ( new A n n o t a t i o n ( ”demo . useD efA nno t a t i o n ” ,f a l s e , message ) ,

pos ) ;} ca tch ( B adL oca t i onE x ce p t i on ex ) {

/ / T h i s shou l d never happen ; p o s i t i o n s are from AST !cont i nue ;

}}return posMap ;

}}Listing B.18. Converting positions to Annotations

c l a s s MarkDefsUseAct ion {/ / . . .void s u b m i t A n n o t a t i o n s ( Map<Anno t a t i on , P o s i t i o n > annoMap ,

IAnno t a t i onM ode l annModel ) {O bj ec t l o c k O b j e c t = ge t LockO bj ec t ( annModel ) ;

synchronized ( l o c k O b j e c t ) {i f ( annModel i n s t a n c e o f I A n n o t a t i o n M o d e l E x t e n s i o n ) {

/ / THE EASY WAY w i t h t h e more f u n c t i o n a l APII A n n o t a t i o n M o d e l E x t e n s i o n ame =

( I A n n o t a t i o n M o d e l E x t e n s i o n ) annModel ;

ame . r e p l a c e A n n o t a t i o n s ( fO l dA nno t a t i on s , annoMap ) ;

144 R.M. Fuhrer

} e l s e {/ / THE HARD WAY: remove e x i s t i n g a n n o t a t i o n s one/ / by one , and add new a n n o t a t i o n s one by o n er e m o v e E x i s t i n g O c c u r r e n c e A n n o t a t i o n s ( ) ;

for (Map . Ent ry<Anno t a t i on , P o s i t i o n > e : annoMap . e n t r y S e t ( ) ) {annModel . addAnno t a t i on ( e . getKey ( ) , e . ge t Va l ue ( ) ) ;

}}

}}

}

Listing B.19. Submitting annotations to the editor’s annotation model

C Effects-Based Refactoring Code Listings

public class EffectsVisitor extends NodeVisitor {private final Map<Node,Effect> fEffects= new HashMap();

public Node leave(Node parent, Node old, Node n,NodeVisitor v) {

Effect result= null;if (old instanceof Async) {

Async async = (Async) old;result= computeEffect(async);

} else if (old instanceof Unary) {result= computeEffect((Unary) old);

} else if (old instanceof Binary) {result = computeEffect((Binary) old);

} else if (old instanceof Call) {result= Effects.makeBottomEffect();

} else if (old instanceof LocalAssign) {result= computeEffect((LocalAssign) old);

} else if (old instanceof ArrayAssign) {result = computeEffect((ArrayAssign) old);

} else if (old instanceof FieldAssign) {result= computeEffect((FieldAssign) old);

} else if (old instanceof Block) {result= computeEffect((Block) old);

} else if (old instanceof ForLoop) {result= computeEffect((ForLoop) old);

} else if (old instanceof If) {result= computeEffect((If) old);

} else if (old instanceof Field) {result= computeEffect((Field) old);

} else if (old instanceof Local) {


result= computeEffect((Local) old);} else if (old instanceof LocalDecl) {

result= computeEffect((LocalDecl) old);}fEffects.put(old, result);return super.leave(parent, old, n, v);

}// ...

}

Listing C.1. Effects visitor

private Effect followedBy(Effect e1, Effect e2) {if (e1 == null) return e2;if (e2 == null) return e1;return e1.followedBy(e2, fMethodContext);

}

Listing C.2. Utility methods for effects computation

private Effect computeEffect(LocalAssign la) {Effect result= null;Local l= la.local();Expr rhs= la.right();

if (isMutable(l)) {Effect rhsEff= fEffects.get(rhs);result= rhsEff;

} else {Effect rhsEff= fEffects.get(rhs);Effect writeEff= Effects.makeEffect(Effects.FUN);writeEff.addWrite(Effects.makeLocalLoc(l));result= followedBy(rhsEff, writeEff);

}return result;

}

private Effect computeEffect(FieldAssign fa) {Effect result= null;Receiver target= fa.target();Expr rhs= fa.right();

if (isMutable(f)) {Effect rhsEff= fEffects.get(rhs);Effect writeEff= Effects.makeEffect(Effects.FUN);writeEff.addWrite(Effects.makeFieldLoc(target, fi));result= followedBy(rhsEff, writeEff);

146 R.M. Fuhrer

} else {return Effects.makeBottomEffect();

}return result;

}

Listing C.3. Computing the effects of assignments

private Effect computeEffect(Local local) {Effect result;

if (isMutable(local.localInstance())) {// ignore "effects" on immutable variablesresult= null;

} else {result= Effects.makeEffect(Effects.FUN);result.addRead(Effects.makeLocalLoc(local));

}return result;

}

private Effect computeEffect(Field field) {Effect result= Effects.makeEffect(Effects.FUN);

result.addRead(Effects.makeFieldLoc(field.target(), field));return result;

}

Listing C.4. Computing the effects of references

private Effect computeEffect(Unary unary) {Effect result;Expr opnd= unary.expr();Operator op= unary.operator();Effect opndEff= fEffects.get(opnd);

if (op == Unary.BIT_NOT || op == Unary.NOT ||op == Unary.NEG || op == Unary.POS) {result= opndEff;

} else {// one of the unary inc/dec opsEffect write= Effects.makeEffect(Effects.FUN);write.addAtomicInc(opnd);

if (op == Unary.POST_DEC || op == Unary.POST_INC) {result= opndEff.followedBy(write);

} else {result= write.followedBy(opndEff);

}


}return result;

}

private Effect computeEffect(Binary binary) {Effect result;Expr lhs= binary.left();Expr rhs= binary.right();Effect lhsEff= fEffects.get(lhs);Effect rhsEff= fEffects.get(rhs);

result= followedBy(lhsEff, rhsEff);return result;

}

Listing C.5. Computing the effect of expressions

private Effect computeEffect(If n) {Effect condEff= fEffects.get(n.cond());Effect thenEff= fEffects.get(n.consequent());Effect elseEff= (n.alternative() != null) ?

fEffects.get(n.alternative()) : null;

return followedBy(followedBy(condEff, thenEff), elseEff);}

private Effect computeEffect(ForLoop forLoop) {Effect bodyEff= fEffects.get(forLoop.body());// Abstract any effects involving the loop induction var

return bodyEff.forall(forLoop.formal());}

Listing C.6. Computing the effect of control-flow statements

private Effect computeEffect(Block b) {Effect result= null;// aggregate effects of the individual statements.// prune effects on local vars whose scope is this block.List<LocalDecl> blockDecls= collectDecls(b);for(Stmt s: b.statements()) {

Effect stmtEffect= fEffects.get(s);Effect filteredEffect=

removeLocalVarsFromEffect(blockDecls, stmtEffect);result= followedBy(result, filteredEffect);

}return result;

}

148 R.M. Fuhrer

private Effect removeLocalVarsFromEffect(List<LocalDecl> decls,Effect effect) {

Effect result= effect;for(LocalDecl ld: decls) {

if (isMutable(ld)) {Expr init= ld.init();result= result.exists(Effects.makeLocalLoc(localName),

init);} else {

result= result.exists(Effects.makeLocalLoc(localName));}

}return result;

}

private List<LocalDecl> collectDecls(Block b) {List<LocalDecl> result= new LinkedList<LocalDecl>();for(Stmt s: b.statements()) {

if (s instanceof LocalDecl) {result.add((LocalDecl) s);

}}return result;

}}

Listing C.7. Computing the effect of a block

p u b l i c c l a s s L o o p F l a t P a r a l l e l i z a t i o n R e f a c t o r i n gextends X10Refac t o r i n gB a se {

p r i v a t e ForLoop fLoop ;

p u b l i c L o o p F l a t P a r a l l e l i z a t i o n R e f a c t o r i n g ( I T e x t E d i t o r e d i t o r ) {super ( e d i t o r ) ;

}

p u b l i c R e f a c t o r i n g S t a t u s c h e c k I n i t i a l C o n d i t i o n s ( I P r o g r e s s M o n i t o r pm)throws CoreE xcep t ion , O p e r a t i o n C a n c e l e d E x c e p t i o n {i f ( fSourceAST == n u l l ) {

return c r e a t e F a t a l E r r o r S t a t u s ( ” s y n t a x e r r o r s ” ) ;}/ / N . B . : f S e l N o d e s s e t by s u p e r c l a s s c t o r from s e l e c t i o ni f ( fSe lNodes . s i z e ( ) != 1) {

return c r e a t e F a t a l E r r o r S t a t u s ( ” S e l e c t a loop s t a t e m e n t . ” ) ;}Node node= fSe lNodes . g e t ( 0 ) ;i f ( ! ( node i n s t a n c e o f ForLoop ) ) {

return c r e a t e F a t a l E r r o r S t a t u s ( ” Must s e l e c t a fo r − loop . ” ) ;}fLoop = ( ForLoop ) node ;


fPa thCom pute r = new NodePathComputer ( fSourceAST , fLoop ) ;f C o n t a i n i n g M e t h o d= fPa thCom puter . f i n d E n c l o s i n g N o d e ( fLoop ,

MethodDecl . c l a s s ) ;

i f ( loopHasAsync ( fLoop ) ) {return c r e a t e F a t a l E r r o r S t a t u s ( ” Loop body i s a l r e a d y ” +

” c o n t a i n e d w i t h i n an async . ” ) ;}

return c r e a t e O k S t a t u s ( ) ;}/ / . . .

}

Listing C.8. Loop-flat parallelization initial precondition checking

publ i c R e f a c t o r i n g S t a t u s c h e c k F i n a l C o n d i t i o n s ( I P r o g r e s s M o n i t o r pm)throws CoreExcep t ion , O p e r a t i o n C a n c e l e d E x c e p t i o n {

t ry {Stmt loopBody = fLoop . body ( ) ;Formal loopVar = fLoop . fo rm a l ( ) ;

E f f e c t s V i s i t o r e f f V i s i t o r = new E f f e c t s V i s i t o r ( ) ;loopBody . v i s i t ( e f f V i s i t o r ) ;E f f e c t bodyEff = e f f V i s i t o r . g e t E f f e c t F o r ( loopBody ) ;

boolean commutes ;Set<Pa i r <E f f e c t , E f f e c t >> i n t e r f e r e n c e = n u l l ;

commutes = bodyEff . com m ut esW i t hFora l l ( loopVar ) ;

i f ( ! commutes ) {/ / Compute t h e s e t o f i n t e r f e r i n g e f f e c t si n t e r f e r e n c e = bodyEff . i n t e r f e r e n c e W i t h F o r a l l ( loopVar ) ;

}i f ( ! commutes ) {

i f ( bodyEff == E f f e c t s . BOTTOM EFFECT ) {return c r e a t e E r r o r S t a t u s ( ”Can ’ t p rove l oop body commutes . ” ) ;

} e l s e {fConso l eS t r eam . p r i n t l n ( ” These e f f e c t s don ’ t commute : ” ) ;for ( Pa i r <E f f e c t , E f f e c t > p : i n t e r f e r e n c e ) {

fConso l eS t r eam . p r i n t l n ( p . f s t + ” and ” + p . snd ) ;}Pa i r <E f f e c t , E f f e c t > f i r s t = i n t e r f e r e n c e . i t e r a t o r ( ) . n e x t ( ) ;return c r e a t e E r r o r S t a t u s ( ” Loop body does no t commute , ” +

” e . g . ” + f i r s t . f s t + ” and ” + f i r s t . snd ) ;}

}return c r e a t e O k S t a t u s ( ) ;

} ca tch ( E xcep t i on e ) {return c r e a t e F a t a l S t a t u s ( ” E xcep t i on o c c u r r e d i n a n a l y s i s : ” +

e . ge tMessage ( ) ) ;}

}

Listing C.9. Loop-flat parallelization detailed precondition checking

150 R.M. Fuhrer

p u b l i c Change c r e a t e C h a n g e ( I P r o g r e s s M o n i t o r pm)throws CoreE xcep t ion , O p e r a t i o n C a n c e l e d E x c e p t i o n {

Composi teChange ou te rChange = new Composi teChange ( ” Loop F l a t ” ) ;T e x t F i l e C h a n g e t f c =

new T e x t F i l e C h a n g e ( ”Add ’ async ’ t o loop body ” , f S o u r c e F i l e ) ;

t f c . s e t E d i t ( new M u l t i T e x t E d i t ( ) ) ;

createAddAsyncChange ( t f c ) ;

i f ( ! l o o p I s W r a p p e d W i t h F i n i s h ( ) ) {c r e a t e A d d F i n i s h C h a n g e ( t f c ) ;

}oute rChange . add ( t f c ) ;

f F i n a l C h a n g e = oute rChange ;return f F i n a l C h a n g e ;

}

p r i v a t e vo id createAddAsyncChange ( T e x t F i l e C h a n g e t f c ) {i n t a s y n c O f f s e t = fLoop . body ( ) . p o s i t i o n ( ) . o f f s e t ( ) ;t f c . a d d E d i t ( new I n s e r t E d i t ( a s y n c O f f s e t , ” async ” ) ) ;

}

p r i v a t e vo id c r e a t e A d d F i n i s h C h a n g e ( T e x t F i l e C h a n g e t f c ) {i n t f o r S t a r t = fLoop . p o s i t i o n ( ) . o f f s e t ( ) ;t f c . a d d E d i t ( new I n s e r t E d i t ( f o r S t a r t , ” f i n i s h ” ) ) ;

}

p r i v a t e boolean loopHasAsync ( ForLoop loop ) {Stmt loopBody = loop . body ( ) ;

i f ( loopBody i n s t a n c e o f Async ) {return true ;

}i f ( loopBody i n s t a n c e o f Block ) {

Block b l o c k = ( Block ) loopBody ;L i s t<Stmt> b l o c k S t m t s = b l o c k . s t a t e m e n t s ( ) ;

i f ( b l o c k S t m t s . s i z e ( ) == 1 &&b l o c k S t m t s . g e t ( 0 ) i n s t a n c e o f Async ) {

return true ;}

}return f a l s e ;

}

Listing C.10. Loop-flat parallelization Change object creation


D Type Smells Code Listings<extension point="org.eclipse.imp.smelldetector.detectors">

<detector name="Overly-Specific Variable"class="org.smellsrus.OverlySpecificVariable"/>

</extension>

Listing D.1. Example Smell Detector Extension

class OverlySpecificDetector extends SmellDetectorBaseimplements IUnitSmellDetector,IProjectSmellDetector {

void unitBegin(CompilationUnit unitAST, ICompilationUnit unit,IFile file) {

OverlySpecificAnalyzer analyzer =new OverlySpecificAnalyzer(unit);

Map<ICompilationUnit,Map<ConstraintTerm,TypeSet>> unitMap =analyzer.computeOverlySpecificVariables();

for(ICompilationUnit icu: unitMap.keySet()) {Map<ConstraintTerm,TypeSet> termMap = unitMap.get(icu);// Each entry in termMap is an overly-specific variable

for(ConstraintTerm t: termMap.keySet()) {TypeSet ts = termMap.get(t);IMarker m = createMarker(file,

t.toString() + " could be " + ts.enumerate(),"demo.overlySpecificVar", // SMELL_KIND...);

// Crude: pick any upper bound in the result TypeSetm.setAttribute(NEW_TYPE,

ts.getUpperBound().anyMember().getQualifiedName());}

}}

}

Listing D.2. Unit Detector Implementation

class OverlySpecificAnalyzer {Map<ICompilationUnit, Map<ConstraintTerm, TypeSet>>computeOverlySpecificVariables() {

collectConstraints();solveConstraints();

Map<ICompilationUnit, Map<ConstraintTerm, TypeSet>>unitMap = new HashMap();

// Examine estimates to determine which variables are// more specific than necessary.for(ConstraintTerm n: constraintGraph.getNodes()) {

152 R.M. Fuhrer

TypeSet est = getEstimate(n);

// If the declared type is more specific than// necessary, add the variable to the result map.if (estimateMoreGeneralThanDecl(est, n)) {

ICompilationUnit icu = n.getCompilationUnit();Map<ConstraintTerm,TypeSet> termMap =

getOrMakeEntry(unitMap, icu);

termMap.put(n, est);}

}return unitMap;

}}

Listing D.3. Overly-specific Analyzer

// For scalability’s sake: save just enough information// to locate the corresponding AST node, but don’t save// the AST node itself.class ParameterVariable extends ConstraintTerm {

ICompilationUnit fCU;String fMethodKey;int fParamIdx;

ParameterVariable(IMethodBinding method, int idx,ICompilationUnit cu) {

fCU= cu;fMethodKey= method.getKey();fParamIdx= idx;

}}

class ReturnVariable extends ConstraintTerm {ICompilationUnit fCU;String fMethodKey;

ReturnVariable(IMethodBinding method,ICompilationUnit cu) {

fCU= cu;fMethodKey= method.getKey();

}}

Listing D.4. Type constraint term classes

c l a s s T y p e C o n s t r a i n t T e r m F a c t o r y implements C o n s t r a i n t T e r m F a c t o r y {/ / ‘ ‘ C a n o n i c a l i z e ’ ’ c o n s t r a i n t terms , e . g . :/ / Flow i n s e n s i t i v e => a l l r e f s t o a g i v e n v a r i a b l e map t o


/ / t h e same C o n s t r a i n t T e r m/ / Flow s e n s i t i v e => each r e f t o a g i v e n v a r i a b l e maps t o/ / a d i f f e r e n t C o n s t r a i n t T e r mMap<Objec t , C o n s t r a i n t T e r m> fCTMap ;

C o n s t r a i n t T e r m c r e a t e E x p r e s s i o n V a r i a b l e ( E x p r e s s i o n e ) { / / [ e ]O b j e c t key ;swi tch ( e . getNodeType ( ) ) {

case ASTNode .NAME:case ASTNode . FIELD ACCESS :

/ / Flow i n s e n s i t i v e : a l l r e f e r e n c e s map t o t h e same/ / Cons t ra in tTe rm , so use t h e b i n d i n g as t h e key .key = e . r e s o l v e B i n d i n g ( ) ;break ;

d e f a u l t :/ / Any o t h e r E x p r e s s i o n g e t s a un ique C o n s t r a i n t T e r m ;/ / use i t s s o u r c e l o c a t i o n as t h e key .key = new Com pi la t i on Un i t Ra n g e ( e ) ;break ;

}C o n s t r a i n t T e r m t = fCTMap . g e t ( key ) ;i f ( t == n u l l )

fCTMap . p u t ( key , t = new E x p r e s s i o n V a r i a b l e ( e ) ) ;return t ;

}/ / . . . s i m i l a r methods , c r e a t i n g o t h e r C o n s t r a i n t T e r m t y p e s . . .C o n s t r a i n t T e r m c r e a t e T y p e V a r i a b l e ( Type T ) { . . . } / / TC o n s t r a i n t T e r m c r e a t e D e c l a r i n g T y p e V a r i a b l e ( I B i n d i n g b ) {

/ / Decl [ b ]. . .

}C o n s t r a i n t T e r m c r e a t e P a r a m V a r i a b l e ( IMethodBind ing m, i n t i ) {

/ / [ Param (m, i ) ]. . .

}C o n s t r a i n t T e r m c r e a t e R e t u r n V a r i a b l e ( IMethodBind ing m) {

/ / [m]. . .

}}

Listing D.5. Type constraint term factory

class TypeConstraintCreator {ConstraintTermFactory fFactory;

List<Constraint> create(Assignment a) { // [rhs] <= [lhs]return new Constraint(

fFactory.createExpressionVariable(a.getRHS()),TypeOperator.Subtype,fFactory.createExpressionVariable(a.getLHS()));

}

List<Constraint> create(MethodInvocation inv) {List<Constraint> result = new List<Constraint>();

154 R.M. Fuhrer

IMethodBinding method = inv.resolveBinding();ITypeBinding methodOwner = method.getDeclaringType();List<Expression> args = method.getArguments();

// [rcvr] <= Decl[method]result.add(new Constraint(

fFactory.createExprVariable(inv.getReceiver()),TypeOperator.Subtype,fFactory.createDeclTypeVariable(methodOwner));

// [arg #i] <= [Param(method, i)]for(int i=0; i < args.size(); i++)

result.add(new Constraint(fFactory.createExpressionVariable(args.get(i)),TypeOperator.Subtype,fFactory.createParmVariable(method, i)));

// [rcvr.m(...)] = [M]return result;

}List<Constraint> create(MethodDeclaration d) {

/* preserve override relationships, ... */}//

}

Listing D.6. Type constraint generation

class ConstraintSolver {void initializeTypeEstimates() {

for(ConstraintTerm t: graph.getNodes()) {if (t instanceof ExpressionVariable) {

if (t is a ctor call, literal, or cast)setEstimate(t, t.getDeclaredType());

elsesetEstimate(t, TypeUniverse.instance());

} else if (t.isConstantType()) {setEstimate(t, t.getDeclaredType());

} else if (t.isBinaryMember()) {// don t report smells on a non-source classsetEstimate(t, t.getDeclaredType());

} else {// let the inferencer figure out the right typesetEstimate(t, TypeUniverse.instance());

}}

}// ...

}

Listing D.7. Type estimate initialization


class ConstraintSolver {// ...void solveConstraints() {

while (!workList.empty()) {ConstraintTerm t = workList.pop();for(c: getConstraintsInvolving(t)) {

lhs = c.getLHS();rhs = c.getRHS();if (c.getOperator().isSubtype())

enforceSubtype(lhs, rhs);else if (c.getOperator().isEquals())

unify(lhs, rhs);}

}}void enforceSubtype(ConstraintTerm lhs, ConstraintTerm rhs) {

TypeSet lhsEst = getEstimate(lhs);TypeSet rhsEst = getEstimate(rhs);TypeSet lhsSuper = lhsEst.superTypes();TypeSet rhsSub = rhsEst.subTypes();if (!rhsSub.containsAll(lhsEst))

setEstimate(lhs, lhsEst.xsectWith(rhsSub));if (!lhsSuper.contains(rhsEst))

setEstimate(rhs, rhsEst.xsectWith(lhsSuper);}

}

Listing D.8. Type constraint solution

c l a s s O v e r l y S p e c i f i c R e s o l u t i o n G e n e r a t o rex tends R e s o l u t i o n G e n e r a t o r B a s e

{publ i c I M a r k e r R e s o l u t i o n [ ] g e t R e s o l u t i o n s ( IMarker m) {

I M a r k e r R e s o l u t i o n r e s o l u t i o n = new O v e r l y S p e c i f i c R e s o l u t i o n ( ) ;

return new I M a r k e r R e s o l u t i o n [ ] { r e s o l u t i o n } ;}

}

Listing D.9. Smell resolution generator

<extension point="org.eclipse.ui.ide.markerResolution"><markerResolutionGenerator

markerType="org.eclipse.imp.smells.smellMarker"class="org.smellsrus.OverlySpecificResolutionGenerator"><attribute

name="smellType"value="demo.overlySpecificVar">

156 R.M. Fuhrer

</attribute></markerResolutionGenerator>

</extension>

Listing D.10. Resolution generator extension

class OverlySpecificResolution extends MarkerResolutionBase {public String getLabel() {return Make type as general as possible;

}

public void run(IMarker m) {IFile file = (IFile) m.getResource();ICompilationUnit icu = getCUForFile(file);CompilationUnit astUnit = createASTForCU(icu);

ASTNode typeNode = findASTNodeForMarker(m);ASTRewrite rewriter = ASTRewrite.create(typeNode.getAST());

String newTypeStr = (String) m.getAttribute(NEW_TYPE);Name newTypeName = ASTNodeFactory.newName(ast, newTypeStr);Type newTypeNode = ast.newSimpleType(newTypeName);

rewriter.replace(typeNode, newTypeNode);performRewrite(file, rewriter);

}}

Listing D.11. Smell resolution

References

1. Allen, R., Callahan, D., Kennedy, K.: Automatic decomposition of scientific programs forparallel execution. In: Proceedings of the 14th ACM Symposium on Principles of Program-ming Languages, POPL 1987, pp. 63–76. ACM Press (1987)

2. Andersen, O.: Program Analysis and Specialization for the C Programming Language. Ph.D.thesis, University of Copenhagen, Copenhagen, Denmark (1994)

3. Banzi, M.: Getting Started with Arduino. Make: Books, vol. 11 (2008)4. Berndl, M., Lhotak, O., Qian, F., Hendren, L., Umanee, N.: Points-to analysis using bdds.

In: Proceedings of the 2003 ACM Conference on Programming Language Design and Im-plementation, PLDI 2003, pp. 103–114. ACM, New York (2003)

5. Boyland, J.: Checking Interference with Fractional Permissions. In: Cousot, R. (ed.) SAS2003. LNCS, vol. 2694, pp. 1075–1075. Springer, Heidelberg (2003)

6. Bravenboer, M., Kalleberg, K.T., Vermaas, R., Visser, E.: Stratego/XT 0.17. A languageand toolset for program transformation. Science of Computer Programming 72(1-2), 52–70(2008)

7. Charles, P., Fuhrer, R.M., Sutton Jr., S.M., Duesterwald, E., Vinju, J.J.: Accelerating thecreation of customized, language-specific IDEs in Eclipse. In: OOPSLA, pp. 191–206 (2009)


8. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C.,Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. SIGPLANNot. 40, 519–538 (2005)

9. Daniel, P., Friedman, M.W., Haynes, C.T.: Essentials of Programming Languages, 2nd edn.MIT Press, Cambridge (2001)

10. Eclipse, http://www.eclipse.org/11. Eclipse Java Development Tools, http://www.eclipse.org/jdt/12. Ekman, T., Hedin, G.: The jastadd extensible java compiler. In: Proceedings of the 2007

ACM Conference on Object-Oriented Programming Systems, Languages and Applications,OOPSLA 2007, pp. 1–18. ACM, New York (2007)

13. Fink, S.J., Knobe, K., Sarkar, V.: Unified Analysis of Array and Object References inStrongly Typed Languages. In: Palsberg, J. (ed.) SAS 2000. LNCS, vol. 1824, pp. 155–174.Springer, Heidelberg (2000)

14. Fowler, M.: Refactoring. Improving the Design of Existing Code. Addison-Wesley (1999)15. Greenhouse, A., Boyland, J.: An Object-Oriented Effects System. In: Guerraoui, R. (ed.)

ECOOP 1999. LNCS, vol. 1628, pp. 668–668. Springer, Heidelberg (1999)16. Hedin, G.: Incremental Semantic Analysis. Ph.D. thesis, Lund University, Lund, Sweden

(1992)17. Heering, J., Klint, P., Rekers, J.: Lazy and incremental program generation. ACM Trans.

Program. Lang. Syst. 16(3), 1010–1023 (1994)18. Heintze, N., Tardieu, O.: Ultra-fast aliasing analysis using cla: a million lines of c code in a

second. In: Proceedings of the ACM SIGPLAN 2001 Conference on Programming LanguageDesign and Implementation, PLDI 2001, pp. 254–263. ACM, New York (2001)

19. Kats, L.C.L., Visser, E.: The Spoofax language workbench. Rules for declarative specifica-tion of languages and IDEs. In: Rinard, M. (ed.) Proceedings of the 2010 ACM Conferenceon Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2010,Reno, NV, USA, pp. 444–463 (October 2010)

20. Klint, P., van der Storm, T., Vinju, J.J.: Rascal: A domain specific language for source codeanalysis and manipulation. In: Ninth IEEE International Working Conference on SourceCode Analysis and Manipulation (SCAM), pp. 168–177. IEEE Computer Society (2009)

21. Kushner, D.: The Making of Arduino. IEEE Spectrum, 1–2 (2011)22. Maddox, W.H.: Incremental static semantic analysis. Ph.D. thesis, University of California

at Berkeley, Berkeley, CA, USA (1998), uMI Order No. GAX98-0328423. McKinley, K.S., Carr, S., Tseng, C.W.: Improving data locality with loop transformations.

ACM Transactions on Programming Languages and Systems 18(4), 424–453 (1996)24. Milanova, A., Rountev, A., Ryder, B.G.: Parameterized object sensitivity for points-to anal-

ysis for java. ACM Trans. Softw. Eng. Methodol. 14(1), 1–41 (2005)25. Morgenthaler, J.D.: Static analysis for a software transformation tool. Ph.D. thesis, Univer-

sity of California at San Diego, La Jolla, CA, USA (1998), uMI Order No. GAX98-0450926. Nielson, F., Nielson, H.R., Hankin, C.: Principles of program analysis. Springer (2005)27. Nystrom, N., Clarkson, M.R., Myers, A.C.: Polyglot: An Extensible Compiler Framework

for Java. In: Hedin, G. (ed.) CC 2003. LNCS, vol. 2622, pp. 138–152. Springer, Heidelberg(2003)

28. Palsberg, J., Schwartzbach, M.I.: Object-oriented type inference. In: Conference Proceedingson Object-Oriented Programming Systems, Languages, and Applications, OOPSLA 1991,pp. 146–161. ACM, New York (1991)

29. Pouchet, L.N., Bastoul, C., Cohen, A., Cavazos, J.: Iterative optimization in the polyhedralmodel: part II, Multidimensional Time. Conference on Programming Language Design andImplementation 43(6), 90–100 (2008)

http://www.eclipse.org/

http://www.eclipse.org/jdt/

158 R.M. Fuhrer

30. Pouchet, L.N., Bondhugula, U., Bastoul, C., Cohen, A., Ramanujam, J., Sadayappan, P.,Vasilache, N.: Loop transformations: convexity, pruning and optimization. In: Proceedingsof the 38th Annual ACM Symposium on Principles of Programming Languages, POPL 2011,vol. 46, pp. 549–562. ACM (2011)

31. Reps, T., Teitelbaum, T., Demers, A.: Incremental context-dependent analysis for language-based editors. ACM Trans. Program. Lang. Syst. 5(3), 449–477 (1983)

32. Steensgaard, B.: Points-to analysis in almost linear time. In: Proceedings of the 23rd ACMSymposium on Principles of Programming Languages, POPL 1996, pp. 32–41. ACM, NewYork (1996)

33. Tip, F., Fuhrer, R.M., Kiezun, A., Ernst, M.D., Balaban, I., Sutter, B.D.: Refactoring usingtype constraints. ACM Trans. Program. Lang. Syst. 33, 9:1–9:47 (2011)

R. Lämmel, J. Saraiva, and J. Visser (Eds.): GTTSE 2011, LNCS 7680, pp. 159–196, 2013. © Springer-Verlag Berlin Heidelberg 2013

Differencing UML Models: A Domain-Specific vs. a Domain-Agnostic Method

Rimon Mikhaiel1, Nikolaos Tsantalis2, Natalia Negara1, Eleni Stroulia1, and Zhenchang Xing3

1 Computing Science Department, University of Alberta, Edmonton, AB, T6G 2E8, Canada 2 Department of Computer Science and Software Engineering, Concordia University, Montreal,

QC, H3G 1M8, Canada 3 Department of Computer Science, School of Computing, National University of Singapore,

13 Computing Drive, Singapore 117417 {rimon,negara,stroulia}@ualberta.ca,

[email protected], [email protected]

Abstract. Comparing software artifacts to identify their similarities and differences is a task ubiquitous in software engineering. Logical-design comparison is particularly interesting, since it can serve multiple purposes. When comparing the as-intended vs. the as-implemented designs, one can evaluate implementation-to-design conformance. When comparing newer code versions against earlier ones, one may better understand the development process of the system, recognize the refactorings it has gone through and the qualities motivating them, and infer high-order patterns in its history. Given its importance, design differencing has been the subject of much research and a variety of algorithms have been developed to compare different types of software artifacts, in support of a variety of different software-engineering activities. Our team has developed two different algorithms for differencing logical-design models of object-oriented software. Both algorithms adopt a similar conceptual model of UML logical designs (as containment trees); however, one of them is heuristic whereas the other relies on a generic tree-differencing algorithm. In this paper, we describe the two approaches and we compare them on multiple versions of an open-source software system.

Keywords: UML, software differencing, software evolution.

1 Introduction

Differencing of software artifacts is a task essential to a variety of software-engineering activities, and a multitude of its instances can be found in a range of well-recognized areas of software-engineering research. Alternative designs are compared to each other in order to recognize their differences and assess their relative merits. Design models are compared against code in order to evaluate implementation-to-design conformance. Newer code versions are compared against earlier ones when submitted to a shared repository, in order to recognize potentially conflicting edits and properly merge them. Code fragments are compared against each other to recognize

160 R. Mikhaiel et al.

“clones”, i.e., lexically and syntactically similar code that could potentially be abstracted into a single “named” and reusable code structure. Component interfaces are matched against queries in order to enable the discovery and selection of reusable components.

In this paper, we explore the problem of object-oriented design differencing. Recognizing the differences between two object-oriented designs is essential for the following tasks.

(a) To understand (at a high level of abstraction) the evolution between two versions of a software system, one may reverse engineer the corresponding design versions and compare the as-implemented design of the software.

(b) To analyze the long-term evolution of a system and its constituent components and recognize interesting restructuring and expansion phases, one can repeatedly perform the above analysis over a sequence of subsequent software versions.

(c) To recognize the progress of the development team towards implementing the software design, one may again reverse engineer the design of the code base and compare it against the intended design of the system.

(d) Finally, to merge out-of-sync versions of software, one has to compare the merge candidates.

We have chosen to focus on design-level differencing because of several reasons. First, design provides a high level, yet information rich, abstraction of the software implementation, essential for understanding complex systems. Second, there is a standard representation of object-oriented design (namely UML and XMI), which is available in the context of many development environments, thus enabling the study of our methods and tools in a broad range of contexts. Third, high-level abstraction makes possible the comparison of design documents (high-level description) against source code (low-level implementation). Finally, adopting logical UML models as the underlying representation of the artifacts to be compared, we have the option of expanding our study to other types of UML models representing requirements (use cases), dynamic behaviors (sequence diagrams) and physical architecture (component diagrams).

Having committed to a particular representation of software, the question becomes how to design an algorithm for comparing instances of this representation. In principle, there are two different methodological approaches to addressing this question. On one hand, one can design an algorithm specific to the adopted representation, aware of the semantics of the modeled elements. The advantages of such domain-specific approaches are that they usually produce intuitive results, since the understanding of the representation semantics is “embedded” in the algorithm design, and their process is usually straightforward to follow and explain. Their major disadvantage is that they are not easy to generalize or to migrate to other representations of software. The alternative is to map the software representation to a more abstract representation (such as strings, trees, or graphs) for which differencing algorithms already exist and to configure these more general algorithms to somehow take into account the semantics of the domain. This approach is clearly more generalizable than domain-specific methods, since one can imagine multiple mappings of the same algorithm to multiple software representations; however it is likely to suffer from unintuitive results since the

Differencing UML Models: A Domain-Specific vs. a Domain-Agnostic Method 161

complex semantics of the domain have to be abstracted into a set of few elements and their relations.

Our team has been exploring these two alternative methodologies in the context of the PhD theses of Zhenchang Xing [28] and Rimon Mikhaiel [10]. In this paper, we describe in detail VTracker, and summarize our understanding of the relative advantages and disadvantages of two algorithms through an extensive comparison on multiple version of an open-source system. The paper is organized as follows. We first present UMLDiff, a domain-specific algorithm for differencing UML class models. Next, we discuss VTracker, an extended tree-differencing algorithm that can be systematically configured with a domain-specific cost function in order to be applied to tree-like representations in different domains. In presenting the two algorithms, we comparatively discuss their workflow and assumptions with respect to the cost functions they use to compare the various software elements. Next, we present an extensive experiment where both algorithms have been applied to recognize the changes that occurred in multiple successive versions of an open-source system in order to compare their accuracy, efficiency and scalability. Finally, we review a set of use cases where this type of differencing can be applied for a variety of maintenance activities.

2 UMLDiff

UMLDiff is an algorithm designed to compare software systems, in terms of their UML logical models. The algorithm takes as input two directed graphs, G(V, E), corresponding to the models of the systems to be compared. The vertex set, V, of each such graph contains the elements of the system’s UML logical model; the edge set, E, contains the relations among them. The model elements and the possible relations among them are shown in Tables 1 and 2.

Given two versions of a software system and the graphs G1(V1, E1) and G2(V2, E2), corresponding to their UML logical-design models, UMLDiff essentially maps the two model graphs by computing the intersection and margin sets between (V1, V2) and (E1, E2). More specifically, (V1–V2) and (E1–E2) are the sets of removed model elements and relations, (V1∩V2) and (E1∩E2) are the sets of the mapped elements and relations, and (V2–V1) and (E2–E1) are the sets of the added model elements and relations.

UMLDiff is a heuristic tree-differencing algorithm, relying on the fact that the composition relations (see Table 3) induce a spanning tree on the directed graph of the system’s UML logical model. The UML semantics guarantees that all model elements can be visited by traversing the containment hierarchy, starting from the top-level subsystem (corresponding to the system as a whole), and that the children of their containing parent are unique in terms of their names. There are four logical levels in which all types of model elements belong (see Table 3): subsystem (including the top-level subsystem) > package > (class, interface) > (attribute, operation). Note that the model elements of type subsystem, package, class and interface may contain same-type elements.


Table 1. Types of Elements in the UML Logical Model

Metaclass <<Stereotype>> Description

Subsystem A subsystem is a grouping of model elements. Package A package is a grouping of model elements (Java specific). Class A class declares a collection of attributes, operations and methods, to

describe the structure and behavior of a set of objects; it acts as the namespace for various elements defined within its scope, i.e. classes and interfaces.

Interface An interface is a named set of operations that characterize the behavior of an element.

DataType A data type is a type whose values have no identity. Attribute An attribute is a named piece of the declared state of a classifier, which

refers to a static feature of a model element. An attribute may have an initValue specifying the value of the attribute upon initialization.

Operation <<create>>

<<initialize>>

An operation is a service that can be requested from an object to effect behavior, which refers to a dynamic feature of a model element.

Method A method is the implementation of an operation. Parameter A parameter is a declaration of an input/output argument of an operation. Exception An exception is a signal raised by an operation. Reception A reception is a behavioral feature; the classifier containing the feature

reacts to the signal designated by the reception feature.

Table 2. Types of Relations among the Elements of a UML Logical Model

Metaclass <<Stereotype>> Description

Generalization A generalization is a taxonomic relation between a more general element (parent) and a more specific element (child).

Abstraction <<realize>>

An abstraction is a dependency relation; it relates two (sets of) elements representing the same concept.

Usage <<call>><<send>> <<instantiate>>

<<read>><<write>>

A usage is a dependency relation in which one element requires another element (or set of elements) for its full implementation or operation.

Association An association is a declaration of a semantic relation between classifiers that can be of three different kinds: 1) ordinary association, 2) composite aggregate, and 3) shareable aggregate.

Table 3. Composition Relations over the Elements of the UML Logical Models

Element type Types of the element’s children Top-level Subsystem

Subsystem and Package ProgrammingLanguageDataType Class and Interface whose isFromModel=false

Subsystem Subsystem and Package Package Package, Class and Interface Class Class and Interface

Attribute, Operation, Operation<<create>>, Operation<<initialize>> Interface Class and Interface, Operation Attribute N/A Operation Parameter


2.1 The UMLDiff Algorithm

Given two input graphs, UMLDiff starts by comparing their vertices, i.e., mapping the elements of the first model to “same” elements of the second model. Once this process has been completed, it proceeds to analyze the relations of the two graphs.

2.1.1 Mapping Elements UMLDiff traverses the containment-spanning trees of the two compared models, descending from one logical level to the next, in both trees at the same time. It starts at the top-level subsystems that correspond to the two system models and progresses down to subsystems, packages, classes and interfaces, and finally, attributes and operations. At each level, it compares all elements at that level from version 1, ei-1, to all elements of version 2, ej-2, and recognizes pairs of “same” elements, i.e., elements that correspond to the same design-model concept.

Similarity for UMLDiff is established on the basis of two criteria: (a) lexical similarity, i.e., a metric of the lexical distance between the identifiers of two same-level elements, and (b) structure similarity, i.e., a metric of the degree to which the two compared elements are related in the same ways to other elements that have already been established to be the same.

Name similarity is a “safe” indicator that e1 and e2 are the same entity: in our experience with several case studies, very rarely is a model element removed and a new element is added to the system with the same name but different element type and different behavior. UMLDiff recognizes same-name model elements of the same type first and uses them as initial “landmarks” to subsequently recognize renamed and moved elements.

Within each level, after all same-name elements have been recognized, UMLDiff attempts to recognize renamed and/or moved elements at that level. When a model element is renamed or moved – frequent changes in the context of object-oriented refactorings – its relations to other elements tend to remain the same, for the most part. For example when an operation moves, it still reads/writes the same attributes and it calls (and is called by) the same operations. Therefore, by comparing the relations of two same-type model elements, UMLDiff infers renamings and moves: the two compared elements are the same, if they share “enough” relations to elements that have already been established to be the same, even though their names (in the case of renamings) and/or their parent (containing) model elements are different (in the case of moves).

The knowledge that two model elements are essentially the same, in spite of having been renamed or moved, is added to the current set of mapped elements, and is used later on to further match other not-yet-mapped elements. This process continues until the leaf level of the two spanning trees has been reached and all possible corresponding pairs of model elements have been identified.

Given two renaming or move candidates, UMLDiff computes their structural similarity as the cardinality of the intersection of their corresponding related-element sets (see Section 2.2.2 for details). Given the sets of elements that are connected to the two compared candidates with a given relation type, UMLDiff identifies the common


subset of elements that have already been mapped. Therefore, if most of the model elements related to two candidates were also renamed and/or moved and cannot be established as “same”, the UMLDiff structure-similarity heuristic will fail. If, on the other hand, a set of related elements were renamed or moved but enough model elements related to the affected set remained the “same”, it would be possible to recognize this systematic change.

The structure-similarity metric fails when global renamings are applied, i.e., renamings to meet a new naming convention, for example. In such cases, there may be so many elements affected that the initial round of recognizing “same” elements based on name similarity may not produce enough mapped elements, to be used as landmarks for structure similarity. To address this problem, UMLDiff can be configured with a user-provided string transformation – introducing a prefix or appending a suffix, or replacing a certain substring – to be applied to the names of the model elements of one of the compared versions, before the differencing process. To further accelerate the recognition of “same” elements, UMLDiff propagates operation renamings along the inheritance hierarchy, i.e., it assumes that if an operation o1 in a class c1 has been renamed to o2, then all its implementations in the subclasses of c1

have also been similarly renamed. Finally, as each round of recognition of “same” elements based on structure

similarity establishes more landmarks on the basis of which new elements can be recognized as structurally similar, UMLDiff can be configured to go through multiple rounds of renaming and move identification, until no more new renamed and/or moved elements can be found or it finishes the user-specified number of iterations.

2.1.2 Mapping Relations Once UMLDiff has completed mapping the sets of model elements, V1 and V2, it proceeds to map the relation sets, E1 and E2, by comparing the relations of all pairs of model elements (v1, v2), where v2=null if v1 is removed and v1=null if v2 is added. The relations from (to) a removed model element are all removed and the relations from (to) an added model element are all added. For a pair of mapped elements (v1, v2), they may have matched, newly added, and/or removed relations. Note that a removed (added) relation between two model elements does not indicate any of the elements it relates being removed (added).

Finally, UMLDiff detects the redistribution of the semantic behavior among operations, in terms of usage-dependency changes, and computes the changes to the attributes of all pairs of mapped model elements.

2.1.3 Configuration Parameters The UMLDiff differencing process is configured through the following set of parameters.

1. The LexicalSimilarityMetric specifies which of three alternative lexical-similarity metrics (Char-LCS, Char-Pair, and Word-LCS) will be used by UMLDiff.


2. The RenameThreshold and MoveThreshold specify the minimum similarity values between two model elements in the two compared versions in order for them to be considered as the same conceptual element renamed or moved. UMLDiff allows multiple rounds (MaxRenameRound and MaxMoveRound) of renaming and/or move identification in order to recover as many renamed and moved entities as possible.

3. The ConsiderCommentSimilarity parameter defines whether the similarity of the comments of the model elements should also be taken into account when comparing two elements, if the compared elements have an initial overall similarity value above the MinThreshold. This threshold prevents model elements with very low name- and structure-similarity from qualifying as renamings or moves just because of their similar comments.

4. The ConsiderTransclosureUsageSimilarity parameter controls whether the similarity of the transitive usage dependencies between two compared operations may also be used to assess their structural similarity.

5. At the end of the differencing process, UMLDiff can be instructed whether or not to compute the usage dependency changes for all model elements and analyze the redistribution of operation behavior.

2.2 Assessing Similarity

In the above section, we have described how UMLDiff maps elements relying on two heuristics – lexical and structure similarity. In this section we delve deeper on the details of how exactly lexical and structure similarity are computed. The equations specifying these computations are intuitively motivated and have been tuned through substantial experimentation. These computations are fundamentally heuristic, tailored to the idiosyncrasies of the UML domain and our intuitions and understanding of the practices of developers in naming identifiers.

2.2.1 Lexical Similarity To assess the similarity of the identifiers of (and the textual comments associated with) two compared model elements, UMLDiff integrates three metrics of string similarity: (a) the longest common character subsequence (Char-LCS); (b) the longest common token subsequence (Word-LCS); and (c) the common adjacent character pairs (Char-Pair). All these metrics are computationally inexpensive to calculate, given the usually small length of the names and comments of model elements. They are also case insensitive, since it is common to misspell words with the wrong case or to modify them with just case changes. They are all applicable to name similarity, while only Char-LCS and Word-LCS may be applied to compute comment similarity. Irrespective of the specific metric used, let us first describe what exactly UMLDiff considers as the “identifier” of each model-element type.

The lexical similarity of operations is calculated as the product of their identifier similarity and their parameter-list similarity. In turn, the similarity of two parameter


lists is computed based on the Jaccard coefficient of the two bags of data types of the operations’ parameters, i.e. the intersection of two bags of parameter types divided by the union of two bags of parameter types.

For packages, we split package names into a set of words by “.”, and then compute the lexical similarity of packages using the similarity equations defined below. The similarity of the comments associated with two model elements is only consulted when both elements have associated comments (i.e., the UMLDiff parameter ConsiderCommentSimilarity is true) and the initial overall similarity metric between these elements is greater than the UMLDiff parameter MinThreshold.

The longest common character subsequence (Char-LCS) algorithm [15] is frequently used to compare strings. Word-LCS applies the same LCS algorithm, using words instead of characters as the basic constituents of the compared strings. The names of model elements are split into a sequence of words, using dots, dashes, underscores and case switching as delimiters. Comments are split into words using space as the sole delimiter. The actual metric used for assessing LCS-similarity is shown in Equation 1.

Char/Word-LCS(s1, s2) = 2 * length(LCS(s1, s2)) / (length(s1)+length(s2)), where LCS() and length() is based on the type of token considered, i.e., characters or words.

Equation 1

LCS reflects the lexical similarity between two strings, but it is not very robust to changes of word order, which is common with renamings. To address this problem, we have defined the third lexical-similarity metric in terms of how many common adjacent character pairs are contained in the two compared strings. The pairs(x) function returns the pairs of adjacent characters in a string x. By considering adjacent characters, the character ordering information is, to some extent, taken into account. The Char-Pair similarity metric, which is a value between 0 and 1, is computed according to Equation 2.

Char-Pair(s1, s2 ) = 2 * |pairs(s1)∩pairs(s2)| / (|pairs(s1)|+|pairs(s2)|).

Equation 2

2.2.2 Structure Similarity Table 4 lists the relations that UMLDiff examines to compute the structure similarity between two model elements of the same type. The top-level subsystems, corresponding to the two compared versions of a UML logical model, are always assumed to match. The structure similarity of subsystems, packages, classes and interfaces is determined based on (a) the elements they contain, (b) the elements they use, and (c) the elements that use them. The structure similarity of attributes is determined by the operations that read and write them, and their initialization expressions. The structure similarity of operations is determined by the parameters they declare, their outgoing usage dependencies (including the attributes they read and write, the operations they call, and the classes/interfaces they create), and their incoming usage dependencies (including the attributes (through their initValue) and the operations that call them).


Table 4. The UML relations for computing structure similarity

Element type Type of relations Subsystem [namespace – ownedElement] Incoming and outgoing usage Package [namespace – ownedElement] Incoming and outgoing usage Class, Interface [namespace – ownedElement] and [owner – feature]

Incoming and outgoing usage Attribute Usage<<read>>, Usage<<write>> and inherent Attribute.initValue Operation [BehaviorFeature – parameter] and [typedParameter – type]

Outgoing usage: Usage<<read>>, Usage<<write>>, Usage<<call>>, Usage<<instantiate>>

Incoming usage: Usage<<call>> The structure similarity of two compared elements is a measure of the overlap

between the sets of elements to which the compared elements are related. The intersection of the two related-element sets contains the pairs of model elements that are related to the compared elements (with the same relation type) and have already been mapped. In effect, this intersection set incorporates knowledge of any “known landmarks” to which both compared model elements are related.

Given two model elements of the same type, v1 and v2, let Set1 and Set2 be their related-element sets, the structure similarity between v1 and v2 according to a given group of relations is a normalized value (between 0 and 1) as computed according to Equation 3.

StructureSimilarity = matchcount / (matchcount + addcount + removecount), where the matchcount, addcount, and removecount are the cardinalities of [Set1 ∩ Set2], [Set2 – Set1], [Set1 – Set2] respectively.

Equation 3

For a usage dependency, its count tag, which indicates the number of times that it appears between the client and supplier elements, is used to compute its matchcount, addcount, and removecount.

The similarity of the parameter lists of two operations is based on the names and types of their parameters. The computation of parameter-list similarity is insensitive to the order of parameters. For non-return parameters, if none of the two operations is overloading, the matchcount for a pair of same-name parameters is 1. If any of the two compared operations is overloading, the types of the two same-name parameters is further examined, in order to distinguish the overloading methods from each other, which often declare the same-name parameters but with different parameter types. In the case of overloading, if the same-name parameters are of mapped types, their matchcount is 1; otherwise, their matchcount is 0.5. For the return parameters, if their types are mapped, the matchcount is 1; else it is set at 0. If the type of the return parameter of both operations is void, the matchcount for the return parameter is 0.

The similarity of the initValue of two compared attributes is computed in the same way as the outgoing usage similarity between two operations. The initValue-similarity value is added to the overall matchcount of the Usage<<write>> similarity between two attributes.


Determining the similarity when both related model-element sets are empty is challenging, when, for example, two operations are not called by any other operations. In such cases, setting the structure similarity to be by default 0 or 1 is not desirable: without any explicit evidence of similarity, assuming that the structure is completely the same or completely different may skew the subsequent result. Therefore, in such cases, UMLDiff uses the name similarity with an increasing exponent. The effect is dampened as more empty sets are encountered. For example, when computing the structure similarity of two operations in the order of their parameter-list, outgoing usage and incoming usage similarities, if the two compared operations declare no parameters, have return type void, and have no outgoing and incoming usage dependencies, UMLDiff returns name-similarity1 for comparing parameter-list similarity, name-similarity2 for outgoing usage similarity, and name-similarity3

for incoming usage similarity.

2.2.3 Overall Similarity Assessment Given two model elements e1 and e2 of the same type, their overall similarity metric, used for determining potentially renamed and moved model elements, is computed according to the Equation 4, below. SimilarityMetric=(lexical-similarity+ΣNstructure-similarity)/(lexical-similarity+N), where lexical-similarity = name-similarity + comment-similarity, and N is the number of different types of structure similarities computed for a given type of model elements, as defined in Table 2.

Equation 4

The value of ΣNstructure-similarity is adjusted in the following cases. When comparing two operations, if any of them is overloaded, ΣNstructure-

similarity is multiplied by the parameter-list similarity of the compared operations in order to distinguish the overloading operations from each other, which often have similar usage dependencies but with different parameters.

When determining the potential moves of attributes and operations, if the declaring classes/interfaces of the compared attributes/operations are not related through inheritance, containment, or usage relations, the value of ΣNstructure-similarity is multiplied by the overall similarity of the classes in which the compared attributes/operations are declared, and divided by the product of the numbers of all the not-yet-mapped model elements with the same name and type as the two compared elements. This is designed to improve the low precision when identifying attribute and operation moves.

UMLDiff uses two user-defined thresholds (RenameThreshold and MoveThreshold): two model elements are considered as the “same” element renamed or moved when their overall similarity metric is above the corresponding threshold. If, for a given element in one version, there are several potential mappings above the user-specified threshold in the other version, the one with the highest similarity score is chosen. The higher the threshold is, the stricter the similarity requirement is. The smaller the threshold is, the riskier the renamings-and-moves recognition process is.


3 VTracker

The VTracker algorithm is designed to compare XML documents, based on a tree-differencing paradigm. It calculates the minimum edit distance between two labeled ordered trees, given a cost function for different edit operations (e.g. change, deletion, and insertion). Essentially, VTracker views XML documents as partially ordered trees, since XML elements contain other XML elements and the order of contained elements within a container does not matter, unless these elements are contained in a special ordered container. Given that UML logical models can be represented in XMI, i.e., an XML-based syntax, the problem of UML logical-model differencing can be reduced to XML-document differencing and VTracker can be applied to it.

VTracker is based on the Zhang-Shasha's tree-edit distance [30] algorithm, which calculates the minimum edit distance between two trees T1 and T2

1, given a cost function for different edit operations (e.g. change, deletion, and insertion) in complexity of O(|T1|

3/2|T2|3/2), according to the analysis of Dulucq and Tichit [1].

Intuitively, given two trees, the Zhang-Shasha algorithm identifies the minimum cost of mapping the nodes of the two trees to each other, considering the following three options, illustrated in Figure 1.a:

(a) the cost of mapping the root nodes of the two trees plus the cost of mapping the remaining forests to each other (assuming that the root nodes of the two trees are comparable);

(b) the cost of deleting the root of the first tree plus the cost of mapping the remaining forest to the entire second tree (assuming that the root of the first tree was newly inserted in the second tree); and

(c) the cost of deleting the root of the second tree plus the cost of mapping the entire first tree against the remaining forest of the second tree (assuming that the root of the first tree is missing in the second tree).

The VTracker algorithm, for calculating the edit distance between two trees rooted by nodes x and node y respectively, is shown in pseudocode in Algorithm 1. The algorithm assumes that nodes are numbered in a post-order manner where a parent node is visited after all its children, from left to right, have been recursively visited. The process, as shown in lines 5-8, starts by determining the span of each node (x and y); the span of node x includes all the nodes starting at the left-most child of x to x, the root, plus a “dummy” node, which represents the void node given index zero while the left-most child at index one. The algorithm proceeds to progressively calculate the edit distance between portions (forests) from both trees. For example, fdist[i][j] is the distance between the first forest (including all the nodes in the first tree up to and including node with index i) and the second forest (including all the nodes in the second tree up to and including node with index j). Then, the process keeps adding a single node on each of the compared forests (lines 10 to 13) and assessing the cost, until it reaches the last point where both sides are not forests anymore but the complete trees.

1 We use T1 and T2 to refer to the trees and the number of their nodes, at the same time.


Input: T1 and T2 trees 01 DECLARE matrix tdist with size [|T1|+1] * [|T2|+1] 02 DECLARE matrix fdist with size [|T1|+1] * [|T2|+1] 03 FUNCTION treeDistance (x , y) 04 START 05 lmx = lm1(x) // left most node of x 06 lmy = lm2(y) // left most node of y 07 span1 = x – lmx + 2 //size of sub-tree x + 1 08 span2 = y – lmy + 2 //size of sub-tree y + 1 09 fdist[0][0] = 0 10 FOR i = 1 TO span1 – 1 // set the first column 11 fdist[i][0] = fdist[i-1][0] + cost(k,-1,i,j) 12 FOR j = 1 TO span2 – 1 // set the first row 13 fdist[0][j] = fdist[0][j-1] + cost(-1,l,i,j) 14 k = lmx 15 l = lmy 16 FOR i = 1 TO span1 - 1 17 FOR j = 1 TO span2 – 1 18 IF lm1(k) = lmx and lm2(l) = lmy 19 THEN // tree edit distance 20 fdist[i][j] = min(fdist[i-1][j] + cost(k,-1,i,j),

fdist[i][j-1] + cost(-1,l,i,j), fdist[i-1][j-1] + cost(k,l,i,j))

21 tdist[k][l] = fdist[i][j] 22 ELSE // forest edit distance 23 m = lm1(k) – lmx 24 n = lm2(y) – lmy 25 fdist[j][j] = min(fdist[i-1][j] + cost(k,-1,i,j),

fdist[i][j-1] + cost(-1,l,i,j), fdist[m][n] + tdist(k,l,i,j))

26 l++ 27 k++ 28 RETURN tdist[x][y] 29 END

Algorithm 1: The Zhang-Sasha Tree Comparison

At line 9, the algorithm starts by initializing fdist[0][0], i.e., the cost of transforming

a void forest into another void forest, to zero. In lines 10 and 11 it calculates the deletion costs of various forests of the first tree, which it progressively leads to calculating the cost of deleting the whole first tree. Similarly, the algorithm calculates the insertion costs in lines 12 and 13. At this point it has calculated the cost of mapping the two trees through the drastic change of deleting all the nodes of the first one and adding all the nodes of the second.

Then, beginning at line 18, the algorithm starts adding one node to each tree and calculating the distance between the resulting forests. In each step, if both sides have one full sub-tree, it applies the tree distance mechanism; otherwise it uses the forest edit distance mechanism (illustrated in Figure 1.b), where it chooses the minimum cost option of the three below:

• The cost of mapping node x to node y plus the cost of matching the remaining forests to each other.

• The cost of deleting node x plus the cost of matching remaining forest of first tree against the entire second tree.


• The cost of inserting node y plus the cost of matching entire first tree against remaining forest of the second tree.

(a) Visualization of Tree-Edit Distance (b) Visualization of Forest-Edit Distance

Fig. 1. Visualization of Zhang-Shasha algorithm [30]

VTracker extends the Zhang-Shasha algorithm in four important ways. First, it uses an affine-cost policy, which adjusts the cost of each operation if it happens in the vicinity of many similar operations. The affine-cost computation algorithm is discussed in Section 3.1.

Second, unlike the Zhang-Shasha algorithm, which assumes “pure” tree structures, VTracker allows for cross-references between nodes of the compared trees, which is essential for comparing XML documents that use the ID and IDREF attributes. VTracker considers the existence of these references in two different situations during the matching process. First, it considers referenced elements as being a part of the referring elements’ structure (see Section 3.2); when two nodes are being compared, VTracker considers all their children irrespective of whether they are defined in the context of their parent nodes or referenced by them. Additionally, through its “context-aware matching” process, VTracker considers not only the internal structure of the compared elements but also the context in which they are used, namely the elements by which they are being referenced.

Third, in a post-processing step, VTracker applies a simplicity-based filter to discard the more unlikely solutions from the solution set produced during the tree-alignment phase (see Section 3.3).

Finally, in addition to being applied with the default cost function that assigns the same cost to addition/deletion/change operations, VTracker can be configured with a domain-specific cost function (see Section 3.4) constructed through an initial boot-strapping step where VTracker with the default cost function is applied to comparing the forest of elements from the XML Schema Definition of the domain to itself.

3.1 Cost Computation

The original Zhang-Shasha algorithm assumes that the cost of any deletion/insertion operation is independent of the context in which the operation is applied: the cost of a node insertion/deletion is the same, irrespective of whether or not that node's children


are also deleted or inserted. As a result, the Zhang-Shasha algorithm considers as equally expensive two different scripts with the same number and types of edits, with no preference to the script that may include all the changes within the same locality. Such behavior is unintuitive: a set of changes within the same sub-tree is more likely than the same set of changes dispersed across the whole tree. Since the parent-child relation within the tree is likely to represent a semantic relation in the domain, whether it is composition (the parent contains the child), or inheritance (the parent is a super-type of the child), or association (the parent uses/refers to the child), it is more likely than not that changes in one participant of the relation will affect the other. This is why changes are likely to be clustered together around connected nodes, as opposed to “hitting” a number of unrelated nodes.

In order to produce more intuitive tree-edit sequences, VTracker uses an affine-cost policy. In VTracker, a node's deletion/insertion cost is context sensitive: if all of a node’s children are also candidates for deletion, this node is more likely to be deleted as well, and then the deletion cost of that node should be less than the regular deletion cost. The same is true for the insertion cost.

As shown in the Algorithm 2 below, the cost function accepts four parameters. The first two parameters, x and y, represent the absolute indexes of the two nodes being considered within the two full trees; the other two parameters, i and j, representing the local their indexes within the two sub-trees being considered that help to determine the edit operation context. A delete operation is denoted by y=-1, and an insert operation is denoted by x=-1 correspondingly; otherwise, it is matching operation and the objective is to assess how much it will cost to transform a node x to node y. As shown in GetDeletionCost function to assess the cost of deleting a certain node, the node is checked to be eligible for an affine discounted cost; otherwise the standard edit cost is used. The GetInsertionCost function is similar to the deletion one.

FUNCTION Cost (x, y, i, j) START IF y = -1 THEN RETURN GetDeletionCost (x, i, j) ELSEIF x = -1 RETURN GetInsertionCost (y, i, j) ELSE RETURN MappingCost (x, y, i, j) ENDIF END FUNCTION GetDeletionCost (x, i, j) START IF IsDeleteAffineEligible(i, j) THEN RETURN DISCOUNTED_DELETION_COST // the whole tree is deleted ELSE RETURN STANDARD_DELETION_COST ENDIF END

Algorithm 2: Calculating Costs

Algorithm 3 explains the logic of calculating the cost of transforming node x to node y, i.e., the cost of mapping nodes x and y. Normally, a NodeDistance function is used to reflect the domain logic of assessing the cost of node x being transformed to node y. However, if any of the two nodes x or y has reference to another node, a different mechanism is used. This mechanism follows the reference to the referred-to


node. Consider for example the case where node x has no references, while node y is a reference to node z. In order to assess the similarity between nodes x and y, we actually need to assess the similarity between node x and node z. To that end, the treeDistance algorithm, described in Algorithm 1, is used to assess the similarity between the sub-tree rooted at x and the sub-tree rooted at z. This mechanism is explained in more details in Section 3.2.

FUNCTION MappingCost (x,y,i,j) START newX = x newY = y IF x has a reference THEN newX = referenced Id ENDIF IF y has a reference THEN newY = referenced Id ENDIF IF x <> newX OR y <> newY THEN RETURN (treeDistance(newX,newY)/ (TreeDeletionCost(newX)+TreeInsertionCost(newY)) )* STANDARD_CHANGE_COST ELSE RETURN NodeDistance(x,y) ENDIF END

Algorithm 3: Cost, in the presence of References

FUNCTION IsDeleteAffineEligible (i,j) START IF y = 0 THEN // the whole tree is to be deleted RETURN true ELSE // Cost of matching sub-forest is the actual cost minus // Cost of matching the remaining forests to each other CostSubForest = fdist [i-1][j] – fdist [lm1(i)-1][j] // Cost of deleting everything minus // Cost of matching the remaining forests to each other CostDelSubForest = fdist [i-1][0] – fdist [lm1 (i)-1][0] IF costSubForest = costDelSubForest RETURN true ELSE RETURN false END

Algorithm 4: Affine Costs

3.2 Reference-Aware Edit Distance

Tree-edit distance algorithms only consider node-containment relationships, i.e., parent nodes containing children nodes. VTracker, designed for XML documents, is not a pure tree-differencing algorithm; it is aware of other relations between XML elements that are represented as additional references between the corresponding tree nodes. This feature is very important, since most XML documents reuse element definitions thus implying references from an element to the original element definition. The Zhang-Shasha simply ignores such references. In VTracker such reference structure is considered in an integrated manner within the tree-edit distance calculation process.


A typical interpretation of such references is that the referenced element structure is meant to be entirely copied under at the reference location; but, to avoid potential inconsistencies through cloning and local changes, elements are reused through a reference to one common definition. VTracker compares tree nodes by traversing the containment structure until it encounters a reference. It then recursively follows the reference structure as if it was a part of the current containment structure, until it reaches a previously examined node; then it backtracks recording all the performed calculations, for future use by other nodes referring to the same node.

The question then becomes “how should the cost function be adjusted in order to compute the differences of two nodes in terms of the similarities and differences of the elements they contain and refer to?” As shown in Algorithm 3 above, the definition of the cost function is changed when one of the nodes is a reference to another node. If any or both nodes are references (i.e., have nothing but references), then the cost of changing one into the other is the tree edit-distance between the referenced tree structures. Let’s assume that node x refers to node x’ and node y refers to node y’. The cost of changing node x to node y is the tree-edit distance between the sub-tree rooted at x’ against the sub-tree rooted at y’. Additionally, a normalization step is essential here because the tree-edit distance between x’ and y’ can vary according to the size of the two trees. Our approach divides the calculated edit distance between the two referenced sub-trees by the cost of deleting both of them which is the maximum possible cost. In this sense, the normalized cost is always ranging from 0 (in case of perfect match) to 1 (in case of totally different structures). Finally, the normalized edit distance is scaled against the maximum possible cost of change, i.e. a normalized cost of 1.0 should be scaled to the maximum cost of changing two nodes to each other. This step is necessary to ensure that the calculated change cost is in harmony with other calculated change costs.

In addition to taking into account efferent relations, i.e., references from the compared nodes to other nodes, VTracker also considers the afferent relations of the compared elements, i.e., their “usage context” by nodes that refer to the compared elements. In a post-calculation process, usage-context distance measures are calculated and combined with standard tree-edit distance measures into a new context-aware tree edit distance measure. For each two nodes x and y, we established two sets, context1(x) = {v | v→x} and context2(y) = {w | w→y}, that include the nodes from which x and y are referenced, respectively. Now, the usage-context distance between x and y is calculated as the Levenshtein edit distance [6] between these elements, where the distance between any two elements is the tree edit distance between these two sub-trees, and the final result is called the usage context distance between x and y. Finally, the consolidated context-aware tree edit distance measure is the average between the usage context distance and the tree edit distance measure.

3.3 Simplicity Heuristics

Frequent times, the differencing process may be unable to produce a unique edit script as there may be multiple scripts that transform one tree to the other with the same minimum cost. VTracker uses three simplicity heuristics, to discard the more unlikely solutions from the result set.


The path-minimality criterion eliminates “long paths”. When there is more than one different path with the same minimum cost, the one with the least number of deletion and/or insertion operations is preferable.

The vertical simplicity heuristic eliminates any edit sequences that contain “non-contiguous similar edit operations”. Intuitively, this rule assumes that a contiguous sequence of edit operations of the same type essentially represents a single mutation or refactoring on a segment of neighboring nodes. Thus, when there are multiple different edit-operation scripts with the same minimum cost, and the same number of operations, the one with the least number of changes (refractions) of edit-operation types along a tree branch is preferable.

Finally the horizontal simplicity criterion is implemented by counting the number of horizontal refraction points, found when a node suffers an edit operation different from the one applied to its sibling. Therefore, a solution where the same operation is applied to (most of) a node’s children is preferable to another where the same children suffer different types of edit operations.

3.4 Schema-Driven Synthesized Cost Function

The VTracker algorithm is generic, i.e., it is designed to compare XML documents in general and not XMI documents specifically. However, in order to produce accurate solutions that are intuitive to domain experts, VTracker needs to be equipped with a domain-specific cost function that captures the understanding of subject-matter experts of what constitutes similarity and difference among elements in the given domain. Lacking such knowledge, a standard cost function can always be used as a default, which may however sometimes yield less accurate and non-intuitive results. To address the challenge of coming up with a “good” domain-specific cost function, we have developed a method for synthesizing a cost function from the domain’s XML schema, relying on the assumption that the XML schema captures in its syntax a (big) part of the domain’s semantics. Essentially, VTracker assumes that the designers of the domain schema use their understanding of the domain semantics to identify the basic domain elements and to organize related elements into complex ones.

In addition to the domain-specific or default cost functions, VTracker uses more cost functions to handle node-level cost assessment. For example, VTracker uses a Levenshtein string edit distance [6] to measure the distance between any two literal values like two node names, attribute names or values, text node contents, etc.

4 Comparison of the UMLDiff vs. VTracker Methodologies

UMLDiff and VTracker have both been applied to the task of recognizing design-level differences between subsequent system versions. In this section we review some interesting methodological differences between the two of them.


They both conceptualize logical-design models of object-oriented software as trees. The parent-child relationship between tree nodes corresponds (a) to the instances of the composition relations in UMLDiff and (b) to the XMI containment relations, in VTracker. The two sets of relations are essentially the same. Practically, UMLDiff is applied to a database of “design facts” extracted through a process that analyzes a system’s source code; therefore UMLDiff always takes into account the exact same relations. VTracker, on the other hand, takes as input two XML documents of any type; to be applied to the task of UML model comparison, in principle, it should be provided with the XMI representation of the model. In practice, however, VTracker’s computation requires too much memory and therefore it cannot be applied to the complete raw XMI representations of large systems. Therefore it has to be applied to a filtered version of XMI and therefore care has to be given on what elements of the XMI syntax are preserved to be considered by VTracker. Through experimentation during the development of the WebDiff system [14], we have discovered that VTracker works well when applied to XML composition models of single classes, and inheritance models. When multiple classes are compared at the same time, the mapping of tree elements becomes more complex and the computation tends to become impractical. Performance is at the crux of the difference between the two approaches. By restricting itself to a consistent representation of the same design facts, UMLDiff can make assumptions about what to consider comparing and how. VTracker does not always get applied to the same types of XML documents, and, as a result, in its particular application, one has to trade off “richness” of the model representation against efficiency.

Both UMLDiff and VTracker can be aware of additional types of relations, like association and inheritance, between logical-model elements. UMLDiff exploits these relations while calculating the structure-similarity metric between same-type elements that are considered as candidates for move or renaming. With VTracker there are two options. Assuming containment as the primary relation defining the tree structure, additional edges between model elements can be introduced to reflect these other relations. This approach enables VTracker to consider these relations through its usage-context and reference-aware matching features; however, it has a substantial negative impact on its performance. In our experimentation with VTracker to date, we have developed parallel representations of the logical model, each one considering one of these relations separately, resulting in separate containment, inheritance and association trees, each one to be compared with the corresponding tree of the second logical model.

UMLDiff and VTracker exhibit interesting similarities and differences in terms of their similarity/cost functions for comparing model elements.

• They both combine metrics of lexical and structure similarity. • We have experimented with a variety of lexical similarity metrics for comparing

identifiers in UMLDiff. VTracker, by default, assigns 0 to the distance between two elements when their labels (i.e., identifiers) are the same and 1 when not and can be configured to use the Levenshtein distance [6] for these labels.


• The function for UMLDiff’s structural similarity assessment was “hand crafted” after much experimentation. VTracker’s cost function is by default very simple (all change operations have the same cost) and has been extended with affine policy and domain-specific weight calculation.

To study in detail the similarities and differences of the two approaches we performed an extensive experiment, where the two methods have been applied to recognize the changes that occurred in multiple successive versions of an open-source system. More specifically, the experiment is driven by three research questions:

1. How does the generic differencing algorithm perform (in terms of precision and recall) compared to the tailor-made one in the examined differencing problem?

2. Is the generic differencing algorithm efficient and scalable in the examined differencing problem?

3. Does the additional effort required for the configuration of the generic differencing algorithm make it an acceptable solution for the examined differencing problem?

To answer the aforementioned research questions we performed a direct comparison of VTracker with UMLDiff against a manually obtained gold standard. In the following subsections, we describe in detail the process that has been applied in order to conduct this experiment.

4.1 Specification of XML Input for VTracker

As we have already mentioned above, VTracker is a tree-differencing algorithm, potentially able to handle any kind of XML documents. Nevertheless, the particulars of the XML schema of the documents to be compared can have substantial implications for the accuracy and efficiency of VTracker. Therefore, it is very important to come up with an appropriate XML representation of the design elements and relationships in an object-oriented software system. To this end, we have divided the object-oriented design model to three distinct hierarchical structures, implied by the three different dependency relationships (design aspects) specified by the Unified Modeling Language (UML).

• Containment: A hierarchical structure representing the containment relationships between a class and its members (i.e., operations and attributes declared within the body of the class).

• Inheritance: A hierarchical structure representing inheritance relationships (including both generalization and realization relationships) between classes.

• Usage: A hierarchical structure representing the usage dependencies among an operation and other operations and/or attributes (i.e., operation calls and attribute accesses within the body of the operation).

We applied VTracker on the three aforementioned design aspects separately for each class of the examined system. This divide-and-conquer approach leads to the construction of XML trees with a smaller number of nodes compared to the


alternative approach of usinall the classes of the examiimprovement of efficiencybeing compared. An indirepossibility of extracting inccombinations that have to b

The process of generatinfollowing steps:

1. The source code of theextract the structure anthe underlying design m

2. For each class being p(one for each compare(i.e., containment, inhe

Figure 2 shows a pair of aspect of class PaintIhierarchical structure of theexist between the source-coperation PaintItem()aPaintItem, while parPaintItem().

Fig. 2. XML rep

ng a single XML tree for all design aspects together or ined system. A direct consequence of this approach is

y due to the significant reduction in the size of the trect consequence is the improvement of accuracy, since correct node matches is smaller when the number of nbe compared is smaller. ng the XML input files for VTracker is performed in

e two compared versions is parsed and analyzed in ordend the relationships between the source code elementsmodels. present in both versions, we generate a pair of XML fed version) for each one of the examined design aspeeritance and usage).

generated XML files regarding the containment destem in versions 1.0.5 and 1.0.6 of JFreeChart. Te XML files represents the containment relationships tcode elements declared in the given class. For exampand attributes paint and value are members of clrameters paint and value belong to operat

(a) Version 1.0.5

(b) Version 1.0.6

presentation of class PaintItem for containment

for the

rees the

node

the

er to s of

files ects

sign The that

mple, lass tion


In Figure 2, one can observe that the parameter types of an operation are represented both as attributes of the Operation node, as well as attributes of the Parameter child nodes. The motivation behind this apparent duplication of information is to further improve the accuracy of VTracker when trying to match overloaded operations (i.e., operations having the same name but a different number or types of parameters). By including them as attributes of the Operation node, we give to these attributes an increased weight (compared to the weight that they normally have as attributes of the Parameter child nodes) and thus we can avoid the problematic situation of mapping incorrectly a set of overloaded operations in the first tree to the corresponding set of overloaded operations in the second tree.

Figure 2 shows that two changes occurred in class PaintItem between versions 1.0.5 and 1.0.6. The type of the attribute value as well as the type of the parameter value in operation PaintItem() have been changed from Number to double (the changes are highlighted in yellow). The XML files regarding the inheritance and usage design aspects are structured in a similar manner.

4.2 Configuration of VTracker

The configuration of VTracker plays an important role on the accuracy of the technique, since it affects the weights assigned to the attributes of the nodes during the pair-wise matching process. The configuration process is very straightforward, since it only requires the specification of two properties.

The first property is idAttributeName for which we have to specify the most important attribute (i.e., id attribute) for each type of node in the compared trees. The specified attributes are assigned a higher weight compared to the other attributes of each node type. Practically, this means that if the ID attribute of a node is changed, then the two versions of the node are considered less similar than if another attribute was changed.

The second property is changePropagationParent for which we have to specify the node types that should be reported as changed if at least one of their child nodes is added, removed or changed. This feature allows us to identify that a node has changed because of changes propagated from its children, even if the parent node itself is unchanged. For example, an operation node should be considered as changed if one of its parameters has been renamed even if this specific change has no effect on the attributes of the operation node.

Table 5 shows the configuration properties that we have specified for the XML files corresponding to the containment design aspect (as shown in the example of Figure 2).

Table 5. Configuration of VTracker for the containment design aspect

Property Value(s)

idAttributeName

Class => className Operation => operationName Parameter => paramName Attribute => attrName

changePropagationParent Operation


4.3 Extraction of True Occurrences

In order to compute the accuracy (i.e., precision and recall) of a differencing technique we need to determine first the actual changes that occurred between different versions of the examined artifact and consider them as the set of true occurrences. Within the context of object-oriented design differencing we consider the following types of design changes per design aspect. For containment:

• Addition/deletion of an operation or an attribute. • Change of an operation, which includes any kind of change in its signature (i.e.,

change of visibility, addition/deletion of modifiers, change of return type, renaming of the operation’s name, change in the order of parameters, change in the types of parameters and addition/deletion of parameters).

• Change of an attribute, which includes change of the attribute’s visibility, addition/deletion of modifiers, change of the attribute’s type and renaming of the attribute’s name.

For inheritance:

• Addition/deletion/change of the class being extended by a given class. • Addition/deletion of an interface being implemented by a given class.

For usage:

• Addition/deletion of an operation call or attribute access within the body of an operation.

• Change of an operation call. This type of change refers to operation calls which either correspond to operation declarations whose signature has changed or have been replaced with calls to other operations (possibly declared in a different class) that return the same type and possibly take the same arguments as input.

• Change of an attribute access. This type of change refers to attribute accesses which either correspond to changed attribute declarations or have been replaced with accesses to other attributes (possibly declared in a different class) having the same type.

For the extraction of true occurrences we have followed a procedure that ensures, to a large extent, a reliable and unbiased comparison of the examined differencing approaches. Two of the authors of the paper have independently compared the source code of all JFreeChart classes throughout successive stable versions2.

The comparison has been performed with the help of a source-code differencing tool offered by the Eclipse IDE. The employed tool provides a more sophisticated view of the performed changes in the sense that it is able to associate a change with the context of the source code element where the change occurred. In contrast to traditional text differencing tools, the Eclipse differencing tool offers an additional view as the one illustrated in Figure 3 showing the changes that were performed in class PaintItem between versions 1.0.5 and 1.0.6.

In this view, the listed class members are those on which changes have been performed between the two compared versions. Furthermore, the plus (+) and minus (-) symbols indicate that a change occurred in the signature of the corresponding class member (plus symbol is used to represent the previous value of the changed class 2 http://sourceforge.net/projects/jfreechart/

Differencing UML Mod

Fig. 3. D

member, while minus symbmember). The absence of a of the corresponding class mon the elements shown in tchanges on the actual sourcechange. This differencing veasier, faster and more accthroughout the evolution of time consuming, which is w

The two authors examinheritance (ranging from pairs for usage (ranging froa smaller number of versioversion pair is significantlyimpossible. Furthermore, itsince they occur within the structure and a large numbethis specific version range completely different packapackage structure (introduclatest versions in the evolusince they cover a moreFurthermore, they contain the scalability of the examin

After the completion of aforementioned versions, thconsensus in the cases of amore careful re-examinatioactually renamed. In some oa deletion of a class membinterpreted it as a change to

The number of true occcontainment, inheritance and 3 Note that UMLDiff can hand

of recognizing class moves athe time complexity of comp

dels: A Domain-Specific vs. a Domain-Agnostic Method

Differencing view offered by the Eclipse IDE

bol is used to represent the next value of the changed csymbol indicates that the change occurred within the b

member, thus not affecting its signature. By double-clickthe differencing view, it is possible to directly inspect e code and make a safer conclusion about the nature of eview feature offered by the Eclipse IDE made significancurate the manual inspection of the changes that occurf JFreeChart. Clearly, this type of inspection is prohibitiv

why automated differencing methods are being developed.mined 14 successive version pairs for containment

version 1.0.0 to version 1.0.13) and 8 successive versom version 1.0.0 to version 1.0.8). The reason for selecton pairs for usage is that the number of usage changes y larger, thus making the examination of all version pt is significantly harder to manually inspect usage changbody of the operations, which in turn may have a comp

er of overlapping changes. Finally, the reason for selectis that the classes in versions prior to 1.0.0 are placed iage structure making difficult their mapping to the nced after version 1.0.0).3 Moreover, we have selected ution of JFreeChart (until the last/current version 1.0.1e mature development phase of the examined proja larger number of larger classes, which allows us to ned differencing techniques. the independent comparison of all classes throughout

he two authors merged their results by reaching a comma different change interpretation. The cases that requireon usually involved operations or attributes that have bof these cases, one of the authors interpreted the changeber and an addition of a new one, while the other aut

o the same class member. currences for each type of change per design aspect (d usage) is shown in Tables 6, 7, 8 respectively. As it can

dle this type of overall source-code reorganization, as it is capacross packages. VTracker is also, in principle, capable, howearing whole system structures as a trees is prohibitive.

181

lass ody

king the

each ntly rred vely . and sion ting per

pairs ges, plex ting in a new the

13), ect. test

the mon ed a been e as thor

(i.e., n be

able ever


observed from the tables most of the actually performed changes are additions, especially in containment and inheritance aspects. This is not surprising, since JFreeChart is a Java library that is used by client applications for creating and displaying charts. Consequently, its developers tried to maintain a consistent public interface throughout its evolution without performing several deletions and signature changes.

Table 6. True Occurrences for containment (operations and attributes)

Versions Added oper.

Removed oper.

Changed oper.

Added attr.

Removed atrr.

Changed attr.

1.0.0-1.0.1 10 0 0 1 0 0 1.0.1-1.0.2 60 0 0 17 1 0 1.0.2-1.0.3 86 3 2 29 0 16 1.0.3-1.0.4 70 1 3 9 1 0 1.0.4-1.0.5 85 0 5 11 1 1 1.0.5-1.0.6 78 7 2 22 1 2 1.0.6-1.0.7 125 0 3 50 3 2 1.0.7-1.0.8 36 0 0 6 0 0 1.0.8-1.0.8a 4 0 0 0 0 0 1.0.8a-1.0.9 15 1 1 0 0 0 1.0.9-1.0.10 94 0 3 11 0 6 1.0.10-1.0.11 117 0 1 41 4 3 1.0.11-1.0.12 45 2 0 11 1 4 1.0.12-1.0.13 160 4 6 50 2 0 TOTAL 985 18 26 258 14 34

Table 7. True Occurrences for inheritance (generalizations and realizations)

Versions Added gener.

Removed gener.

Changed gener.

Added realiz.

Removed realiz.

1.0.0-1.0.1 1 0 0 2 0 1.0.1-1.0.2 3 0 0 3 0 1.0.2-1.0.3 16 0 0 23 0 1.0.3-1.0.4 5 0 0 17 1 1.0.4-1.0.5 3 0 0 5 0 1.0.5-1.0.6 6 0 0 11 0 1.0.6-1.0.7 18 0 0 52 0 1.0.7-1.0.8 0 0 0 0 0 1.0.8-1.0.8a 0 0 0 0 0 1.0.8a-1.0.9 0 0 0 0 0 1.0.9-1.0.10 4 0 0 35 18 1.0.10-1.0.11 6 0 0 23 0 1.0.11-1.0.12 0 0 0 0 18 1.0.12-1.0.13 9 0 0 30 0 TOTAL 71 0 0 201 37


Table 8. True Occurrences for usage (operation calls and attribute accesses)

Versions Added oper. calls

Removedoper. calls

Changed oper. calls

Added attr. accesses

Removed atrr. accesses

Changed attr. accesses

1.0.1-1.0.2 119 31 25 51 6 0 1.0.2-1.0.3 306 99 47 72 31 134 1.0.3-1.0.4 180 23 18 82 15 0 1.0.4-1.0.5 143 102 64 109 14 11 1.0.5-1.0.6 266 97 85 36 20 5 1.0.6-1.0.7 210 74 46 106 28 13 1.0.7-1.0.8 84 223 115 21 2 0 TOTAL 1324 650 400 489 117 164

4.4 Evaluation of Precision and Recall

In order to evaluate the accuracy of the two examined differencing approaches, we should compare the set of true occurrences with the results reported by each tool. For this purpose, we have defined a common report format per design aspect (i.e., containment, inheritance and usage) in order to make easier the comparison of the results reported by each tool with the set of true occurrences. Next, we generated human readable textual descriptions of the true occurrences for each examined version pair of JFreeChart and per design aspect (based on the common report format). Finally, we transformed the output produced by each tool to the common report format. In particular, we have created a parser that goes through the changes reported in the edit scripts produced by VTracker and generates a report per design aspect following the common format rules. Additionally, we executed a set of appropriate queries on the database tables where UMLDiff stores the change facts of interest and transformed the results of the queries into the common report format.

The source code required for the replication of the experiment along with the gold standard containing the actual changes that occurred between the successive versions of JFreeChart and the edit scripts produced by VTracker and UMLDiff are available online4.

For the computation of precision and recall we need to define and quantify three measures, namely:

• True Positives (TP): the number of true occurrences reported by each examined tool.

• False Positives (FP): the number of false occurrences reported by each examined tool.

• False Negatives (FN): the number of true occurrences not reported by each examined tool.

After determining the values for the three aforementioned measures the accuracy of each examined tool can be computed based on the following formulas:

4 http://hypatia.cs.ualberta.ca/~vtracker/


(1)

(2)

In Tables 9, 10, 11 we present the results of precision and recall for the containment, inheritance and usage design aspects, respectively.

Table 9. Precision (P) and recall (R) per type of change for containment

VTracker UMLDiff P (%) R (%) P (%) R (%)

Added operations 100 100 99.4 97.6 Removed operations 100 100 55.5 83.3 Changed operations 100 100 100 100 Added attributes 98.4 98 98.4 98 Removed attributes 75 64.3 64.7 78.6 Changed attributes 83.3 88.2 91.9 100

Table 10. Precision (P) and recall (R) per type of change for inheritance


Added generalizations 100 100 100 100 Removed generalizations N/A N/A N/A N/A Changed generalizations N/A N/A N/A N/A Added realizations 100 100 84.4 100 Removed realizations 100 100 N/A 0

N/A: not applicable due to zero by zero division.

Table 11. Precision (P) and recall (R) per type of change for usage


Added operation calls 99 93.6 83.6 87.7 Removed operation calls 99.3 88.5 99.6 92 Changed operation calls 79.7 100 100 82.2 Added attribute accesses 99.8 97.1 98.5 95.3 Removed attribute accesses 99 88 98.9 77.8 Changed attribute accesses 92.1 100 100 6.1

4.4.1 VTracker As shown in Table 9, VTracker demonstrated an absolute precision and recall in identifying the actual changes that occurred in operations, but failed to identify correctly some changes which were related to attributes. In total, VTracker missed 4 changes in attributes:

FPTP

TP

+=Precision

FNTP

TP

+=Recall


• In versions 1.0.4-1.0.5 and class AbstractBlock, the attribute border with type BlockBorder was changed to attribute frame with type BlockFrame. This double change (i.e., attribute renaming and type change) was reported as a removal of attribute border from version 1.0.4 and an addition of attribute frame in version 1.0.5.

• In versions 1.0.10-1.0.11 and class XYDrawableAnnotation, the attributes width and height were renamed to displayWidth and displayHeight. VTracker produced an incorrect mapping of the renamed attributes with other attributes of the class.

• In versions 1.0.10-1.0.11 and class PaintScaleLegend, the static and final attribute SUBDIVISIONS was changed to non-static and non-final attribute subdivisions. This change was reported as a removal of the original attribute and an addition of a new one.

Moreover, VTracker reported erroneously 6 cases of attribute changes that were actually removals of fields from previous versions and additions of new ones.

Table 10 shows that VTracker demonstrated an absolute precision and recall in identifying inheritance related changes.

Finally, VTracker demonstrated a relatively high precision and recall in identifying usage-related changes (see Table 11). The lowest percentage is observed in the precision for changed operation calls (79.7%). This is due to a significant number of cases that were identified as changed operation calls, while actually they correspond to removals of operation calls from previous versions (usually by deleting code fragments within the body the operations) and additions of new operation calls.

4.4.2 UMLDiff In general, UMLDiff demonstrated a high precision and recall in identifying containment related changes (Table 9). In comparison with VTracker, UMLDiff performed better in the identification of changed attributes. This means that the use of domain-specific heuristics (e.g., by combining attribute usage information) can lead to better results especially with respect to the renaming of attributes.

As shown in Table 10, UMLDiff failed to identify correctly all removals of realizations. Moreover, the realizations that were supposed to be reported as removed were actually reported as added (false positives). As a result, this situation had also a negative impact on the precision of added realizations. All problematic cases refer to subclasses that implemented a list of interfaces in a previous version that were removed in the next version. However, the same list of interfaces was implemented by their superclasses in both previous and next versions. We believe that this inaccuracy is caused by the fact that UMLDiff computes and reports transitively all inheritance relationships (i.e., the generalizations and realization relationships of a superclass are also considered as direct relationships for all of its subclasses).

Regarding usage-related changes, UMLDiff demonstrated a low recall in identifying changed attribute accesses (6.1%). All problematic cases refer to accesses of attributes that were renamed or whose type has changed between two versions. Possibly, UMLDiff considers that the access itself does not change when the attribute that it refers to is changed.


4.4.3 Comparison of Overall Accuracy In Table 12 we present the overall precision and recall (i.e., over all types of changes) per design aspect. It is obvious that VTracker demonstrated better overall precision and recall in all examined design aspects. This result can be mainly attributed to the fact that VTracker performed better on the changes related to operations and operation calls (especially to the operations and operation calls that have been added, Table 9 and 11) whose number is significantly larger compared to the other types of changes (Table 6 and 8) and thus its overall precision and recall was positively affected.

Table 12. Overall precision and recall


Containment 99 98.9 97.7 97.4 Inheritance 100 100 88.1 88.1 Usage 95.6 94 91.8 84.4

It is very important to note that the improved accuracy in the results of VTracker

was achieved by using the default implementation of the tree-differencing algorithm and without performing any kind of tuning in the default comparator or similarity function. As already explained in Sections 5.1 and 5.2, we used VTracker “out of the box” (so to speak) simply defining the XML input format for each examined design aspect and specifying the required configuration options. The obtained experimental results on the identification of design changes in object-oriented models open the way for the application of VTracker (and possibly other domain-independent differencing approaches) on other software engineering differencing problems whose artifacts can be represented in the form of XML.

4.5 Evaluation of Efficiency and Scalability

In order to assess the efficiency and scalability of VTracker, we have measured the CPU time required in order to compare the set of XML file pairs corresponding to all the classes of JFreeChart in a given version pair. We performed this analysis for all 14 examined version pairs (starting from version 1.0.0 until version 1.0.13) and per design aspect separately. The measurements have been performed on a MacBookPro5,1 (Intel Core 2 Duo 2.4 GHz and 4 GB DDR3 SDRAM). The results of the analysis are shown in Figure 4.

As it can be observed from Figure 4, the inheritance design aspect requires the least amount of CPU time (ranging from 16 to 20 seconds for all the classes in a given version pair), the containment design aspect requires a larger amount of CPU time (ranging from 300 to 458 seconds), while the usage design aspect requires the largest amount of CPU time (ranging from 3843 to 6292 seconds, approximately 64 to 105 minutes). From a more detailed analysis of the results, we can conclude that there is an almost linear relation between the size of the compared trees (in terms of the number of their nodes) and the time required for their comparison. For example, when the size of the compared trees is increased by 10 times, the time required for their


Fig. 4. CPU time (in seconds) per examined version pair and design aspect for VTracker

comparison in also increased by 10 times. This outcome may initially not seem intuitive, since the problem of matching ordered labeled trees is quadratic to the number of nodes by nature. However, VTracker applies a set of heuristics (described in Section 3) that make the performance of the tree differencing algorithm linear for a major part of the matching problem and quadratic for the rest.

Another interesting observation is that the time required for the analysis of a version pair increases as JFreeChart evolves. This phenomenon can be attributed to two reasons: first because the number of classes increased as the project evolved, and second because the size of some classes increased as the project evolved.

Additionally, we have measured the CPU time required by UMLDiff for the comparison of each JFreeChart version pair in order to provide a direct comparison of efficiency between the two differencing approaches. Figure 5 shows the CPU time required for the comparison of each JFreeChart version pair by VTracker and UMLDiff, respectively. In the case of VTracker, the given CPU time is actually the sum of the CPU times required for differencing each design aspect (Figure 4).

As it can be observed from Figure 5, UMLDiff performed better in every examined JFreeChart version pair and required on average 27% less CPU time compared to VTracker, even though it considered all design aspects in the same context. The separation of the three design aspects is necessary to make the use of VTracker feasible for large systems (otherwise it suffers from insufficient-memory problems and fails). This simplification of the problem also has a positive impact to the accuracy of VTracker, which is difficult to quantify however. From VTracker’s efficiency analysis per design aspect, we estimated that the comparison of the XML files representing usage constitutes 93% of the total CPU time. The XML files for the usage design aspect have exactly the same structure as the XML files for containment (Figure 2) with the addition of nodes representing operation calls and attribute accesses (as children of the Operation nodes). As a result, the XML files for the usage aspect contain a significantly larger number of nodes and their alignment


Fig. 5. CPU time (in seconds) per examined version pair for VTracker and UMLDiff

requires significantly more processing time, since matching is performed at two levels (i.e., the Operation level and the OperationCall and AttributeAccess level). However, the fact that VTracker can analyze each design aspect separately makes it a more efficient solution for the detection of API-level changes (i.e., changes in the public interface of the examined classes that can be detected by analyzing the containment and inheritance design aspects).

4.6 Threats to Validity

Let us now consider the various threats to the validity of our experiment and findings. In principle, the internal validity of our experiment could potentially be threatened by erroneous application of tools and incorrect observations and interpretations by the experimenters. On the other hand, the threats to external validity of the conducted experiment are associated with factors that could limit the generalization of the results to other examined projects, differencing algorithms and domains.

4.6.1 Internal Validity The first threat to the internal validity of the conducted experiment is related with the determination of true occurrences. Obviously, the extracted set of true occurrences affects the computation of both precision and recall and consequently could also affect the conclusions of the experiment. This threat was alleviated by two means. First, the extraction of design changes was performed independently by two of the authors and their results were merged by reaching a common consensus in the cases of a different change interpretation. In this way, we tried to eliminate the bias in the interpretation of changes. Second, the authors inspected the changes with the help of a sophisticated source code differencing tool offered by the Eclipse IDE. This tool


made easier and more accurate the inspection and interpretation of changes in comparison to generic text differencing tools which are not able to associate a change with the context of the source code element where the change occurred. In this way, we tried to eliminate human errors in the process of manually identifying source code changes.

The second threat to the internal validity of the conducted experiment is related with the correct and proper use of the examined differencing tools. Obviously, this could affect the results being reported by the examined tools and consequently the conclusions of the experiment. This threat was alleviated by taking advice directly from the developers of the tools (who are also authors of this paper) on how to properly configure, execute and collect the change information. More specifically, the developer of UMLDiff (Xing) specified the queries required for the extraction of the examined design changes from the database in which the change facts are stored. Furthermore, the developer of VTracker (Mikhaiel) gave advice towards the construction of XML input files that optimize the accuracy and efficiency of VTracker, the proper configuration of VTracker for the employed XML schema representation, and finally the correct parsing of the produced edit script describing the changes.

4.6.2 External Validity Regarding the generalization of the results to other projects, we have selected an open-source project, namely JFreeChart, which has been widely used as a case study in several empirical studies and source code differencing experiments in particular. Therefore, it can be considered as a rather representative and suitable project for this kind of experiments. However, it should be noted that JFreeChart is a project that evolved mostly by adding new features and fixing bugs. Moreover, due to the fact that it is a library, it has not been subject to a large number of refactoring activities (a heavily refactored library would cause several compilation problems to already existing client applications). Obviously, the presence of complicated refactorings in the evolution of a project would have a significant impact on the accuracy of any differencing technique. As a result, we cannot claim that the results can be generalized to any kind of software projects (e.g., frameworks, APIs, applications).

Regarding the generalization of the results to other differencing algorithms, we have compared a generic domain-agnostic algorithm (VTracker) with a domain-specific algorithm (UMLDiff), which is considered as the state-of-the-art in the domain of object-oriented model differencing. Several prior experimental studies [19], [25] have demonstrated a high accuracy for UMLDiff in accordance with the results of this experiment. Therefore, it can be considered as one of the best differencing algorithms in its domain.

Finally, regarding the generalization of the results to other domains, we have selected a domain, namely object-oriented design models, which is very rich in terms of model elements and relationships among them. As a result, we could assume that our generic algorithm would demonstrate a similar performance in domains having a similar or lower complexity, such as Web service specification documents in the form of WSDL files. However, this assumption needs to be empirically validated with further experiments.


5 Related Work

The general area of software-model differencing is quite vast. A pretty comprehensive overview can be found in Chapter 2 of Xing’s thesis [28]. In this paper, we eclectically review the most relevant work (Section 5.1) and we discuss the work of our own team building on UMLDiff and VTracker (Section 5.2).

5.1 Object-Oriented Design Differencing

Object-oriented software systems are better understood in terms of structural and behavioral models, such as UML class and sequence models. The UML modeling tools often store UML models in XMI (XML Metadata Interchange) format for data- interchange purposes. XML-differencing tools (such as DeltaXML5 for example), applied to these easily available XMI representations, report changes of XML elements and attributes, ignoring the domain-specific semantics of the concepts represented by these elements. VTracker, with its domain-aware affine cost function and its ability to take into account references, is exactly addressing this problem of domain-aware XML differencing. VTracker (and its precursor algorithms) has in fact been applied to other domains, including HTML comparison [9], RNA alignment [7], and WSDL comparison [8, 31].

In the context of UML differencing, several UML modeling tools come with their own UML-differencing methods [2, 11]. Each of these tools detect differences between subsequent versions of UML models, assuming that these models are manipulated exclusively through the tool in question which manages persistent identifiers for all model elements. Relying on consistent and persistent identifiers is clearly not possible if the development team uses a variety of tools, which is usually the case.

More generally, on the subject of reasoning about similarities and differences between UML models, we should mention Egyed’s work [3] on a suite of rule- and constraint- and transformation-based methods for checking the consistency of the evolving UML diagrams of a software system. Similarly, Selonen et al. [13] have also developed a method for UML transformations, including differencing.

Kim et al. [5] developed a method for object-oriented software differencing that works at the level of the source code itself (and does not require its design model). The algorithm takes as an input two versions of a program and starts by comparing the method headers from each program version and identifying the ones that most match at the lexical level, based on a set of matching rules and a similarity threshold. The algorithm iteratively and greedily selects the best rule to apply to identify the next pair of matching methods in order to maximize the total number of matches. This idea was later extended to LSDiff (Logical Structural Diff) [4], which involves more rules.

More recently, Xing [29] proposed a general framework, GenericDiff, for model comparison. GenericDiff represents a domain-independent approach for model

5 Mosell EDM Ltd: http://www.deltaxml.com


differencing that is also aware of domain-specific properties and syntax. In this approach the domain-specific inputs are separated from the general graph matching process and are encoded by using composite numeric vectors and a pair-up graph. This allows the domain-specific properties and syntax to be uniformly handled during the matching process. GenericDiff is similar to VTracker, in that they both model the subject systems in terms of a more abstract representation; they are different in that GenericDiff adopts a bipartite-graph model where VTracker adopts a tree model.

5.2 Work Building on UMLDiff and VTracker

In this section, we review research from our team, building on UMLDiff and VTracker for different use cases in design differencing: (a) understanding the design changes between two versions of a system; (b) analyzing the evolution history of a system and its constituent components; (c) comparing the intended vs. the as-implemented design of a system; and (d) merging out-of-sync versions of software.

Both UMLDiff and VTracker have been applied to the task of UML-design differencing. UMLDiff was implemented in the context of JDEvAn [27], an Eclipse plugin, which can be invoked by the developer to query a pre-computed database of design changes and the analyses based on them. The envisioned usage of UMLDiff in the context of JDEvAn was that it would be applied as an off-line process to pairs of “stable” releases of the system as a whole and its results would be made available to developers in the context of their development tasks, i.e., looking at the recent changes of an individual class, or reviewing the refactorings across the system during the most recent releases.

VTracker, on the other hand, was implemented as a service accessible through WebDiff [14], a web-based user interface. In the context of the WebDiff portal, VTracker can be applied to any level of logical models, including models of systems, packages or individual classes. Table 13 below identifies the publications in which these studies are described in detail.

Table 13. Studies with UMLDiff and VTracker

UMLDiff/JDEvAn VTracker/WebDiff design changes 19, 25 14

longitudinal class/system analysis 16, 17, 18, 21, 23 design vs. code differencing 14

refactoring and merging 22, 24, 26

5.2.1 Longitudinal Analysis of Individual Classes and the Overall System Ever since Lehman and Belady first formulated the “Laws of Software Evolution” in 1974, describing the balance between forces driving new software development and forces that slow down progress and increase the brittleness of a system, software-engineering research has been investigating different metrics and methods for analyzing evolution to recognize the specific forces at play at a particular point in the life of a system.


Relying on UMLDiff, we developed a method for analyzing the long-term evolution history of a system as a whole, its individual classes, and related class collections, based on metrics summarizing its design-level changes. Given a sequence of UML class models, extracted from a corresponding sequence of code releases, we can use UMLDiff to extract the design-level changes between each pair of subsequent code releases, to construct a sequence of system-wide system-change transactions and class-specific class-change transactions.

To analyze potential co-evolution patterns between sets of system classes [18, 23], we first discretized the class-change transactions into a sequence of 0s (when there was no change to the class) and 1s (if there was at least some change to the class). In a subsequent experiment, we conducted a more refined discretization process, classifying the collection of changes that each class suffered into one of five discrete categories, depending on whether they have high/low/average number of element additions/deletions/changes. We then applied the Apriori association-rule mining algorithm to recognize sets of coevolving classes (as itemsets). Recognizing coevolving classes is interesting since co-evolution implies design dependencies among the coevolving classes; when such dependencies are undocumented, they are likely to be unintentional and possibly undesirable. In fact co-evolution is frequently referred to as a “bad design smell” implying the need for refactoring.

In addition to co-evolution, we have explored two more types of analyses of longitudinal design evolution. We used phasic analysis to recognize distinct phases in the discretized evolution profile of a design entity, whether it is the system as a whole or an individual class. Intuitively, a phase consists of a consecutive sequence of system versions, all of which exhibit similar classifications of changes. Identifying a phase in a class-evolution profile may provide some insight regarding the development goals during the corresponding period. We further used Gamma analysis to recognize recurring patterns in the relative order of phases in an evolution profile, such as consistent precedence of a phase type over another. Different process models advocate distinctive ordering of activities in the project lifecycle; gamma analysis can reveal such consistent relative orderings and, thus, hint at the adopted process model. In particular, Gamma analysis provides a measure of the general order of elements in a sequence and a measure of the distinctiveness or overlap of element types.

Finally, we developed a set of special-purpose queries [22, 24] to the design-changes database to extract information about combination of design-level changes characteristic of refactorings.

5.2.2 Design vs. Code Differencing We have experimented with reflexion, i.e., comparison between design (as intended) vs. design as implemented in the code (extracted through reverse-engineering tools) using the VTracker through the WebDiff portal. It is interesting to note here that although both UMLDiff and VTracker are equally applicable (and able to address) to this task, pragmatically VTracker is a better choice. Since UMLDiff is implemented as a java-based program accessing a database of extracted design-level facts, to apply it to this task, we would have to develop a parser for XMI to extract the relevant design facts from a UML design and store them in the JDEvAn [27] database for UMLDiff.


VTracker, on the other hand, requires as input XML documents easily available as the products of either a design tool or a reverse engineering tool.

5.2.3 Software Merging A particularly interesting case of software merging is that of migrating applications to newer versions of libraries and/or frameworks. Applications built on reusable component frameworks are subject to two independent, and potentially conflicting, evolution processes. The application evolves in response to the specific requirements and desired qualities of the application’s stakeholders. On the other hand, the evolution of the component framework is driven by the need to improve the framework functionality and quality while maintaining its generality. Thus, changes to the component framework frequently change its API on which its client applications rely and, as a result, these applications break.

Relying on UMLDiff, in the Diff-CatchUp tool [26], we tackled the API-evolution problem in the context of reuse-based software development, which automatically recognizes the API changes of the reused framework and proposes plausible replacements to the “obsolete” API based on working examples of the framework code base. The fundamental intuition behind this work is that when a new version of the framework is developed, it is usually associated with a test suite that exercises it. This test suite constitutes an example of how to use the new framework version and can be used as an example for other client applications that need to migrate.

6 Summary and Conclusion

In this paper, we reviewed two different algorithms and their corresponding tool implementations for object-oriented design differencing, a task that is essential for the purposes of (a) recognizing design-level changes between two versions of a software system; (b) comparing the intended design of a system against its as-implemented design; (c) analyzing the long-term evolution of a system and its constituent components; and (d) merging out-of-sync versions of software.

UMLDiff and VTracker assume the same basic conceptual model of UML models, namely, as trees, where nodes correspond to design elements, their children correspond to the elements’ contents, and additional edges connect them to other “related” design elements. The actual representations on which the two algorithms operate are different. UMLDiff works on a database of design facts, precisely reflecting the UML relations in the system. VTracker works on XML documents and primarily exploits and relies on the tree structure of these documents, as opposed to the semantics of the underlying UML relations they represent. Together, they give us an interesting test-bed on which to study software-model differencing in general.

In order to compare the two approaches, we first extracted the actual design changes that occurred between successive versions of the JFreeChart open-source project and used them as the set of true occurrences. This gold standard has been made publicly available and can serve as a benchmark for the evaluation of other differencing techniques, as well as for the replication of the conducted experiment. Based on the extracted set of true occurrences we computed the precision and recall of VTracker and UMLDiff and compared their accuracy for several types of changes


within three design aspects, namely containment, inheritance and usage. In general, VTracker proved to be more accurate than UMLDiff over most types of changes per design aspect despite of being domain-independent. UMLDiff performed better than VTracker only in the identification of changed attributes. The experimental results open the way for the application of VTracker on other software engineering differencing problems whose artifacts can be represented in the form of XML.

Finally, we performed an efficiency analysis based on the CPU time required by VTracker and UMLDiff for the comparison of all classes per version pair of JFreeChart. We concluded that VTracker has a comparable performance to UMLDiff, since VTracker required on average 27% more CPU time compared to UMLDiff. Additionally, the analysis has shown that there is an almost linear relation between the size of the compared trees (in terms of the number of their nodes) and the time required for their comparison and thus the VTracker algorithm can be efficiently applied to domains of problems having even a larger size.

The fundamental contribution of this study is that it demonstrates VTracker’s relevance to software difference, as a flexible and effective tool for recognizing changes in software evolution. In the future, we plan to apply VTracker to more instances of this general problem, by developing more XML representations of software, towards producing a general software differencing service.

Acknowledgements. Many more people have been involved in this work over the years during which these algorithms were being developed and evaluated, including Brendan Tansey, Ken Bauer, Marios Fokaefs and Fabio Rocha. Their contributions towards this body of work have been invaluable and we are grateful for them. This work has been supported by NSERC, AITF (former iCORE) and IBM.

References

1. Dulucq, S., Tichit, L.: RNA Secondary structure comparison: exact analysis of the Zhang–Shasha tree edit algorithm. Journal Theoretical Computer Science 306(13), 471–484 (2003)

2. Comparing and merging UML models in IBM Rational Software Architect, http://www-128.ibm.com/developerworks/rational/library/ 05/712_comp/

3. Egyed, A.: Scalable consistency checking between diagrams - The VIEWINTEGRA approach. In: Proceedings of the 16th International Conference on Automated Software Engineering, pp. 387–390 (2001)

4. Kim, M., Notkin, D.: Discovering and Representing Systematic Code Changes. In: Proceedings of the 31st International Conference on Software Engineering, pp. 309–319 (2009)

5. Kim, M., Notkin, D., Grossman, D.: Automatic Inference of Structural Changes for Matching Across Program Versions. In: Proceedings of the 29th International Conference on Software Engineering, pp. 333–343 (2007)

6. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)


7. Mikhaiel, R., Lin, G., Stroulia, E.: Simplicity in RNA Secondary Structure Alignment: Towards biologically plausible alignments. In: Post Proceedings of the IEEE 6th Symposium on Bioinformatics and Bioengineering, pp. 149–158 (2006)

8. Mikhaiel, R., Stroulia, E.: Examining Usage Protocols for Service Discovery. In: Dan, A., Lamersdorf, W. (eds.) ICSOC 2006. LNCS, vol. 4294, pp. 496–502. Springer, Heidelberg (2006)

9. Mikhaiel, R., Stroulia, E.: Accurate and Efficient HTML Differencing. In: Proceedings of the 13th International Workshop on Software Technology and Engineering Practice, pp. 163–172 (2005)

10. Mikhaiel, R.: Comparing XML Documents as Reference-aware Labeled Ordered Trees, PhD Thesis, Computing Science Department, University of Alberta (2011)

11. Ohst, D., Welle, M., Kelter, U.: Difference tools for analysis and design documents. In: Proceedings of the 19th International Conference on Software Maintenance, pp. 13–22 (2003)

12. Schofield, C., Tansey, B., Xing, Z., Stroulia, E.: Digging the Development Dust for Refactorings. In: Proceedings of the 14th International Conference on Program Comprehension, pp. 23–34 (2006)

13. Selonen, P., Koskimies, K., Sakkinen, M.: Transformations between UML diagrams. Journal of Database Management 14(3), 37–55 (2003)

14. Tsantalis, N., Negara, N., Stroulia, E.: WebDiff: A Generic Differencing Service for Software Artifacts. In: Proceedings of the 27th IEEE International Conference on Software Maintenance, pp. 586–589 (2011)

15. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)

16. Xing, Z., Stroulia, E.: Understanding Phases and Styles of Object-Oriented Systems’ Evolution. In: Proceedings of the 20th International Conference on Software Maintenance, pp. 242–251 (2004)

17. Xing, Z., Stroulia, E.: Understanding Class Evolution in Object-Oriented Software. In: Proceedings of the 12th International Workshop on Program Comprehension, pp. 34–45 (2004)

18. Xing, Z., Stroulia, E.: Data-mining in Support of Detecting Class Co-evolution. In: Proceedings of the 16th International Conference on Software Engineering & Knowledge Engineering, pp. 123–128 (2004)

19. Xing, Z., Stroulia, E.: UMLDiff: an algorithm for object-oriented design differencing. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, pp. 54–65 (2005)

20. Xing, Z., Stroulia, E.: Towards Experience-Based Mentoring of Evolutionary Development. In: Proceedings of the 21st IEEE International Conference on Software Maintenance, pp. 621–624 (2005)

21. Xing, Z., Stroulia, E.: Analyzing the Evolutionary History of the Logical Design of Object-Oriented Software. IEEE Trans. Software. Eng. 31(10), 850–868 (2005)

22. Xing, Z., Stroulia, E.: Refactoring Practice: How it is and How it Should be Supported - An Eclipse Case Study. In: Proceedings of the 22nd IEEE International Conference on Software Maintenance, pp. 458–468 (2006)

23. Xing, Z., Stroulia, E.: Understanding the Evolution and Co-evolution of Classes in Object-oriented Systems. International Journal of Software Engineering and Knowledge Engineering 16(1), 23–52 (2006)

24. Xing, Z., Stroulia, E.: Refactoring Detection based on UMLDiff Change-Facts Queries. In: Proceedings of the 13th Working Conference on Reverse Engineering, pp. 263–274 (2006)


25. Xing, Z., Stroulia, E.: Differencing logical UML models. Autom. Softw. Eng. 14(2), 215–259 (2007)

26. Xing, Z., Stroulia, E.: API-Evolution Support with Diff-CatchUp. IEEE Trans. Software Eng. 33(12), 818–836 (2007)

27. Xing, Z., Stroulia, E.: The JDEvAn tool suite in support of object-oriented evolutionary development. In: Proceedings of the 30th International Conference on Software Engineering (ICSE 2008 Companion), pp. 951–952 (2008)

28. Xing, Z.: Supporting Object-Oriented Evolutionary Development by Design Evolution Analysis, PhD Thesis, Computing Science Department, University of Alberta (2008)

29. Xing, Z.: Model Comparison with GenericDiff. In: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, pp. 135–138 (2010)

30. Zhang, K., Shasha, D.: Simple fast algorithm for the editing distance between trees and related problems. SIAM Journal on Computing 18(6), 1245–1262 (1989)

31. Fokaefs, M., Mikhaiel, R., Tsantalis, N., Stroulia, E., Lau, A.: An Empirical Study on Web Service Evolution. In: Proceedings of the IEEE International Conference on Web Services, ICWS 2011, pp. 49–56 (2011)

Model Management in the Wild

Richard F. Paige, Dimitrios S. Kolovos, Louis M. Rose, Nikos Matragkas,and James R. Williams

Department of Computer Science, University of York, UK{paige,dkolovos,louis,nikos,jw}@cs.york.ac.uk

Abstract. Model management is the discipline of manipulating modelsin support of Model-Driven Engineering (MDE) scenarios of interest.Model management may be supported and implemented via differentoperations – e.g., model transformation, validation, and code generation.We motivate the concepts and processes of model management, present apractical model management framework – Epsilon – and briefly indicatehow it has been used to solve significant model management problems’in the wild’.

1 Introduction

Model-Driven Engineering (MDE) requires organisations to invest substantialeffort into constructing models and supporting the modelling process. Modelsrange from design artefacts (e.g., in UML, SysML, or bespoke domain-specificlanguages), to requirements, to documentation and reports, to what-if analyses,and beyond. Constructing models is expensive and time consuming, but it is aninvestment. What return do organisations get on that investment? How do theyextract value from their models? Or, to put it another way: once models havebeen constructed, what can you do with them?

Model management is the discipline of manipulating models. In this paper, wewill see different ways in which engineers may want to manipulate models, basedon real scenarios from real engineering projects1. We take a broad and inclusiveinterpretation of model: they capture phenomena of interest, and are constructedfor a purpose. The language used for modelling (whether it is general-purpose,like UML, or domain-specific) is a secondary issue. However, as we shall see, werequire models to be well-formed and implemented by tools – as such, modelsneed to have a metamodel (i.e., a language definition) that distinguishes well-formed models from ill-formed ones.

Model management requires a clear conceptual basis, a sound theory, andpractical and scalable tools that enable and implement these foundations. Thefocus of this paper, and of our research on MDE in the last six years, has been ondeveloping a clear conceptual basis and corresponding and practical tool support;the tools allow us to experiment with ideas in an efficient and flexible way.

To start answering the key question – what can you do with models oncethey have been constructed? – let’s examine several realistic scenarios. These

1 Though names and some technical details have been changed or simplified.


198 R.F. Paige et al.

scenarios have all been derived from real projects involving modelling and modelmanagement.

1.1 Transformation Scenario

A common model management scenario is that involving different kinds of modeltransformation. In such scenarios, models have been constructed in some suitablelanguage, and need to be transformed to models in a different language2. Atypical type of transformation is to generate code (sometimes called model-to-text transformation, but there are many other kinds – [1] is a comprehensivereference). A transformation is thus a model management operation that enablesapplication of a new task, once target models have been produced.

Here is a concrete example. In the MADES project3, two technical organi-sations – EADS and TXT – have specific modelling requirements for buildingtheir embedded systems. These requirements effectively require the use of UMLprofiles, including MARTE, as these languages are already familiar to engineersand provide comprehensive support for the kinds of concepts that are neededin-house. However, these models are not sufficient for their full engineering pro-cesses: the MARTE models must be transformed to enable other tasks, includingverification and platform-specific code generation.

Figure 1 illustrates one transformation scenario in MADES. In this scenario,hardware diagrams (expressed in MARTE) are transformed to a hardware ar-chitecture description (in this particular case), which is a different kind of modelin a different modelling language (one, in fact, constructed to support transfor-mation to a ‘virtual platform’).

Fig. 1. Model transformation to hardware architecture for a virtual platform

From this description, different platform-specific descriptions can be gener-ated. This latter type of transformation is usually called code generation, butin general it is a model-to-text transformation. The MADES project providesa number of code generators for different platforms, as well as model-to-texttransformations to support verification in a number of ways.

2 Other scenarios – such as those where source and target language are the same – areof course possible.

3 http://www.mades.eu

http://www.mades.eu

Model Management in the Wild 199

1.2 Modification and Validation

Our second scenario is rather different. Consider the model of a sensor array illus-trated in Figure 2; cubes, diamonds and circles represent physical components inthe array (e.g., a sensor) while triangles represent communication between com-ponents (e.g., via wired or wireless networking). The details are unimportant;effectively, we have a collection of sensors, which measure certain phenomena.Sensors are connected via networking capability, including routers and ‘Hubs’,which also provide connection to the outside world. The array is meant to befault tolerant, able to provide functionality (i.e., measure phenomena) even ifcomponents fail, network connections drop (e.g., due to a severed wire) or en-vironmental conditions prevent harvesting of data at regular times. It is also tobe constructed by domain experts (surveying engineers) with IT skills, but nosoftware engineering expertise.

Fig. 2. Sensor array model

Given such a model, in our scenario we want to modify the model to simu-late different kinds and quantities of component and connector failure. At thesame time, after modifications, we need to validate the model to ensure thatcertain minimal structural properties still hold. For example, we may only wantto simulate conditions under which three or fewer sensors fail. We could writea model management operation to modify the model to simulate failure of arandom number of sensors, and then validate the model that is produced. Onlythose models satisfying this “three or fewer failures” scenario are kept.


1.3 Structure of the Paper

The structure of the paper is as follows. These scenarios, and others, will beused to introduce model management. We will then move to a presentation ofthe conceptual foundations of model management, including examples of mod-els and key requirements for model management. We will then discuss the corecharacteristics of model management operations, and will start to illustrate howthese are supported in the Epsilon platform. We will then present Epsilon’s sup-port for some of the most common model management tasks, including modeltransformation, code generation and validation. We will then discuss more ad-vanced model management topics, including comparison of models and migrationof models, along with support for chaining operations together.

2 Foundations of Model Management

As we said earlier, model management is the discipline of manipulating models.Model management can support manipulation of models in different ways (in-cluding, as we have seen, transformation, code generation and validation). Wenow describe the foundations of model management – i.e., the shared character-istics of all forms of model manipulation. This conceptual model will be used tohelp introduce a concrete realisation – the Epsilon platform. In essence, we aretrying to understand the commonalities of all model management tasks, so thatwe can investigate this core in detail.

What tasks can be performed on models? A useful specification of core taskscan be taken from an analysis of the literature on database management systems:

– create new models and model elements that conform to a metamodel;– read or query models, e.g., to project out information of interest to specific

stakeholders. Specific examples of queries include boolean queries to deter-mine whether two or more models are consistent, and queries to select asubset of modelling elements satisfying a particular property.

– update models, e.g., changing the properties of a model, adding elements toa model.

– delete models and model elements.

The scenarios that we have considered already – i.e., model transformation, codegeneration and validation – all can be expressed in terms of combinations of theabove tasks.

However, the above model is coarse-grained, and also does not make explicitthe fact that many of our model management scenarios involve manipulatingseveral models at the same time (e.g., model transformations often involve twoor more models). Additionally, the above specification of tasks is repository-centric, focusing on requirements for storing and retrieving models rather thanon requirements for fine-grained manipulation of models.

An alternative specification of core tasks on models, which we have developedafter careful analysis of a number of model management platforms (see [9]), isas follows. Model management tasks are based on the following primitives:


– navigating models to be able to identify and extract model elements of in-terest;

– modifying models (e.g., add elements, delete elements, change elements);– accessing multiple models simultaneously, as as to support both inter- and

intra-model operations (e.g., merging, transformation).

This alternative specification can be used to specify and implement a varietyof concrete model management tasks, including the aforementioned transforma-tion, code generation and validation; this will be demonstrated in the followingsections. We do not claim that this set of tasks is complete – it has been derivedfrom analysing model management platforms, observation of tasks carried outby engineers using models, and theoretical concerns. Extensibility of this set oftasks will be discussed later in the paper.

We now introduce a practical model management framework, Epsilon, whichimplements the conceptual model above, and which will be used throughout thetutorial to illustrate model management concepts and principles.

3 Introduction to Epsilon

In this section we discuss a platform for model management in more detail.Epsilon [6] is both a platform of task-specific model management languages,

and a framework for implementing new model management languages by ex-ploiting the existing ones. It is a component of the Eclipse Modelling Project4.Epsilon provides a set of inter-related languages that can be used to implementand execute various model management tasks. The current architecture of Ep-silon is illustrated in Figure 3.

Epsilon consists of a set of languages, a connectivity framework (more onthat later), and some additional tools to help ease development. Each languagehas further development tools (e.g., syntax-aware editors). Each language aimsto support a particular model management task. More specifically, there is alanguage for direct manipulation of models (EOL) [9], as well as languages formodel merging (EML) [8], model comparison (ECL) [12], model-to-model trans-formation (ETL) [13], model validation (EVL) [11], model-to-text transformation(EGL) [20], model migration (Flock) [19]) and unit testing of model managementoperations (EUnit).

There’s a lot to take in from Figure 3, and we will look at parts of Epsilonin more detail over the course of the paper. For now, we focus on the foun-dation of Epsilon: the core language, EOL, the Epsilon Object Language. EOLdirectly supports all of the core primitives described earlier. In particular, it sup-ports navigation of models (via OCL-like expressions and queries), modificationof models (via assignment statements), and multiple-model access capabilities.EOL is reused in all of the other languages in the platform. Indeed, the pat-terns of reuse of EOL in Epsilon are diverse and interesting, and are describedin detail in [16].

4 http://www.eclipse.org/epsilon

http://www.eclipse.org/epsilon


Epsilon Model Connectivity

Epsilon Object Language

EMF driver MDR driver XML driver Z driver

M2M Transformation (ETL) M2T Transformation (EGL)

Model Comparison (ECL) Model Merging (EML)

Model Refactoring (EWL) Model Validation (EVL)

Model Migration (Flock) Unit Testing (EUnit)

Fig. 3. Epsilon model management platform

3.1 EOL Core Features

EOL reuses a significant part of the Object Constraint Language (OCL), includ-ing model querying operations such as select(), and iterate(). It definesvariables similarly, and has an almost identical type system. Unlike OCL, it sup-ports direct access to multiple models simultaneously. To do this, each modelis given a unique name, and the metamodel to which the model conforms mustalso be specified. Access to a specific metaclass of a model is performed viathe ! operator (which is adopted from ATL [5], though in ATL the name of ametamodel is used as an identifier).

So, if UML2 is a UML 2.x model, UML2!Class will return the Class meta-class reference. UML2!Class.all() will return all the instances of the Classmetaclass that are contained in UML2. If there are conflicting meta-class names,the full path, e.g., UML!Foundation::Core::Class can be used.

OCL is a declarative language; EOL adds operational features. This was apragmatic decision. When using OCL we often find ourselves writing manydeeply nested and complicated queries that are difficult to write, parse, un-derstand and debug. As a result we included statement sequencing and blocks.Statements can be composed using ; and blocks are delineated using { and }.

All operations and navigations in EOL are invoked using the ’.’ operator.For example, EOL provides a built-in print() operation that generates andoutputs a String representation of the object to which it is applied.

OCL expressions cannot create, update or delete model elements; for mostmodel management tasks this feature is essential. As a result we include theassignment operator := which performs assignments of values to variables and


model element features: e.g., object.name := ’Something’. EOL also ex-tends the built-in collection types (e.g., Bag, Sequence, Set) with operations(such as the add(Any) and remove(Any) operations) that can modify the con-tents of the collection to which they are applied. EOL also provides elementcreation and deletion features (new and delete(), respectively).

Finally, EOL provides a notion of user-defined operation; such operationscan be used in other EOL programs, but also in any other Epsilon lan-guage (e.g., transformations). Such operations are analogous to OCL helpers,but (as described earlier) can refer to multiple models, and can be importedvia the import statement. As an example, an operation that checks if aUML!ModelElement has a specific stereotype is displayed in Listing 1.1.

Listing 1.1. EOL Operation example

operation UML!ModelElement hasStereotype(name : String) : Boolean {return self.stereotype. exists (st :UML!Stereotype|st.name = name);

}

An example of an EOL program is given in Listing 1.2. The program creates anew object-oriented model and stores it in variable m. It populates the modelwith five packages, then creates three classes per package. Finally, the programassigns a random superclass to each class in the model.

Listing 1.2. EOL example

var m : new Model;m.name := ’m’;

−− Create five packagesfor ( i in Sequence{1..5}){var package : Package := new Package;package.name := ’p’ + i;package.package := m;

−− Create three classes in each packagefor (j in Sequence{1..3}) {var class : Class := new Class;class .name := ’c’ + i + ’’ + j;class . isAbstract := false ;class .package := package;

}}

−− Assign random supertypes to the classes createdfor (c in Class. all ) {c.extends := Class. all .random();

}


EOL comes with supporting tools, as a set of plug-ins for Eclipse, including aneditor, perspectives, wizards, and launching configuration that allow developersto use the language in real problems. More on the architecture of Epsilon willcome during the tutorial.

Next, we briefly discuss three other languages in Epsilon: the transformationlanguage (ETL), the model-to-text language (EGL), and the validation language(EVL).

4 Key Languages in Epsilon

Some model management tasks arise more frequently than others. In our ex-perience, the three most common tasks are transformation, generation of text,and validation of models. Epsilon provides three different languages to supporteach of these tasks: ETL (for model-to-model transformation), EGL (for textgeneration) and EVL (for validation). Each of these languages reuses EOL –both conceptually and in implementation terms – to provide basic navigation,expression, and querying facilities. Effectively, each language uses EOL to specifyfine-grained logic for the concrete tasks that are being carried out.

4.1 Epsilon Generation Language (EGL)

EGL is a template-based model-to-text transformation language; it reuses EOLentirely. A full description, with a number of examples, can be found in [20].EGL is based on a notion of section, from which templates are constructed. Sec-tions are either static (content appears verbatim in generated text) or dynamic(content is executable code, written in EOL, that is used to control generatedtext). EGL provides a new object, out, which can be used specifically in dy-namic sections; this allows operations to be applied to generated text (such asappending strings).

The concrete syntax of EGL is fairly standard, when contrasted with othertemplate-based code generation languages; however, some syntax is inheriteddirectly from EOL. The block syntax of [% %] is used to delimit a dynamicsection. Any text not enclosed in such a block is static, and is copied to theoutput text. Listing 1.3 (from [20]) illustrates the use of dynamic and staticsections to form an EGL template.

Listing 1.3. A basic EGL template.

1 [% for (i in Sequence{1..5}) { %]2 i is [%=i%]3 [% } %]

[%=expr%] is shorthand for [% out.print(expr); %]; this appends exprto the output generated by the transformation. The out keyword also provides


println(Object) and chop(Integer) methods, which can be used to con-struct text with linefeeds, and to remove a specified number of characters fromthe end of the generated text.

EGL makes idiomatic several other model-to-text transformation patternsthat we have observed in practice, such as protected regions (which provide amechanism for preserving hand-written text), and beautification (for formattinggenerated text).

4.2 Epsilon Transformation Language (ETL)

ETL provides model-to-model (M2M) transformation support in Epsilon. Theidea with such transformations is that they allow interoperation between lan-guages. So, for example, a typical M2M transformation might be to transforma UML model into a relational database model. Most often, in M2M trans-formations the source and target languages differ (two special cases of M2Mtransformation, where source and target languages are identical, are refactoringtransformations and queries).

ETL is a hybrid transformation language; it supports both declarative andimperative language features for expressing transformations. It is also a rule-based language, in that the logic of the transformation is expressed as a setof rules that, effectively, describe how target models are produced from sourcemodels. The behaviour of these rules is expressed directly in EOL.

ETL has a number of distinctive features (most of which are described in[13]). Some key features include: the ability to transform arbitrary numbers ofsource models into an arbitrary number of target models; automatic generationof trace-links as a side-effect of the transformation process; and the ability tospecify whether, for each target model, its contents should be preserved by thetransformation (or overwritten by the transformation).

Listing 1.4 gives an example of an ETL rule.

Listing 1.4. EOL example

rule Tree2Nodetransform t : Tree!Treeto n : Graph!Node {

n. label := t. label ;

if (t .parent.isDefined()) {var edge := new Graph!Edge;edge.source := n;edge.target := t.parent.equivalent ();}

}

In this example, we have two very simple metamodels/languages: a Tree languageand a Graph language. We want to transform each Tree element to a Node in the


Graph, and each edge that connects the node with the equivalent of its parent.The rule starts with its name, and then specifies that it can transform elementsof type Tree in the Tree model, into elements of type Node in the Graph model.It then copies the label of the tree to the label of the node. Then, if the parentof the source tree exists. a new edge is created in the graph. The source of thisedge is the new Node, and the target is the equivalent of the parent of the tree.The syntax equivalent() is important in ETL. This is effectively a way ofcalculating the elements that have been produced by applying a transformationrule. Thus, t.parent.equivalent() returns whatever t.parent has beentransformed into. (In general, of course, you may have multiple equivalents ofa single model element; ETL provides support for gathering these and selectingones that you want.)

The execution semantics of ETL programs deserves a brief comment: ETLrules are executed in the order that they are expressed; rules cannot invokeother rules. Thus, there is, behind the scenes, a rule scheduler that orchestratesthe rules and executes them. The scheduler also ensures that implicit cyclesin the rule structures do not lead to infinite looping (so, model elements aretransformed exactly and only once).

A final point: ETL supports interactive transformations, i.e., transformationsthat interact with a user. This, for example, allows a user to direct a transfor-mation, provide values to store in target models, etc. More information can befound in [13].

4.3 The Epsilon Validation Language (EVL)

Often, after carrying out a model management task (or even after constructing amodel), you want to be sure that your models satisfy properties that you judgeto be desirable. For example, you may want to ensure that your models arewell-formed, or that they obey various stylistic rules and naming conventions.Perhaps you want to ensure that a recently modified model is consistent (in someway) to other models that have not been modified. Such scenarios are supportedin Epsilon via the validation language, EVL.

EVL provides a mechanism for writing constraints and critiques on one ormore models. Constraints are properties that must hold of a set of models; cri-tiques are desirable properties. These both take the form of a set of expressionson a set of models; the expressions are written using EOL, and can take advan-tage of all of EOL’s features and concepts. Additionally, EVL allows specificationof fixes, which are EOL programs that can be invoked on failure of a constraintor a critique. A fix should modify models so that the constraints or critiques arethereafter satisfied.

An example of an EVL program is given in Listing 1.5. The program is theclassical example of validating an object-oriented model against a relationaldatabase model. Suppose that an OO class model has been constructed, andthen an engineer has run an ETL program to produce a relational databasemodel (consisting of tables, columns, keys etc). What guarantee is there thatthe target database model is sound and well-formed? This can be checked using,


in part, the EVL program below. The program starts with a description of itscontext, i.e., the type of instances on which the program will be executed. Thisexample applies to Class instances belonging to an OO model. The programconsists of one constraint, TableExists, which first specifies a check condi-tion to be satisfied; this condition is implemented in EOL. Next, the programspecifies a message that will be displayed during the validation process; if thecondition fails, the message is displayed. Finally, a fix block is provided thatcan be invoked (on demand, or automatically) so as to repair the model whenthe condition fails. fix blocks can be arbitrarily complicated EOL programs.This particular EOL program creates a new table with a default name and addsit to the database model. It’s possible to have as many fix blocks as you like;if a choice needs to be made, the engineer will be asked to select one (and, ifnecessary, provide input).

Listing 1.5. EVL example

context OO!Class {

constraint TableExists {

// The condition that needs to be satisfied :// For every class X there is a table named T Xcheck : DB!Table.all. select (t |t .name = ”T ” + self.name).size() > 0

// The message that is displayed to the user// if the check part returns falsemessage : ”No table found for class ” + self .name

// This is an optional fix which the user may want to invoke// to fix this inconsistencyfix {

title : ’Add missing table’

do {var table = new DB!Table;table .name = ”T ” + self.name;DB!Database.all. first (). contents.add(table);

}}

}

}

The paper [11] provides further examples and details on EVL, including a dis-cussion of how EVL can be integrated with different editors (particularly thoseconforming to a model-view-controller architecture). One particular example


illustrates how EVL can be integrated with arbitrary GMF editors, such thatfailed constraints or critiques are indicated directly in the GMF panel (with cus-tomisable icons), and the engineer can interact with the concrete syntax directlyto invoke fix blocks.

4.4 Application of Key Languages

We have experience of applying the key Epsilon languages in numerous indus-trial and realistic scenarios, including the ones described in the introduction.Another important application of these languages is via the EuGENia toolset5

[14]. EuGENia is an application that automatically generates the the mod-els required by the Eclipse Graphical Modeling Framework (GMF) to producea graphical editor for a modelling language. Given an annotated metamodel(implemented using Ecore), EuGENia automatically generates the .gmfgraph,.gmftool and .gmfmap models needed to implement a GMF editor. EuGENiaprovides high-level annotations that protect the user from the complexity ofGMF – particularly, the need to create intermediate models and from havingto understand the process by which GMF generates these intermediate mod-els and the final editor. EuGENia is designed to lower the entrance barrier toGMF, and it can be used to produce a basic editor, as well as a final, polishedversion.

How does EuGENia work? It is a chain of model transformations of differentkinds. All the transformations are implemented using EOL for efficiency andsimplicity, though some of the steps in the chain could be implemented usingother Epsilon languages.

EuGENia defines a set of simple GMF-specific annotations, to indicate howdifferent elements of your modelling language editor should be represented inconcrete syntax. Your modelling language is specified in Ecore, and the EuGE-Nia annotations are applied to Ecore elements (for example, you might specifyusing an annotation that a particular language element is to be described usinga rectangle or a dashed line). The annotated metamodel is then passed as in-put to a model-to-model transformation (cakked Ecore2GMF ) to generate thetooling, graph and mapping GMF models. This is done in one step. After gener-ating these intermediate models, we use a standard built-in GMF transformation(GMFGen) to build the GMF generator model. EuGENia then takes this andapplies an update-in-place transformation (FixGMFGen) to capture some of thegraphical syntax configuration options. Additional polishing, as discussed in [14],can be applied at this stage. The whole process is illustrated in Figure 4, takenfrom [14].

EuGENia is thus a substantial model transformation (approximately 1000lines of code) that does exactly what MDE is supposed to do: abstract and hidethe end-user from having to deal with unnecessary complexity.

5 http://www.eclipse.org/gmt/epsilon/doc/eugenia/

http://www.eclipse.org/gmt/epsilon/doc/eugenia/


Fig. 4. The EuGENia transformation workflow [14]

4.5 Summary

This section has given a brief overview of some of the key Epsilon languages,illustrating their important features, and briefly indicating how they have beenderived from the core Epsilon language. We also illustrated briefly how these keylanguages have been used in practice, to automate the process of building GMFeditors.

In the next section, we briefly summarise some of the other languages inEpsilon, as well as indications of how Epsilon languages can and have been usedtogether, often to support scenarios like those discussed earlier. We will alsobriefly describe the technical architecture of Epsilon, including its connectivitylayer and its mechanisms for scalability.

5 Advanced Concepts

Epsilon provides broad model management capabilities, and we have so far seensome of its most widely used features. In this section we will summarise some


additional languages and tasks supported by Epsilon, and will give a briefoverview of Epsilon’s technical architecture.

5.1 Model Comparison

Model comparison involves calculating matching elements between models.What constitutes matching elements is generally problem or domain-specific;at its most general we say that matching elements are those that are involvedin a relationship of interest. An analogy can be drawn to the equality oper-ator = (or ==) in many programming languages: what does equality mean?It depends on the types of the entities involved, and, ultimately, what wewant to do with the results of testing for equality. Model comparison, thus,is an operation applied to two or more models that calculates matching modelelements.

When might we want to use model comparison? A common scenario is to sup-port so-calledmodel differencing, which is important in providing version controlsupport for MDE. Before calculating the differences between two models, it isusually necessary to identify the pairs of elements that correspond. We mightalso use comparison before merging homogeneous models (e.g., because we aretrying to reconcile two branches of a versioning tree). In this situation, it is es-sential that matching elements are calculated so that we do not duplicate themin the merged model. Another scenario of use arises with testing of transforma-tions: given a test model, we run a transformation, then compare its results with‘expected’ output models.

Epsilon provides a task-specific language, the Epsilon Comparison Language(ECL) to allow engineers to define and execute arbitrary notions of model com-parison. Engineers can specify precise comparison algorithms in a rule-basedfashion; on execution, an ECL program produces pairs of matching elements be-tween different models. Notably, ECL allows definition of comparison algorithmswith arbitrary logic: models can be compared in arbitrary ways, e.g., using modelidentifiers, using names, similarity-based comparison, etc. ECL is independent ofthe logic used for comparison; engineers specify their own comparison algorithmsusing ECL rules.

Like other Epsilon languages, ECL is an extension of EOL and uses its syntaxand features to encode the logic of the comparison algorithms. Of particular noteis that, because of EOL’s support for invoking native (Java) code, external com-parison solutions – such as string matching libraries, or fuzzy matching libraries– can easily be used.

An example of an ECL program is in Listing 1.6. Like ETL, ECL is rule-based, but the rules define matching logic rather than transformative logic. Thisparticular program compares a System model with a Vocabulary model. Thescenario we are using is one where we eventually want to merge an Entity modelwith a Vocabulary model. Entity models have names, a boolean property anda reference to their enclosing System. Vocabulary models consist of a set ofTerms, where each Term has a name and a set of Aliases (Aliases merely includenames). The first step in merging is to compare models; the matching that is


produced from the comparison (which is encoded as a set of trace-links) can thenbe fed to a merging program which will reconcile any differences between themodels.

Listing 1.6. ECL example

rule MatchSystemWithVocabularymatch s : Source!Systemwith v : Vocabulary!Vocabulary {

compare {return true;

}}

rule MatchEntityWithTermmatch s : Source!Entitywith t : Vocabulary!Term {

compare {return s .name = t.name ort .’ alias ’. exists (a|a.name = s.name);

}}

The ECL program consists of two rules. Similar to ETL transformation rules,each rule has a name and a set of arguments; these arguments indicate the modelsthat are being compared. The first rule in the program compares a System with aVocabulary model. As these are just containers, any pair of such elements alwaysmatch; more fine-grained comparisons (evidenced by other rules) can refine sucha match.

The second rule in the ECL program compares an Entity model with a Termmodel. The comparison logic, expressed in the compare block, is an EOL pro-gram that returns a boolean. The expression states that the an Entity and aTerm match if their names match, or if they have a shared alias.

The example we have shown illustrates comparison in terms of names; ECLis not restricted to this (or, indeed, to any comparison logic). Numerous exam-ples of using ECL (including with external libraries like Simmetrics, illustratingsimilarity-based matching) can be found in [7]. The paper also discusses accessthe internal representation of a matching, through trace-links.

5.2 Model Migration

MDE offers a number of benefits (see, e.g., [21]), but of course introduces newchallenges and difficulties. One challenge that has been most recently investi-gated is that of model migration. Effectively, this is a problem of managingchange in the development lifecycle. Change, as we all know (e.g., from the agile


methods literature) is problematic in software engineering: changing require-ments can have ripple effects downstream in the development process, and inthe worst case may involve complete redesign, re-validation and re-certification.The situation in MDE is no better – in fact, arguably, it is worse because thereare more artefacts that can change (models, metamodels, operations that de-pend on both, editors, etc), and these artefacts are often tightly coupled. As aresult, there has been much research into novel techniques for managing changein MDE.

A specific change problem is that of model migration. Metamodels (languages)can change, particularly when domain-specific languages are used and are inthe early stages of development. (General-purpose languages like UML can alsochange; such change may occur more slowly, and perhaps more predictably, butthe impact on tools and development processes may be more significant as aresult of its wider use.) When a metamodel changes, everything that depends onit needs to be updated, including models. Thus model migration is the processof updating models to conform to changes to a metamodel.

Model migration is effectively a transformation problem, from one version of ametamodel to another. However, it’s a very specific instance of a transformationproblem, with its own patterns and logic. A detailed analysis of the patternsarising in model migration was presented in [19]; one key element of modelmigration transformations is that invariably many rules need to be written tocopy model elements that don’t need to change at all (because the correspondingparts of the metamodel from which these elements are instantiated have notchanged). Writing such copying rules by hand is error-prone (not to mentionincredibly boring).

Epsilon offers a task-specific language for model migration: Flock. Flock isa transformation language specific for model migration, and supports a novelcopying algorithm to eliminate the need to write excessive rules. Flock is rule-based, like ETL, and supports two types of rules: migrate rules (which mi-grate elements to a new metamodel) and delete rules (which remove elementsthat are no longer needed because their corresponding metamodel elements havedisappeared).

When a Flock program is executed on a model, the following three steps takeplace:

1. An analysis of the Flock program to determine which model elements mustbe deleted, which must be retyped, and which can be copied directly to themigrated model.

2. Copying of the original model elements to the migrated model, using theconservative copy algorithm [19]. Essentially, conservative copy ensures thatall necessary data is copied over from the original model. Any data that,due to the changes to metamodel, is no longer relevant to the model is notcopied automatically.

3. Finally, in this last step, the user-defined migration logic is executed to,effectively, update the model.


To facilitate this last step, Flock provides two variables, original and mi-grated to be used in the body of any migration rule. Further, like ETL, itdefines an equivalent() operation which can be called on any original modelelement and returns the equivalent migrated model element.

An example of a Flock program is in Listing 1.7. This is a traditional example:migrating a Petri net model where the metamodel has changed. The originalmetamodel had elements representing a Net, a Place and a Transition. The newmetamodel introduces an Arc element; there are two kinds of Arcs: PTArcs(from a Place to a Transition) and TPArcs (from a Transition to a Place). Themigration logic is straightforward: each Place and Net in the original model ismigrated to the new model, as the metamodel components for these elementshave not changed. The original model’s concept of a Transition is migrated to twoArcs: a PTArc and a TPArc. Once again, we use the Epsilon built-in operationof equivalent().

Listing 1.7. Flock example

migrate Transition {for (source in original.src) {var arc = new Migrated!PTArc;arc. src = source.equivalent ();arc.dst = migrated;arc.net = migrated.net;

}

for (destination in original.dst) {var arc = new Migrated!TPArc;arc. src = migrated;arc.dst = destination.equivalent ();arc.net = migrated.net;

}}

A comparison of Epsilon Flock with other model migration solutions can befound in [18], particularly with the COPE tool [4] and AML [2].

5.3 Using Multiple Epsilon Languages Together

Some applications of Epsilon (and model management) require application ofjust one Epsilon language. For example, we have provided support to compa-nies requiring document generation, via application of EGL templates to be-spoke domain-specific languages. More complicated applications of Epsilon haverequired use of two or more languages.

A good example is in [17], which presented an automated safety analysistechnique called FPTC. FPTC is a technique that is applied to architecturalmodels of systems: models of components and connectors. Components may behardware devices or software; connectors may be hardware-based or protocols.


These models are annotated with safety-specific information, focusing on thefailure behaviour of the individual components and connectors in the system.For example, we may know (through our domain expertise) that a particulartype of hardware sensor reliably delivers data late 0.05% of the time; that is,it exhibits late failure behaviour. After annotating components and connectorswith failure information, FPTC – the safety analysis technique – can be appliedto the model. What FPTC produces automatically is the whole-system failurebehaviour, i.e., the visible effects of component or connector behaviour on systemoutputs. In this way, FPTC can be used to understand how a system will respondto a particular type of internal failure, and, moreover, how it would respond toreplacing a component or a connector with a different one. In this manner, FPTCsupports plug-and-play safety analysis.

FPTC is actually implemented as a workflow that chains together differentEpsilon operations. The details of how FPTC has been implemented can befound in [17]; instead, a few words about the workflow mechanisms for Epsilon.Model management workflows are implemented using Apache Ant6 tasks. EachEpsilon language has a corresponding Ant task, which can be used to identifythe models to which an Epsilon program is applied, identify results, and exposeinternal model management information (like trace-links). Listing 1.8 containsa sample workflow that chains together different Epsilon operations.

Listing 1.8. Workflow example.

<project default=”main”><target name=”main”>

<epsilon.emf.loadModelname=”MyUMLModel”modelfile=”sample.uml”metamodeluri=”http://www.eclipse.org/uml2/2.1.0/UML”read=”true”store=”false”/>

<epsilon.eol>MyUMLModel!Class.all.size().println();<model ref=”MyUMLModel”/>

</epsilon.eol>

<epsilon.evl src=”ValidateInhouseStyle.evl”><model ref=”MyUMLModel”/>

</epsilon.evl></target>

</project>

6 http://ant.apache.org/

http://ant.apache.org/


The first Ant task loads an EMF model (which happens to be a UML 2.1.0model). The second task executes an EOL program on said model, which printsthe number of classes in the model. The final task executes a stored EVL programthat runs a set of constraints on the UML model, checking whether or not themodel satisfies a set of stylistic rules.

A number of Ant tasks have been developed, which also provide powerfulsupport for accessing traceability information, and for transactions, which arenecessary for large-scale and reliable tasks.

5.4 Architecture of Epsilon

Fig. 3, presented earlier, shows the conceptual architecture of Epsilon. We havetalked about some of the Epsilon languages. All Epsilon languages depend onEOL, and use it for expressing operation behaviour. For example, ETL usesEOL to express the logical behaviour of transformation rules; EUnit – a testinglanguage that we have not discussed in this tutorial – uses EOL to expressunit tests on model management operations. Though all languages in Epsilondepend directly on EOL (and, indeed, the execution tools for Epsilon languagesall reuse those for EOL), the ways in which the Epsilon languages reuse EOL varyconsiderably. We have identified several different language reuse mechanisms:

– extension and specialisation, i.e., where the language and tools of one lan-guage are inherited and thereafter reused by a second language;

– annotation (a special form of extension), where a new language is formed byaddition of lightweight annotations to an existing language;

– preprocessing, i.e., where a new language is implemented as a preprocessor,generating output in the form of another Epsilon language.

In Epsilon, only one language – EGL – has been implemented as a preprocessorto EOL. Annotations have been used to implement EUnit (the testing language).All other languages, including Flock, have been implemented using extension andspecialisation of EOL. The advantages and disadvantages of each approach toreuse are summarised in [16].

Another important element of the conceptual and technical infrastructure ofEpsilon is that its languages are, to a first approximation, technology-agnostic:that is, the model management operations that are written in the Epsilon lan-guages are independent of the technology used to represent and store models.Thus, in most cases, an operation written in any Epsilon language can be used tomanipulate models stored in Ecore, MDR, XML, and any technology for whicha driver can be supplied. Drivers are encapsulated in the ‘middleware’ layer ofEpsilon – the Epsilon Model Connectivity (EMC) layer. EMC sits between EOLand the modelling technologies themselves, and provides a level of indirectionbetween EOL operations and instructions, and those required to access, create,update and delete elements from the model itself.

If Epsilon does not yet support a modelling technology7, the platform canbe extended to support new types of models without having to significantly

7 Current drivers include EMF/Ecore, MDR, CZT, and plain XML.


change existing Epsilon programs. Extension requires the implementation of anew EMC driver. Each driver provides an implementation of the interface IModel,as well as extensions to the EMC Eclipse extension points so that there is properintegration with the user interface. This interface requires methods for querying,type interrogation, loading/storing, disposal and traversal to be implemented forthe new kind of model. Implementing this interface is generally straightforward,but does require some familiarity with Eclipse, and more generally with reflectionin Java. Full details can be found in the Epsilon book [15] (Chapter 2).

5.5 Other Concepts

We have not touched on all of Epsilon’s concepts, nor all of its applications. Inparticular, we have not discussed model merging (via EML) [8] or update-in-place transformations (via EWL) [10], nor testing of model management oper-ations via EUnit [3]. Nor have we delved into additional advanced features ofEOL, including interactive model management, dynamic properties, Epsilon’snative interface with Java, and more. The references give numerous pointers tothese and other applications and advanced topics.

6 Outlook

Epsilon is a dynamic project: we have many users (both researchers and practi-tioners), an active team of developers and internal users, and a lively community.Development of Epsilon is proceeding in a number of ways: on new applications,technical improvements, and standardisation. Epsilon is moving to the EclipseModelling Project (EMP) and out of the research incubator sub-project. Ef-fort is also underway on providing more support for very large models (and,correspondingly, model management operations on very large models). We arealso applying Epsilon in novel domains, including support for validation of rail-way interlocking models, support for Through Life Capability Management anddecision support, and also combining model management with optimisation tech-niques for interactive systems.

An obvious concern that end-users may have with Epsilon is that to supportcomplex model management tasks, a number of different languages may needto be learned and applied, and this may both increase the learning curve anddissuade some potential users. In practice, we have not found this to be the case.The Epsilon languages are closely related (both syntactically and semantically),and after learning the first language (which is typically EOL), successive lan-guages are easier to adopt. Second, each language is cohesive and has a clearscope and domain of applicability; as such, we argue that it is easier to un-derstand when to apply an Epsilon language than more general-purpose modelmanagement frameworks. Finally, we believe that the architecture of Epsilonreflects the essential complexity of model management. Whether an engineeruses one general purpose language or several task specific language is not reallythe point; the point is that model management involves diverse problems with


diverse requirements, and engineers should be able to choose the most appropri-ate approach to solve each problem. We would argue that the Epsilon approachprovides richer, more task-specific and more modular approaches to solve suchproblems.

References

1. Czarnecki, K., Helsen, S.: Feature-based survey of model transformationapproaches. IBM Systems Journal 45(3), 621–646 (2006)

2. Garces, K., Jouault, F., Cointe, P., Bezivin, J.: Managing Model Adaptation byPrecise Detection of Metamodel Changes. In: Paige, R.F., Hartman, A., Rensink, A.(eds.) ECMDA-FA 2009. LNCS, vol. 5562, pp. 34–49. Springer, Heidelberg (2009)

3. Garcıa-Domınguez, A., Kolovos, D.S., Rose, L.M., Paige, R.F., Medina-Bulo, I.:EUnit: A Unit Testing Framework for Model Management Tasks. In: Whittle, J.,Clark, T., Kuhne, T. (eds.) MODELS 2011. LNCS, vol. 6981, pp. 395–409. Springer,Heidelberg (2011)

4. Herrmannsdoerfer, M., Benz, S., Juergens, E.: COPE - Automating Coupled Evo-lution of Metamodels and Models. In: Drossopoulou, S. (ed.) ECOOP 2009. LNCS,vol. 5653, pp. 52–76. Springer, Heidelberg (2009)

5. Jouault, F., Allilaire, F., Bezivin, J., Kurtev, I., Valduriez, P.: ATL: a QVT-liketransformation language. In: OOPSLA Companion, pp. 719–720 (2006)

6. Kolovos, D.S.: Extensible Platform for Specification of Integrated Languages formOdel maNagement Project Website (2007),http://www.eclipse.org/gmt/epsilon

7. Kolovos, D.S.: Establishing Correspondences between Models with the Ep-silon Comparison Language. In: Paige, R.F., Hartman, A., Rensink, A. (eds.)ECMDA-FA 2009. LNCS, vol. 5562, pp. 146–157. Springer, Heidelberg (2009)

8. Kolovos, D.S., Paige, R.F., Polack, F.A.C.: Merging Models with the EpsilonMerging Language (EML). In: Wang, J., Whittle, J., Harel, D., Reggio, G. (eds.)MoDELS 2006. LNCS, vol. 4199, pp. 215–229. Springer, Heidelberg (2006)

9. Kolovos, D.S., Paige, R.F., Polack, F.A.C.: The Epsilon Object Language (EOL).In: Rensink, A., Warmer, J. (eds.) ECMDA-FA 2006. LNCS, vol. 4066, pp. 128–142.Springer, Heidelberg (2006)

10. Kolovos, D.S., Paige, R.F., Polack, F., Rose, L.M.: Update transformations in thesmall with the Epsilon Wizard Language. Journal of Object Technology 6(9), 53–69(2007)

11. Kolovos, D.S., Paige, R.F., Polack, F.A.C.: On the Evolution of OCL for CapturingStructural Constraints in Modelling Languages. In: Abrial, J.-R., Glasser, U. (eds.)Rigorous Methods for Software Construction and Analysis. LNCS, vol. 5115, pp.204–218. Springer, Heidelberg (2009)

12. Kolovos, D.S., Paige, R.F., Polack, F.A.C.: Model comparison: a foundation formodel composition and model transformation testing. In: Proc. GaMMa, pp. 13–20. ACM (2006)

13. Kolovos, D.S., Paige, R.F., Polack, F.A.C.: The Epsilon Transformation Language.In: Vallecillo, A., Gray, J., Pierantonio, A. (eds.) ICMT 2008. LNCS, vol. 5063, pp.46–60. Springer, Heidelberg (2008)

14. Kolovos, D.S., Rose, L.M., Abid, S.B., Paige, R.F., Polack, F.A.C., Botterweck,G.: Taming EMF and GMF Using Model Transformation. In: Petriu, D.C.,Rouquette, N., Haugen, Ø. (eds.) MODELS 2010, Part I. LNCS, vol. 6394, pp.211–225. Springer, Heidelberg (2010)

http://www.eclipse.org/gmt/epsilon


15. Kolovos, D.S., Rose, L.M., Paige, R.F.: The Epsilon Book. University of York(2011)

16. Paige, R.F., Kolovos, D.S., Rose, L.M., Drivalos, N., Polack, F.A.C.: The designof a conceptual framework and technical infrastructure for model managementlanguage engineering. In: ICECCS, pp. 162–171 (2009)

17. Paige, R.F., Rose, L.M., Ge, X., Kolovos, D.S., Brooke, P.J.: FPTC: AutomatedSafety Analysis for Domain-Specific Languages. In: Chaudron, M.R.V. (ed.) MOD-ELS 2008. LNCS, vol. 5421, pp. 229–242. Springer, Heidelberg (2009)

18. Rose, L.M., Herrmannsdoerfer, M., Williams, J.R., Kolovos, D.S., Garces, K.,Paige, R.F., Polack, F.A.C.: A Comparison of Model Migration Tools. In: Petriu,D.C., Rouquette, N., Haugen, Ø. (eds.) MODELS 2010, Part I. LNCS, vol. 6394,pp. 61–75. Springer, Heidelberg (2010)

19. Rose, L.M., Kolovos, D.S., Paige, R.F., Polack, F.A.C.: Model Migration withEpsilon Flock. In: Tratt, L., Gogolla, M. (eds.) ICMT 2010. LNCS, vol. 6142, pp.184–198. Springer, Heidelberg (2010)

20. Rose, L.M., Paige, R.F., Kolovos, D.S., Polack, F.A.C.: The Epsilon GenerationLanguage. In: Schieferdecker, I., Hartman, A. (eds.) ECMDA-FA 2008. LNCS,vol. 5095, pp. 1–16. Springer, Heidelberg (2008)

21. Schmidt, D.C.: Guest editor’s introduction: Model-driven Engineering. IEEEComputer 39(2), 25–31 (2006)

Bidirectional by Necessity: Data Persistenceand Adaptability for Evolving Application Development

James F. Terwilliger

Microsoft Corporation

Abstract. Database-backed applications are ubiquitous. They have common re-quirements for data access, including a bidirectional requirement that the appli-cation and database must have schemas and instances that are synchronized withrespect to the mapping between them. That synchronization must hold under bothdata updates (when an application is used) and schema evolution (when an appli-cation is versioned). The application developer treats the collection of structuresand constraints on application data — collectively called a virtual database —as indistinguishable from a persistent database. To have such indistinguishability,that virtual database must be mapped to a persistent database by some means.Most application developers resort to constructing such a mapping from custom-built middleware because available solutions are unable to embody all of the nec-essary capabilities. This paper returns to first principles of database applicationdevelopment and virtual databases. It introduces a tool called a channel, com-prised of incremental atomic transformations with known and provable bidirec-tional properties, that supports the implementation of virtual databases. It useschannels to illustrate how to provide a singular mapping solution that meets allof the outlined requirements for an example application.

1 Introduction

The persistent data application is a staple of software development. A client applica-tion written in some programming language (usually object-oriented in contemporarysystems) presents data to a user, who may update that data as well. That data is storedin a relational database, whose schema may be designed entirely independently fromthe application. This paradigm is ubiquitous at this point, and as such, ample softwaretooling support has been developed to support it. There are graphical designers for theuser interface, model builders to design workflow or class diagrams, integrated devel-opment environments for the design of the code, and data persistence tools to handlethe retrieval and updating of data between the application and the database.

An application typically has some local understanding of its data that conforms toa schema. That schema may be explicit in the form of a local data cache, or it may beimplicitly present in the form of a programming interface. That schema likely also hasconstraints over its data in terms of valid data states, as well as some form of referentialintegrity between entities. In short, an application’s local data schema has many ofthe same conceptual trappings as the schema of a relational database. Thus, one canconsider the local schema to be a virtual database.

The structure and semantics of that virtual database may differ significantly from theactual schema of the data where it is stored. Despite these differences, the application


220 J.F. Terwilliger

developer has certain expectations of data as it moves to and from persistence. For in-stance, when the application constructs data, it assumes that it will be able to retrieveit again, and in the same form as it was created. In short, the designer of an applica-tion expects to be able to treat a virtual database as if it were indistinguishable froma real database with persistence. Tools such as query-defined views, object-relationalmappers (e.g., [28,31,34]), extract-transform-load scripts (e.g., [42]), and research pro-totypes (e.g., [8,9]) fulfill some but not all of the requirements of a virtual database.For instance, query-defined views provide a simple way to transform data in a databaseto match an application’s schema, but only provide limited support for update and nosupport for relationships expressed between views. As a result, the typical applicationdeveloper combines multiple data access components bound together with additionalprogram code, with no assurances for correctness and an unknown amount of main-tenance cost. Most data access frameworks have minimal support for the evolution ofan application’s schema over multiple versions, resulting in additional manual mainte-nance and testing cost.

This paper views the problem of persistent data in applications and virtual databasesfrom first principles: It first looks at the requirements of application development andexecution, then develops one possible tool — called a channel — capable of fulfillingthose requirements.

1.1 Scenario Requirements

The development and usage lifecycles of the database-backed application induce anumber of requirements that any data access solution — or whatever collection ofdata access solutions and manual processes are employed by an application — mustfulfill. The requirements come about from how an application is designed initially,how it is used in a production environment, and how it is maintained over time acrossversions.

One-Way Roundtripping. When the user inputs data into the application, or the usersees a complete entity in the view of the application, the user expects that data to beunchanged when persisted to the database and retrieved again. The same property canalso be true in the opposite direction, where data in the database must be unchangedby a trip to and from the application, but there are a variety of reasons why thatproperty need not be respected. A prime example of requiring that database-centeredroundtripping need not be respected is the situation where accessing the database isrecorded for security purposes. In this case, a set of operations that have the aggregateeffect of leaving the application model unchanged will in fact change the database state.

Object-Relational Mapping. A data persistence tool must provide a solution to theimpedance mismatch. Object-centered constructs like class hierarchies and collection-valued properties must have a mapping to relational storage.

Relational-Relational Mapping. The application and storage schemas may havebeen designed independently and thus have arbitrarily different structures. Once theimpedance mismatch has been overcome, the persistence software must be able toaccommodate the remaining differences between relational schemas. For instance, an

Bidirectional by Necessity: Data Persistence and Adaptability 221

application’s database tables may be represented as key-attribute-value triples becausethe number of attributes per table would be too large otherwise.

Business Logic. The relationship between schemas may exist to simply restructuredata into a different form without altering information capacity. However, it may alsoinclude business-specific rules. For instance, an application may require that no dataever be deleted from the database but rather be “deprecated” to maintain audit trails.

So-called CRUD Operations. Applications require the ability to Create a new entity,Retrieve an individual entity based on key value, Update an entity’s properties, andDelete a given entity.

Bonus: Arbitrary Query and Set-Based Update. Many applications, though not all,require the ability to perform arbitrary queries against their application schema. Otherapplications may require the ability to update or delete entities based on arbitraryconditions instead of key values. These features are sometimes not required by the userinterface of the application, but rather by some internal processing or workflow withinthe application.

Bonus: Evolution of the Application Schema. Different versions of an applicationwill likely have different models of data to varying degrees. As that data schemaevolves, the schema and instances of the persistent data store, as well as the mappingbetween the two schemas, must evolve with it. Most data access frameworks donot account for such evolution automatically and must be edited manually, but suchevolution is the natural byproduct of application versioning.

1.2 The Status Quo

Database virtualization mechanisms present a perspective on persistent data that is dif-ferent from the actual physical structures to match the model an application presentsto a user. Virtualization can mask certain data for security, simplify structure for sim-pler querying, allow existing programs to operate over a revised physical structure, andso forth. Various virtualization mechanisms for databases have been proposed over thedecades, and are in use in production environments; the most well-known is relationalviews, expressed as named relational queries.

Query-defined views, e.g., as specified in SQL, are highly expressive — especiallyfor read-only views — and offer an elegant implementation that can leverage the opti-mizer in the DBMS. But query-defined views fall short of the above requirements forseveral reasons. First, while the view update problem has been well studied, there isno support for expressing schema modifications (Data Definition Language statements,or DDL), including key and foreign key constraints, against a view. If an application’sdemands on its view schema change, the developer has no recourse but to manually editthe physical schema and mapping.

Second, even if DBMSs in common use supported the full view update capabilitydescribed in the research literature (e.g., [4,10,16,22]), database applications would stillrequire more. The relationship between an application’s view and physical schemasmay require discriminated union, value transformation according to functions or lookup


tables, or translation between data and metadata, as in the case where persistence isin the form of key-attribute-value triples. Business logic like the “data deprecating”example above are not handled either. None of these transformations are supported byupdatable views as currently implemented by database systems; the final two are notconsidered by research literature; and deprecation is not expressible in SQL withoutstored procedures, triggers, or advanced features like temporal capabilities (e.g., [24]).

Because there are not yet tools that support true virtual databases, applications re-quiring anything more than a trivial mapping to storage often have custom-crafted so-lutions built from SQL, triggers, and program code. This approach is maximally ex-pressive, using a general-purpose language, but presents an interface of one or moreread/update routines with pre-defined queries — far from being indistinguishable froma real database. A programming language-based approach is not well-suited for declar-ative specification, analysis, simplification, and optimization of the virtualization map-ping. Thus, there is essentially no opportunity to formally reason about the propertiesof a database virtualization expressed in middleware, in particular, to prove that infor-mation is preserved through the virtualization.

1.3 Related Tools with Different Requirements

There are other applications for relational-to-relational mappings that are not directlyrelated to application development and have different requirements. For instance, a fed-erated database system may use relational mappings to connect constituent data sourcesto a single integrated system (e.g., [26]). In such a system, if you consider the collec-tive set of data sources as a single integrated source, the mapping still needs to supportquery operations, but not necessarily update operations. They need not handle businesslogic. Some federated languages like Both-As-View (BAV) can also support schemaevolution of the federated schema in a semi-automated fashion.

Another related technology is data exchange, where a mapping is used to describehow to translate data from one schema to another. In this scenario, the mapping maybe lossy in both directions, since the schemas on either end of the mapping need nothave been developed with any common requirements [2]. In this scenario, most of theassumptions about requirements from the first section are inapplicable. For instance,queries against the target schema may be incomplete due to incomplete mapping infor-mation. Instance updates are not a requirement, though effort has been made to investi-gate scenarios where mappings may be invertible with varying degrees of loss. Schemaevolution of either schema has been considered, but evolutions are allowed to be lossy,as they do not propagate to the partner schema.

1.4 Overview

The narrative thread of this paper follows a simple theme: devising a complete solu-tion to a problem. Section 2 will introduce an example of an application model anda database, both derived from a real-world application. The majority of the paper ad-dresses the technical challenges posed by that application. Finally, Section 8 describeshow to implement the application using the created tools.


Section 3 explores further the notion of a virtual database and one possible tool toimplement database virtualization called a channel. Section 4 continues the discussionof channels by defining channel transformations that operate over relational data only,and gives fleshed-out examples of two such transformations. Section 5 introduces thenotion of a channel transformation whose behavior may be non-deterministic. Section6 defines a channel transformation that maps object-oriented, hierarchical data to re-lational data. Finally, Section 9 gives some insights on further reading on channels,related work, and future directions for investigation.

Some of the work presented in this paper has been drawn from two conference pa-pers: one specifically on relational mappings [41], and one on object-relational map-pings [37]. Chapter 5 has not yet appeared in any publication (outside of dissertationwork [36]). The relational work was done as part of a larger context of graphical ap-plication development, where the schema for an application is derived from its userinterface so that alternative artifacts such as query interfaces may be automatically gen-erated from the user interface as well. Additional work has been published on the gen-eral framework for user interface development [39] as well as giving the user the abilityto create permanent, redistributable anchors to data as displayed in a user interface [40].The ability to use an application interface and schema as a first-class entity is enabledby database virtualization, and channels are one way to effect that property.

2 Example Scenario

To motivate the discussion, consider an application whose object-oriented model ofdata is pictured in Figure 1. This sample application is a reduction from a real-worldapplication written in the field of clinical endoscopy for the purpose of maintaining anelectronic medical record. The primary data to be tracked in the application is clinicalprocedure data. In the full application, procedure data is tracked from pre-procedurethrough post-procedure, including the state of the patient prior to the procedure, painmanagement regimens, the length of the procedure, the treatments and therapies ap-plied, and post-operative instructions, care, and outcomes. In Figure 1, a simplifiedform of the data to be tracked can be found in the Procedure class hierarchy. Note thatin the real application, the number of attributes per procedure numbers in the range ofseveral hundred.

There are also relationships established between the procedures and the people in-volved, both the patients and the clinical staff performing the procedure or doing pre- orpost-procedure assessments. The people involved are represented in the Person hierar-chy, and the relationships between procedures and people are represented by the foreignkeys in the diagram, shown as dotted lines. For instance, since there is exactly one pa-tient associated with a procedure, but a given patient may undergo many procedures, therelationship between patients and procedures is a one-to-many relationship. In the fig-ure, “Navigation Properties” are merely shortcuts that allow object-oriented languagesa means to traverse relationships, so for instance to find the patient associated with aprocedure, one need only access the Patient property of a procedure.

Figure 2 shows the schema for the relational storage for the application schema inFigure 1. The relational schema in the figure is also a simplification of the actual rela-tional schema used in the clinical application. The class hierarchy Person in Figure 1


Fig. 1. The object-oriented schema of the example application

maps to the two tables Patient and Staff in Figure 2 (this mapping style is oftencalled “table per concrete class”). The mapping between the Procedure hierarchyand its tables is much less straightforward, but encompasses the tables Procedure,TextData, and NumericData.

To illustrate how the mapping works for procedure data, consider an example in-stance as shown in Figure 3. The patient object and the staff member object in thatexample (Figure 3(a)) map cleanly to individual rows in relations (Figure 3(b)). Theprocedure, however, is split across three tables: one to hold the “basic” informationabout the procedure, including the kind of the procedure (also known as a discrimi-nator column), one to hold the remaining text-valued attributes of the procedure, andone to hold the remaining number-valued procedure attributes. The text and numbervalue tables have been “pivoted”; each row in the table corresponds to an attribute inthe original procedure.

Also, for each text or number attribute, three additional columns have been added:a start time (VTStart, or Valid Time Start), an end time (VTEnd, or Valid Time End),and an indicator of what user last updated that row. These three columns are the resultof business logic in the application. Every time one of these attributes is edited in theapplication, the old value of the attribute is still kept around, but the “end time” is set,in essence deprecating the value. Then, the new updated value is added as a new row. InFigure 3(b), one can see that user “travis” has updated two of the values in the originalprocedure to become the values as seen in the current state of the objects in Figure 3(a).

In the original application upon which this example is based, the mapping betweenthe two schemas comprises several different technologies:

– Stored procedures to handle business logic for updating rows in the manner de-scribed in Figure 3, and to return only non-deprecated data


Fig. 2. The relational schema of the example application, serving as the persistent storage for theschema in Figure 1

– In-memory programming logic to handle the re-assembly of procedure data intoobjects, and also to break a procedure apart into the various tables when writing tothe database

– Manual editing effort when evolving the application to add new properties to anexisting procedure kind, or to add another procedure kind altogether

– Extract-Transform-Load (ETL) scripts for translating the data into a more human-readable form for reporting purposes (i.e., arbitrary queries)

Relative to the requirements laid out in Section 1.1, notice that this example has thefollowing characteristics:

– One-Way Roundtripping: The user of the application will have the expectationthat new patients, staff members, or procedures can be added in the tool, and sub-sequent retrievals will pull the correct data. The same can be said about updatedpatients, etc. The same is not true in reverse; someone could technically add a newprocedure to the database manually, but if the user pulls that procedure into the ap-plication, changes an attribute, and then changes it back, the database state may notbe in the same state, as the values in the “LastModifiedBy” fields may be different.

– Object-Relational Mapping: The procedure class hierarchy must be mappeddown to tables in some manner.

– Relation-Relational Mapping: The way that the procedure hierarchy is mappedto tables is far more complex than is necessary. In particular, the data is partitionedand pivoted.

– Business Logic: The “VTStart”, “VTEnd”, and “LastModifiedBy” columns con-tain data not present in the application model, and are populated based on businessrules about data retention.

– CRUD Operations: The application operating over this schema must be able tocreate, retrieve, update, and delete individual patients, staff, and procedures.


ProcedureID: 345ProcedureDate: 01/10/11Duration: 50 Location: A150 Depth: 25 Visualization: Fair Patient: ● Anesthetist: ●

PersonID: 10FirstName: Jane LastName: Edwards HireDate: 05/16/05 Title: Chief Anesthetist ProcedureAs: … ProcedureBs: …

PersonID: 116 FirstName: Robert LastName: Knapp InsuranceID: KN000468 DOB: 04/16/45 SSN: 123-45-6789 Procedures: …

(a)Procedure:

ProcedureID ProcedureDatePatientID345 01/10/11116

TextData:ProcedureID TextValueProperty

345 A150Location345 PoorVisualization

VTStart LastModifiedByVTEnd01/10/11 davis01/10/11 davis01/12/11

345 FairVisualization 01/12/11 travis

NumericData:ProcedureID NumberValueProperty

345 40Duration345 50Duration

VTStart LastModifiedByVTEnd01/10/11 davis01/12/1101/12/11 travis

345 25Depth 01/10/11 davis

Patient: PersonID LastNameFirstName

116 KnappRobertInsuranceID SSNDOBKN000468 123-45-6789 04/16/45

Staff: PersonID LastNameFirstName

10 EdwardsJaneHireDate Title05/16/05 Chief Anesthetist

(b)

ProcedureKindProcedureA

345 10Anesthetist 01/10/11 davis

Fig. 3. Examples of instances in the example application, both in its object representation (a) andits relational storage (b)

– Queries and Set-Based Updates: The application includes the capability to searchfor patients, staff, or procedures based on complex search criteria built from agraphical interface. The application can also edit existing staff information basedon employment data changes.

– Application Schema Evolution: The schema for clinical data changes over time— potentially frequently — as more or different data is collected during procedures.These changes directly impact the application schema, and when they happen, thedatabase must adapt to compensate.

Section 8 revisits this example, demonstrating how to accomplish the same mappingbut with additional capabilities and none of the tools in the list above.


3 An Introduction to Virtual Databases and Channels

This paper demonstrates how to support virtual databases that are indistinguishablefrom a “real” database in the same way that a virtual machine is indistinguishable froma hardware machine. This capability requires that the user (e.g., an application devel-oper) be able to issue queries, DML operations (insert, update, and delete), as well asDDL operations (to define and modify both schema and constraints) against the ap-plication’s virtual schema. The candidate tool described in this paper for supportingvirtual databases is called a channel. One constructs a channel by composing atomicschema transformations called channel transformations (CT’s), each of which is capa-ble of transforming arbitrary queries, data manipulation statements, schema evolutionprimitives, and referential integrity constraints addressing the input schema (called thevirtual schema) into equivalent constructs against its output schema (called the physicalschema). Our approach is similar to Relational Lenses [8] in that one constructs a map-ping out of atomic transformations. Lenses use a state-based approach that resolves anupdated instance of a view schema with a physical schema instance, whereas a channeltranslates query, DML, and DDL update statements directly.

This section defines an initial set of CT’s that cover a large number of database re-structuring operations seen in practice. Later sections show how CT’s can be formallydefined by describing how they transform the full range of query, DML, and DDL state-ments. The framework includes a definition of correctness criteria for CT’s that guaran-tees indistinguishability. All CT’s must support one-way invertibility where operationsissued against the input database, after being propagated to the output database, havethe same effect (as observed from all operations issued against the input database) as ifthe operations had been issued against a materialized instance of the input database.

3.1 Channels and Channel Transformations

A channel transformation (CT) is a uni-directional mapping from an input (virtual)schema S to an physical schema S that encapsulates an instance transformation. A CTrepresents an atomic unit of transformation that is known to be updatable. A channel isbuilt by composing CT’s. A channel is defined by starting with the virtual schema andapplying transformations one at a time until the desired physical schema is achieved,which explains the naming conventions of the transformations. Figure 4 shows a graph-ical representation of how applications interact with a virtual schema connected to aphysical one through a channel.

Formally, a CT is a 4-tuple of functions (S, I,Q,U), each of which translates state-ments expressed against the CT’s virtual schema into statements against its physicalschema. Let S be the set of possible relational schemas and D be the set of possibledatabase instances. Let Q be the set of possible relational algebra queries. Let U bethe set of possible database update statements, both data (DML) and schema (DDL), aslisted in Table 1. Let [U] be the set of finite lists of update statements — i.e., an elementof [U] is a transaction of updates. Finally, let ε represent an error state.

– Function S is a schema transformation S : S → S ∪ {ε}. A channel transformationmay have prerequisites on the virtual schema s, where S(s) = ε if those prerequi-sites are not met. Function S must be injective (1-to-1) whenever S(s) � ε.


Physical Storage Schema and Instance

Transformation 1

Transformation 2

Transformation 3

Transformation 4

Transformation 5

Transformation N

Sche

ma

Inst

ance

Que

ries

Upd

ates

, Sch

ema Δ

Application Schema, Virtual

Transformations specified one at a

time from application to

storage… … …

Queries Updates Schema Δ

Application Services UI, Query, Admin, Designer

Fig. 4. A channel connecting a virtual schema (which fields all operations from application ser-vices) to a concrete, physical one

– Function I is an instance transformation I : S × D → D, defined on pairs of input(s, d) where S(s) � ε and instance d conforms to schema s. Function I must beinjective on its second argument, and output a valid instance of S(s).

– Function Q is a query transformation Q : S × Q → Q, de-fined on pairs of input(s, q) where S(s) � ε and query q is valid over schema s, i.e., the query executedon an instance of the schema would not return errors. Function Q must be injectiveon its second argument, and output a valid query over S(s).

– Function U is an update transformation U : S × [U] → [U] ∪ ε, defined onpairs of input (s, u) where S(s) � ε and update transaction u is valid over schemas, where each update in the transaction references existing schema elements andwhen executed on schema s do not cause errors or schema conflicts (e.g., renaminga column of a table to a new name that conflicts with an existing column). FunctionU must be injective on its second argument when U(s, u) � ε, and output a validupdate transaction over S(s). Expression U(s, u) evaluates to the error state ε if uapplied to s produces schema s′ where S(s′) = ε.

The function S (and function I) provides the semantics for a CT in terms of translatinga virtual schema (and an instance of it) into a physical schema (and an instance of it).These functions are not used in any implementation, but allow one to reason about thecorrectness of functions Q and U. Neither query nor update functions require a databaseinstance as input; a CT directly translates the statements themselves.


The associated functions of a channel transformation must satisfy the following com-mutativity properties, where q(d) means executing query q over instance d, and u(s)means executing update transaction u on schema s:

– For a virtual schema s, a concrete instance d of s, and a query q, let q = Q(s, q)(the translated query) and d = I(s, d) (the translated instance). Then, q(d) = q(d).In other words, translating a query and then executing the result on the translatedinstance will produce the same result as running the query on an instance of thevirtual schema (Figure 5(a)).

– For a virtual schema s and a valid update transaction u against s, let s = S(s) (thetranslated schema) and u = U(s, u) (the translated update). Then, u(s) = S(u(s)).Running a translated update against a translated schema is equivalent to runningthe update first, then translating the result (Figure 5(b)).

– For a virtual schema s and a valid update transaction u against s, for each table t ∈ s,let qt be the query SELECT * FROM t. Let s = S(s) (the translated schema) and u =U(s, u) (the translated update). Finally, let qt

u = Q(u(s), qt), the result of translatingquery qt on schema s after it was updated by u. Then, qt

u(u(s)) ≡ qt(u(s)). Runninga translated query against a translated schema that has been updated by a translatedupdate is equivalent to running the query locally after a local update (Figure 5(c)).

s

u(s)

s

u(s)

s

u(s)

s

u(s)u

S

u=U(s, u)S

Q

u

S

u=U(s, u)

Q(u(s), qt)

d d I

Q(s, q)

Q

q

qt

(a) (b) (c)

Fig. 5. Three commutativity diagrams that must be satisfied for CT’s that have defined instance-at-a-time semantics

The last commutativity property abuses notation slightly by allowing queries to runon a schema instead of a database instance, but the semantics of such an action arestraightforward. If s is a schema and qt is the query SELECT * FROM t for t ∈ s, thenqt(s) ≡ t, and more complicated queries build on that notion recursively. The notationallows us to reason about queries and updates without referring to database instancesby treating a single update statement as interchangeable with the effect it has on aninstance:

– If u = I(t,C,Q), then qt(u(s)) ≡ t ∪ Q (all of the rows that were in t plus the newrows Q added)


– If u = D(t, F), then qt(u(s)) ≡ σ¬Ft (all of the rows that were in t that do not satisfyconditions F)

– If u = AC(t,C,D), then qt(u(s)) ≡ t × ρ1→C{null} (a new column has been addedwith all null values)

– If u = DE(t,C, E) for a key column C, then qt(u(s)) ≡ σC�E t (delete rows that havethe dropped element for column C)

In addition to the commutativity properties, function U must have the following prop-erties:

– U(s, u) = ε ⇐⇒ S(u(s)) = ε. Function U returns an error if and only if applyingthe update transaction to the virtual schema results in the schema no longer meetingthe transformation’s schema preconditions.

– If U(s, u) � ε and d is an arbitrary instance of schema s, U(s, u)(I(s, d)) = ε ⇐⇒u(d) = ε. Applying a transaction to an instance returns an error in case of a primaryor foreign key violation. This property ensures that a violation occurs on the outputinstance if and only if a violation would occur if a materialized instance of thevirtual schema were updated. Note that such a violation occurs when the transactionis executed rather than when it is translated.

4 Transformations over Relational Schemas

This section and the next few sections continue the discussion of channels and chan-nel transformations by considering channel transformations whose input and outputschemas are both relational. A relational CT translates a relational schema into anotherrelational schema, a relational algebra query into another relational algebra query, andrelational update statements into relational update statements. A CT is named basedon its effect on artifacts in the direction of its operation, even though CT’s are appliedfrom the application schema toward the physical schema. For instance, the “HMerge”CT describes a horizontal merging of tables from the virtual schema into a table in thephysical schema. Examples of relational CT’s include the following transformations,where all parameters with an overbar represent constructs in the CT’s output and thosewith a vector notation (e.g., T) are tuples:

– VPartition(T, f , T1, T2) distributes the columns of table T into two tables, T1 andT2. Key columns of T appear in both output tables, and a foreign key is establishedfrom T2 to T1. Non-key columns that satisfy predicate f are in T1, while the restare in T2.

– VMerge(T1, T2, T ) vertically merges into table T two tables T1 and T2 that arerelated by a one-to-one foreign key.

– HPartition(T,C) horizontally partitions the table T based on the values in columnC. The output tables are named using the domain elements of column C.

– HMerge( f , T ,C) horizontally merges all tables whose schema satisfies predicate finto a new table T , adding a column C that holds the name of the table from whicheach row came.


– Apply(T,C,C, f , g) applies an invertible function f with inverse g to each row inthe table T . The function input is taken from columns C, and output is placed incolumns C.

– Unpivot(T, A,V, T) transforms a table T from a standard one-column-per-attributeform into key-attribute-value triples, effectively moving column names into datavalues in new column A (which is added to the key) with corresponding data valuesplaced in column V . The resulting table is named T .

– Pivot(T, A,V, T) transforms a table T in generic key-attribute-value form into aform with one column per attribute. Column A must participate in the primary keyof T and provides the names for the new columns in T , populated with data fromcolumn V . The resulting table is named T .

These informal definitions of CT’s describe what each “does” to a fully materializedinstance of a virtual schema, but a virtual schema is virtual and thus stateless. Thus, a CTmaintains the operational relationship between input and output schemas by translatingall operations expressed against the virtual schema into equivalent operations againstthe physical schema. Two of these CT’s — Horizontal Merge and Pivot — will be usedas running examples through the next few sections.

(a)

(b)

Fig. 6. Instances transformed by an HMerge CT (a) and a Pivot CT (b)

Example: HMerge. The HMerge transformation takes a collection of tables withidentically named and union-compatible primary keys and produces their outer union,adding a discriminator column C to give each tuple provenance information. Any tablein the virtual schema that does not satisfy predicate f is left unaltered1. Figure 6(a)shows an example of an HMerge CT.

Let the CT for HMerge( f , T ,C) be the 4-tuple HM = (SHM, IHM, QHM,UHM). Let T f

be the set of all tables in virtual schema s that satisfy predicate f and Cols(t) be theset of columns for table t. Define SHM on schema s as follows: replace tables T f withtable T with columns (

⋃t∈T f Cols(t)) ∪ {C}, the union of all columns from the source

1 The predicate parameter for HMerge is described only informally in this paper. One suchexample would be “Table has prefix P ”, which is the predicate in Figure 6(a).


Table 1. Relational DML and DDL statements supported by channel transformations

Statement Formalism Explanation of Variables

Insert I(T,C,Q) Insert rows into table T into columns C, using the values of C from the rowsin Q. The value of Q may be a table constant or a query result.

Update U(T, F,C,Q) Update rows in table T that satisfy all equality conditions F specified onkey columns. Non-key columns C hold the new values specified by queryor constant Q. Query Q may refer to the pre-update row values as constants.Not all key columns need to have a condition.

Delete D(T, F) Delete rows from table T that satisfy all equality conditions F specified onkey columns. Not all key columns need to have a condition.

Add Table AT(T,C, D, K) Add new table T , whose columns C have domains D, with key columnsK ⊆ C.

Rename Table RT(To,Tn) Rename table To to be named Tn. Throw error if Tn already exists.Drop Table DT(T ) Drop the table named T .

Add Column AC(T,C,D) Add to table T a column named C with domain D.Rename Column RC(T,Co ,Cn) In table T , rename the column Co to be named Cn. Throw error if Cn already

exists.Change ColumnFacet

CP(T,C, F,V) For the column C of the table T , change its domain facet F to have thevalue V . Common facets include whether a column is nullable, the column’smaximum length if the column has a string-valued domain, or the precisionand scale if the column is numeric-valued.

Drop Column DC(T,C) In table T , drop the non-key column C.

Add Element AE(T,C, E) In table T , in column C, add a new possible domain value E.Rename Element RE(T,C, Eo ,En ) In table T , in column C, rename domain element Eo to be named En . Throw

error if En conflicts with an existing element.Drop Element DE(T,C, E) In table T , in column C, drop the element E from the domain of possible

values.

Add Foreign Key FK(F|T.X → G|T ′.Y) Add foreign key constraint from columns T.X to columns T ′ .Y, so that foreach tuple t ∈ T , if t satisfies conditions F and t[X] � null, there must betuple t′ ∈ T ′ such that t′ satisfies conditions G and t[X] = t′[Y].

Drop Foreign Key DFK(F|T.X → G|T ′.Y) Drop the constraint imposed by the enclosed statement.Add Constraint Check(Q1 ⊆ Q2) Add a check constraint so that the result of query Q1 must always be a

subset of the results of query Q2. This constraint is also called a Tier 3 FK.Drop Constraint DCheck(Q1 ⊆ Q2) Remove the check constraint between the results of queries Q1 and Q2.

Loop Loop(t,Q,S) For each tuple t returned by query Q, execute transaction S.Error Error(Q) Execute query Q, and raise an error if any rows are returned.

tables eliminating duplicates, plus the provenance column, whose domain is the namesof the tables in T f . The key of T is the common key from tables T f plus the column C.SHM(s) = ε if the keys are not union-compatible and identically named.

Define IHM on schema s and instance d by replacing the instances of T f in d with⊎

t∈T f (t×{(name(t))}), where⊎

is outer union with respect to column name (as opposedto column position) and name(t) represents the name of the table t as a string value.

Example: Pivot. Recall that a Pivot CT takes four arguments: T (the table to be piv-oted), A (a column in the table holding the data that will be pivoted to form columnnames in the result), V (the column in the table holding the data to populate the piv-oted columns), and T (the name of the resulting table). Let the channel transformationfor Pivot(T, A,V, T) be the 4-tuple PV = (SPV, IPV,QPV,UPV). An example instancetransformation appears in Figure 6(b).

Let SPV be defined on schema s by removing table T (which has key columns Kand non-key columns N, where A ∈ K and V ∈ N), and replacing it with T with keycolumns (K−{A}) and non-key columns (N−{V}∪Dom(A)). Dom(A) represents the do-main of possible values of column A (not the values present in any particular instance);


therefore, the output of SPV(s) is based on the domain definition for A as it appears inschema. The new columns for each element in Dom(A) have domain Dom(V). If A isnot present or not a key column, or if Dom(A) has any value in common with an inputcolumn of T (which would cause a name conflict in the output), then SPV(s) = ε.

Let IPV be defined on schema s and instance d by replacing the instance of T in d with↗�Dom(A);A;V T , where↗� is an extended relational algebra operator that performs a pivot,detailed in the next section. Formally, Dom(A) could be any finite domain; practicallyspeaking, PV would only be applied where Dom(A) is some small, meaningful set ofvalues such as the months of the year or a set of possible stock ticker names.

4.1 Translating Queries

Each CT receives queries, expressed in extended relational algebra addressing the CT’svirtual schema, and produces queries expressed in extended relational algebra address-ing its physical schema. The query language accepted by a channel includes the eightstandard relational algebra operators (σ, π, ×, �, ∪, ∩, −, and ÷)2, the rename operator(ρ), table and row constants, plus:

– Left outer join (��) and left antisemijoin (�) [6]– Pivot (↗�C;A;V ): For a set of values C on which to pivot, pivot column A, and pivot-

value column V (translating a relation from key-attribute-value triples into a nor-malized, column-per-attribute form)

– Unpivot (↙�C;A;V ), the inverse operation to pivot– Function application (αI,O, f ): Apply function f iteratively on all rows, using

columns I as input and placing the result in output columns O

The pivot query operator is defined as:

↗�C;A;V Q ≡ (πcolumns(Q)−{A,V}Q) �� (ρV→C1πcolumns(Q)−{A}σA=C1Q)

�� . . . �� (ρV→Cnπcolumns(Q)−{A}σA=CnQ) for C1, . . . ,Cn = C

Name Period Price IBM Sp 19 IBM Su 22

MSFT Su 31 MSFT W 35 Apple Su 52 MSFT F 36 Apple F 54

Keys Name IBM

MSFT Apple

(a)

“Sp” Name Price IBM 19

(b)

“Su” Name PriceIBM 22

MSFT 31 Apple 52

(c)

“F” Name Price MSFT 36 Apple 54

(d)

“W” Name PriceMSFT 35

(e)

Pivoted Version Name Sp Su F W IBM 19 22

MSFT 31 36 35 Apple 52 54

(f)

Fig. 7. An example of an instance transformed by the pivot query operator ↗�{Sp,Su,F,W};Period;Price ,first broken down into intermediate relations that correspond to the set of non-pivoted columns(a) and the subsets of rows corresponding to each named value in the pivot column “Period” (b–e). The pivoted instances are then outer joined with the first instance (Keys) to produce the pivottable (f).

2 For a good primer on the relational data model and relational algebra, consider [15]


Note that the pivot (and unpivot) query operators above have an argument giving theprecise values on which to pivot (or columns to unpivot, respectively); as a result, bothquery operators have fixed input and output schemas. This flavor of the pivot and un-pivot operator is consistent with implementations in commercial databases (e.g., thePIVOT ON clause in SQL Server [29]). Contrast this property with the Pivot CT, wherethe set of output columns is dynamic. Some research extensions to SQL, includingSchemaSQL [23] and FISQL [45,46], introduce relation and column variables that canproduce the effect of dynamic pivoting, but over unconstrained or manually constraineddomains. The Pivot CT lays between these two cases in functionality, where the set ofpivot columns is dynamic but still constrained.

Figure 7 shows an example instance transformed by a pivot operator ↗�C;A;V , withthe transformation broken down into stages. First, columns A and V are dropped usingthe project operator, with only the key for the pivoted table remaining. Then, for eachvalue C in the set C, instance ρV→Cπcolumns(Q)−{A}σA=CQ is constructed consisting of allrows in the instance that have value C in the pivot column A, with the “value” columnV renamed to C to disambiguate it from other value columns in the pivot table. Finally,each resulting table is left-outer-joined against the key table, filling the key table outwith a column for each value C.

A pivot operator is useful in the algebra because, like joins, there are well-knownN log N algorithms involving a sort of the instance followed by a single pass to fill outthe pivot table. Some details of the pivot query operator are left aside, such as what todo if there exist multiple rows in the instance with the same key-attribute pair, sincethe exact semantics of what to do in these cases have no bearing on the operation of achannel (Wyss and Robertson have an extensive formal treatment of Pivot [45]).

The unpivot query operator is defined as follows:

↙�C;A;V Q ≡⋃

C∈C(ρC→Vπcolumns(Q)−(C−{C})σC<>null(Q) × ρ1→A(name(C)))

where name(C) represents the name of attribute C as a constant (to disambiguate it froma reference to instance data).

Each CT translates a query — including any query appearing as part of a DML orDDL statement — in a fashion similar to view unfolding. That is, function Q looks forall references to tables in the query and translates them in-place as necessary.

As an example, consider QPV, the query translation function for Pivot, which trans-lates all references to table T into↙�Dom(A);A;V T . That is, the query translation introducesan unpivot operator into the query to effectively undo the action that the Pivot CT per-forms on instances. Of particular note is that the first parameter to ↙� is populated bythe CT with the elements in the domain of column A at the time of translation. Thus,the queries generated by the Pivot transformation will always reference the appropriatecolumns in the pivoted logical schema, even as elements are added or deleted from thedomain of the attribute column in the virtual schema, and thus columns are added ordeleted from the physical schema. (Pivot will process the DDL statements for addingor dropping domain elements — see Section 4.3 for an example.)

Because the set of columns in T without V is a superkey, there can never be tworows with the same key-attribute combination; thus, unlike the pivot relational queryoperator in general, the Pivot CT need not deal with duplicate key-attribute pairs.


HMerge Translation of Queries. Function QHM translates all references to a tablet |= f into the expression πCols(t)σC=tT . That is, QHM translates a table reference t intoa query that retrieves all rows from the merged table that belong to virtual schematable t as a selection condition on the provenance column, and a projection down to thecolumns in the virtual schema for t.

To prove that function QHM respects the commutativity properties, one must showthat the translation effectively undoes the outer-union operation, which follows fromrelational algebra equivalences.

4.2 Translating DML Statements

The set of update statements accepted by a channel is shown in Table 1. A channeltransformation supports the insert, update, and delete DML statements. Update anddelete conditions must be equality conditions on key attributes, and updates are notallowed on key attributes, assuming that the application will issue a delete followed byan insert. Channels also support a loop construct, denoted as Loop(t,Q, S), similar toa cursor: t is declared as a row variable that loops through the rows of the result of Q.For each value t takes on, the sequence of statements S execute. Statements in S maybe any of the statements from Table 1 and may use the variable t as a row constant.Using Loop, one can mimic the action of arbitrary update or delete conditions by usinga query to retrieve the key values for rows that match the statement’s conditions, thenissue an update or delete for each qualifying row. Channels support an error statementError(Q) that aborts the transaction if the query Q returns a non-empty result.

A complete definition of the update translation function U includes computation ofU(I(T,C,Q)), U(D(T, F)), etc. for each statement in Table 1 for arbitrary parametervalues. The results are concatenated based on the original transaction order to formthe output transaction. For instance, for an update function U, if U(s, [u1]) = [u1, u2]and U(s, [u2]) = [u3, u4, u5], then U(s, [u1, u2]) = [u1, u2, u3, u4, u5]. An error either ontranslation by U (i.e., U evaluates to ε on a given input) or during execution against theinstance aborts a transaction.

Name Period Price IBM Sp 19 IBM Su 22

MSFT Su 35 MSFT Sp 31 Apple Su 52

Name Sp Su IBM 19 22

MSFT 91 35 Apple NULL 52

u = I (Stock, {Name, Period, Price}, {(MSFT, Sp, 31), (Apple, Su, 52)})

UPV(u) =

Error(σName=MSFT,SP≠NULL Stock) U (Stock, Name=MSFT, Sp, 91) I (Stock, {Name, Su}, {(Apple, 52)})

Fig. 8. An example of inserts translated by a Pivot CT

HMerge Translation of Inserts. Let UHM be the update function for the transformationHMerge( f , T ,C), and let s be its virtual schema. Define the action of UHM on an Insertstatement I(t,C,Q) where t |= f as follows:


UHM(s, I(t,C,Q)) = I(T ,C ∪ {C},Q × {name(t)})

where name(t) is the string-valued name of table t. The translation takes all rows Qthat are to be inserted into table t and attaches the value for the provenance column inthe output. An example is shown in Figure 9. Since the output consists entirely of insertstatements, proving that UHM respects the commutativity properties for insert statementsreduces to showing that the newly added rows, when queried, appear in the virtualschema in their original form. In short, one must show that πCols(t)σC=t(Q×{name(t)}) =Q, which can be shown to be true by relational equivalences.

P_Staff:

FName LName Cert Gail Brown T

Person: FName LName T Age Cert

Bob Smith P_Client 19 Ted Jones P_Staff T Gail Brown P_Staff T

u = I (P_Admin, {FName, LName, Cert}, {(Gail, Brown, T)})

UHM(u) = I (Person, {FName, LName, T, Age, Cert}, {Gail, Brown, P_Staff, null, T})

Fig. 9. An example of an insert statement translated by an HMerge CT

Pivot Translation of Inserts. Now consider UPV(I(T,C,Q)), pushing an insert state-ment through a Pivot. Each tuple (K, A,V) inserted into the virtual schema consists of akey value, an attribute value, and a data value; the key value uniquely identifies a row inthe pivoted table, and the attribute value specifies the column in the pivoted table. UPV

thus transforms the insert statement into an update statement that updates column A forthe row with key K to be value V , if the row exists. In Figure 8, the inserted row withName = ‘‘MSFT’’ corresponds to a key value already found in the physical schema;that insert row statement therefore translates to an update in the physical schema. Theother row, with Name = ‘‘Apple’’, does not correspond to an existing key value, andthus translates to an insert.

The Pivot CT adds an error statement to see if there are any key values in commonbetween the new rows and the existing values in the output table, and if so, returns anerror, as this situation indicates that a primary key violation would have occurred ina materialized virtual schema. Next, using a Loop statement, for each row in Q thatcorresponds to an existing row in the output table, generated statements find the correctrow and column and set its value. A final insert statement finds the rows in Q that donot correspond to existing rows in the output table, pivots those, and inserts them.

Let s be the virtual schema of the CT Pivot(T, A,V, T). Define the action of the CT’supdate function UPV on an insert DML statement I(T,C,Q) as follows:

UPV(s, I(T,C,Q)) =Error((πKeys(T )Q) ∩ πKeys(T )↙�Dom(A);A;V (πCols(T )(Q � T ))),

(check that inserted rows do not collide with existing data)∀a∈Dom(A)Loop(t, σA=aQ � (πKeys(T )T ),

U(T ,∀c∈Keys(T )c = t[c], {a}, πVt)),(update each row whose key is already present)

I(T ,Cols(T ),↗�Dom(A);A;V (Q�(πKeys(T )T )))(inserts for non-existent rows)


4.3 Translating DDL Statements

Table 1 includes the full list of supported schema and constraint update statements. Thedomain-element DDL statements are unique to our approach. If a domain element E incolumn C is dropped, and C is not a key column, then any row that had a C value of Ewill have that value set to null. However, if instead C is a key attribute, then any suchrow will be deleted. In addition, the Rename Element DDL statement will automaticallyupdate an old domain value to the new one. Since renaming an element can happen onany column, key or non-key, renaming elements is a way to update key values in-place.Note that the set of changes in Table 1 is complete in that one can evolve any relationalschema S to any other relational schema S ′ by dropping any elements they do not havein common and adding the ones unique to S ′ (a similar closure argument has been madefor object-oriented models, e.g. [5]).

P_Admin: FName LName Pay

Person: FName LName T Age Cert Pay

Bob Smith P_Client 19 Ted Jones P_Staff T

u = AT (P_Admin, {FName, LName, Pay}, {string, string, int}) UHM(u) = AE (Person, T, P_Admin),

AC (Person, Pay, int)

Fig. 10. An Add Table statement translated by HMerge

HMerge Translation of Add Table. Let UHM be the update function for transformationHMerge( f , T ,C), and let s be its virtual schema. Define the action of UHM on an AddTable statement for a table t that satisfies f as follows:

UHM(s,AT(t,C, D, K)) =If T exists, then AE(T ,C, t), and for each column c in table t,

(�s|= f c ∈ Cols(s))→ AC(T , c,Dom(c))If T not yet created, then

AT(T ,C ∪ {C}, D ∪ {{name(t)}}, K ∪ {C})

If the merged table already exists in the physical schema, the function adds a new do-main element to the provenance column to point to rows coming from the new table.Then, the function UHM adds any columns that are unique to the new table. If the new ta-ble is the first merge table, the output table is created using the input table as a template.An example is shown in Figure 10, assuming tables “P Client” and “P Staff” alreadyexist in the virtual schema.

HMerge Translation of Add Column. Let UHM be the update function for thetransformation HMerge( f , T ,C), and let s be its virtual schema. Define the action ofUHM on an Add Column statement AC(t,C,D) for one of the merged tables t |= f asfollows:


UHM(s,AC(t,C,D)) =If C is not a column in any other merged table besides t, then

AC(T ,C,D)If C exists in another merged table t′, and t′.C has a different domain, then ε (abort

— union compatibility violated)If C exists in other merged table(s), all with the same domain, then ∅ (leave output

unchanged)

Pivot Translation of Drop Element. Let UPV be the update function for the transfor-mation Pivot(T, A,V, T), and let s be its virtual schema. Define the action of UPV onDrop Element DDL statements as follows:

UPV(s,DE(T,C, E)) =If C = A, then DC(T , E)Else if C = V , then ∀c∈Dom(A)DE(T , c, E)Else, DE(T ,C, E)

If dropping an element from the attribute column, translate into a Drop Column. Ifdropping an element from the value column, translate into Drop Element statementsfor each pivot column. Otherwise, leave unaffected (also leave unaffected for any DropElement statement on tables other than T ). An example of Drop Element translation isin Figure 11.

Name Period Age IBM Sp 19 IBM Su 22

MSFT Su 35

Name Sp Su IBM 19 22

MSFT NULL 35

u = DE(Stock, Period, Sp) (Equivalent to DELETE WHERE PERIOD = “Sp”)

UPV(u) = DC(Stock, Sp)

Fig. 11. An example of a Drop Element statement translated by a Pivot CT

Let us use this translation as an example of how to prove the correctness of a CT’supdate translation, in particular dropping an element from the “attribute” column Ain the input to the Pivot CT. Proving the first property is omitted for space and leftas an exercise for the reader. To prove the second commutativity property, one mustdemonstrate that the schema that results from adding the pivot table with the elementstill present through the pivot followed by dropping the element has the same result aspushing the table’s schema through without the element.

Proposition: Let s be a schema with T undefined. Then:UPV(s, {AT(T,C ∪ {A}, D ∪ {D′ − {E}}, K ∪ {A})})= UPV(s, {AT(T,C ∪ {A}, D ∪ {D′}, K ∪ {A}), (DE(T, A, E0))}).

Proof: UPV(s,AT(T,C ∪ {A}, D ∪ {D′ − {E}}, K ∪ {A}))= AT(T , (C − {V}) ∪ D′ ∪ {E0},

D − {Dom(V)} ∪ {∀a∈D′−{E}Dom(V)}, K)


(Push the Add Table statement through the Pivot)= AT(T , (C − {V}) ∪ D′, D − {Dom(V)} ∪ {∀a∈D′Dom(V)}, K),

DC(T , E,Dom(V))(DDL equivalence)

= UPV(s, {AT(T,C ∪ {A}, D ∪ {D′}, K ∪ {A}),DE(T, A, E)})(View the statements in their pre-transformation image)

�

Finally, one needs to prove the commutativity property from Figure 5(c):

Proposition: Let s be a schema with T defined. Then:QPV(DE(T,C, E)(s), qT )(DC(T, E)(SPV(s))) = qT (DE(T,C, E)(s))

Proof: QPV(DE(T,C, E)(s), qT )(DC(T, E)(SPV(s)))= (↙�Dom(A)−{E};A;V T )(DC(T , E)(SPV(s)))

(Transforming the query qT , but on a schema where column A has lost element E)=↙�Dom(A)−{E};A;VπCols(T)−{E}T

(Dropping a column has the effect of projecting it away)= σA�E↙�Dom(A);A;V T

(Extended relational algebra equivalence for unpivot)= σA�EqT (s)

(Pull query back through transformation on original schema)= qT (DE(T,C, E)(s))

(Effect of Drop Element statement on a key column is to delete all rows with thatvalue)�

4.4 Translating Foreign Keys

Consider three levels — or tiers — of referential integrity, offering a trade-off betweenexpressive power and efficiency. A Tier 1 foreign key is a standard foreign key in thetraditional relational model. A Tier 3 foreign key Check(Q1 ⊆ Q2) is a containmentconstraint between two arbitrary queries. A Tier 2 foreign key falls between the two,offering more expressiveness than an ordinary referential integrity constraint but withefficient execution.

A Tier 2 foreign key statement FK(F|T.X → G|U.Y) is equivalent to statementCheck(σFπXT ⊆ σGπXU), where Y is a (not necessarily proper) subset of the primarykey columns of table U, and F and G are sets of conditions on key columns (for theirrespective relations) with AND semantics. The statement FK(true|T.X → true|U.Y) istherefore a Tier 1 primary key — a foreign key in the traditional sense — if Y is the keyfor table U.

To translate FK (and DFK) statements, one can leverage the insight that any FKstatement can be restated as a Check statement. Statements Check (and DCheck) havebehavior specified as queries, so their translation follows directly from query translation.It becomes immediately clear why additional levels of referential integrity are required;if one specifies a standard integrity statement FK(true|T.X → true|U.Y) against a vir-tual schema, its image in the physical schema may involve arbitrarily complex queries.


A foreign key constraint in the standard relational model is a containment relation-ship between two queries, πCT ⊆ πKT ′, where C is the set of columns in T comprisingthe foreign key and K is the key for T ′. Figure 12(a) shows a traditional foreign keybetween two tables. Figure 12(b), shows the same two tables and foreign key after thetarget table of the foreign key has been horizontally merged with other tables. The for-eign key now points to only part of the key in the target table and only a subset of therows, a situation that is not expressible using traditional relational foreign keys. Figure12(c) shows the same tables as Figure 12(a), but this time, the target table has been piv-oted. Now, the “target” of the foreign key is a combination of schema and data values.

Thus, propagating an ordinary foreign key through a CT may result in a containmentquery involving arbitrary extended relational algebra. It is possible to translate a foreignkey constraint Q1 ⊆ Q2 through a CT simply by translating queries Q1 and Q2. How-ever, one can observe that in many cases, the translated query is in the form πC′σF′T ′

or even πC′T ′, though not necessarily covering a table’s primary key. A containmentconstraint using these simple queries may be enforced by triggers with reasonable andpredictable performance.

Table 1 lists the two statements that can establish integrity constraints, FK andCheck. The update function U for a CT translates a Check statement by translatingits constituent queries via the CT’s query translation function Q. Note that as a conse-quence, if a CT translates an FK statement into a Tier 3 foreign key requiring a Checkstatement, it will stay as a Check statement through the rest of the channel.

There are two additional statements listed in Table 1 that drop referential integrityconstraints — DFK and DCheck. A CT translates these statements in the same fashionas their “add” analog.

Tiered Foreign Keys. A Tier 1 foreign key defined from columns T.X to table T ′ withprimary key Y is equivalent to the following logical expression:

∀t∈T t[X] � null→ ∃t′∈T ′ t[X] = t′[Y].

A Tier 2 foreign key statement FK(F|T.X → G|T ′.Y) is equivalent to the followinglogical expression:

∀t∈T∃t′∈T ′ (t |= F ∧ t[X] � null) −→ (t[X] = t′[Y] ∧ t′ |= G).

where Y is a (not necessarily proper) subset of the primary key columns of table T ′,and F and G are sets of conditions on key columns (for their respective relations) withAND semantics. Figure 12(b) shows an example of a Tier 2 foreign key enforced ontable instances, and the statement used to create the foreign key.

The foreign key FK(true|T.X → true|T ′.Y) is precisely a Tier 1 foreign key whenY is the primary key for T ′. One can represent Tier 1 FKs using Tier 2 FK syntaxFK(F|T.X → G|T ′.Y) because it simplifies the description of a CT, and because it istrivial to check at runtime whether F and G are empty and Y is a key for T ′. Thus, ourimplementation can determine at runtime when a Tier 2 FK can be implemented in adatabase as a Tier 1 FK (a standard relational foreign key).

A Tier 3 foreign key is a containment constraint between two queries Q and Q′ inarbitrary relational algebra over a single schema, expressed as Check(Q ⊆ Q′).


Sales: ID Buyer Item Vendor 1 101 Soup A 2 224 Bread B



Food: Item Vendor Stock Bob Smith 19 Bob Jones 44 Sue Jones 95

AllItems: Item Vendor Type StockSoup A Food 19 Soup B Food 44 Bread B Food 95 Yarn B Textile 34

Food: Item A B Soup 19 44 Bread NULL 95

FK(true | Sales(Item, Vendor) true | Food(Item, Vendor))

FK(true | Sales(Item, Vendor) Type=Food | AllItems(Item, Vendor)) Note: cannot insert row (3, 645, Yarn, B) into Sales, since qualifying row in

AllItems does not meet condition Type=Food specified in FK.

(a)

(b)

(c) Check (πItem,Vendor Sales ⊆ πItem,Vendor {A,B},Vendor,Item Food)

Fig. 12. Examples of Tier 1 (a), Tier 2 (b), and Tier 3 (c) foreign keys

The example in Figure 12(c) can be expressed as a Tier 3 foreign key, where thetarget of the foreign key is a pivoted table. Since Tier 3 FKs may be time-consuming toenforce, a channel designer should take note of when a CT demotes a Tier 1 or 2 foreignkey to Tier 3, i.e., any time a Check statement appears in the logic for translating anFK statement and consider the tradeoff.

Tier 2 FK as a Trigger. A Tier 2 foreign key FK(F|T.X → G|T ′.Y) can be enforced ina standard relational database using triggers — specifically, insert and update triggerson the source table T and a delete trigger on the target table T ′:

begin insert trigger (T)

if new tuple satisfies conditions F

for each tuple t in T

if t[Y] = new tuple[X] and t satisfies G

accept insert

reject insert

end trigger

(update trigger follows same pattern as insert)


begin delete trigger (T’)

if deleted tuple satisfies conditions G


if t[X] = deleted tuple[Y] and t satisfies F

delete tuple t

end trigger

The worst-case performance for enforcing a Tier 2 foreign key is that tables T and T ′

must be scanned once. The best-case scenario is that there is an index on T.X and T ′.Y,and the triggers may be able to operate using index-only scans.

HMerge Translation of Tier 2 FK. Let UHM be the update function for the transfor-mation HMerge( f , T ,C), and let s be its virtual schema. Define the action of UHM on aTier 1 or 2 foreign key as follows:

UHM(s,FK(F|T.X → G|T ′.Y)) =If T |= f and T ′ �|= f , then

FK(F ∧ (C = T )|T .X → G|T ′.Y)Else, if T �|= f and T ′ |= f , then

FK(F|T.X→ G ∧ (C = T ′)|T .Y)Else, if T |= f and T ′ |= f , then

FK(F ∧ (C = T )|T .X → G ∧ (C = T ′)|T .Y)Else, FK(F|T .X → G|T ′.Y)

This result follows from query translation — one can translate the fragment into itsTier 3 equivalent, translate the two constituent queries through QHM, then translatethe result back to an equivalent Tier 2 fragment to arrive at the result above. Notethat the translation of a Tier 2 FK through a Horizontal Merge results in a Tier 2foreign key.

Pivot Translation of Tier 2 FK. Let UPV be the update function for the transformationPivot(Tp, A,V, Tp), and let s be its virtual schema. The action of UPV has several casesbased on the tables, columns, and conditions in a Tier 1 or 2 foreign key definition; forbrevity, here are two of the interesting cases:

Case 1: T = Tp, T ′ � Tp, and A ∈ X. One of the source columns is pivoted — this isthe case demonstrated in Figure 12).UPV(s,FK(F|T.X→ G|T ′.Y)) =

Check(πXσF↙�Cols(Tp)−Keys(Tp);A;V Tp, πYσGT ′)Figure 12(c) is such a case, where the target of the foreign key references the pivot

attribute column, so a Check statement is needed to describe the integrity constraintover the logical schema.

Case 2: T = Tp, T ′ � Tp, ∃(c=v)∈Fc = A, V ∈ X and A � X. The source table is pivoted,there is a condition on the pivot attribute column, and the value column V participatesin the foreign key.


UPV(s,FK(F|T.X→ G|T ′.Y)) =FK(F − {(c = v)}|Tp.(X − {V} ∪ {v})→ G|T ′.Y)

The result is a single FK involving only one pivoted column v in the source table,matching the original condition on column A.

5 Business Logic and Channel Transformations

To fully support the property of indistinguishability, a channel must encapsulate all ofthe data and query transformations that occur between an application’s virtual schemaand its physical schema. Such transformations may include business logic that is typi-cally found in data access layers or stored procedures. While the restructuring CT’s inSection 4 are defined on materialized instances, other kinds of business logic may benon-deterministic.

For instance, consider an application designer who would like all tables in thedatabase to include a column whose value for a given row is the user that last edited thedata in that row. The application does not display that data, but reports may be run overthe database to look for patterns in data usage based on user. Such a transformation canstill be defined as a CT, even considering the non-determinism of the operation. Suchbusiness logic transformations include:

– Adorn(T, e, A,C) adds columns A to table T . The columns hold the output of func-tion e, which returns the state of environment variables. Values in A are initializedwith the current value of e on insert, and refreshed on update whenever any valuesin columns C change.

– Trace(T, T ) records all operations to any tables in the set T and records them totable T . For instance, given an delete operation to table T1, Trace would insert intotable T a row containing a textual representation of the statement as well as thetime of the operation and the current user. The CT could be further parameterizedto include the set of statements to monitor, the exact names of the columns in thetable T , etc.

– Audit(T, B, E) adds columns B and E to table T , corresponding to a lifespan (i.e.,valid time) for each tuple. Rows inserted at time t have (B, E) set to (t, null). Forrows deleted at t, set E = t. For updates at time t, clone the row; set E = t forthe old row, and set (B, E) = (t, null) for the new row. The virtual schema instancecorresponds to all rows from the output database where E = null.

5.1 Business Logic Transformations and Correctness

As noted in Section 3.1, channel transformations are ordinarily defined in terms of fourfunctions corresponding to their effect on schemas, instances, queries, and updates. Re-call that the instance function I is never implemented, but serves as part of the semanticdefinition of the transformation. For all of the restructuring transformations listed inSection 4, the function I makes sense because each restructuring CT is fully determin-istic. Thus its effect on a fully-materialized instance is the most logical way to thinkabout its definition.


For a Business Logic CT (BLCT), defining such a transformation is not as impor-tant given that its definition is non-deterministic. Said another way, whereas the mostnatural way to describe the operation of a restructuring CT is to describe its operationon instances, the most natural way to describe the operation of business logic CT’sis to describe its operation on updates. Note that the descriptions of Adorn and Auditin the introduction to this section are even described in terms of how they operate onupdates.

BLCT’s are still defined as a tuple of functions S, I, Q, and U. BLCT’s must stillsatisfy the correctness properties in Figure 5. The instance function I is still definedfor a BLCT, but its output may contain “placeholders” that would be filled in withthe current value of some variable or environment at the time of execution. How-ever, where the restructuring CT’s may use the properties in Figure 5 to infer an up-date transformation from an instance transformation, one may use Figure 5 to infer Ifrom U by defining I as if it were a sequence of insert statements inserting all of itsrows.

5.2 Example Transformation: Audit

The Audit transformation is primarily a way to ensure that no data in a database is everoverwritten or deleted. As a motivating application of the transformation, consider anapplication where the data in the database must be audited over time to detect errors thatmay have been overwritten, or to allow data recovery in the event that data is acciden-tally overwritten or deleted. This operation has also been called soft delete in databaseliterature [1].

Audit is in many ways a way to compensate for the lack of first-class temporal sup-port in database systems. There have been many extensions to relational systems thatadd temporal features (e.g., [24,35]). Such systems maintain a queryable history of thedata in the database, and therefore allow users to write queries that ask questions like,“what would have been the answer to query Q as of time T?”, or “what is the result ofthis aggregate operation over time?” However, since such systems have typically stayedin the research realm and because some applications’ needs in the temporal area arefairly small, many applications merely implement some of the temporal functionalityas part of their data access layer. The Audit transformation serves this same purpose.

The rest of this section discusses the Audit transformation in depth in two ways.First, the section provides a detailed discussion of a simple version that meets the trans-formation’s basic requirements as an insulator from DML update and delete statements.Second, the section provides insights into how to construct a more complete version ofthe transformation that may go beyond the needs of many applications but covers moreadvanced scenarios.

5.3 Translating Schema

As mentioned in the description of the CT, Audit requires two additional columns tobe added to any audited table. Each row in the physical schema will have a value forcolumn B indicating when the row comes into existence, and a (possibly null) valuefor column E indicating when the row is no longer valid. Temporally speaking, the two


values for B and E put together form an interval closed on the left and open on the right,and possibly unbounded on the right if E = null, representing the case where the row isstill current.

In addition to the schema translation S adding two extra columns, it also adds columnB to the primary key of the table. The need for an altered primary key is motivated bythe fact that a single row with primary key K may map to multiple rows in the physicalschema. Each of those rows, however, must have a unique value for B, however, sinceeach row must be in existence at different points in time.

A simplified version of Audit may stop here, as it provides the proper client-sideeffects. Specifically, if the only access to the database is through the channel, no ad-ditional modifications are necessary to the schema. However, it is important to note atthis stage is that merely adding B to the schema of the output table does not prevent amalicious or unknowing user to create a database instance in the physical schema thatis nonsensical. Consider, for example, the following two tuples:

[A = 10, Z = Bob, B = 5, E = 10]

[A = 10, Z = Alice, B = 7, E = 12]

Treating the values for B and E as intervals, the two tuples above have overlappinglifespans for the row with key value A = 10, implying that at a given point in time, twotuples with the same key existed. Similarly, consider the following pair of tuples:

[A = 10, Z = Bob, B = 5, E = null]

[A = 10, Z = Alice, B = 7, E = null]

For this pair, the key value A = 10 corresponds to two different “active” rows, whichviolates our invariants. Finally, consider this tuple:

[A = 10, Z = Alice, B = 7, E = 3]

This tuple is invalid as it has ended before it begun. Such constraints in a relationaldatabase are indeed handled by true temporal databases, but are not enforced in a rela-tional database without triggers, such as the following example, where “PK” is the setof key columns not including lifespan origin point B:

begin insert trigger (T)

if (new tuple[E] is not null and new tuple[E] <= new tuple[B])

reject insert


if t[PK] = new tuple[PK]

if new tuple[E] is null and t[E] is null

reject insert

else if new tuple[E] is null and t[E] is not null

if new tuple[B] < t[E]

reject insert

else if new tuple[E] is not null and t[E] is null

if t[B] < new tuple[E]

reject insert


else

if not (t[E] <= new tuple[B] or new tuple[E] <= t[B])

reject insert

accept insert

end trigger

(update trigger follows same pattern as insert)

This kind of trigger cannot be mimicked by any of the constraints considered in Table1. Therefore, to account for this kind of constraint, one would need to add a new kindof constraint to Table 1, in which case all CT’s would need to “learn” how to translatesuch constraints, and one would need a set of constraints that is closed under all CT’s.Given the difficulty and computational complexity of this constraint, most applicationsthat employ an Audit transformation opt not to include the constraint and rely on theapplication and data access layers to produce correct data.3

5.4 Translating DML Statements

True to the “nothing deleted” spirit of the transformation, the most interesting operationof the Audit CT is on update statements. Informally, the update function U has thefollowing effect on DML statements:

– For insert statements, add the current system time as the value for the “begin” times-tamp column B.

– For delete statements, change the statement to an update statement that sets the“end” timestamp column E to the current system time for all affected rows.

– For update statements, take all affected rows and clone them, where the new clonedrow has a current timestamp for column B. Then, set column E for the old rows tothe current time as well.

Formally, the function U can be defined as follows:

UAudit(T, I(t,C,Q)) = I(T,C ∪ {B, E},Q × {Now(), null})UAudit(T,D(t, F)) = Loop(t, σF∧E=nullT, {

U(T,∀c∈Keys(T )c = t[c], {E}, {Now()})})UAudit(T,U(t, F,C,Q)) = Loop(t, σF∧E=nullT, {

U(T,∀c∈Keys(T )c = t[c], {E}, {Now()}),I(T,Cols(T ), πCols(T )−{B,E}t × {Now(), null})}

In the above, Now() represents the current timestamp at the time of execution. Examplesof the action of an Audit CT on an insert statement can be found in Figure 13(a), whereFigure 13(b) shows the same CT operating on an update and a delete.

3 This pattern of relying on correct application and data access behavior is a common one, andwill be revisited in the section on object-relational mappings.


Person: FName LName Age Cert

Bob Smith 19 F Ted Jones 32 F Sal Thomas 40 F Gail Brown 28 T

Person: FName LName Age Cert

Bob Smith 19 F Ted Jones 32 T Sal Thomas 40 F Gail Brown 28 T

Person: FName LName Age Cert B E

Bob Smith 19 F 1/10 null Ted Jones 32 F 1/15 null Sal Thomas 40 F 2/8 null Gail Brown 28 T 3/15 null

Person: FName LName Age Cert B E

Bob Smith 19 F 1/10 null Ted Jones 32 F 1/15 3/16 Ted Jones 32 T 3/16 null Sal Thomas 40 F 2/8 3/16 Gail Brown 28 T 3/15 null

u = I (Person, {FName, LName, Age, Cert}, {(Gail, Brown, 28, T)})

UAUDIT(u) = I (Person, {FName, LName, Age, Cert, B, E}, {Gail, Brown, 28, T, Now(), null})

u1 = U (Person, {FName = Ted, LName = Jones}, Cert, T) u2 = D (Person, {FName = Sal, LName = Thomas}),

UAUDIT(u1) = U (Person, {FName = Ted, LName = Jones, E = null}, E, Now()), I (Person, {FName, LName, Age, Cert, B, E}, {Ted, Jones, 32, T, Now(), null}) UAUDIT(u2) = U (Person, {FName = Ted, LName = Jones, E = null}, E, Now()),

(a)

(b)

Fig. 13. The action of the Audit transformation on an insert operation (a) as well as an update anda delete operation (b)

5.5 Translating Instances and Queries

The instance transformation IAudit is defined as discussed above, where IAudit processesthe instance as if it were a sequence of instance statements. In this case, that meansappending to each row in the instance of table T the value Now() for B and null for E.

Query translation for the Audit CT is straightforward. If T is the relational algebraquery for all data in audited table T , then:

QAudit(T ) = πCσE=nullT

where C in this case is the set of columns of T on the transformation’s input side. Inessence, a query for any data in T will return only the non-deprecated rows in the datastore, and only the columns that are visible in the input. Note here that it is trivial toprove the property in Figure 5(a), since QAudit effectively shears off the columns thatare added by IAudit.


5.6 Translating DDL Statements

The Audit transformation as described in this section is simplified in that its action soft-ens only the impact of DML statements. This definition may be sufficient for many ap-plications; however, this section discusses what changes would be necessary to modifythe definition and operation of the Audit transformation to account for DDL transfor-mations as well. Such handling would allow the user to, for instance, drop a columnfrom the virtual schema but retain that column in the physical schema.

Even the simplified version of the Audit transformation, before taking into ac-count any schema modification handling, must make two accommodations for DDLstatements:

– Translates an Add Table statement to add the two temporal “endpoint” columns tothe table’s definition, where the starting point is also added to the table’s primarykey

– Checks to make sure that an Add Column statement does not collide with eitherendpoint column, and if so, return an error

All other table, column, and element DDL statements would pass through unaltered forthe simple version of Audit. However, consider the implications of taking the “neverlose data” property to include schema modifications:

– When dropping a table, the table itself must somehow be deprecated. The table hasbeen dropped from the virtual schema, so the application is no longer aware of it.However, Audit must define some behavior for the case where the application triesto add a new table with the same name and a possibly different schema.

– When dropping a column, the column must somehow be deprecated. Just as withdropping a table, the column disappears from the application schema and is thusinvisible, but must be insulated from being overwritten by a subsequent add columncall.

– When dropping an element, the domain element must somehow be deprecated. Thesemantics for dropped elements can be preserved here, where for key columns,treat as if it were a deleted row for a key column and an update to null for a non-key column. To allow CT’s later in the channel to preserve the semantics, Auditwould need to change the drop element operation to Update and Delete statements,since the element itself would not be able to be dropped (or else the original valuewould be lost).

– Audit would also need to maintain a policy for renamed elements. Audit couldeither allow the operation to complete without change, or alternatively treat theoperation like it treats a DML update, keeping both the old and the new elementsand “deprecating” the old ones.

The full version of Audit would then need to maintain a list of all deleted or renamedschema artifacts, implying that those would need to be added to the parameters of thetransformation. The physical schema of the CT would always be the union of all of theschemas that had appeared over time. True temporal systems would be able to managemultiple versions of schemas internally rather than require a single unified schema.


5.7 Translating Foreign Keys

Section 4.4 defines an approach where foreign key translation follows directly fromquery translation. With the simple form of Audit, the same logic applies. Considertranslation of a foreign key statement FK(F|T.X → G|T ′.Y), where the table T ′ is runthrough an Audit CT. Simply by running query translation QAudit and simplifying theresulting relational algebra, the constraint becomes Check(σFπXT ⊆ σGπXσE=nullT

′).This statement essentially says that affected tuples in T must match against “current”,non-deprecated tuples in T ′. Said another way, if one only considers non-deprecatedtuples, the foreign key statement is still enforced as before.

Just as in Section 5.3, the result of the above foreign key translation keeps foreignkeys enforced for all operations on current data, but leaves historical data unenforced.Additional constraints would be necessary to enforce constraints on the historical dataas well, which would mean that foreign keys would need additional processing beyondwhat is inferred from just query translation.

6 Object-Relational Mappings

The previous sections describe a methodology to handle relational-to-relational trans-formations. However, in the typical application development scenario, the client devel-opment model and data schema is object-oriented4. It requires some method to trans-form the object-oriented client schema into a relational one that meets the same re-quirements as the relational-only transformations. There are two noteworthy solutionstrategies in the design space:

1. Construct a language of transformations where each individual transformation de-scribes an incremental change of a specific object-oriented artifact into a relationalone. For instance, a transformation may dictate that a given collection-valued prop-erty should be mapped to a given table with a foreign key relationship.

2. Construct a single transformation whose logic is powerful enough to dictate allof the details about how to translate an entire object-oriented schema into arelational one.

Each option has trade-offs. For instance, option 1 above is well-documented and studiedas part of the DB-MAIN project [18]. DB-MAIN as a methodology allows the datadeveloper to dictate the transformation procedure from one model to another one stepat a time, incrementally translating artifacts in the first model into artifacts in the second.The translation from object-oriented to relational schemas is a supported special caseof that framework.

Option 2 above is well-implemented in tool support from object-relational mappingtools. Object-relational mapping tools (ORMs) have become a central fixture in applica-tion programming over relational databases as a way to bridge the gap between the twodata models. Such tools typically have a single indivisible declarative mapping layer

4 We assume that the object-oriented data model is in particular the Entity Data Model [7], whichis similar to the Entity Relationship model extended with generalization hierarchies.


that handles all translations from object classes to relations in a single declarative pass.The ORM uses this mapping to translate queries and updates against the model intosemantically-equivalent ones against the relational database.

A key differentiator between these two options is which design dimension holdsmore complexity. For option 1, each individual transformation is small and composablein nature, and thus easy to understand. However, because single transformations are notexpected to bring an entire schema from one model to another, schemas on both theinput and output of a transformation may not conform to a single model. DB-MAINcompensates for this situation by formally treating each transformation as a mapping toand from a single unified model that contains the constructs from all relevant source andtarget schemas. As a result, for a DB-MAIN-style transformation to fit into the channelsframework, it would need to be defined on the full set of incremental transformationsover the generalized model.

For option 2, the design complexity lies in the transformation operation itself. Thesystem only requires as input the set of incremental transformations over an object-oriented schema. With a single transformation being assigned the monolithic task ofthe full model translation, the difficulty becomes determining how that complex andhighly configurable transformation reacts to schema evolution inputs.

The rest of this part of the paper describes a technique that makes option 2 tractableto fit in the channels framework. Given incremental schema evolution primitives, anobject-relational mapper can automatically modify itself and its physical schema.

6.1 Query and Update Statements

Table 2 shows a list of transformations similar to that of Table 1, only the statementsrepresent DML and DDL actions in an object-oriented environment. Recall that theoperations in Table 1 have a closure property — any CT that transforms a relationalschema into another relational schema must be closed with respect to that set of state-ments under its update transformation U. An ORM modeled as a CT must be closedin the sense that it must translate statements from Table 2 into statements from Table1. Table 2 is also complete in the same sense as Table 1 is complete, where one canget from any schema to any other schema incrementally using add and drop statements,with the other statements serving as shorter and better instance-preserving shortcuts [5](e.g., Move Property, which could be modeled as a drop followed by an add).

There is one additional catch to be noted when considering an ORM as a CT, andthat is query translation. The query language for a relational CT is simply relationalalgebra, and thus can be modeled as SQL. The output query language from an ORM isindeed SQL or relational algebra, but the input language must be a query language overobjects. There are many possible extensions to SQL that can allow it to handle queryingover object data; this paper assumes that the query language is Entity SQL or ESQL, alanguage that accompanies the Microsoft Entity Framework [28].

ESQL has the property that it can be translated through an ORM into traditional SQLwhenever the ORM exhibits a valid mapping, but with one catch. Because ESQL is aquery language over objects, it is allowed to return data as objects and also constructobjects on the fly. Traditional SQL lacks that capability. As such, whenever an inputquery requires an object constructor, the Entity Framework translates the query into


SQL that includes additional columns that encode how to construct the objects whenthe objects are returned. For instance, consider the following simple ESQL query overthe example in Figure 1:SELECT p FROM Person

The set of all Person objects is a polymorphic set that includes patients and staff, andthe query itself returns objects. The generated SQL will include extra constant columnsthat dictate for each row what kind of object it is, and when the data is returned, theEntity Framework will construct objects of the correct type for each row5. This paperassumes that query translation proceeds in that fashion.

6.2 The Impedance Mismatch and Schema Evolution

The primary goal of any ORM is to overcome the impedance mismatch, the differencein data models that must be overcome to allow full-fidelity communication betweenobjects and relations. Therefore, a primary feature of an ORM is to map object-orientedconstructs to relational ones. For any given construct that requires mapping, there areoften several ways to accomplish the task. For instance, given a class hierarchy, thereare three basic ways to map that hierarchy to relations:

– Table-per-Type (TPT), in which each type in the hierarchy is assigned its own tablecontaining only the non-inherited properties of that type.

– Table-per-Concrete Class (TPC), in which each non-abstract type in the hierarchyis assigned its own table with all properties of that type, inherited or otherwise.

– Table-per-Hierarchy (TPH), in which the entire hierarchy is mapped to a singletable.

Examples of each of these mapping scheme can be found later in this section6.Mapping schemes may also be mixed and matched in the same hierarchy. In that case,

handling schema evolution becomes non-trivial. Consider an application with a modelas shown in Figure 14. A new application version may require that new types be addedto the type hierarchy, including the three shown in the figure. With ORM tools, theseconceptual model artifacts must map to persistent storage. Thus, one must determinewhat the mapping should be and what changes should be made to the physical schema.

If the entire hierarchy in Figure 14 is mapped to storage using a consistent pattern —for instance, by mapping the entire hierarchy to a single table (as Ruby on Rails does[34]) or instead by mapping each type to its own table — then it is natural to map thenew type using that same pattern (regardless of where in the hierarchy of Figure 14 onechooses to add the type). However, for more complex mappings, especially ones that donot employ a uniform mapping pattern, the answer is more complicated. In Figure 14,the choice of mapping and physical storage may differ for each of the three locationsfor adding a new type.

To treat an ORM as a channel transformation, the correct means for propagating aschema change from a conceptual model to a physical database must be determined by

5 More examples may be found in Entity Framework literature (e.g., [28]).6 A more complete treatment of the options for mapping objects to relations may be found in

[13], among other places.


Fig. 14. A type hierarchy with three locations for adding a new type to the hierarchy

using the existing mapping to guide future incremental changes, even when the mappingscheme is not uniform across a hierarchy. If there is a consistent pattern in the immediatevicinity of the change, then that pattern is preserved after the change. As a special case,if an entire hierarchy is mapped using a single scheme, then it is mapped using thatscheme for new artifacts. Given a list of incremental conceptual model changes and theprevious version of the model and mapping:

1. Create a representation of the mapping called a mapping relation that lends itselfto analysis, then

2. For each model change, effect changes to the mapping, to the store model, and toany physical databases that conform to the store model, and finally

3. Translate the mapping relation changes into changes to the original mapping

There are many ORM tools available today, including TopLink [31], Hibernate [20],Ruby on Rails [34], and Entity Framework (EF) [7]. Each of these tools has differentsyntax and expressive power for mapping conceptual models to databases, but noneprovides a method for propagating conceptual model changes into mapping or storechanges. In the research project MeDEA, given an incremental change to a conceptualmodel, a developer chooses a rule to guide mapping evolution [12]. However, the newmapping need not be consistent with the existing mapping; if such consistency is im-portant, the developer must maintain intricate knowledge of the mapping. By contrast,the technique presented here provides the developer with an automated model designexperience with changed artifacts mapped consistently with existing artifacts. There-fore, the requirements of using a ORM as a CT are fulfilled, as it can faithfully translateschema evolution primitives without requiring intervention.

One final note about mappings and constraints. As with the Audit transformationfrom Section 5, object-relational mappers are very good at ensuring a consistent client-side experience. However, they often fall short (intentionally) in enforcing constraintson the database that ensure that direct access to the database does not violate any as-sumptions. Each of the three mapping paradigms — Table per Type, Table per ConcreteClass, and Table per Hierarchy — has such assumptions that could be expressed using


triggers but are often costly to enforce. For example, if the class hierarchies on theobject side of the mapping do not allow multiple inheritance:

– For TPT mappings, if table T and table T ′ represent types in the object space thatare siblings in the class hierarchy, then there must not be any overlapping keys incommon between T and T ′.

– For TPC mappings, if T represents the set of all tables that map to a given classhierarchy, the union of the primary keys from all tables in T must be a distinct set.

– For TPH mappings, there must be a mutual exclusion property enforced. If a tableT represents an entire class hierarchy, given a row r in the table, r[c] must be nullfor any column c that does not map to a property of the class to which r’s instancebelongs.

– Foreign key mappings may become incomplete in translation. For instance, con-sider the association between types Procedure and Patient in Figure 1. Becausethe Person hierarchy is mapped TPC, it happens that the association becomes aforeign key in Figure 2. However, if Person were instead mapped as TPH, the as-sociation would become a foreign key that would only point to part of the table thatstores all Person instances. Note that in this situation, Tier 2 or 3 foreign keys areexpressive enough to model the relationship.

Most object-relational mapping systems consider these limitations to be acceptable.However, just as the statement Check(Q1 ⊆ Q2) formalizes a kind of constraint that canbe expressed as a trigger, so too could the necessary constraints for full fidelity of thesemapping be formalized. For example, the needed constraint for full fidelity of TPC map-pings could be expressed as a statement Dis joint(Q), which takes a set of queries andensures that their results do not overlap. Such a statement could be processed througha CT using the query pipeline in much the same way as Check statements. This paperleaves such extensions as future work.

6.3 The Mapping Relation

Given the conceptual model from Figure 14, consider a mapping that combines severalschemes for the hierarchy, as in Figure 15. The mapping between the conceptual model(a) and its physical storage (b) has the following characteristics:

– The types Thing, Company, and Person are mapped using the TPT scheme, whereeach type maps to its own table and hierarchical relationships are modeled usingforeign keys.

– The type Partner is mapped using the TPC scheme relative to type Company,where each type still maps to its own table, but the child type Partnermaps all ofits properties derived from Company.

– The types Person, Student, and Staff are mapped using the TPH scheme, withthe entire sub-hierarchy mapped to a single table. Furthermore, the types mapcolumns according to their domain, minimizing the number of columns neededin table TPerson.


Table 2. Object-oriented DML and DDL statements supported by channel transformations

Statement Formalism Explanation of Variables

Insert I(T,Q) Insert objects into entity set T from the result of Q. The value of Q may bea constant or a query result.

Update U(T, F,C,Q) Update objects in entity set T that satisfy all equality conditions F specifiedon key properties. Non-key properties C hold the new values specified byquery or constant Q. Query Q may refer to the pre-update object values asconstants. Not all key properties need to have a condition.

Delete D(T, F) Delete objects from entity set T that satisfy all equality conditions F speci-fied on key properties. Not all key properties need to have a condition.

Add Simple Type AT(T, P, D,A) Add new type T without a key, whose properties P have domains D. Pa-rameter A is a Boolean value indicating whether the type is abstract.

Add IdentifiableType

AT(T, P, D, K,A) Add new type T , whose properties P have domains D, with key proper-ties K ⊆ P. Parameter A is a Boolean value indicating whether the type isabstract.

Add DerivedType

AT(T,T ′ , P, D,A) Add new type T , derived from T ′ , whose new (non-inherited) properties Phave domains D. Domains may include references to other types. ParameterA is a Boolean value indicating whether the type is abstract.

Rename Type RT(To,Tn) Rename type To to be named Tn. Throw error if Tn already exists.Change Type Ab-straction

CT(T, A) Change type T to be abstract if A = true, concrete otherwise.

Drop Type DT(T ) Drop the type named T . Throw an error if type T has any derived types(only drop if the type is a child type).

Add Property AP(T, P,D) Add to type T a property named P with domain D.Rename Property RP(T, Po, Pn) In type T , rename the property Po to be named Pn. Throw error if Pn already

exists.Move Property MP(T,T ′ ,P) Move the non-key property P from type T to type T ′ , provided that T is

either a descendant or an ancestor of T ′ . Any instance that has property Pboth before and after the move will not lose data.

Change PropertyFacet

CP(T, P,F,V) For the property P of the type T , change its domain facet F to have thevalue V .

Drop Property DP(T, P) In type T , drop the non-key property P.

Add Element AE(T, P,E) In type T , in property P, add a new possible domain value E.Rename Element RE(T, P,Eo ,En) In type T , in property P, rename domain element Eo to be named En . Throw

error if En conflicts with an existing element.Drop Element DE(T, P,E) In type T , in property C, drop the element E from the domain of possible

values.

Add Association AA(A, F|T ↔C G|T ′) Add an association type named A between types T and T ′ , where the asso-ciation is limited by conditions F on type T and by conditions G on typeT ′ . The association has cardinality C, which is one of the following op-tions: one-to-one, zero-or-one-to-one, one-to-many, zero-or-one-to-many,or many-to-many.

Rename Associa-tion

RA(A, A′) Renames an association A to A′ . Throws an error if there is already anassociation or type named A′.

Change Associa-tion Cardinality

CA(A,C) Changes the cardinality of the association A to be C.

Drop Association DA(A) Drops the association named A.

Loop Loop(t,Q,S) For each object t returned by query Q, execute transaction S.Error Error(Q) Execute query Q, and raise an error if the result is non-empty.

For this mapping, there is no single mapping scheme for the entire hierarchy. However,one can make some observations about the three different locations from Figure 14,specifically regarding the types that are “nearby”:

– Location 1 has sibling Partner and parent Company, mapped TPC.– Location 2 has siblings Company and Person and parent Thing, mapped TPT.– Location 3 has parent Student, in a sub-hierarchy of types mapped TPH.


Type = “Student” Type = “Staff” Type =

“Person”

(a) Client Model

(b) Store Model

Fig. 15. Example client (a) and store (b) models, with a mapping between them

Using this informal reasoning, one can argue that types added at locations 1, 2, and 3should be mapped using TPC, TPT, and TPH respectively. Formally speaking, we havetwo challenges to overcome. First, we need a definition of what it means to be “nearby”in a hierarchy. We address this in Section 7. Second, while TPC, TPT, and TPH are well-known and well-understood concepts to developers, they may not be expressed directlyin the mapping language, so some analysis is needed to recognize them in a mapping.Moreover, finer-grained mapping notions like column mapping are not present in anyavailable ORM.

Different object-relational mapping tools have different methods of expressing map-pings. We assume in our work and our running example that the mappings are specifiedusing EF, whose mappings need to be analyzed to identify mapping schemes like TPT,TPC, or TPH. In EF, a mapping is a collection of mapping fragments, each of which is anequation between select-project queries. Each fragment takes the formπPσθE = πCσθ′T ,where P is a set of properties of client-side entity E, C is a set of columns of table T , andθ and θ′ are conditions over E and T respectively. Conditions θ and θ′ may be of the formc = v for column or property c and value v, c IS NULL, c IS NOT NULL, type tests IST or IS ONLY T for type T , or conjunctions of such conditions.7

Our mapping evolution work uses a representation of an O-R mapping called a map-ping relation, a relationM with the following eight attributes:

– CE,CP,CX: Client entity type, property, conditions– ST ,SC,SX: Store table, column, conditions– K : a flag indicating if the property is part of the key– D: The domain of the property

7 The formal specification in [28] allows disjunction in conditions. This paper considers onlyconjunction because it simplifies exposition and because EF as implemented only allows con-junction. The techniques described in this paper are applicable when disjunction is allowed,with some adaptation.


A mapping relation is a pivoted form of mapping, where each row represents a property-to-property mapping for a given set of conditions. As an example, consider the modelpair and mapping shown in Figure 15. One can express the mapping in the figure usingEntity Framework as follows:

– πID, NameThing = πEID, ENameTEntity

– πID, ContactσIS ONLY CompanyThing = πBID, CNameTCorp

– πID, Contact, CEOσIS PartnerThing = πRID, Contact, CEOTPartner

– πID, DOBσIS ONLY PersonThing = πPID, BdayσType = “Person”TPerson

– πID, DOB, Stipend, Major, StatusσIS ONLY StudentThing

= πPID, BDay, Integer1, String1, Integer2σType = “Student”TPerson

– πID, DOB, Office, Title, SalaryσIS ONLY StaffThing

= πPID, BDay, String1, String2, Integer1σType = “Staff”TPerson

One can translate an EF mapping fragment πPσFE = πCσGT into rows in the mappingrelation as follows: for each property p ∈ P, create the row (E′, p, F′, T, c,G, k, d),where:

– E′ is the entity type that participates in the IS or IS ONLY condition of F, or E ifno such conditions exist

– F′ is the set of conditions F with any IS or IS ONLY condition removed– c is the column that matches p in the order of projected columns– k is the boolean indicating whether the property is a key property– d is a string value indicating the domain (i.e., data type) of the property

To translate an entire EF mapping to a mapping relation instance, one performs theabove translation to each constituent mapping fragment. Table 3 shows the mappingrelation for the models and mapping in Figure 15.

Table 3. The mapping relation for the models and mapping in Figure 15 (column CC not shown,since the mapping has no client conditions)

CE CP ST SC SX K D

Thing ID TEntity EID — Yes GuidThing Name TEntity EName — No Text

Company ID TCorp BID — Yes GuidCompany Contact TCorp CName — No Text

Partner ID TPartner RID — Yes GuidPartner Contact TPartner Contact — No TextPartner CEO TPartner CEO — No Text

Person ID TPerson PID Type=Person Yes GuidPerson DOB TPerson BDay Type=Person No Date

Student ID TPerson PID Type=Student Yes GuidStudent DOB TPerson BDay Type=Student No DateStudent Stipend TPerson Integer1 Type=Student No IntegerStudent Major TPerson String1 Type=Student No TextStudent Status TPerson Integer2 Type=Student No Integer

Staff ID TPerson PID Type=Staff Yes GuidStaff DOB TPerson BDay Type=Staff No DateStaff Office TPerson String1 Type=Staff No TextStaff Title TPerson String2 Type=Staff No TextStaff Salary TPerson Integer1 Type=Staff No Integer


The rows in the mapping relation do not need to maintain IS or IS ONLY conditionsbecause they are intrinsic in the mapping relation representation. The IS condition issatisfied by any instance of the specified type, while the IS ONLY condition is onlysatisfied by an instance of the type that is not also an instance of any derived type.In the mapping relation, the IS condition is represented by rows in the relation wherenon-key entity properties have exactly one represented row (e.g., Thing.Name in Table3). The IS ONLY condition is represented by properties that are mapped both by thedeclared type and by derived types (e.g., Company.Contact and Partner.Contactin Table 3).

7 Similarity and Local Scope

The approach is to identify patterns that exist in the mapping in the local scope of theschema objects being added or changed. Before defining what local scope means, onemust first define what it means for two types in a hierarchy to be similar. The desirednotion of similarity formalizes the following notions:

– An entity type is most like its siblings.– Two entity types X and Y, neither a descendant of the other, are more similar to

each other than to their least common ancestor.– If entity type X is a descendant of entity type Y, then X is more similar to any of Y’s

descendants than to Y, but more similar to Y than to any of Y’s ancestors, siblings,or siblings’ descendants.

One can formalize these notions by assigning to each type in a hierarchy a pair ofintegers (m, n) relative to a given entity type E0 that belongs to the hierarchy (or is justadded to it) according to the following algorithm:

1. Assign the pair (0, 0) to type E0 and all of its siblings.2. For each type E with assigned pair (m, n), if E’s parent is unassigned, assign to it

the pair (m + 2, n). Repeat until no new pair assignments can be made.3. For each type E with assigned pair (m, n), assign the pair (m + 1, n) to any of E’s

siblings that have no assigned pair. Apply this rule once for each type that hasassigned pairs from step 2.

4. For each type E, if E has no pair and E’s parent has the pair (m, n), assign to E thepair (m, n + 1). Repeat until no new pair assignments can be made.

Once the above steps have been completed, every type in the hierarchy will be assigneda pair, like those shown in Figure 16. The priority score P(E, E0) for an entity type Ein a hierarchy relative to E0 is computed from its pair (m, n) as P(E, E0) = 1+m − 2−n.The priority score imposes a partial ordering on entity types, as indicated by the second(lower) number on each node in Figure 16.

Using the priority score, one can formalize the local scope Φ(E0) of an entity typeE0 as follows. Let H = [E1, E2, . . .] be the ordered list of entity types Ei in E0’s hier-archy such that σCE=EiM � ∅ (i.e., there exists mapping information; some types maybe abstract and not have any mapping defined). List H is sorted on priority score, soP(Ei, E0) ≤ P(Ei+1, E0) for all indexes i. Then:


(0, 0) 1

(0, 0)1

(0, 0) 1

(0, 1)2

(0, 1) 2

(0, 1) 2

(3, 0)4

(3, 0) 4

(2, 0) 3

(4, 0)7

(3, 1) 5

(3, 1) 5

(3, 2) 6

Fig. 16. An example of similarity pairs and the numeric order of hierarchy nodes relative to agiven node (in bold outline)

– If |H| ≤ 2, then Φ(E0) = H.– If |H| > 2, then construct Φ(E0) by taking the first two elements in H, plus any

elements with the same priority score as either of those elements.

This construction of local scope satisfies the informal notions introduced earlier. Forinstance, if an entity type E has priority score x relative to E0, then every sibling Es ofE also has priority score x unless Es is an ancestor E0. Thus, if E ∈ Φ(E0), then anysibling E′ that has associated mappings is also in Φ(E).

7.1 Mapping Patterns

Using the mapping relation and local scope function Φ, we can use the mapping itselfas data to mine the various mapping schemes. A mapping pattern is a query Q+ thatprobes for the existence of the requested mapping scheme and returns either trueor false. The first set of patterns search for one of the three prominent hierarchymapping schemes mentioned in Section 6.3, given a local scope Φ(E) for an entitytype E:

Table-per-Hierarchy (TPH): Map an entity type E and its children to a single table T .Given local scope Φ(E), the TPH pattern tests whether the entities in E’s scope map toexactly one store table. We test that all rows in the mapping relation matching the localscope have the same value for the mapped table S T :

Q+T PH ≡ (|πS TσCE∈Φ(E)M| = 1).

Table-per-Type (TPT): Given an entity type E and a child type E′, map them to tables Tand T ′ respectively, with properties of E mapped to T and properties of E′ not presentin E mapped to T ′. Given local scope Φ(E), we define the TPT pattern as a query thattests for two properties. First, any pair of entity types in scope will have non-overlappingsets of mapped tables. Second, if A is the least common ancestor of all entity types inΦ(E), then for each entity type in scope that is not the common ancestor, the non-key


properties of A are not re-mapped (i.e., there are no matching rows in the mappingrelation):

Q+T PT ≡ (∀E′ ,E′′∈Φ(E)πS TσCE=E′M ∩ πS TσCE=E′′M = ∅)∧(∀E′∈Φ(E)∀P∈NKP(A)(E

′ � A)→ |σCP=PσCE=E′M| = 0).

where NKP(E) is the set of declared non-key properties for entity type E (i.e., do notinclude properties derived from ancestors).

Table-per-Concrete Class (TPC): Given an entity type E and a child type E′, map themto tables T and T ′ respectively, with properties of E mapped to T and properties of E′

(including properties inherited from E) mapped to T ′. We define the TPC pattern as thesame tests as TPT, except that all entity types in scope must map the non-key propertiesof common ancestorA:

Q+T PC ≡ (∀E′ ,E′′∈Φ(E)πS TσCE=E′M ∩ πS TσCE=E′′M = ∅)

∧(∀E′∈Φ(E)∀P∈NKP(A)|σCP=PσCE=E′M| > 0).

If we find an instance of the TPH scheme, we can use column mapping patterns todistinguish further how the existing mapping reuses store columns. Column mappingpatterns do not use local scope, but rather look at the entire mapping table for allentities that map to a given table; we expand the set of considered entities to all entitiesbecause the smaller scope is not likely to yield enough data to exhibit a pattern:

Remap by column name (RBC): If types E and E′ are cousin types in a hierarchy8, andboth E and E′ have a property named P with the same domain, then E.P and E′.P aremapped to the same store column. This scheme maps all properties with like names tothe same column, and is the scheme that Ruby on Rails uses by convention [34]. Givenhierarchy table T , the RBC pattern is:

Q+RBC ≡ (∃C∈πS CσS T=Tσ¬KM|σCP∈NKP(CE)σS T=T∧S C=CM| > 1)

∧(∀C∈πS CσS T=Tσ¬KM|πCPσCP∈NKP(CE)σS T=T∧S C=CM| = 1).

That is, check if a store column C is mapped to more than one client property, and allclient properties CP that map to store column C have the same name.

Remap by domain (RBD): If types E and E′ are cousin types in a hierarchy, let P be theset of all properties of E with domain D (including derived properties), and P′ be the setof all properties of E′ with the same domain D. If C is the set of all columns to whichany property in P or P′ map, then |C| = max(|P|, |P′|). In other words, the mappingmaximally re-uses columns to reduce table size and increase table value density, even ifproperties with different names map to the same column. Said another way, if one wereto add a new property P0 to an entity type mapped using the TPH scheme, map it toany column C0 such that C0 has the same domain as P0 and is not currently mapped byany property in any descendant type, if any such column exists. Given hierarchy tableT , the RBD pattern is:

Q+RBD ≡ (∃C∈πS CσS T=Tσ¬KM|σCP∈NKP(CE)σS T=T∧S C=CM| > 1)

8 Cousin types belong to the same hierarchy, but neither is a descendant of the other.


∧(∀X∈πDσS T=T∧¬KM∃E∈πCEσS T=TM|πCPσCE=E∧S T=T∧D=X∧¬KM|

= |πS CσS T=T∧D=X∧¬KM|).

There is at least one store column C that is remapped, and for each domain D, there issome client entity E that uses all available columns of that domain.

Fully disjoint mapping (FDM): If types E and E′ are cousin types in a hierarchy, thenon-key properties of E map to a set of columns disjoint from the non-key properties ofE′. This pattern minimizes ambiguity of column data provenance — given a column c,all of its non-null data values belong to instances of a single entity type. Given hierarchytable T , the FDM pattern is:

Q+FDM ≡ ∀C∈πS CσS T=Tσ¬KM|σCP∈NKP(CE)σS T=T∧S C=CM| = 1.

Each store column C is uniquely associated with a declared entity property CP.In addition to hierarchy and column mapping schemes, other transformations may

exist between client types and store tables. For instance:

Horizontal partitioning (HP): Given an entity type E with a non-key property P, onecan partition instances of E across tables based on values of P.

Store-side constants (SSC): One can assign a column to hold a particular constant. Forinstance, one can assign to column C a value v that indicates which rows were createdthrough the ORM tool. Thus, queries that filter on C = v eliminate any rows that comefrom an alternative source.

Strictly speaking, we do not need patterns for these final two schemes — the al-gorithm for generating new mapping relation rows (Section 7.2) carries such schemesforward automatically. Other similar schemes include vertical partitioning and merg-ing, determining whether a TPH hierarchy uses a discriminator column (as opposedto patterns of NULL and NOT NULL conditions), and association inlining (i.e., whetherone-to-one and one-to-many relationships are represented as foreign key columns onthe tables themselves or in separate tables).

Note that each group of patterns is not complete on its own. The local scope of anentity may be too small to find a consistent pattern or may not yield a consistent pattern(e.g., one sibling is mapped TPH, while another is mapped TPC). In our experience,the developer is most likely to encounter this situation during bootstrapping, when theclient model is first being built. Most mappings we see are totally homogeneous, withentire models following the same scheme. Nearly all the rest are consistent in their localscope (specifically, all siblings are mapped identically). However, for completeness inour implementation, we have chosen the following heuristics for the rare case whenconsistency is not present: If we do not see a consistent hierarchy mapping scheme(e.g., TPT), we rely on a global default given by the user (similar to [12]). If we do notsee a consistent column mapping scheme, we default to the disjoint pattern. If we donot see consistent condition patterns like store constants or horizontal partitioning, weignore any store and client conditions that are not relevant to TPH mapping.


7.2 Evolving a Mapping

Once it is known that a pattern is present in the mapping, one can then effect an in-cremental change to the mapping and the store based on the nature of the change. Thechanges in Table 2 fall into four categories, based on the nature of the change and itseffect on the mapping relation.

Constructive Changes. Setting an abstract entity type to be concrete is also a changeof this kind. For these changes, new rows may be added to the mapping relation, butexisting rows are left alone. For example, consider the cases of adding a new derivedtype to a hierarchy, or adding a new property to an existing type.

Adding a New Type to the Hierarchy: When adding a new type to a hierarchy, onemust answer three questions: what new tables must be created, what existing tables willbe re-used, and which derived properties must be re-mapped. For clarity, we assume thatdeclared properties of the new type will be added as separate “add property” actions.When a new entity type E is added, we run algorithm AddNewEntity:

1. AddNewEntity(E):2. k ← a key column for the hierarchy3. G ← γCXσCP=k∧CE∈Φ(E)M, where γCX groups rows of the mapping relation ac-

cording to their client conditions4. If ∃i|πCEGi| � |Φ(E)| then G ← {σCP=k∧CE∈Φ(E)M} (if there is no consistent

horizontal partition across entity types, then just create one large partition, ignoringclient-side conditions)

5. For each G ∈ G:6. If Q+T PT (G): (if TPT pattern is found when run just on the rows in G)7. For each property P ∈ Keys(E) ∪ NKP(E):8. Add NewMappingRow(GenerateTemplate(G, P), E)9. If Q+T PH(G) or Q+T PC(G):

10. A← the common ancestor of Φ(E)11. For each property P ∈ Keys(E) ∪ ∩e∈ENKP(E) where E is the set of all

entities between E and A in the hierarchy, inclusive:12. Add NewMappingRow(GenerateTemplate(G, P), E)

Function GenerateTemplate(R, P) is defined as follows: we create a mapping tem-plate T as a derivation from a set of existing rows R, limited to those where CP = P.For each column C ∈ {CE,CP, S T, S C}, set T.C to be X if ∀r∈Rr.C = X. Thus, for in-stance, if there is a consistent pattern mapping all properties called ID to columns calledPID, that pattern is continued. Otherwise, set T.C = ⊗, where ⊗ is a symbol indicatinga value to be filled in later.

For condition column CX (and S X), template generation follows a slightly differentpath. For any condition C = v, C IS NULL, or C IS NOT NULL that appear in every CX(or S X) field in R (treating a conjunction of conditions as a list that can be searched),and the value v is the same for each, add the condition to the template. If each rowr ∈ R contains an equality condition C = v, but the value v is distinct for each row r,add condition C = ⊗ to the template. Ignore all other conditions.


Table 4 shows an example of generating a mapping template for a set of rows cor-responding to a TPH relationship; the rows for the example are drawn from Table 3,with additional client and store conditions added to illustrate the effect of the algorithmacting on a single horizontal partition and a store constant. Note that the partition condi-tions and store conditions translate to the template; note also that the name of the storecolumn remains consistent even though it is not named the same as the client property.

Table 4. Creating the mapping template for a type added using a TPH scheme, over a singlehorizontal partition where “Editor=Tom” and with a store-side constant “Source=A” — the finalrow shows the template filled in for a new type Alumnus

CE CP CX ST SC SX K D

Person ID Editor=Tom TPerson PID Type=Person AND Source=A Yes GuidStudent ID Editor=Tom TPerson PID Type=Student AND Source=A Yes GuidStaff ID Editor=Tom TPerson PID Type=Staff AND Source=A Yes Guid

⊗ ID Editor=Tom TPerson PID Type=⊗ AND Source=A Yes Guid

Alumnus ID Editor=Tom TPerson PID Type=Alumnus AND Source=A Yes Guid

The function NewMappingRow(F, E) takes a template F and fills it in with detailsfrom E. Any ⊗ values in CE, CX, S T , and S X are filled with value E. Translating thesenew mapping table rows back to an EF mapping fragment is straightforward. For eachhorizontal partition, take all new rows collectively and run the algorithm from Section6.3 backwards to form a single fragment.

Adding a New Property to a Type: When adding a new property to a type, one has adifferent pair of questions to answer: which descendant types must also remap the prop-erty, and to which tables must a property be added. The algorithm for adding propertyP to type E is similar to adding a new type:

– For each horizontal partition, determine the mapping scheme for Φ(E).– If the local scope has a TPT or TPC scheme, add a new store column and a new

row that maps to it. Also, for any child types whose local scope is mapped TPC,add a column and map to it as well.

– If the local scope has a TPH scheme, detect the column remap scheme. If remappingby name, see if there are other properties with the same name, and if so, map to thesame column. If remapping by domain, see if there is an available column with thesame domain and map to it. Otherwise, create a new property and map to it. Add amapping row for all descendant types that are also mapped TPH.

Translating these new mapping rows backward to the existing EF mapping fragmentsis straightforward. Each new mapping row may be translated into a new item added tothe projection list of a mapping fragment. For a new mapping row N, find the mappingfragment that maps σN.CX N.CE = σN.S X N.S T and add N.CP and N.S C to the clientand store projection lists respectively.

Manipulative Changes. One can change individual attributes, or “facets,’, of artifacts.Examples include changing the maximum length of a string property or the nullabilityof a property. For such changes, the mapping relation remains invariant, but is used toguide changes to the store.


Consider a scenario where the user wants to increase the maximum length of theproperty Student.Major to be 50 characters from 20. One can use the mapping rela-tion to effect this change as follows. First, if E.P is the property being changed, issuequery πS T,S CσCE=E∧CP=PM — finding all columns that property E.P maps to (theremay be more than one if there is horizontal partitioning). Then, for each result row t, is-sue query Q = πCE,CPσS T=t.S T∧S C=t.S CM— finding all properties that map to the samecolumn. Finally, for each query result, set the maximum length of the column t.S C intable t.S E to be the maximum length of all properties in the result of query Q.

For the Student.Major example, the property only maps to a single column calledTPerson.String1. All properties that map to TPerson.String1 are shown in Table5. If Student.Major changes to length 50, and Staff.Office has maximum length40, then TPerson.String1 must change to length 50 to accommodate. However, ifTPersonString1 has a length of 100, then it is already large enough to accommodatethe wider Major property.

Destructive Changes. Setting a concrete entity type to be abstract also qualifies in thiscategory. For changes of this kind, rows may be removed from the mapping relation,but no rows are changed or added.

Consider as an example dropping a property from an existing type. Dropping a prop-erty follows the same algorithm as changing that property’s domain from the previoussection, except that the results of the query Q are used differently. If query Q returnsmore than one row, that means multiple properties map to the same column, and drop-ping one property will not require the column to be dropped. However, if r is the rowcorresponding to the dropped property, then we issue a statement that sets r.S C to NULLin table r.S T for all rows that satisfy r.S X. So, dropping Student.Majorwill executeUPDATE TPerson SET String1 = NULL WHERE Type=’Student’. If query Q re-turns only the row for the dropped property, then we delete the column.9 In both cases,the row r is removed fromM. We refer to the process of removing the row r and eithersetting values to NULL or dropping a column as DropMappingRow(r).

Table 5. A listing of all properties that share the same mapping as Student.Major

CE CP ST SC SX K D

Student Major TPerson String1 Type=Student No TextStaff Office TPerson String1 Type=Staff No Text

Refactoring Changes. Renaming constructs, moving a property, and changing an as-sociation’s cardinality fit into this category. Changes of this kind may result in arbitrarymapping relation changes, but such changes are often similar to (and thus re-use logicfrom) changes of the other three kinds. For example, consider the case of moving aproperty.

9 Whether to actually delete the data or drop the column from storage or just remove it from thestorage model available to the ORM is a policy matter. One possible implementation wouldissue Drop Column statements.


Moving a property from a type to a child type: If entity type E has a property Pand a child type E′, it is possible using a visual designer to specify that the property Pshould move to E′. In this case, all instances of E′ should keep their values for propertyP, while any instance of E that is not an instance of E′ should drop its P property. Thisaction can be modeled using analysis of the mapping relationM as well. Assuming forbrevity that there are no client-side conditions, the property movement algorithm is asfollows:

1. MoveClientProperty(E, P, E′):2. r0 ← σCE=E∧CP=PM (without client conditions, this is a single row)3. If |σCE=E′ ∧CP=PM| = 0: (E′ is mapped TPT relative to E)4. AddProperty(E′, P) (act as if we are adding property P to E′)5. For each r ∈ σCE=E′ ∨CE∈Descendants(E′ )σCP=PM:6. UPDATE r.S T SET r.S C = (r.S T � r0.S T ).(r.S C) WHERE r.S X7. E− ← all descendants of E, including E but excluding E′ and descendants8. For each r ∈ σCE∈E−∧CP=PM:9. DropMappingRow(r) (drop the mapping row and effect changes to the physi-

cal database per the Drop Property logic in the previous case)

8 Main Example, Revisited

With the machinery of channels in hand, one can now return to the example introducedin Section 2 and demonstrate how to construct a mapping that satisfies all of the re-quirements laid out in Section 1.1. Starting with the object-oriented application schemain Figure 1:

1. Apply an ORM CT that maps the Person hierarchy using the TPC mapping pattern,and the Procedure hierarchy using the TPH mapping pattern with the Reuse-by-Name paradigm.

2. Vertically partition the Procedure table to save off all columns with a text domain(except the few core attributes) into the table TextValues.

3. Unpivot the table TextValues.4. Audit the table TextValues, then adorn it with a column with the current user.5. Vertically partition the Procedure table to save off all columns with a numeric do-

main (except the few core attributes) into the table NumericValues.6. Unpivot the table NumericValues.7. Audit the table NumericValues, then adorn it with a column with the current user.

Given the steps above, it is a straightforward task to translate each step into CT’s toform a channel. With a channel so defined, the application using said channel has thesame business logic and data mapping as before, and the same query and data updatecapabilities. In addition, the application can now evolve its schema either at design-timeor at run-time, and can perform arbitrary query or set-based updates, capabilities that itdid not even have before without manual intervention.

Note that the solution above starts with a ORM CT, followed by a sequence ofrelational-only CT’s. An alternative approach may instead consider CT’s that that op-erate on and also produce object-oriented schemas; while not discussed in this paper,


one can certainly define CT’s that operate over the statements in Table 2 rather thanTable 1. There is no unique solution to developing a suitable mapping out of CT’s,and whether one can define an optimization framework over CT’s is an open and in-teresting question. At the very least, it is possible to define algebraic re-writing rulesover some CT’s as well as cost estimates over the impact of CT’s on instances andqueries [36].

9 Further Reading and Future Directions

This paper has centered on the concept of database virtualization, where an applicationschema may be treated as if it were the physical storage for that application. Virtual-ization isolates the application from several complicating levels of data independence,including changes in data model, significant data restructuring, and business logic. En-abling an application with database virtualization provides the application with the bidi-rectionality it requires to operate without risk of data loss during operation and whileallowing schema evolution as the application evolves. The paper introduces the no-tion of a channel, a mapping framework composed of atomic transformations, each ofwhich having provable bidirectional properties that are amenable to the requirements ofthe application.

Though a wealth of research has been done on schema evolution [33], very little hasbeen done on co-evolution of mapped schemas connected by a mapping. Channels offera such a solution.

The work on object-relational mapping has been implemented and demonstrated[38], but work is ongoing. For instance, a prominent feature of the Entity Framework(and possibly other mapping frameworks as well) is compilation of a high-level formalspecification into other artifacts. Mapping compilation provides several benefits, includ-ing precise mapping semantics and a method to validate that a mapping can round-tripclient states. The computational cost for compiling and validating a mapping can be-come large for large models, and is worst-case exponential in computational complexity[28]. An active area of research is to translate incremental changes to a model into incre-mental changes to the relational algebra trees of the compiled query and update views,with results that are still valid and consistent with the corresponding mapping and storechanges.

Incremental or Transformation-Based Mappings. Channels are by no means theonly language that has been devised to construct a mapping between two schemas fromatomic components. One such framework — DB-MAIN — has already been referred toin Section 6 as a language for mitigating the effect of translating between instances ofdifferent metamodels a step at a time [18]. What follows are alternative incrementally-specified mapping languages, each introduced for a different scenario.

Both Relational Lenses [8] and PRISM [9] attempt to create an updatable schemamapping out of components that are known to be updatable. Instead of translating up-date statements, a lens translates database state, resolving the new state of the viewinstance with the old state of the logical instance. Some recent research has been per-formed investigating varieties of lenses that operate on descriptions of edits instead of


full states [11,21]. PRISM maps one version of an application’s schema to another us-ing discrete steps, allowing DML statements issued by version X of an application tobe rewritten to operate against version Y of its database. While more complex trans-formations such as pivot have not been explored in either language, it may be possibleto construct such operators in those tools; like channels, the key contribution of thosetools is not the specific set of operators, but rather the abstractions they use and the ca-pabilities they offer. The key difference between channels and these approaches is thatneither Lenses or PRISM can propagate schema modifications or constraint definitionsthrough a mapping.

Both-as-View (BAV) [25], describes the mapping between global and local schemasin a federated database system as a sequence of discrete transforms that add, modify, ordrop tables according to transformation rules. Because relationships in these approachesare expressed using views, processing of updates is handled in a similar fashion as inthe materialized view [17] and view-updatability literature [10]. The ability to updatethrough views, materialized or otherwise, depends on the query language. Unions areconsidered difficult, and pivots are not considered. Schema evolution has also been con-sidered in the context of BAV [26], though some evolutions require human involvementto propagate through a mapping.

An extract-transform-load workflow is a composition of atomic data transformations(called activities) that determine flow of data through a system [42]. Papastefanatos etal. addressed schema evolution in a workflow by attaching policies to activities. Poli-cies semi-automatically adjust each activity’s parameters based on schema evolutionprimitives that propagate through activities [32].

This collection of transformation-based mapping techniques covers a wide selectionof model management scenarios. In addition, there is significant overlap in the expres-sive power of these techniques. For instance, each of the above (including channels) iscapable of expressing a horizontal merge operation, even if that specific transformationhas not yet been defined in the literature for each tool (e.g., horizontal merge can bedefined as a relational lens, even though the literature does not explicitly do so). Aninteresting and open question is whether one can construct a unifying framework tocompare, contrast, and taxonomize these tools. Pointfree calculus and data refinement[30] offer one possible underlying formalism for such a framework.

Monolithic Mappings. An alternative approach to mapping schemas is a declarativespecification, compiled into routines that describe how to transfer data from one schemato the other. Some tools compile mappings into a one-way transformation as exemplifiedby data exchange tools (e.g., Clio [14]). In data exchange, data flow is uni-directional, soupdatability is not generally a concern, though recent research has attempted to providea solution for inverting mappings [3].

Schema evolution has been considered in a data exchange setting, modeled either asincremental client changes [43] or where evolution is itself represented as a mapping[47]. Both cases focus on “healing” the mapping between schemas, leaving the non-evolved schema invariant. New client constructs do not translate to new store constructs,but rather add quantifiers or Skolem functions to the mapping, which means new clientconstructs are not persisted. Complex restructuring operations — especially ones likepivot and unpivot that have a data-metadata transformation component — are especially


rare in data exchange (Clio is the exception [19]) because of the difficulty in expressingsuch transformations declaratively.

NoSQL. No contemporary discussion of application development can go without atleast mentioning the wide variety of tools commonly referred to as noSQL. noSQL isa vague term essentially meaning a modern database management system that has insome way broken away from the assumptions of relational database systems. A noSQLsystem may have a query language, but it is not SQL. It may have an underlying datamodel, but it may not be relational, and is almost certainly not in first normal form.Such systems have become commonplace in internet applications and other applicationswhere access to large, scalable quantities of data need to be very fast but need not havethe same consistency requirements as relational systems provide.

There is no standard language or model among noSQL systems. One model that isshared among many self-identifying noSQL systems is the key-value store. Such a storeoperates much like system memory, where the key is an address and the value is thedata at that address. Depending on the system, the data in the value may be highly non-normalized or atomic in nature, possibly containing references to other keys. Data inthis format can be easily partitioned and accessed across a substantial number of nodesbased on a partition of the key space. Recently, some effort has been made to establishlinks between relational and key-value stores, asserting that the two models are in factmathematical duals of one another, and therefore not only could one query languagebe used to standardize access to noSQL systems, but that the same language may betargeted at relational systems as well [27].

Notable Future Directions. The mapping relation is a novel method of expressingan O-R mapping, and as such, it may have desirable properties on its own that are yetunstudied. For instance, it may be possible to express constraints on a mapping relationinstance that can validate a mapping’s roundtripping properties; such constraints wouldbe useful given the high potential cost of validating an object-relational mapping.

The overall technique presented in this paper allows for client-driven evolution of ap-plication artifacts; the application schema changes, and the mapping and storage changeto accommodate, if necessary. One additional dimension of changes to consider is theset of changes one can make to the mapping itself while leaving the client invariant. Onepossible way to handle the evolution of a channel involves translating the difference be-tween the old channel and the new one into its own “upgrade” channel. An alternativepossibility is to transform each inserted, deleted, or modified CT into DML and DDL.For instance, an inserted Pivot transformation would generate a Create Table statement(to generate the new version of the table), an insert statement (to populate the new ver-sion with the pivoted version of the old data), and a Drop Table statement (to drop theold version), each pushed through the remainder of the channel [36].

There remains a possibility that the mapping relation technique may have other ap-plications outside of object-relational mappings. The mapping relation is a way to takea “monolithic” operation like an object-relational mapping and make it amenable toanalysis for patterns, assuming that such patterns may be identified in the relationshipbetween the source and target metamodels. An interesting and unanswered question is


whether a similar technique can be applied to a data exchange setting. One would needto define patterns over the the expressible mappings, and a mapping table representationfor first-order predicate calculus, in which case similar techniques could be developed.

The set of CT’s presented in this paper is not intended to be a closed set. Whilethe requirements laid out in Section 1.1 are generally applicable to applications andtheir data sources, the exact set of CT’s needed will likely be vastly different from oneapplication to another. The set of CT’s presented here are inspired by an examinationof commercially-available software packages and have been implemented but also for-mally proven. Formal proofs are not likely to be acceptable or sufficient should onewant to enable the individual developer to implement their own CT’s and thus create anopen ecosystem of CT’s. An open area of research is what the implementation contractof a CT should be, and what algorithms may serve as a suitable “certification” processfor a candidate CT.

References

1. Ambler, S.W., Sadalage, P.J.: Refactoring Databases. Addison-Wesley Publisher (2006)2. Arenas, M., Barcelo, P., Libkin, L., Murlak, F.: Relational and XML Data Exchange. Morgan

and Claypool Publishers (2010)3. Arenas, M., Perez, J., Riveros, C.: The recovery of a schema mapping: bringing exchanged

data back. In: PODS 2008, pp. 13–22 (2008)4. Bancilhon, F., Spyratos, N.: Update Semantics of Relational Views. ACM Transactions on

Database Systems 6(4), 557–575 (1981)5. Banerjee, J., Kim, W., Kim, H., Korth, H.F.: Semantics and Implementation of Schema Evo-

lution in Object-Oriented Databases. In: SIGMOD 1987, pp. 311–322 (1987)6. Bernstein, P.A., Chiu, D.-M.W.: Using Semi-Joins to Solve Relational Queries. J.

ACM 28(1), 25–40 (1981)7. Blakeley, J.A., Muralidhar, S., Nori, A.K.: The ADO.NET Entity Framework: Making the

Conceptual Level Real. In: Embley, D.W., Olive, A., Ram, S. (eds.) ER 2006. LNCS,vol. 4215, pp. 552–565. Springer, Heidelberg (2006)

8. Bohannon, A., Pierce, B.C., Vaughan, J.A.: Relational lenses: a language for updatableviews. In: PODS 2006, pp. 338–347 (2006)

9. Curino, C., Moon, H., Zaniolo, C.: Graceful Database Schema Evolution: the PRISM Work-bench. PVLDB 1(1), 761–772 (2008)

10. Dayal, U., Bernstein, P.: On the Correct Translation of Update Operations on RelationalViews. ACM Transactions on Database Systems 8(3), 381–416 (1982)

11. Diskin, Z., Xiong, Y., Czarnecki, K.: From State- to Delta-Based Bidirectional Model Trans-formations. In: Tratt, L., Gogolla, M. (eds.) ICMT 2010. LNCS, vol. 6142, pp. 61–76.Springer, Heidelberg (2010)

12. Domınguez, E., et al.: MeDEA: A database evolution architecture with traceability. Data andKnowledge Engineering 65(3), 419–441 (2008)

13. Embley, D., Thalheim, B.: Handbook of Conceptual Modeling. Springer (2011)14. Fagin, R., et al.: Clio: Schema Mapping Creation and Data Exchange. In: Borgida, A.T.,

Chaudhri, V.K., Giorgini, P., Yu, E.S., et al. (eds.) Conceptual Modeling: Foundations andApplications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009)

15. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems - the complete book, 2ndedn. Pearson Education (2009)


16. Gottlob, G., Paolini, P., Zicari, R.: Properties and update semantics of consistent views. ACMTransactions on Database Systems 13(4), 486–524 (1988)

17. Gupta, A., Mumick, I.S.: Maintenance of Materialized Views: Problems, Techniques, andApplications. IEEE Data Engineering Bulletin 18(2), 3–18 (1995)

18. Hainaut, J.-L.: The Transformational Approach to Database Engineering. In: Lammel, R.,Saraiva, J., Visser, J. (eds.) GTTSE 2005. LNCS, vol. 4143, pp. 95–143. Springer, Heidelberg(2006)

19. Hernandez, M., Papotti, P., Tan, W.: Data Exchange with Data-Metadata Translations.PVLDB 1(1), 260–273 (2008)

20. Hibernate, http://www.hibernate.org/21. Hofmann, M., Pierce, B.C., Wagner, D.: Edit lenses. In: POPL 2012 (2012)22. Keller, A.M.: Algorithms for Translating View Updates to Database Updates for Views In-

volving Selections, Projections, and Joins. In: PODS, pp. 154–163 (1985)23. Lakshmanan, L.V.S., Sadri, F., Subramanian, S.N.: On efficiently implementing SchemaSQL

on a SQL database system. In: VLDB 1999, pp. 471–482 (1999)24. Lomet, D.B., et al.: Immortal DB: transaction time support for SQL server. In: SIGMOD

2005, pp. 939–941 (2005)25. McBrien, P., Poulovassilis, A.: Data Integration by Bi-Directional Schema Transformation

Rules. In: ICDE 2003, pp. 227–238 (2003)26. McBrien, P., Poulovassilis, A.: Schema Evolution in Heterogeneous Database Architectures,

A Schema Transformation Approach. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu,M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 484–499. Springer, Heidelberg (2002)

27. Meijer, E., Bierman, G.M.: A co-relational model of data for large shared data banks. Com-mun. ACM 54(4), 49–58 (2011)

28. Melnik, S., Adya, A., Bernstein, P.A.: Compiling Mappings to Bridge Applications andDatabases. ACM TODS 33(4), 1–50 (2008)

29. Microsoft SQL Server (2005), http://www.microsoft.com/sql/default.mspx30. Oliveira, J.N.: Transforming Data by Calculation. In: Lammel, R., Visser, J., Saraiva, J. (eds.)

GTTSE 2007. LNCS, vol. 5235, pp. 134–195. Springer, Heidelberg (2008)31. Oracle TopLink, http://www.oracle.com/technology/products/ias/toplink/32. Papastefanatos, G., et al.: What-If Analysis for Data Warehouse Evolution. In: Song, I.-Y.,

Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 23–33. Springer, Heidel-berg (2007)

33. Rahm, E., Bernstein, P.A.: An Online Bibliography on Schema Evolution. SIGMODRecord 35(4), 30–31 (2006)

34. Ruby on Rails, http://rubyonrails.org/35. Snodgrass, R.T.: Developing Time-Oriented Database Applications in SQL. Morgan Kauf-

mann Publishers Inc., San Francisco (2000)36. Terwilliger, J.F.: Graphical User Interfaces as Updatable Views. PhD thesis, Portland State

University (2009)37. Terwilliger, J.F., Bernstein, P.A., Unnithan, A.: Automated Co-evolution of Conceptual Mod-

els, Physical Databases, and Mappings. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand,Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 146–159. Springer, Heidelberg (2010)

38. Terwilliger, J.F., Bernstein, P.A., Unnithan, A.: Worry-Free Database Upgrades: AutomatedModel-Driven Evolution of Schemas and Complex Mappings. In: SIGMOD 2010, pp. 1191–1194 (2010)

39. Terwilliger, J.F., Delcambre, L.M.L., Logan, J.: Querying Through a User Interface. Journalof Data and Knowledge Engineering 63(3), 774–794 (2007)

40. Terwilliger, J.F., Delcambre, L.M.L., Logan, J., Maier, D., Archer, D.W., Steinhauer, J.,Britell, S.: Enabling revisitation of fine-grained clinical information. In: IHI 2010, pp. 420–424 (2010)

http://www.hibernate.org/

http://www.microsoft.com/sql/default.mspx

http://www.oracle.com/technology/products/ias/toplink/

http://rubyonrails.org/


41. Terwilliger, J.F., Delcambre, L.M.L., Maier, D., Steinhauer, J., Britell, S.: Updatable andEvolvable Transforms for Virtual Databases. PVLDB 3(1), 309–319 (2010)

42. Vassiliadis, P., et al.: A generic and customizable framework for the design of ETL scenarios.Information Systems 30(7), 492–525 (2005)

43. Velegrakis, Y., Miller, R.J., Popa, L.: Preserving Mapping Consistency Under SchemaChanges. VLDB Journal 13(3), 274–293 (2004)

44. Wei, H., Elmasri, R.: PMTV: A Schema Versioning Approach for Bi-Temporal Databases.In: TIME 2000, pp. 143–151 (2000)

45. Wyss, C.M., Robertson, E.L.: A Formal Characterization of PIVOT/UNPIVOT. In: CIKM2005, pp. 602–608 (2005)

46. Wyss, C.M., Wyss, F.I.: Extending relational query optimization to dynamic schemas forinformation integration in multidatabases. In: SIGMOD 2007, pp. 473–484 (2007)

47. Yu, C., Popa, L.: Semantic Adaptation of Schema Mappings When Schemas Evolve. In:VLDB, pp. 1006–1017 (2005)

Requirements for Self-adaptation�

Nelly Bencomo

INRIA Paris - RocquencourtDomaine de Voluceau, B.P. 105

78153 Le Chesnay, [email protected]

Abstract. Self-adaptation is emerging as an increasingly important ca-pability for many applications, particularly those deployed in dynami-cally changing environments, such as ecosystem monitoring and disastermanagement. One key challenge posed by Self-Adaptive Systems (SAS)is the need to handle changes to the requirements and corresponding be-havior of a SAS in response to varying environmental condition duringruntime. In this paper, we discuss the role of uncertainty in such sys-tems, research challenges and present results from our experiences whentackling those challenges. We also review different modeling techniquesfor the development of self-adaptive systems with specific emphasis ongoal-based techniques.

Keywords: Requirements, reflection, run-time, self-adaptive system.

1 Introduction

Traditionally, an important aim for requirements engineering (RE) has been tounderstand the problem domain with the purpose of formulating the require-ments model of the system that will be developed. Such a model describes thegoals, domain assumptions and requirements. Tacitly, it is assumed that the en-vironmental context is static enough and can be understood sufficiently well topermit the requirements model for a workable solution to be formulated withconfidence. However, in practice, environmental contexts are seldom static overlong periods, and it may not be easy to understand completely. Nonetheless,RE offers a range of techniques capable of mitigating or avoiding these prob-lems provided change happens slowly enough to allow developers to evaluate theimplications and take appropriate action.

More and more, however, systems are being commissioned for problem con-texts that are subject to change over short periods and in ways that are poorlyunderstood. To some extent, this is possible because the self-adaptation infras-tructures have noticeably improved. For example, currently middleware infras-tructures allow software components providing different functionality or qualityof service to be substituted at runtime. Complementing these new technologies

� This work has been supported in part by Marie-Curie Fellowship [email protected].


272 N. Bencomo

is a problem-driven motivation steered by a range of demanding real problemsrelated to disaster management and smart energy domains. The common factorin each of these problem domains is the potential environmental contexts, whichcannot be considered static and are hard to understand. Self-adaptation capabil-ity will become an increasingly required system property. The above makes it cru-cial that software systems are conceived in such a way that they are aware of theirown requirements during execution; what we call [email protected] [1].

The main idea behind [email protected] is that self-adaptive systemsshould be able to discover, reason about and manage requirements during run-time. One key contribution to the achievement of this is the work on require-ments monitoring [2]. Requirements monitoring is necessary because deviationsbetween the systems run-time behaviour and the requirements model may triggerthe need for a systemmodification [3]. Such deviation needs to be correlated withthe current state of the environment so that the reasons can be diagnosed andthe appropriate adaptations executed. We argue that if systems have the need toadapt dynamically in order to maintain satisfaction of their goals, requirementsengineering ceases to be a purely static, off-line activity to be a runtime one too.This is because design-time decisions about the requirements need to be madeon incomplete and uncertain knowledge about the application domain and thestakeholders goals. In order to be able to support requirements-aware systems,requirements for self-adaptive, systems need to be run-time entities that can bereasoned over at run-time.

A requirements-aware system should be able to introspect about its require-ments in the same way that reflective systems permit introspection about theirarchitectural configuration [4]. Implicit in the ability for a system to introspecton its requirements model is the representation of that model at run-time [5].

Inevitably, self-adaptive systems may require the behavior to change or addnew behaviour at runtime. As explained above, these new changes and newbehaviors may be (partially) unknown at design time. It means that new softwareartifacts that are not envisioned at design time may be generated at runtime.New techniques for synthesis or generation of software using runtime modelsduring execution are needed. Therefore, the research communities of GenerativeTechniques in Software Engineering and Self-adaptive Systems need to cooperatetowards the achievements of such techniques. The latter is the main motivationof this paper.

Different research challenges need to be addressed for requirements to be-come useful run-time entities and for self-adaptive systems capable of operatingresiliently in volatile and poorly-understood environments to become a reality.In this paper we discuss research initiatives to tackle these challenges.

The paper is structured as follows. In Section 2 we provide a background andmotivate the need for requirements-aware systems. Section 3 describes the basisof our view on goal-based requirements and how it is relevant to self-adaptivesystems in the case of foreseen and foreseeable adaptations. Sections 4 enumer-ates research challenges, discusses each challenge, research results achieved so far

Requirements for Self-adaptation 273

and also, relevant research directions. Section 5 describes related work. Section6 concludes the paper.

2 Background and Motivation

There are two primary drivers for self-adaptive systems; improvement in tech-nologies, which is making self-adaptive systems easier to develop, and the emer-gence of problems for which self-adaptation offers the most viable solution. Bothdrivers are mutually reinforced.

These new research advances and technologies provide programmers with morepowerful programming frameworks and run-time infrastructures that support self-adaptation. Relevant research initiatives include adaptive architectures [6] suchas Rainbow [7], OpenCom [8], the work by Peyman et al. [9] and GridStix [10].Using OpenCom, for example, a programmer can define the architecture and com-pose it using a set of substitutable components from the component libraries. Pol-icy adaptation (event-condition-action rules) define the circumstance under whichcomponents can be substituted, while the architecture is constrained to ensurethat only valid component configurations are created. Using this model, adap-tive applications such as GridStix can be constructed. GridStix [10] is a sensorgrid that adapts dynamically as the river it monitors changes state. Hence, forexample, the system can switch between components that implement Bluetooth,IEEE 802.11b or GPRS communications technologies according to the demandsimposed by battery health, water depth and resilience. Context-aware systemshave also been proposed to provide the ability for a software system to adapt it-self at runtime to cope with changes in its environment and user needs. Thereforethey have also improved the level of the current run-time infrastructures.

In parallel with the fact that technology to provide self-adaptation has im-proved, self-adaptation has emerged as a design strategy to mitigate mainte-nance costs in systems where factors such as mission-criticality or remotenessmake off-line adaptation unfeasible. Such systems range from enterprise systemswhere scale and complexity are the main drivers for self-adaptation, to embeddedsystems where remoteness and inaccessibility drive the choice of design strategy.Furthermore, when uncertainty about the environmental context is unavoidable,self-adaptation may offer the only viable solution.

Self-adaptation assumes that a system is able to monitor the environment, de-tect changes and react accordingly. If the environment is well-enough understoodthe appropriate adaptation to achieve can be decided and specified because therelationship between environment and system behaviour can be identified andspecified at design-time. Where the environment is poorly understood, however,that relationship cannot be known with certainty and so the decision of howto react is hard to make. To tackle the problem it is necessary to monitor thesystem to discover when its behaviour deviates from the requirements model.Monitoring requirements satisfaction is hard. The requirements themselves maybe imprecise softgoals or non-functional requirements, and they may not bemeasurable directly.

274 N. Bencomo

Despite these difficulties, significant progress has been made on requirementsmonitoring [2,11,12]. However, on the closely-related issue of how to take cor-rective actions to reconcile system behaviour with the requirements model whenmonitoring detects deviancy, research is still immature. Some progress has beenmade in the domain of web services, where infrastructure standards help servicediscovery and dynamic (re-)binding. But even a well-defined web service infras-tructure where service specifications can be queried on-line does not help withreasoning over the requirements model for a composed service. For example,switching to a functionally equivalent web service as an alternative to one whoseresponse time has become unacceptable may impact on other requirements suchas those relating to cost or security. Certainly, such issues can be resolved off-line,particularly if the monitoring data helps resolve some of the environmental un-certainty. However, this resolution may be possible because the developers haveaccess to the full requirements models over which they can reason and reachinformed resolution decisions.

Next section presents our view in terms of self-adaptive systems, we elaborateabout how to tackle uncertainty of SAS, i,e adaptations that should exist inscenarios that are not completely predictable. The given examples and researchresults will be used to elaborate further ideas on how to tackle the researchchallenges presented in this paper.

3 Goal-Based Requirements for Self-adaptive Systems:Foreseen and Foreseeable Adaptations

Changes that a self-adaptive system is designed to tolerate can be foreseen,foreseeable or unforeseen [13].

Our initial research efforts on SASs were concerned about requirements mod-eling for systems dealing with foreseen change [14] . Where change is foreseen,the set of contexts that the system may encounter are known at design time.In this case, a SAS can be defined as a set of pre-determined system configura-tions that define the systems behaviour in response to changes of environmentalcontext. Thus, there is little or no uncertainty about the nature of the systemsenvironment and, if it is developed to high quality standards, satisfaction of thesystems requirements should be deterministic.

More recently we have addressed systems dealing with change that can be de-fined as foreseeable. In this case, the key challenge is uncertainty, where at designtime some features of the problem domain are unknown, perhaps even unknow-able. Crucially, the fact of this uncertainty can be recognized, offering the pos-sibility of mitigating it by resolving the uncertainty at runtime. The uncertaintyassociatedwith foreseeable change typically forces the developers tomake assump-tions in order to define the means to achieve the systems requirements. This is thebase of our ideas on [email protected] [15] further explained in this subsection 3.2.

Systems dealing with unforeseen change is outside our scope, they are moreproperly a topic for artificial intelligence research and pose a different order ofchallenge both for self-adaptation.


In the rest of this section we described our view in terms of self-adaptivesystems, specifically, Sections 3.1 and 3.2 will elaborate further about foreseenand foreseeable adaptations respectively. Moreover, these research results will beused to elaborate further ideas on how to tackle the research challenges presentedin this paper.

3.1 Foreseen Adaptations

The requirements specification approach we have followed so far is goal-drivenand characterizes the environment as a finite set of stable states subject to eventsthat cause transitions between states. A self-adaptive system can be modeled asa collection of target systems [16], each of which correspond to, and operatewithin, a state of the environment [14]. The concerns modelled correspond tolevels of analysis that represent particular concerns of a self-adaptive system:the behaviour of the set of target systems, the requirements for how the self-adaptive system adapts from one target system to the next, and the requirementsfor the adaptive infrastructure. We make each concern the focus of a differentmodel, or group of models, that we use to visualize and analyze the systemrequirements. Level 1 is analogous to the analysis needed for a conventional,non-adaptive system. A separate level 1 model is needed for each stable stateof the environment for each target system. Level 1 models must specify thevariables in the environment to monitor that represent the triggers for adap-tation. Level 2 is concerned with decision making and has no direct analoguein conventional systems. Level 2 helps the analyst focus on understanding therequirements for adaptation by defining adaptation scenarios. The analyst mustidentify the values of the monitored environment variables that prompt transi-tions between stable states, specify the target systems that represent the startand end points of adaptations and specify the form of the adaptation. Level 3analysis is concerned with identification of the adaptive infrastructure in orderto enable self-adaptation. Level 3 is not relevant for this paper.

We have successfully applied our approach to different case studies. One ofthese case studies is the GridStix [17], an adaptive flood warning system andis documented in [14]. We describe only a subset of the models for the casestudy. A fuller description is found in [14,18]. The first task at level 1 was toidentify the high-level, global goals of GridStix: the main goal Predict Flooding,and three goals, or softgoals that describe required qualities, Fault Tolerance,Energy Efficiency and Prediction Accuracy. Next, states of the river environmentwere identified, each of which could be treated as a discrete domain for a targetsystem. In GridStix, these represented a three-fold classification of the river state:S1: Normal or quiescent, where depth and flow rate are both within bounds thatindicate no risk of flood; S2: Alert where the depth was within acceptable boundsbut the flow rate had increased significantly, possibly presaging the onset of thenext state; S3: Emergency.

The next stage in the analysis was aimed at discovering the application re-quirements for each target system. This necessitated development of an i* strate-gic rationale model (SR model) for each of the three target systems. We have

276 N. Bencomo

Fig. 1. Behaviour model of environment variant Normall

used i* notation [19]. SR models in i* help the analyst reason about the re-quirements needed to address each environment variability. The notation usedis i*, target systems are depicted as agents represented by dashed circles, whichdepict the scope of the agents’ responsibilities. Flow rate and Depth which aremodeled as resources in the i* notation. Inside the agent boundaries, each targetsystem is depicted as a set of tasks (the hexagons) that help to satisfy the PredictFlooding goal. The solid arrows arcs represent means ends relationships, whilethe arcs with bars at one end represent task decompositions. An open arrow arcrepresents a quantitative assessment of the extend to which a task contributes tothe satisfaction of a soft goal. It can be annotated as hurt or help. To illustratethis, consider the SR models for Normal state in Figure 1.

A key feature of i* is that it supports reasoning about softgoal trade-offsearly in the analysis. The key aspect to note here is that the SR models allowthe identification of tasks that specify the means by which goals are to be ac-complished. These tasks correspond to architectural relevant concerns (as also isthe spanning tree explained below). According to the context, tasks may satisfy


Fig. 2. Behaviour model of environment variant Emergency

some softgoals better than others. As the river changes, the trade-offs betweensoftgoals change and this impacts on the best combination of tasks (architecturaldecisions) to accomplish the Predict Flooding goal.

Construction of the Level 1 models revealed a conflict among the softgoalsidentified. Energy efficiency is not easily reconciled with Prediction accuracy andFault tolerance. For S1, see Figure 1, Energy efficiency was considered to have ahigher priority than Prediction accuracy and Fault tolerance. This was resolvedby using single-node flow measurement which provides a less accurate predictioncapability than processing an array of sensor data, but is more energy-efficient.When the river is quiescent, single-node flow measurement was judged to provideadequate forewarning of a change in river state. Similarly, with the river quies-cent, there is little risk of node failure so resilience was also traded off againstlow energy consumption. This was the rationale for specifying a relatively effi-cient shortest path network topology. These softgoal trade-offs are reflected bythe hurt and help relationships with the softgoals.

A different balance of trade-offs was used among the softgoals for S2 and S3.In S3:Emergency for example, the water depth has increased to the point wherenodes are threatened by submersion or debris so a fewest-hop spanning tree ( UseFH topology) is specified for the network topology. Fewest-hop networks are moreresilient, though less energy-efficient, than shortest-path network topologies. Theresult of this choice was to strengthen Fault tolerance at the expense of Energyefficiency. Similarly, multi-node digicam was chosen for this target system. SeeFigure 2).

278 N. Bencomo

Note how the trade-off between the conflicts of softgoals has required archi-tectural decisions. Using SP or FH topology for the spanning tree or single-nodeor multi-node digicam for the image flow calculation respectively have proven tohave different architectural impacts on the ongoing system.

Similarly for S2: Alert a balance of trade-offs was used and the resultantstrategy is shown in Figure 3).

Fig. 3. Behaviour model of environment variant Alert

The Level 2 models identified and modeled adaptation scenarios that specifytransitions between steady-state systems S1, S2, and S3. Figure 4 depicts theadaptation scenario for adapting from S1 to S2 as the river state transitionsbetween the domains Normal and Alert.

In addition to specifying the source and target systems, each adaptation sce-nario must address three concerns that determine when and how to adapt: whatdata to monitor; what changes in the monitored data trigger the adaption; andhow the adaptation is effected. Each of these three concerns is conceptualizedas the responsibility of a role of the adaptation infrastructure: Monitoring mech-anism, Decision-making mechanism and Adaptation mechanism, respectively. TheDecision-making mechanism, depending upon the source system S1, determineswhen GridStix needs to adapt from S1 to S2. This adaptation was specifiedto occur on a significant increase in river flow but before any significant depthincrease. The Adaptation mechanism had to satisfy the goal Effect adaptation byperforming the task Replace single-node flow processing with distributed flow pro-cessing and Replace SP tree with FH tree, which defined, at a high-level, the


difference between the S1 and S2 behavior models. The Decision-making mecha-nism is always concerned about the source system and the Adaptation mechanismis concerned about the target system.

Fig. 4. Level 2: S1 to S2 Adaptation Model

3.2 Foreseeable Adaptations

As discussed earlier, the challenging characteristic of SASs is that of uncertainty;a full understanding of all the environmental contexts they will encounter atruntime may be unobtainable at design time. Thus assumptions may have to betaken that risk being wrong, and this may lead to problems at runtime. Thus, forexample, a particular environmental context may be assumed to have particularcharacteristics and the systems behaviour defined accordingly. If the contextturns out to have different characteristics, the system may behave in a way thatis inappropriate. This has led us to exploit the concept of markers of uncertainty[15]. Markers of uncertainty serve as an explicit marker of an unknown thatforces the developer to make an assumption. We have implemented markersof uncertainty using claims at runtime. A benefit of using claims to representdesign-time assumptions is that the uncertainty is bounded and thus the risk ofthe system behaving in an inappropriate may be mitigated by monitoring, claimand goal evaluation, and adaptation. Our solution uses the i* goal explained

280 N. Bencomo

above and claim refinement models, as depicted in Figure 5 (slightly improvedversion of the goal-based models for S3 enhanced with the use of claims).

In the latter section we treated GridStix as a system that was subject onlyto foreseen change. From subsequent experience gained from GridStixs deploy-ment it became clear that GridStix could be characterized as a system subjectto foreseeable change. For example, the effects of the turbulent water surfaceon radio wave propagation were unknown to the designers when GridStix wasoriginally conceived. The fact that this is likely to have an affect on the systemwhen using low power radio for communication is now known, although exactlyhow and when it will have an affect is not easy to predict. Accordingly, we knewthat the goal models explained earlier embodied a number of assumptions thatarose from imperfect understanding of how the system would be affected by itsenvironment. An example of how such uncertainty appears in the goal models, isthat it is supposed that node failure is likely to lead to fragmentation of the net-work if a shortest-path (SP) spanning tree is selected, rather than the normallymore resilient fewest-hop (FH) spanning tree. This assumption is supported byexperience in other network installations, but it isnt necessarily true of everynetwork. In our application of claims, we therefore attached a claim to the goalmodel to make our uncertainty about the choice of spanning tree explicit; theclaim SP too risky for S3. The other two claims, Bluetooth too risky for S3and Single node image processing not accurate enough for S3 represent similarinstances of uncertainty. The claims SP too risky for S3 and Bluetooth too riskyfor S3 also served the purpose of prioritizing the Fault Tolerance softgoal overEnergy Efficiency. They change the hurt contribution link values of the contri-bution links connecting the Use Bluetooth and Use SP Topology tasks to theFault Tolerance softgoal, to break, so favouring selection of their alternative,more fault-tolerant, operationalizations (FH and WiFi) when the resulting goalmodel was evaluated.

Each claim was decomposed in a Claim-Refinement Model CRM (not shownhere) to derive monitorables. For example, the claim SP too risky for S3 decom-posed to two AND-ed subclaims; Faults LIkely and SP is Less Resilient than FH.The semantics of these monitorables were that SP too risky for S3 would be heldto be true only if the frequency of node failures in the environmental contextmanaged by S3 was above a threshold value, and if the frequency of networkfragmentation when the network was configured with a fewest hop spanningtree was below a threshold value. These threshold values were defined by therequirements engineer as utility functions. They represented the frequencies atwhich it was expected that node failures would lead to significant loss of dataand at which network fragmentation exceeded that estimated for a network con-figured with a shortest path spanning tree, respectively. Both frequency valuesare monitorable by collecting run-time data over a given period, again definedby the requirements engineer. The monitors were handcoded to collect the data,evaluate it and generate an event when the criteria for falsifying a claim weremet.


Fig. 5. Gridstix goal models the variant Emergency S3

One of the claim-triggered adaptations observed in Grid Stix concerned thevariation point Transmit Data. As de scribed above, the effect of the Bluetoothtoo risky for S3 was for the runtime goal-based reasoner, when initially invokedat design time, to select the Use WiFi operationalization and, via the adaptivemiddleware, to configure the GridStix nodes to use their IEEE 802.11b wirelesscommunications hardware. At some point during runtime, however, the claimBluetooth is less Resilient than Wi-Fi is falsified by monitoring. This generatesa Monitoring Policy that is detected by GridStixs middleware which in turninvokes the runtime goal-model reasoner.

Contribution links in the goal models are changed accordingly and after a re-evaluation of the goal model by the runtime reasoner, Use Bluetooth is selected asthe operationalization of the Transmit Data. This is a change of operationlizationfrom the Use WiFi selected at design time. The effect is to generate a newadaptation policy for the GridStix middleware runtime engine which in turnadapts the Grid- Stix component configuration to make the GridStix nodes touse their Bluetooth instead of IEEE 802.11b wireless communications hardware.We have observed that this claim-driven adaptation fromWiFi to the less power-hungry Bluetooth contributed to making the GridStix nodes survive for longerunder certain of the simulated conditions. Therefore, we can conclude that theuse of claims had an effect on the improvement in the performance of GridStix.

282 N. Bencomo

4 Research Challenges

Given the above, we have identified the following areas of research:

– Dealing with uncertainty– Run-time representations of requirements– Evolution of the requirements model and synchronization with the architec-

ture– Dynamic generation of software

The research challenges identified are discussed in the rest of this section. Foreach challenge, we present a short description of what the challenge is about,motivations raised in the examples presented in Section 3 and our efforts totackle these challenges. Also, efforts of other authors are partially presented andalso some open questions and research directions.

4.1 Dealing with Uncertainty

A key challenge when developing self-adaptive systems is how to tame un-certainty about the environment. Requirements model must be able to toler-ate uncertainty. The monitoring, reasoning and adaptive capabilities of thesystem explained earlier help tolerate uncertainty. However, in this process,there will arise conflicts as every adaptive strategy may involve a different setof trade-offs. Therefore, the requirements models also need an explicit rep-resentation of where uncertainty exists to know which requirements can betraded off in favour of critical requirements, and under what circumstances.The adaptive reasoning mechanism needs to be capable of dealing with theseconflicts, and of reasoning with imperfect knowledge.

It is important to understand that uncertainty and change are related, butdistinct concepts. An environment that changes, but for which the nature ofthose changes is known, can be handled using standard development techniquessuch as defining adaptation trigger conditions as explained in Section 3.1. In suchcases, requirements awareness is not strictly necessary. More interesting, how-ever, are cases where the environment changes in ways that cannot be predicted.For example ,as in the case study of the flooding warning system Gridstix, wedemonstrated how the system can improve its performance making decisions atruntime about what configuration to use during an emergency given the fact thatmore solar light has recharged the batteries of nodes of the infrastructure (thissituation would have not been clearly predicted during design-time). In this kindof situations, it is not adequate to define adaptation triggers because the cor-rect triggering conditions cannot be anticipated at design-time. An alternativesolution is therefore required: either one that learns new triggering conditionsat run-time, or, as proposed in this paper, a higher-level adaptation mechanismin which requirements themselves are represented at run-time, monitored, andtraded-off against each other, if necessary, when unexpected contextual changestake place.


We believe that RE should consider a move away from binary satisfaction con-ditions for requirements to more nuanced notions of requirements conformance.As an example why this is necessary, consider a self-adaptive system with twooverarching requirements: to perform a given task well and to perform it effi-ciently. Furthermore not all requirements have equal standing. If the environmentchanges unexpectedly, for instance, it may be wise temporarily not to satisfy anon-critical requirement if it means that a critical requirement will continue tobe satisfied. To address such issues, we call for research into how existing re-quirements languages and methodologies can be extended so that self-adaptivesystems have run-time flexibility to temporarily ignore some requirements infavour of others that is, we envisage run-time trade-offs of requirements beingmade as the environment changes. As a first step, we have developed the RELAXrequirements language for adaptive systems [20,21]. RELAX defines a vocabularyfor specifying varying levels of uncertainty in natural language requirements andhas a formal semantics defined in terms of fuzzy branching temporal logic. Thisallows a requirements engineer to specify ideal cases but leaves a self-adaptivesystem the flexibility to trade-off requirements at run-time as environmentalconditions change i.e., certain requirements can be temporarily RELAX-ed. Asa very simple example, consider a protocol that synchronizes various computingdevices in a smart office environment. One requirement for such a system mightbe:

The synchronization process SHALL be initiated when the device owner entersthe room and at 30 minute intervals thereafter.

RELAX provides a process that assists a requirements engineer to make adecision whether a requirement should be RELAX-ed. In this case, s/he mightdecide that the hard thresholds are not crucial and RELAX the requirement to:

The synchronization process SHALL be initiated AS EARLY AS POSSIBLEAFTER the device enters the room and AS CLOSE AS POSSIBLE TO 30minute intervals thereafter.

Given a set of RELAX-ed requirements, they can be traded-off at run-time.For example, critical requirements would not be RELAX-ed, whereas less criticalones would be; in this case, the self-adaptive system can autonomously decide totemporarily not fully satisfy such requirements. RELAX provides a set of well-defined operators (e.g., AS EARLY AS POSSIBLE, AFTER above) which canbe used to construct flexible requirements in a well-defined way. It also offers away to model the key properties of the environment that will affect adaptation.

Research Directions

Although there is a formal semantics, in terms of fuzzy logic, for RELAX, thereis no implementation yet that actually monitors RELAX-ed requirements atrun-time. This therefore is a clear avenue for immediate research. Fuzzy logic isnot the only formalism that could be used to reason about uncertainty in theenvironment, of course. Numerous mathematical and logical frameworks existfor reasoning about uncertainty [22]. For example, probabilistic model check-ers have been used to specify and analyse properties of probabilistic transition

284 N. Bencomo

systems [23] and Bayesian networks enable reasoning over probabilistic causalmodels [24]. However, only limited attention has been shown so far to the treat-ment of uncertainty in requirements engineering models. Our ongoing work hasthe objective to develop extensions to goal-oriented requirements modeling lan-guages to support modeling and reasoning about uncertainty in design-time andrun-time models; initial results are in [15]. In the longer term, self-adaptive sys-tems, and RE in particular, needs a theory of uncertainty. Given such a theory,requirements for self-adaptive systems could be related to the uncertainty in theenvironment and could be monitored or adapted according to that uncertainty.Other fields of study offer possible starting points for such a theory for exam-ple, risk analysis for possible security issues in software-intensive systems, riskassessments in engineering disciplines [25], the economics of uncertainty [26],and uncertainty in management theory [27], as well as well-known mathemat-ical models of uncertainty such as Bayesian networks. All of these fields havedeveloped theories for dealing with uncertainty in their respective domains. Aninteresting longer-term research question is to distill some of this thinking andincorporate it into requirements engineering for self-adaptive systems.

4.2 Run-Time Representations of Requirements

To be able to fully exploit requirements monitoring and allow the system toreason about its requirements, it will be necessary to hold the requirementsmodels in memory. The runtime representation of requirements should allowthe running system itself to evaluate goals satisfaction in real-time and topropagate the effects of for example falsified domain assumptions. We arguethat goal-based models are suitable for the needed runtime representationassociated with requirements models.

Architectural reflection [8,28] offers a pointer to how requirements may be-come run-time artifacts. Architectural reflection allows introspection of the un-derlying component-based structures. An architecture meta-model can be usedto get the current architecture information to determine the next valid step inthe execution of the system. Specifically, the architecture meta-model providesaccess to the component graph where components are nodes and bindings arearcs. Inspection is achieved by traversing the graph, and adaptation/extension isrealized by inserting or removing nodes or arcs. Such extensions and changes willbe reflected on the systems during run-time. Crucially, this meta-model supportsreasoning about the architecture of the system. We argue that the same princi-ples can be applied to allow introspection and reasoning based on (meta-) modelsof requirements at run-time. The mechanisms for achieving this are explored inthe next section. Introspection would offer the ability of a run-time requirementsentity to reveal information about itself and hence allow the system to reasonabout its requirements.

RE is concerned with the identification of the goals to be achieved by the sys-tem, the operationalization of such goals as specifications of services and theirconstrains, and the assignment of responsibilities for services among agents [29]


(i.e. human, physical, and software components) forming the system. As dis-cussed in Section 3 goals can be operationalized in many different ways and theRE process allows us to explore the choices, detect conflicts between require-ments and select the preferred choice by the assessment of the effects on thesystem and its context. The selection of an appropriate set of choices is essentialto the success of a system. However, as also was described in the Section 3 in thecase of the flooding warning system GridStix, inherent uncertainty about the en-vironment and behavior may make it impossible to anticipate all the exceptionalcircumstances. In contrast to assumptions made during the specification of thesystem, the conditions of execution may change unexpectedly manifesting un-foreseen obstacles [30]. As a result, the selection of the right set of choices, in thecase of SASs, may need to be delayed until run-time when the system can reasonto make choices informed by concrete data sensed from the environment [2].

Dynamic assessments and reasoning about requirements imply a run-time rep-resentation of system requirements (i.e. its run-time requirements model [31])that is rich enough to support the wide range of run-time analyses concerningstakeholders goals, software functional and non-functional requirements, alter-native choices, domain assumptions, scenarios, risks, obstacles, and conflicts.Such run-time representation will drive the way a system can reason and as-sess requirements during run-time and crucially will underpin the other chal-lenges described in this section. To support such dynamic assessment of require-ments, language features found in goal-oriented requirements modeling languagesKAOS [29] and i* [19] hold particular promise. KAOS is particularly useful hereas it integrates the intentional, structural, functional, and behavioral aspects ofa system, and offers formal semantics that would allow automated reasoningover goals. Work in [32] about requirements reasoning maybe relevant.

As explained in section 3.2, a way to achieve a run-time representation ofrequirements is to base it on goal-based RE. Specifically in [15] we have demon-strated how to maintain goal-based models in memory to support reasoning atruntime. Runtime representation of design assumptions are made explicit usingthe concept of claims in goal models at design time. Using what we call claimrefinement models (CRMs), we have defined a semantics for claims in terms oftheir impact on alternative strategies that can be used to pursue the goal of thesystem. The impact is calculated in terms of satisfaction and trade-off of the sys-tems non-functional requirements (i.e. sof-goals). Crucially, during runtime whenthe executing system monitors that a given claim is falsified, the system mayadapt to an alternative goal realization strategy that may be more suitable forthe new contextual conditions. Importantly, our approach tackles uncertainty,i.e. the new goal operazionalization strategy may imply a new configurationsof components that was not necessarily foreseen during design time. With thepotential non-foreseen behavior, self-explanation capabilities are crucial.

Research Directions

Particularly, to provide language support for representing, traversing and ma-nipulating instances of a meta-model for goal modeling, for example based on

286 N. Bencomo

the KAOS meta-model. The meta-model could be provided as a set of built-in constructs to a programming language, or alternatively be provided in theform of (e.g.) a library. Crucial, the meta-model must provide a way to repre-sent and maintain relationships between requirements and agents and the inter-relationships between requirements, to dynamically reassign the goals to differentagents or to move to alternative goals in the goal tree. In other words and incontrast to previous work [33], we envision that this representation must takeplace in such a way that is not only readily understandable by humans but alsoeasily manipulable by the system itself. This will allow the persons responsiblefor maintaining software to query the software (as opposed to externally storeddocumentation) to determine requirements-relevant information, such as: Whatare the sub-goals of a goal? Which agents are responsible for achieving the goal?What assumptions are associated with a goal? In some cases, the software itselfwould also be able to use this information to guide its own adaptation. The factthat humans would be able to query the requirements model and its relationto the run-time behavior may be more important than just letting the softwaredo so. The benefits of being able to easily maintain and retrieve up-to-date re-quirements models go beyond self-adaptation. For example, we consider thatself-explanation is an important aspect in self-adaptive systems [34]. SASs mayneed to explain their behavior either because their user is not understanding thecurrent behavior (e.g the reason why the system is adapting in a given way) orbecause maintenance is being carried out.

4.3 Evolution of the Requirements Models and Its Synchronizationwith the Architecture

The introspection on requirements models is not enough. The system shouldalso be able to identify alternative solution strategies. The system, therefore,needs to be aware of the capabilities of its own adaptive machinery and thisawareness needs to be in sync with the runtime requirements models. Inthe case that the system behaviour deviates from the requirements modelsand triggers a search for suitable corrective actions, the range of availableadaptations (for example, in the form of component substitutions) can beevaluated and their different trade-offs balanced to find the most suitable ofthe available solutions. Once identified, the adaptation can be enacted andmonitored to evaluate its effect on the behaviour of the system.

Requirements reflection enables self-adaptive systems to revise and re-evaluatedesign-time decisions at run-time when more information can be acquired aboutthese by observing their own behaviour. We therefore see two research issueshere. One is the evolution of the requirements models themselves and the main-tenance of consistency between the different views during this evolution. In orderto do this it is necessary to specify how the systems requirements can evolvedynamically and to specify the abstract adaptation thresholds that allow for un-certainty and unanticipated environmental conditions [35,21]. Unfortunately, to


Fig. 6. Synchronization between run-time requirements and the architecture

our knowledge none of the existing techniques deal with this degree of evolution,incomplete information, or uncertainty.

The second research issue is the need to maintain the synchronization of therun-time requirements model and the software architecture as either the re-quirements are changed from above or the architecture is changed from below.Current work on computational reflection offers a potential way to structurethe run-time relationship between the requirements model and the architecture.Traditionally, reflective architectures are organized into two causally-connectedlayers [8] the base layer, which consists of the actual running architecture andthe meta-layer, which consists of meta-objects, accessible through a meta-objectprotocol (MOP), for dynamically manipulating the running architecture. Weenvision a similar strategy for achieving requirements reflection: a base layerconsisting of run-time requirements objects (i.e. the requirements models) anda meta-layer allowing dynamic access and manipulation of requirements objects(i.e. stakeholders goals, goal refinements, alternative choices, domain assump-tions, etc.). This way of structuring requirements reflection therefore leads totwo strata one for requirements and one for architecture each encompassing acausally-connected base and meta-layer. As in the case of the traditional archi-tecture meta-model (which offers operations over components and connectors),we can define primitives for the goal-based requirements meta-model that al-lows the meta-level to modify the base-level for the case of the requirementsstratum. These primitives might include add requirement, delete requirement,replace requirement, add goal, delete goal, replace goal, obtain agent from goal,assign agent to goal. A library of requirements model transformation operators,in the spirit of [36], would then be defined on top of these primitive operations.The rich catalogue of model transformation patterns for goal refinement, con-flict resolution and obstacle resolution associated with the KAOS language [29]may provide the basis for defining this library. It would also be complementedwith operators for resolving inconsistencies between multiple views in the spirit

288 N. Bencomo

of Xlinkit or techniques for automatically fixing inconsistencies in UML mod-els [37]. Figure 6 summarizes the proposed structure.

The structures in Figure 6 would require coordination between the upperrequirements stratum and the lower architecture stratum. As a simple exam-ple, if a goal is changed in the upper stratum, then the running system mayidentify a set of components in the architecture to replace. Our solution [email protected] explained in section 3.2 gives an examples of how this syn-chronization between runtime requirements and the architecture of the runningsystem.

Research Directions

Changes in the software architecture should be monitored to ensure that the re-quirements are not broken; and changes to the requirements at run-time shouldbe reflected in the running system through dynamic generation of changes tothe software architecture (see last challenges and the relevance for the researchcommunity of Generative Techniques in Software Engineering). For this to bepossible there needs to be a tight semantic integration between requirementsand architecture models. While there are many similarities between requirementsengineering models and architecture description languages, subtle semantic dif-ferences between existing languages make the relation between the two modelscomplex [38]. Integration between requirements and architecture is already anurgent area for research. It is particularly important for requirements-aware sys-tems that progress is made.

4.4 Dynamic Generation of Software

The uncertainty that attends understanding of changeable environmentsmeans that new unforeseen variants of behavior may be needed during exe-cution. Crucially, as the new variants were not necessarily explicitly specifiedin advance, new approaches will be needed to generate them on-the-fly. Weencourage researchers of the community of Generation and TransformationalTechniques in Software Engineering to tackle this essential challenge.

As explained earlier, advances have been made in the use of runtime re-quirements models, particularly in the area of adaptive systems [39,15] andrequirements-aware systems. However, the research topic synthesis or genera-tion of software using runtime models during execution has been neglected. Weargue that runtime requirements models (and runtime models in general) canalso support the run-time synthesis of software that will be part of the executingsystem. Examples of the artifacts that could be generated at runtime are policy-based adaptations generated from architectural-based models as in [40] . Theapproach uses runtime models to generate the adaptation logic (i.e. reconfigura-tion scripts) to reconfigure the system by comparing the current configurationof the running system with a composed model representing the target configu-ration. In this paper, and as explained in Section 3.1 we have presented early


results to generate policy-based adaptations from higher levels of abstraction(i.e. requirements models).

Next, we explain the mapping from goals to architectural design that was doneusing foreseen adaptations in Section 3.1. The mapping and generation of policiesduring runtime explained in Section 3.2 base its logic in the same mapping logictherefore the rationale of the mapping explained here is essentially the same aswhen doing the mapping for generating adaptation policies at runtime. The Level1 goal-based models described earlier have guided the design of the architectural-based models for each target systems. Similarly, the goal-based models associatedwith the transitions (models at Level 2) have guided the construction of modelsof the dynamic fluctuation of the environment and contexts, and their impacton the variation of the architecture of the applications during execution. Sucharchitectural models have been constructed using the model-based tool Genie[41,42].

Figure 7 shows a Genie model that specifies the transitions between the targetsystems (bottom right-hand corner) of the Gridstix. Each transition (arc) de-scribes when and how to dynamically switch from one target system to another(see details of the adaptation policies in the Figure).The architectural concernsRouting Protocol and Network will encompass the network topologies ( SP andFH) and the BlueTooth (denoted as BT) and WiFi (denoted as WF). Each tar-get system shows a pair of choices: Normal: ( SP,BT ), Alert: ( SP,WF ), andEmergency: ( FH,WF ). Furthermore, from these transition models, Genie allowsthe generation of different artifacts. e.g. the adaptation policies that will guidethe adaptation of the system during execution, and the configurations of com-ponents associated with each target system. These artifacts can be dynamicallyinserted during runtime by the GridStix platform [17]. More information is foundin [41,42].

Combining the three-level analysis models with adaptive middleware plat-forms we have provided a model-driven approach for the development of theadaptive application GridStix that also covers runtime.Earlier we have demon-strated that the goal models can be maintained at runtime. Therefore, the in-formation needed to generate the policies as shown in Figure 7 and explainedabove is accessible by the system to generate the adaptation poles on the fly anduse them at runtime using the Gridstix capabilities for architectural adaptation.From the goal-based models and the architecture decisions explained above themodels for transitioning the system during execution and their policy adapta-tions were derived during design and also runtime.

In Section 3.2 we have demonstrated how as the application evolves the initialrequirements may change. This may be due to a more accurate set of require-ments being gleaned from the operation of the deployed system, or new capa-bilities being introduced into the system. In the specific case of the flood moni-toring sensor network application explained earlier; initially the requirements inthe development lifecycle produced a set of 6 foreseen adaptations between threesoftware configurations [18]. As explained in Section 3.2, finding new informa-tion about the environment may introduce a new software configuration (a new

290 N. Bencomo

SP / BT

S1: Normal

SP / WF

S2: Alert

FH / WF

S3: Flow

Flood Predicted

!Flood Predicted && !HighFlow

!Flood Predicted && HighFlow !HighFlow

HighFlow Flood Predicted

Adaptation Policy S1 to S2

<Concern> Network </Concern>

<Events>

<Event>

<Type>HIGH_FLOW</Type>

<<Event>true/Event>

</Event>

</Events>

<Reconfiguration>

<FileType>Java</FileType>

<Name>Reconfiguration..WF</Name>

</Reconfiguration>

</ReconfigurationRule>


<Concern> Routing Protocol</Concern>

<Events>

<Event>

<Type>FLOOD_PREDICTED</Type>

<Event>true</Event>

</Event>

</Events>

<Reconfiguration Action>

<Name>Reconfiguration.FewestHop</Name>

</Reconfiguration Action >


<Concern> Network</Concern>

<Events>

<Event>

<Type>FLOOD_PREDICTED</Type>

<Event>true</Event>

</Event>

</Events>

<Reconfiguration Action>

<Name>Reconfiguration.WiFi</Name>

</Reconfiguration Action >

S2 adaptation S1 -> S2 goal models

S1 goal models

Fig. 7. Genie models or transitioning between target systems showing traceability fromthe goal-based models to policy rules

target system) and transitions into the application, as well as affecting some ofthe other transitions.

Different EU research projects have been working on early ideas about thedynamic generation of software from runtime models. The CONNECT projectis researching how runtime models can be used to generate connectors at thelevel of middleware and applications [43]. The EU DiVA project was successfulin seminal ideas about generation of adaptation policies [40]. However, still moreresearch efforts are needed.

Research Directions

We have shown that the running systemmay find appropriate to reconfigure itselfto a new target system that was not foreseen and therefore was not validatedbefore. Therefore, the need to perform validation and verification at runtime isan area of research that needs to be explored. Dynamic generation of softwarebring up different research questions:

How are the generative technologies used during design different from thesynthesis techniques needed when using runtime models to guide adaptationand in general during execution? Can we use the same current technologies fordynamic model synthesis? are they suitable?

How can we achieve reversible model transformations to deal with synchro-nization issues between the runtime model and the running system and betweenthe development models and runtime models?


What are the methods for specifying semantics suited to automated interpre-tation (i.e. done during runtime)?

So, even though the idea may be seen as old, combining runtime models fordynamic generation with modern computing technology not only amplifies thebenefits but also creates new opportunities.These questions are just few startingpoints for research in this exciting research topic with potential fruitful resultsfor software engineering.

5 Related Work

In the approach recommended by [16] and [44], and followed by [45], as well asin our own work on LoREM [14] and partially shown here, the environment of aself-adaptive system is treated as a finite set of discrete contexts with a conceptu-ally independent system specified for each. The authors of [45] avoid the explicitenumeration of alternatives using modularization techniques based on aspects.However, reasoning is achieved at the architectural level in contrast to our views,where it is done at the requirements level. Closer to our own work is that of Sykeset al. [46] which tackles a different but related problem from a different angle. In[46], the system is able to reconfigure its component configurations at runtimeto optimize non-functional requirements (NFR) satisficement, prescribing dedi-cated monitors for specific NFRs to promote generality. While in our case, weuse runtime requirements models, Sykes et al. work at the architectural level.

Lapouchnian [47] uses alternative paths in goal models to derive a set ofpossible system behaviors that are annotated in the goal model. As in our casein this paper, Lapouchnian uses the idea of variation points to indicate explicitalternatives to operationalize a goal. Awareness Requirements (or AwReqs) aredefined as requirements that refer to the success or failure of other requirements.AwReqs can refer to goals, tasks, quality constrains and domain assumptions andoffer high-level monitoring capabilities that can be used to determine satisfactionlevels for AwReqs. However, [47] and [48] do not provide runtime representationof the requirements as we emphasize. Lapouchnian et al. [47] focus on reasoningabout partial satisfaction of requirements. [49], which formalizes a means forrepresenting partial goal satisfaction based on KAOS [29]. In this way, bothapproaches differ from that of our own and with that of Sykes et al. where thefocus is on optimizing NFR trade-offs as the environment changes.

DeLoach and Miller [50] explore how to maintain a runtime representation ofgoals. However, they do not deal with the runtime representation of softgoalsor goal realization strategies. Instead, the running system interacts with theruntime goal model to trigger an update of the status of a goal. Thus a goalcan be triggered to go from active to achieved, failed, obviated, or removed.This supports understanding about what the systems is doing but not reasoningabout goal satisfaction.

Chen et al. [51] maintain goal models at runtime to support reasoning abouttradeoff decisions that are aimed at achieving survivability assurance. As inour case, Chen et al.’s live goal models postpone the necessary quality tradeoff

292 N. Bencomo

decisions until runtime. Unlike our work, however, they deal with functional(hard) goals, disconnecting them from the goal model in order to spare resources.

Uncertainty in adaptive systems has been tackled by FLAGS [39], RELAX[52], and Loki [53]. RELAX and FLAGS adopt fuzziness to express uncertainrequirements and allow small and transient deviations. In FLAGS the feasibilityof adaptive goals has been demonstrated using a particular underlying adaptivemachinery; a service-based architecture. RELAX is also used in [21] to specifymore flexible requirements within a goal model to handle the uncertainty dur-ing design time. In FLAGS adaptive goals are used for adaptation of servicecompositions that are aware of their own degree of satisfaction during runtime.Adaptations are triggered by violated goals, the goal model is modified accord-ingly to keep a coherent view of the system and apply adaptation policies on therunning system. Finally, Loki is an approach the automatically discover combi-nations of environmental states that give place to violations of requirements ofself-adaptive systems. Loki is used during design time.

6 Conclusions

In this paper we have argued that self-adaptive systems should be aware of itsrequirements. Our motivation for advocating requirements-awareness is that self-adaptive systems are increasingly being tasked with operating in volatile andpoorly-understood environments. At design-time, sufficient uncertainty aboutthe environment exists that requirements engineers can only hypothesize aboutthe states and events that the system may encounter at run-time. Because somuch is based on conjecture because of the incomplete information, the systemsneed the ability to self-adapt to cope with unforeseen or partly-understood eventsif it is to be adequately resilient to unanticipated environmental contexts. Wehave shown how a number of advances have been made in software engineeringand specifically RE to support this vision. However, research to be done as theresearch area of self-adaptive systems to tackle uncertainty is still in its infancy.Our proposal represents a way to improve the research area. Requirements-awareness requires that the requirements models cease to be strictly off-line,passive entities and become run-time objects that can be queried and manipu-lated to (e.g.) re-assign goal satisfaction responsibility between different agentsas the needs of the fluctuating environmental context dictate. Implicit in re-quirements awareness is that the systems architecture and requirements modelsare synchronized since different architectural configurations often imply differ-ent trade-offs, particularly in terms of soft goal satisfaction. Such trade-offs oftennecessitate the resolution of conflicting goals. Uncertainty and the scale of thepossible solution space preclude enumeration and resolution of such conflicts atdesign-time, so the necessary resolution reasoning needs to occur at run-time.Underpinning these principles is a need to be able to express the uncertaintythat exists in terms of what it is that is uncertain and the boundaries of what isacceptable in terms of goal satisficement when unanticipated events occur andconflicting goals need to be traded-off.


Furthermore, a self-adaptive system is likely to exhibit emergent behaviour.New techniques for generating software artifacts will be needed. Concretely, weshould be able to dynamically generate new system capabilities according tonew situations that may be unforeseen in the original design of the system. Thelatter may require changing the metamodel while the system is running (usingruntime models). Developers need to be able to trace the origin of this behaviourand users need to be gain confidence in the system.

The machinery for self-adaptation already exists and is increasingly beingdeployed in systems with limited generative capabilities. Generative Techniquesin Software Engineering (GTSE) has a important role to play in the developmentof future self-adaptive systems. We hope we have motivated researchers of thearea of GTSE to work with the community of self-adaptive systems to improvethe current state of the art. .

Acknowledgments. Thanks to Pete Sawyer, Jon Whittle, Betty Cheng, Em-manuel Letier, Anthony Finkelstein, and Kris Welsh who have been co-authorsof different papers that have been part of the basis of the material presentedhere.

References

1. Bencomo, N., Whittle, J., Sawyer, P., Finkelstein, A., Letier, E.: Requirements re-flection: requirements as runtime entities. In: Proceedings of the 32nd ACM/IEEEInternational Conference on Software Engineering, ICSE 2010, vol. 2, pp. 199–202(2010)

2. Fickas, S., Feather, M.: Requirements monitoring in dynamic environments. In:Second IEEE International Symposium on Requirements Engineering, RE 1995(1995)

3. Feather, M., Fickas, S., van Lamsweerde, A., Ponsard, C.: Reconciling system re-quirements and runtime behavior. In: Proceedings of Ninth International Workshopon Software Specification and Design, pp. 50–59 (April 1998)

4. Capra, L., Blair, G., Mascolo, C., Emmerich, W., Grace, P.: Exploiting reflection inmobile computing middleware. ACM SIGMOBILE Mobile Computing and Com-munications Review 6(4), 34–44 (2002)

5. Sawyer, P., Bencomo, N., Whittle, J., Letier, E., Finkelstein, A.: Requirements-aware systems: A research agenda for re for self-adaptive systems. In: IEEE Inter-national Conference on Requirements Engineering, pp. 95–103 (2010)

6. Kramer, J., Magee, J.: Self-managed systems: an architectural challenge. In: 2007Future of Software Engineering, FOSE 2007, pp. 259–268. IEEE Computer Society(2007)

7. Garlan, D., Cheng, S.W., Huang, A.C., Schmerl, B., Steenkiste, P.: Rain-bow: Architecture-based self-adaptation with reusable infrastructure. IEEE Com-puter 37(10), 46–54 (2004)

8. Coulson, G., Blair, G., Grace, P., Joolia, A., Lee, K., Ueyama, J., Sivaharan, T.:A generic component model for building systems software. ACM Transactions onComputer Systems (February 2008)

294 N. Bencomo

9. Oreizy, P., Gorlick, M.M., Taylor, R.N., Heimbigner, D., Johnson, G., Medvidovic,N., Quilici, A., Rosenblum, D.S., Wolf, A.L.: An architecture-based approach toself-adaptive software. IEEE Intelligent Systems and Their Applications 14(3), 54–62 (1999)

10. Hughes, D., Greenwood, P., Coulson, G., Blair, G., Pappenberger, F., Smith, P.,Beven, K.: Gridstix: Supporting flood prediction using embedded hardware andnext generation grid middleware. In: 4th International Workshop on Mobile Dis-tributed Computing (MDC 2006), Niagara Falls, USA (2006)

11. Robinson, W.: A requirements monitoring framework for enterprise systems. Re-quirements Engineering 11(1), 17–41 (2005)

12. Baresi, L., Ghezzi, C., Guinea, S.: Smart monitors for composed services. In: Pro-ceedings of the 2nd International Conference on Service Oriented Computing, IC-SOC 2004, pp. 193–202. ACM, New York (2004)

13. Andersson, J., Lemos, R., Malek, S., Weyns, D.: Modeling Dimensions of Self-Adaptive Software Systems. In: Cheng, B.H.C., de Lemos, R., Giese, H., Inver-ardi, P., Magee, J. (eds.) Software Engineering for Self-Adaptive Systems. LNCS,vol. 5525, pp. 27–47. Springer, Heidelberg (2009)

14. Goldsby, H.J., Sawyer, P., Bencomo, N., Hughes, D., Cheng, B.H.: Goal-basedmodeling of dynamically adaptive system requirements. In: 15th Annual IEEEInternational Conference on the Engineering of Computer Based Systems, ECBS(2008)

15. Welsh, K., Sawyer, P., Bencomo, N.: Towards requirements aware systems: Run-time resolution of design-time assumptions. In: 26th IEEE/ACM InternationalConference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA,pp. 560–563 (2011)

16. Zhang, J., Cheng, B.H.: Model-based development of dynamically adaptive soft-ware. In: International Conference on Software Engineering (ICSE 2006), China(2006)

17. Hughes, D., Bencomo, N., Blair, G.S., Coulson, G., Grace, P., Porter, B.: Exploitingextreme heterogeneity in a flood warning scenario using the gridkit middleware.In: Middleware (Companion), pp. 54–57 (2008)

18. Sawyer, P., Bencomo, N., Hughes, D., Grace, P., Goldsby, H.J., Cheng, B.H.C.:Visualizing the analysis of dynamically adaptive systems using i* and dsls. In:REV 2007: 2nd Interl Workshop on Requirements Engineering Visualization, India(2007)

19. Yu, E.S.K.: Towards modeling and reasoning support for early-phase requirementsengineering. In: Proceedings of the 3rd IEEE International Symposium on Require-ments Engineering (RE 1997), Washington, DC, USA (1997)

20. Whittle, J., Sawyer, P., Bencomo, N., Cheng, B.H., Bruel, J.M.: Relax: Incorpo-rating uncertainty into the specification of self-adaptive systems. In: 17th IEEEInternational Requirements Engineering Conference, RE 2009 (2009)

21. Cheng, B.H., Sawyer, P., Bencomo, N., Whittle, J.: A Goal-Based Modeling Ap-proach to Develop Requirements of an Adaptive System with Environmental Un-certainty. In: Schurr, A., Selic, B. (eds.) MODELS 2009. LNCS, vol. 5795, pp.468–483. Springer, Heidelberg (2009)

22. Halpern, J.Y.: Reasoning about uncertainty. MIT Press (2005)23. Kwiatkowska, M., Norman, G., Parker, D.: Probabilistic symbolic model check-

ing with prism: A hybrid approach. International Journal on Software Tools forTechnology Transfer (STTT), 52–66 (2002)

24. Fenton, N.E., Neil, M.: Making decisions: using bayesian nets and mcda. Knowl.-Based Syst. 14(7), 307–325 (2001)


25. Stewart, M., Melchers, R.: Probabilistic Risk Assessment of Engineering Systems.Springer (2007)

26. Gollier, C.: The Economics of Risk and Time. MIT (2001)27. Courtney, H.: 20/20 Foresight: Crafting Strategy in an Uncertain World. Harvard

Business School Press (2001)28. Maes, P.: Computional reflection. PhD thesis, Vrije Universiteit (1987)29. van Lamsweerde, A.: Requirements Engineering: From System Goals to UML Mod-

els to Software Specifications. John Wiley & Sons (2009)30. van Lamsweerde, A., Letier, E.: Handling obstacles in goal-oriented requirements

engineering. IEEE Trans. Software Eng. 26(10), 978–1005 (2000)31. Blair, G., Bencomo, N., France, R.B.: Models@ run.time. Computer 42(10), 22–27

(2009)32. Goknil, A., Kurtev, I., Berg, K.: A Metamodeling Approach for Reasoning about

Requirements. In: Schieferdecker, I., Hartman, A. (eds.) ECMDA-FA 2008. LNCS,vol. 5095, pp. 310–325. Springer, Heidelberg (2008)

33. Dardenne, A., van Lamsweerde, A., Fickas, S.: Goal-directed requirements acqui-sition. Science of Computer Programming, 3–50 (1993)

34. Bencomo, N., Welsh, K., Sawyer, P., Whittle, J.: Self-explanation in adaptive sys-tems. In: 17th IEEE Int. Conf. International Conference on Engineering of ComplexComputer Systems, ICECCS (2012)

35. Cheng, B.H.C., Atlee, J.M.: Research directions in requirements engineering. In:FOSE 2007, pp. 285–303 (2007)

36. Johnson, W.L., Feather, M.: Building an evolution transformation library. In: Pro-ceedings of the 12th International Conference on Software Engineering, ICSE 1990,pp. 238–248. IEEE Computer Society Press, Los Alamitos (1990)

37. Egyed, A.: Fixing inconsistencies in uml design models. In: ICSE, pp. 292–301(2007)

38. Letier, E., Kramer, J., Magee, J., Uchitel, S.: Deriving event-based transition sys-tems from goal-oriented requirements models. Autom. Softw. Eng. 15(2), 175–206(2008)

39. Baresi, L., Pasquale, L.: Fuzzy goals for requirements-driven adaptatio. In: 18thInternational IEEE Requirements Engineering Conference, RE 2010 (2010)

40. Morin, B., Fleurey, F., Bencomo, N., Jezequel, J.M., Solberg, A., Dehlen, V., Blair,G.: An Aspect-Oriented and Model-Driven Approach for Managing Dynamic Vari-ability. In: Czarnecki, K., Ober, I., Bruel, J.-M., Uhl, A., Volter, M. (eds.) MOD-ELS 2008. LNCS, vol. 5301, pp. 782–796. Springer, Heidelberg (2008)

41. Bencomo, N., Grace, P., Flores, C., Hughes, D., Blair, G.: Genie: Supporting themodel driven development of reflective, component-based adaptive systems. In:ICSE 2008 - Formal Research Demonstrations Track (2008)

42. Bencomo, N., Blair, G.: Using Architecture Models to Support the Generation andOperation of Component-Based Adaptive Systems. In: Cheng, B.H.C., de Lemos,R., Giese, H., Inverardi, P., Magee, J. (eds.) Software Engineering for Self-AdaptiveSystems. LNCS, vol. 5525, pp. 183–200. Springer, Heidelberg (2009)

43. Issarny, V., Bennaceur, A., Bromberg, Y.-D.: Middleware-Layer Connector Syn-thesis: Beyond State of the Art in Middleware Interoperability. In: Bernardo, M.,Issarny, V. (eds.) SFM 2011. LNCS, vol. 6659, pp. 217–255. Springer, Heidelberg(2011)

44. Berry, D., Cheng, B., Zhang, J.: The four levels of requirements engineering forand in dynamic adaptive systems. In: 11th International Workshop on Require-ments Engineering: Foundation for Software Quality (REFSQ 2005), Porto, Por-tugal (2005)

296 N. Bencomo

45. Morin, B., Barais, O., Nain, G., Jezequel, J.M.: Taming dynamically adaptivesystems using models and aspects. In: International Conference in Software Engi-neering, ICSE (2009)

46. Sykes, D., Heaven, W., Magee, J., Kramer, J.: Exploiting non-functional prefer-ences in architectural adaptation for self-managed systems. In: Proceedings of the2010 ACM Symposium on Applied Computing, SAC 2010, pp. 431–438. ACM,New York (2010)

47. Lapouchnian, A.: Exploiting Requirements Variability for Software Customizationand Adaptation. PhD thesis, University of Toronto (2011)

48. Silva Souza, V.E., Lapouchnian, A., Robinson, W.N., Mylopoulos, J.: Awarenessrequirements for adaptive systems. Technical report, University of Trento (2010)

49. Letier, E., van Lamsweerde, A.: Reasoning about partial goal satisfaction for re-quirements and design engineering. In: Proc. of 12th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering, pp. 53–62 (2004)

50. DeLoach, S.A., Miller, M.: A goal model for adaptive complex systems. Interna-tional Journal of Computational Intelligence: Theory and Practice 5(2) (2010)

51. Chen, B., Peng, X., Yu, Y., Zhao, W.: Are your sites down? requirements-drivenself-tuning for the survivability of web systems. In: 19th International Conferenceon Requirements Engineering (2011)

52. Whittle, J., Sawyer, P., Bencomo, N., Cheng, B.H.C., Bruel, J.M.: Relax: alanguage to address uncertainty in self-adaptive systems requirement. Requir.Eng. 15(2), 177–196 (2010)

53. Ramirez, A.J., Jensen, A.C., Cheng, B.H.C., Knoester, D.B.: Automatically ex-ploring how uncertainty impacts behavior of dynamically adaptive systems. In:26th IEEE/ACM International Conference on Automated Software Engineering,ASE, pp. 568–571 (2011)

Dynamic Program Analysisfor Database Reverse Engineering

Anthony Cleve, Nesrine Noughi, and Jean-Luc Hainaut

PReCISE Research CenterUniversity of Namur, Belgium

{acl,nno,jlh}@info.fundp.ac.be

Abstract. The maintenance and evolution of data-intensive systems should ide-ally rely on a complete and accurate database documentation. Unfortunately, thisdocumentation is often missing, or, at best, outdated. Database redocumentation,a process also known as database reverse engineering, then comes to the rescue.This process typically involves the elicitation of implicit schema constructs, thatis, data structures and constraints that have been incompletely translated into theoperational database schema. In this context, the SQL statements executed by theprograms may be a particularly rich source of information. SQL APIs come intwo variants, namely static and dynamic. The latter is intensively used in object-oriented and web applications, notably through ODBC and JDBC APIs. Whilethe static analysis of SQL queries has long been studied, coping with automat-ically generated SQL statements requires other weapons. This tutorial providesan in-depth exploration of the use of dynamic program analysis as a basis forreverse engineering relational databases. It describes and illustrates several auto-mated techniques allowing to capture the trace of the SQL-related events occuringduring the execution of data-intensive programs. It then presents and evaluatesseveral heuristics and techniques supporting the automatic recovery of implicitschema constructs from SQL execution traces. Other applications of SQL execu-tion trace analysis are also identified.

1 Introduction

Software system maintenance mainly is a matter of understanding. Beyond the prelim-inary steps, namely requirements collection, analysis, design and coding, most, if notall, after-birth system engineering processes require an in-depth understanding of eachsystem component in order to refactor it, to make it evolve, to migrate it to a new plat-form or to integrate it into a larger system. While documenting was long claimed tobe the first and most important activity in software engineering, we may observe thatit has now become outdated, economically unsustainable and psychologically unbear-able. Indeed, development teams seldom have time to write and maintain a precise,complete and up-to-date documentation. Therefore, many complex software systemslack the documentation that would be necessary for their maintenance and evolution.

As a consequence, understanding the objectives and the internals of an existing (andundocumented) software artefact must be obtained in another way, mainly through theexamination of the artifact itself. For example, careful analysis of the source code of


298 A. Cleve, N. Noughi, and J.-L. Hainaut

a program leads to a deep understanding of how it works internally, and, as a secondstage, of its external functional and non-functional specifications.

The concept of artefact understanding, be it at the technical (internal) or concep-tual (external) levels, is the very objective of reverse engineering processes. ”Reverseengineering is the process of analyzing a subject system to identify the system’s compo-nents and their interrelationships and create representations of the system in anotherform or at a higher level of abstraction.” [1]. The same definition applies to databases,human-computer interfaces, object class systems, web services, API’s and the like.

The problem of database reverse engineering happens to be particularly complex,due to prevalent developement practices. First of all, many databases have not beendeveloped in a disciplined way, that is, from a preliminary conceptual schema of thedatabase structure and constraints. This was already true for old systems, but loose em-pirical design approaches keep being widespread for modern databases due, notably,to time constraints, poor database education and the increasing use of object-orientedmiddleware that tends to consider the database as the mere implementation of programclasses. Secondly, the logical (platform-dependent) schema, that is supposed to be de-rived from the conceptual schema and to translate all its semantics, generally missesseveral conceptual constructs. This is due to several reasons, among others the poor ex-pressive power of legacy data modeling languages and the lazziness, awkwardness orilliteracy of some programmers [2].

From all this, it results that the logical schema often is incomplete and that the DDL1

code that expresses the database schema in physical constructs ignores important struc-tures and properties of the data. The missing constructs are called implicit, in contrastwith the explicit constructs that are declared in the DDL code. Several field experimentsand projects have shown that as much as half of the semantics of the data structures isimplicit. Therefore, merely parsing the DDL code of the database, or, equivalently, ex-tracting the physical schema from the system tables, sometimes provides barely half theactual data structures and integrity constraints.

Fortunately, data-intensive software systems exhibit an interesting symmetrical prop-erty due to the mutual dependency of the database and the programs. When no usefuldocumentation is available, it appears that (1) understanding the database schema isnecessary to understand the programs and, conversely, (2) understanding what the pro-grams are doing on the data considerably helps in understanding the properties of thedata. Program source code has long been considered a complex but rich informationsource to redocument database schemas. Even in ancient programs based on standardfile data managers, identifying and analysing the code sections devoted to the vali-dation of data before storing them in a file allows developers to detect implicit con-structs as important as actual record decomposition, uniqueness constraints, referentialintegrity or enumerated field domains. In addition, navigation patterns in source codecan help identify such important constructs as semantic associations between recordtypes.

When, as it has been common for more than two decades, data are managed byrelational database management systems (DBMS), the database/program interactionsare performed through the SQL language and protocols. Based on the relational algebra

1 Data Description Language, or Data Definition Language.

Dynamic Program Analysis for Database Reverse Engineering 299

and the relational calculus, SQL is a high-level language that allows programmers to de-scribe in a declarative way the properties of the data they instruct the DBMS to providethem with. In contrast, navigational DMLs (also called one-record-at-a-time DML) ac-cess the data through procedural code that specifies the successive operations necessaryto get these data. Therefore, a single SQL statement can be the declarative equivalent ofa procedural section of several hundreds of lines of code. Understanding the semanticsof an SQL statement is often much easier than that of this procedural fragment. Theanalysis of SQL statements in application programs is, unsurprisingly, a major programunderstanding technique in database reverse engineering [3–7].

In our previous work [7], we have first proposed a static program analysis approach,taking as input the source code of the programs. This approach aims at detecting andexploiting the dataflow dependencies that hold within and between (successive) SQLqueries. Industrial data reverse engineering projects have shown that this approach, andits supporting tools, allow the recovery of implicit knowledge on the database structuresand constraints such as undeclared foreign keys, finer-grained decomposition and moreexpressive names for tables and columns [8].

Unfortunately, the static program analysis techniques are, by nature, limited towhat is statically detectable. In this tutorial, we will therefore explore the use of dy-namic program analysis techniques for reverse engineering relational databases [9, 10].Those techniques, taking program executions as main inputs, allow the analysis ofdata-intensive programs in presence of automatically generated SQL queries (e.g.,Java/JDBC or PhP/MySQL).

The tutorial notes are structured as follows. After describing some basic conceptsin Section 2, we identify and describe, in Section 3, a set of techniques for capturingthe SQL queries executed by an application program at runtime. Section 4 investigatesthe use of SQL execution trace analysis in support to implicit schema construct elici-tation. It particularly focuses on undeclared foreign key detection, based on heuristicscombining intra-query and inter-query dependencies. In Section 5, we present an initialcase study, where those heuristics are used to detect implicit referential constraints ina real-life web application. Section 6 summarizes related work and further readings.Other possible applications of SQL execution trace analysis are identified in Section 7,and concluding remarks are given in Section 8.

2 Preliminaries

2.1 Implicit Schema Constructs

In order to illustrate the concept of implicit schema constructs, Figure 1 depicts threedatabase schemas: (A) a conceptual schema expressed in the Entity-Relationship for-malism, (B) a logical relational schema that would translate schema (A) with high-fidelity, and (C) the actual logical relational schema, that hides three implicit foreignkeys. Schema (C) corresponds to the data structure and constraints explicitely declaredin the DDL code, the latter being the only available documentation of the database.Schemas (A) and (B) are not available (yet), they must be recovered through thedatabase reverse engineering process.


(B) Equivalent logical schema

(A) Conceptual schema

(C) Logical schema with implicit foreign keys

PRODUCTreferencepriceid: reference

ORDERSnumdatecuscodeid: num

DETAILprodrefordnumquantityid: prodref

ordnum

CUSTOMERcodenameaddressphoneid: code

PRODUCTreferencepriceid: reference

ORDERSnumdatecuscodeid: numref: cuscode

DETAILprodrefordnumquantityid: prodref

ordnumref: ordnumref: prodref

CUSTOMERcodenameaddressphoneid: code

1-10-N places 0-N0-Ndetail

quantity

PRODUCT

referencepriceid: reference

ORDER

numdateid: num

CUSTOMER

codenameaddressphoneid: code

Fig. 1. (A) a conceptual schema, (B) a logical relational schema equivalent to (A), and (C) alogical relational schema hiding three implicit foreign keys

2.2 Program-Database Dependencies

The programs that manipulate a database are strongly dependent on the (implicit)database structures and constraints. From a purely structural point of view, the databasequeries executed by the program should strictly comply with the database schema. Forinstance, each table name occuring in the from clause of an SQL query must corre-spond to a table declared in the schema. Similarly, each column name occuring in theselect clause must correspond to a column of one of the tables referenced in thefrom clause.

More importantly, the database schema also influences the logic of the programs.First, it suggests a set of possible data navigation paths, mainly through (procedural)joins between inter-related tables. The cardinalities defined in the schema correspondsto the presence of loops (multivalued values) or conditional statements (optional val-ues). For instance, starting from a given CUSTOMER, you may search all her ORDERS,then, for each of them, you can retrieve all the ordered PRODUCTs, etc. Second, theschema imposes some constraints on the way programs can update data. The programsand the data management system usually share the responsibility of assessing data in-tegrity. We can distinguish two main approaches for program-side validation:

– Reactive validation, where verification queries are systematically executed beforeexecuting a database modification that could challenge data consistency. If the ver-ification query fails, then the data modification is not performed. For instance, theuser must enter the code of the product she wants to order, if the specified codedoes not correspond to an existing product, the order is not inserted.

– Proactive validation, enforcing data integrity rules during the query constructionprocess itself, in such a way that the executed database operations never violate in-tegrity constraints. This strategy is typically implemented through user interface re-strictions. For instance, the user must select the products she wants to order in a list.


A typical example is the management of implicit foreign keys between two rela-tional tables. All the programs updating the contents of those tables must make sure topreserve data consistency against this implicit referential constraint. In our example ofFigure 1(C), this should involve, for instance, the execution of a validation query se-lecting rows from table CUSTOMER (resp. ORDERS) before modification operations ontable ORDERS (resp. CUSTOMER). As an illustration, Listing 2.1 shows a sample pro-gram fragment involving data validation. Before inserting a new row in table ORDERS,the program checks that the value of variable NewCusCode corresponds to an existingcustomer reference. Another example, in the other direction, would be the deletion of acustomer. In that case, the program could follow the delete-no-action mode, by abort-ing the deletion of the customer in case the latter is still referenced by rows of tableORDERS.

Listing 2.1 Sample pseudo-code fragment with implicit referential constraint validationselect count(*) into :NbrCustfrom CUSTOMERwhere code = :NewCusCode;if (NbrCust == 0){

print(’unknown customer !’);}else{

insert into ORDERS(num, date, cuscode)values(:NewNum, :NewDate, :NewCusCode);

}

2.3 Exploiting Program-Database Dependencies

Since, as just explained above, the programs make use and manage (possibly implicit)schema constructs, their analysis may lead to the identification of lost knowledge onthe schema. The most important implicit constructs that can be identified in relationalschemas [11] include:

– Exact column structure. Compound and multivalued columns (that would have beentranslated into row and array complex types in SQL3) are often anonymously rep-resented by the concatenation of their elementary values.

– Unique keys of tables. One or more columns implicitly serve as unique identifier.– Unique keys of multivalued fields. This property is particularly important in implic-

itly strongly structured columns.– Foreign keys. Each value of a column, or of a set of columns, is processed as a

reference to a record in another file.– Functional dependencies. The values of a column can depend on the val-

ues of other columns that have not been declared or elicited as a candi-date key. This pattern has often been used in older databases for performancereasons.

– Value domains. A more precise definition of the domain of a field can be discov-ered by data and program analysis. Identifying enumerated domains is particularlyimportant.


– Meaningful names. Proprietary naming standards (or, worse, the absence thereof)may lead to cryptic component names. However, the examination of programvariables and electronic form fields in/from which column values are moved cansuggest more meaningful names.

2.4 Dynamic SQL

The SQL code fragments of Listing 2.1 are expressed in static SQL, a variant of thelanguage in which the SQL statements are hard-coded in the source program. There isanother family of SQL interfaces, called dynamic SQL, with which the SQL statementsare built at runtime and sent to the database server through a specific API. Typically,these programs build each query as a character string, then ask the DBMS to preparethe query (i.e., to compile it) and finally execute it. The only moment at which the SQLquery actually exists is at runtime, when, or just before, the query string is sent to theDBMS for compilation and/or execution.

Dynamic SQL or call-level interface (CLI) has been standardized in the eighties andimplemented by most relational DBMS vendors. The most popular DBMS-independentAPIs are ODBC, proposed by Microsoft, and JDBC, proposed by SUN. Dynamic SQLprovides a high level of flexibility but the application programs that use it may be diffi-cult to analyse and to understand. Most major DBMS, such as Oracle and DB2, includeinterfaces for both static and dynamic SQL.

The ODBC and JDBC interfaces provide several query patterns, differing notablyon the binding technique. The most general form is illustrated in Listing 2.2. Line1 creates a database connection con. Line 2 builds the SQL query in host variablequery. This statement includes input placeholder ? which will be bound to an ac-tual value before execution. Line 3 creates and prepares statement stmt from stringquery. This statement is completed in Line 4, by which the first (and unique) place-holder is replaced by the value of variable cusCode. The statement can then be exe-cuted (Line 5), which creates the resulting set of rows rset. Method next of rsetpositions its cursor on the first/next row (Line 6) while Line 7 extracts the first (andunique) output value specified in the query select list and stores it in host variableorderNum.

Listing 2.2 Standard JDBC database interaction1 Connection con = driverMgr.getConnection(url, login, passwd);2 String query = "select Num from ORDERS where cuscode = ?";3 PreparedStatement stmt = con.prepareStatement(query);4 stmt.setInt(1, cusCode);5 ResultSet rset = stmt.executeQuery();6 while (rset.next()){7 orderNum = rset.getInt(1);

...}


3 SQL Statement Capturing Techniques

In this section, we identify and describe six different techniques to capture the SQLstatements executed by an application program. These techniques are intended to cap-ture the behaviour of the client/server system at runtime. More detail about them can befound in [8].

DBMS logs. The easiest technique is to use the query logs that are provided by thedatabase management system. Most database servers store the requests received fromthe client application programs in a specific file or table. For example, MySQL writes,in the order it received them, all the client queries in its general query log. Each recordcomprises the client process id, the timestamp of query reception and the text of thequery as it was executed, the input variables being replaced with their instant values.As compared to the program instrumentation technique, DBMS log does not provideprogram points and can be processed off-line only. This technique does not requireany modification nor recompilation of the program source code. However, the trace isusually poor, it contains the executed queries, but it does not include the results of thosequeries nor any source code location information allowing to map the executed querieswith the program source code files. Also, the trace is possibly polluted by queries thatare not interesting for the reverse engineer, such as the queries accessing the systemtables.

Tracing stored procedures. SQL procedures are centrally stored in the database andcan be invoked by any client program. By building SQL procedures equivalent to themost used SQL queries, and by replacing some or all SQL statements in programs bythe invocation of these SQL procedures, we provide an ad hoc API that can be aug-mented with tracing instructions that log SQL statement executions. This techniquecan be considered in architectures that already rely on SQL procedures. When clientprograms include explicit SQL statements, it entails in-depth and complex code modifi-cation. However, since it replaces complex input and output variable binding with mereprocedure arguments, this reengineering can lead to a better code that will be easier tomaintain and evolve.

API substitution. If the source code of the client side database API is available, whichis the case for ODBC and JDBC drivers of open source DBMSs, additional tracingstatements can be inserted directly in this API. The API can then be recompiled andbound to the client applications. The latter need not be modified. This technique allowsto record the SQL statement instances as well as their results, but it cannot log anysource code location information.

API overloading. In case the client side API is not open source, the API overloadingtechnique consists in encapsulating (part of) the API within dedicated classes whichprovide similar public methods but produce, in addition, the required SQL executiontrace. For instance, we could write our own Statement, PreparedStatementand ResultSet classes which, in turn, make use of the JDBC corresponding classes,


as shown in Listing 3.1. This technique allows to produce SQL execution traces of sim-ilar richness that the ones obtained with the API substitution technique, but it requiressome minor program adaptation (illustrated in Listing 3.2) and recompilation.

Listing 3.1 Creating an intermediate API for API overloading.package myAPI;...public class Statement{java.sql.Statement stat;public Statement(java.sql.Connection con){

stat = con.createStatement();}public ResultSet executeQuery(String sql){

log.traceQuery(sql);return new ResultSet(stat.executeQuery(sql));

}...

}

Listing 3.2 Program adaptation for API overloading.import java.sql.Statement; import myAPI.Statement;import java.sql.ResultSet; import myAPI.ResultSet;... ==> ...

Statement sta = con.createStatement(); Statement sta = new Statement(con);ResultSet rsl = sta.executeQuery(q); ResultSet rsl = sta.executeQuery(q);

Program instrumentation. The capture of a dynamic SQL statement is performed by adedicated code section inserted before the program point of this statement. Similarly,the result of an SQL statement will be captured by a code section inserted after it.This technique requires code analysis to identify and decode database API statementsand entails source code modification and recompilation. It provides a temporal list ofstatement instances. In the example of Listing 3.3, the tracing statement writes in thelog file the source code location id (132), the event type (SQLexec), the statement objectid (hashCode()) followed by the SQL statement or the output variable contents.According to the information that needs to be extracted from the trace, the programid, process id and/or timestamp can be output as well.

Aspect-based tracing. Aspect-based tracing consists in specifying separately the trac-ing functionality by means of aspects, without any alteration of the original sourcecode. An aspect typically consists of pointcuts and associated advices, which can beseen as program-side triggers. A pointcut picks out certain join points in the programflow, which are well-defined moments in the execution of a program, like method call,method execution or object instantiation. An advice is associated to a pointcut. It de-clares that certain code should execute at each of the join points specified by the point-cut. The advice code can be executed before, after, or around the specified join point.Aspect-oriented support is available for several programming languages among whichJava, C, C++, ... and COBOL [12].


Listing 3.3 Logging SQL operations using program instrumentation.Statement stmt = connection.createStatement();ResultSet rs = stmt.executeQuery(SQLstmt);SQLlog.write(”132;SQLexec;”+stmt.hashCode()+”;”+SQLstmt);rs.next();vName = rs.getString(1);SQLlog.write(”133;SQLgetS1;”+rs.getStatement().hashCode()+”;”+vName);vSalary = rs.getInt(2);SQLlog.write(”134;SQLgetI2;”+rs.getStatement().hashCode()+”;”+vSalary);

Listing 3.4 shows a simple tracing aspect, written in AspectJ [13], that logs (1)the execution of SQL queries (without statement preparation2), and (2) the extractionof the query results. The first pointcut (lines 5-6) refers to the invocation of methodexecuteQuery of class Statement. Before each occurence of query execution,the advice (lines 8-13) writes a log entry indicating (1) the class name, (2) the line ofcode, (3) the object identifier of the statement and (4) the query string itself. The sec-ond pointcut (lines 15-16) is associated to a get method invoked on an instance of classResultSet. The corresponding advice (lines 18-27) logs (1) the source code location,(2) the statement object identifier (3) the name of the get method invoked, (4) the nameor index of the corresponding column and (5) the result value itself.

Listing 3.4 Tracing SQL operations with aspects1 public aspect SQLTracing {23 private MySQLLog log = new MySQLLog();45 pointcut queryExecution(String query):6 call(ResultSet Statement.executeQuery(String)) && args(query);78 before(String query): queryExecution(query){9 String file = thisJoinPoint.getSourceLocation().getFileName();

10 int LoC = thisJoinPoint.getSourceLocation().getLine();11 Statement statement = (Statement) thisJoinPoint.getTarget();12 log.traceQuery(file, LoC, statement.hashCode(), query);13 }1415 pointcut resultExtraction(ResultSet rSet) :16 call(** ResultSet.get*(**)) && target(rSet);1718 Object around(ResultSet rSet) throws SQLException : resultExtraction(rSet){19 String file= thisJoinPoint.getSourceLocation().getFileName();20 int LoC = thisJoinPoint.getSourceLocation().getLine();21 String methodName = thisJoinPoint.getSignature().getName();22 Object colNameOrInd = thisJoinPoint.getArgs()[0];23 Object res = proceed(rSet);24 Statement stat = rSet.getStatement();25 log.traceResult(file, LoC, stat.hashCode(), methodName, colNameOrInd, res);26 return res;27 }28 }

2 The way of tracing SQL query executions is a little bit more complicated in the presence ofstatement preparation. We refer to [9] for further details.


4 SQL Trace Analysis for Database Reverse Engineering

Examination of static SQL queries is one of the most powerful techniques to elicitimplicit database schema constructs and constraints among which undeclared foreignkeys, identifiers and functional dependencies [14–17]. In this section, we discuss the useof dynamic analysis of automatically-generated SQL queries for such a database reverseengineering task. In particular, we show how SQL execution traces can serve as a basisto formulate hypotheses on the existence of undeclared foreign keys3. These hypotheseswill still need to be validated afterwards (e.g., via data analysis or user/programmerinterviews).

We distinguish two different approaches to detect implicit referential constraints be-tween columns of distinct or identical tables. One can observe either the way suchreferential constraints are used or the way they are managed.

– Referential constraint usage consists in exploiting the referential constraint. Forinstance, within the same execution, an output value o1 of an SQL statement s1querying table T1 is used as an input value of another SQL statement s2 accessinganother table T2. A more direct usage of a foreign key consists of a join of T1 andT2 within a single query. In both cases, this suggests the existence of an implicitforeign key between tables T1 and T2.

– Referential constraint management aims at verifying that the referential constraintkeeps being respected when updating the database. For instance, before modifyingthe contents of a table T2 (by an insert or update statement s2), the programexecutes a verification query q1 on table T1. According to the result of q1, s2 is ex-ecuted or not. When both q1 and s2 are executed, they include at least one commoninput value. Similarly, when deleting a row of a table T1 using a delete state-ment d2, one observes that the program also deletes a possibly empty set of rowsof another table T2 via another delete statement d1 (procedural delete cascade).

4.1 Heuristics for Implicit Foreign Key Detection

We have identified three main heuristics for implicit foreign key constraint detectionfrom SQL execution traces, namely joins, output-input dependency and input-input de-pendency. Below, we further define and illustrate them.

Notations. Let q be an SQL query occuring in an execution trace. Let t be a table of arelational database.

– q.match4denotes the set of couples of columns (c1, c2) whose values are matchedin an equality condition of q;

– q.in denotes the set of input values of q;– q.out denotes the set of output values of q;– q.seq denotes the sequence number of q in its trace;– t.cols denotes the set of columns of t.

3 However, a similar discussion can be developed for the other implicit constructs.4 The q.match relationship is symmetric.


Joins. Most SQL joins rely on the matching of a foreign key with its target primary key.For instance, let us consider the following query, that one could find in a SQL executiontrace:

select num, cuscode, namefrom ORDERS, CUSTOMERwhere code = cuscode

It corresponds to a standard join, where several table names appear in the from clauseof the query. It combines the rows of those tables, typically based on a join conditionwhich expresses the equality of the primary key (code) and foreign key (cuscode)values. When primary keys have been recovered, from the physical schema or as im-plicit constructs, then the join conditions provide strong evidence for implicit foreignkeys.

Definition 1. An SQL query q contains a join of two tables t1 and t2 iff ∃(c1, c2) ∈q.match such that c1 ∈ t1.cols ∧ c2 ∈ t2.cols.

It is important to mention that all SQL joins do not necessarily correspond to the match-ing of a foreign key and a primary key value. Several counter-examples can be observed.A typical case consists in joining two tables on their foreign keys (none of them beinga candidate key) to a third table, a dubious pattern known as connection trap [18].

Output-Input Dependency. An output-input dependency occurs when an SQL queryuses as input some of the results of a previous SQL query of the same programexecution.

Definition 2. An SQL query q2 is output-input dependent on another SQL query q1 iffq2.in ∩ q1.out �= ∅ ∧ q2.seq > q1.seq.

In the presence of foreign keys, be they implicit or not, output-input dependencies canbe observed in several navigational programming patterns. For instance, in a proceduraljoin between the source and target tables of a foreign key, the value of the foreign keycolumn(s) is used to retrieve the target row using a subsequent query. Conversely, thevalue of the identifier of a given target row can be used to extract all the rows referencingit. For instance, the program retrieves a customer, before searching for all her recentorders.

Listing 4.1 shows an example of output-input dependency between two succesiveSQL queries. In this example, the program retrieves the name and address of the cus-tomer who placed a particular order it has just retrieved. We see that the output value ofcolumn ORDERS.cuscode of the first query is the same as the input value of columnCUSTOMER.code in the second query.

Input-Input Dependency. In an SQL execution trace, an input-input dependency holdsbetween two successive SQL queries that share common input values.

Definition 3. An SQL query q1 is input-input dependent on another SQL query q2 iffq1.in ∩ q2.in �= ∅.


Listing 4.1 An SQL execution trace fragment with an output-input dependency....select cuscode from ORDERS O where num = 5789getInt(1) = C400select name, address from CUSTOMER where code = C400...

The presence of input-input dependencies in SQL execution traces constitutes anotherstrong indicator of the presence of foreign keys. Several data manipulation patternsfor referential constraint management make an intensive use of input-input dependentqueries. Among the most popular examples, the delete cascade mecanism, that consistsin deleting all referencing rows before deleting a target row, makes use of delete queriesthat share a common input value: the primary/foreign key value of the target rows to bedeleted.

A second example is the check-before-insert pattern, that aims at preserving areferential integrity constraint when inserting rows in the database. When insert-ing a row in a referencing table, the program first checks that the provided for-eign key value is valid, i.e., it corresponds to the primary key value of an exist-ing row in the target table. Similar patterns can be observed in delete and updateprocedures.

Listing 4.2 An SQL execution trace fragment with an input-input dependency....select count(*) from CUSTOMER where code = C400getInt(1) = 1insert into ORDERS(num, date, cuscode) values (456,’2008-06-20’, C400)...select count(*) from CUSTOMER where code = C152getInt(1) = 0...select count(*) from CUSTOMER where code = C251getInt(1) = 1insert into ORDERS(num, date, cuscode) values (457,’2008-06-20’, C251)...

As an illustration, we consider the execution trace given in Listing 4.2. This tracestrongly suggests the existence of an implicit foreign key between column cuscodeof table ORDERS and column code of table CUSTOMER. Indeed, each row insertionin table ORDERS is preceded by the execution of a validation query that (1) countsthe number of rows of table CUSTOMER having c as value of column code – wherec corresponds to the value of column cuscode of the inserted row of ORDERS – and(2) returns 1 as a result. In other words, the program checks that the provided valueof column cuscode does correspond to the primary key (code) value of an existingcustomer. This SQL trace fragment is actually an instance of the check-before-insertpattern described above.


4.2 On-the-Fly Query Imbrication Detection

Dynamic analysis may also be used to automatically detect potential dependencies be-tween successive query executions on-the-fly. A possible technique, summarized inListing 4.3, consists in analyzing the imbrication relationship between the SQL state-ments. A query q2 is said to be nested if its execution is performed before the resultset of the preceding query q1 has been completely emptied. Such a situation stronglysuggests that a data dependency holds between the output values of q1 and the in-put values of q2 or, in other words, that q2 is most probably output-input dependenton q1.

Detecting nested queries can be performed as follows. During the execution of theprogram, one records a FIFO stack of executed queries. Each time a new query is ex-ecuted, the latter is put on top of the stack. Conversely, each time a query result setbecomes empty (i.e., when the invocation of ResultSet.next() returns false), oneremoves the top element from the query stack. Thus, if the stack is not empty whenexecuting a new query q, it means that q is nested. We can then make the hypothesisthat there exists an output-input dependency between the query being at the top of thecurrent stack, and query q. Unfortunately, this technique fails in case the program doesnot use the complete result set of (some of) the queries it executes. However, this canbe considered as unfrequent.

Listing 4.3 On-the-fly detection of nested SQL queries1. Before each invocation to Statement.executeQuery(q2):

Require: A query (q2), a stack of executed queries with non-empty result set(queryStack).

Ensure: Detect the imbrication of q2, if any.if queryStack �= ∅ then

q1 ← top(queryStack)reportImbrication(q1, q2)

end ifqueryStack ← push(queryStack, q2)

2. After each invocation to ResultSet.next():

Require: A result set (r), a stack of executed queries (queryStack).Ensure: A possibly updated query stack.

if r.next() == false thenqueryStack ← pop(queryStack)

end if

As an example, let us consider Listing 4.4, where the query executed at line 21 (q2)is nested with respect to the query executed at line 6 (q1). We notice that the outputvalue of the first executed query (stored in variable code), is used as input value ofthe second query (via variable cusCode). This potential output-input dependency willbe detected by our algorithm of Listing 4.3, since q1 will be on the top of the querystack when executing q2. Indeed, each execution of q2 originates from an iteration of


the while loop in lines 7-10, the body of which aims at processing a row belonging tothe resultset of q1.

Listing 4.4 Example of nested queries1 public Vector<String> getOrdersOfCity(String vCity) throws SQLException{2 Vector<String> ordersOfCity = new Vector<String>();3 String query = "select code from CUSTOMER where city = ?";4 PreparedStatement stmt = con.prepareStatement(query);5 stmt.setString(1, vCity);6 ResultSet rsl = stmt.executeQuery();7 while (rsl.next()){8 int code = rsl.getInt("code");9 ordersOfCity.addAll(getOrdersFromCustomer(code));10 }11 rsl.close();12 stmt.close();13 return ordersOfCity;14 }1516 public Vector<String> getOrdersFromCustomer(int cusCode) throws SQLException{17 Vector<String> ordersOfCust = new Vector<String>();18 String query = "select num, date from ORDERS";19 query = query + " where cuscode = "+cusCode;20 Statement stmt = con.createStatement();21 ResultSet rsl = stmt.executeQuery(query);22 while (rsl.next()){23 int ordNum = rsl.getInt(1);24 String ordDate = rsl.getString("date");25 ordersOfCust.add("num = "+ordNum +", cuscode = "+cusCode +", date = "+ordDate);26 }27 rsl.close();28 stmt.close();29 return ordersOfCust;30 }

5 Initial Case Study

In this section, we briefly present an initial case study evaluating the use of dynamicprogram analysis for implicit foreign key detection.5. The study is based on WebCam-pus6, an e-learning application that is in use at the University of Namur. This applicationis written in PHP and manipulates a MySQL database. It consists of more than 1000source code files, with a total size of about 460 KLOC. The main database of Web-Campus is made up of 33 tables and 198 columns, used to store data about universityfaculties and departments, the available online courses they offer, the course users, etc.

The DDL code of the database does not explicitly declare any foreign key, due to theuse of the MyISAM storage engine, which does not support foreign key management.However, the Webcampus developers - one of them has participated to the case study- know all the implicit foreign key constraints. So, in this case study, the target of thereverse enginering process is known in advance, in such a way that we can rigorouslyevaluate the effectiveness of dynamic program analysis as a basis for the detection ofimplicit foreign keys.

5 More details about this case study are available in [10].6 see http://webcampus.fundp.ac.be

http://webcampus.fundp.ac.be


The case study involves the following two main steps. First, a set of SQL executiontraces corresponding to typical interaction scenarios within WebCampus are collected.Second, these SQL execution traces are then analyzed in order to detect implicit foreignkeys candidates.

Trace Collection. The SQL traces collected reports on 14 distinct execution scenarios,which translate the most typical operations carried out by WebCampus users on a reg-ular basis. Trace collection was achieved via source code instrumentation. This provedstraightforward since only a few source code modules are in charge of accessing theWebCampus database7. The collected traces are stored in a relational database com-posed of two tables: one containing all the SQL queries executed during each scenario,and one containing all the results of those queries. Table 1 provides some figures aboutthe traces obtained. It indicates, for each scenario, the number and the nature of thecorresponding queries and query results.

Table 1. Size of collected SQL traces, by execution scenario.

Execution scenario # of queries # of select # of insert # of delete # of update # of query results

register user 27 24 3 0 0 163add course manager 194 190 4 0 0 2 391add course user 155 151 4 0 0 1 908create course 29 20 9 0 0 299delete course 132 123 1 7 0 1 700delete course user 84 83 0 1 0 996delete department 37 37 0 0 0 419install applet 88 82 4 0 2 721install tool 2 169 2 039 126 4 0 24 180unistall applet 78 68 0 9 1 573uninstall tool 1 896 1 888 0 8 0 22 419register to a course 64 63 1 0 0 708register to Webcampus 32 30 2 0 0 184unregister from course 19 17 1 1 0 155

Total 5 004 4 815 155 30 3 56 816

Trace Analysis. The goal of the trace analysis process was to find indications ofundeclared foreign keys in the SQL execution traces. This process took the phys-ical schema of the main WebCampus database (without any foreign key) as a ba-sis to systematically analyze the contents of the collected traces. This analysis pro-cess was supported by a dedicated trace analyzer, implemented as a Java plugin ofthe DB-MAIN CASE environment [19]. This plugin takes as input (1) a relationaldatabase schema and (2) a set of SQL traces stored in a relational database in theformat described above. The analyzer returns a set of potential implicit foreign keystogether with the number of corresponding joins, output-input dependencies, and input-input dependencies between reading queries, that can be found in the input executiontraces.

7 But those data access modules are called from almost all modules of the application.


Results. In order to evaluate the recall of our dynamic analysis technique, we comparethe output produced by the trace analyzer with the set of known implicit foreign keys ofthe main Webcampus schema. In order to better interpret this recall value, we first needto evaluate the richness of the collected SQL traces. The left part of Table 2 indicates,for each implicit foreign key fk from table t1 to table t2, (1) the number of queriesreferencing t1, (2) the number of queries referencing t2, and (3) the number of dis-tinct scenarios where both t1 and t2 are accessed. From the latter, we derive that only27 implicit foreign keys of the schema were potentially detectable in the SQL tracesconsidered. Indeed, the minimal requirement for detecting an undeclared foreign keyt1 → t2 in an SQL trace is that both t1 and t2 must be involved in at least one exe-cution scenario considered. If this is the case, then the SQL trace obtained can containindications of the foreign key. The right part of Table 2 summarizes the indications ofimplicit foreign key that have been found in the SQL trace. For each undeclared for-eign key (t1 → t2), we provide (1) the number of SQL joins between t1 and t2, and(2) the number of output-input dependencies between a query q1 accessing t2 (resp. t1)and a subsequent query q2 accessing t1 (resp. t2), and (3) the number of input-inputdependencies between a query q1 accessing t2 (resp. t1) and a subsequent select queryq2 accessing t1 (resp. t2).

From these statistics, we observe that we found evidence for 23 implicit foreignkeys (those with a check mark in the first column of Table 2), which represents a re-call of about 65% of the total number of implicit foreign keys in the main Webcampusdatabase. This represents about 85% of the foreign keys identified as potentially de-tectable in the collected traces. Let us now evaluate the precision that we reached inthis case study. We also extract three unexpected foreign keys hypotheses, all basedon the presence of SQL joins. It turned out that two of them correspond to actual im-plicit foreign keys that were not part of the list initially provided by the WebCam-pus developpers. The third hypothesis is erroneous and therefore constitutes a false-positive. Several joins are made between tables notify and course user based ontheir respective column user id. Both notify.user id and course.user idreference a third table user, as detected by our analyzer (see Table 2). Thiscase actually corresponds to an instance of the connection trap pattern describedabove.

In summary, the use of dynamic analysis allowed us to correctly detect 25 implicitforeign keys in the database schema of Webcampus, which corresponds to a recall of71% (25/37). Considering that only 29 implicit foreign keys were potentially detectable(covered by our execution scenarios), we would reach a recall of 86% (25/29). In termsof precision, only one hypothesis revealed to be erroneous (false-positive), which resultsin a precision of 96% (25/26). Such false-positive foreign keys do not pose a criticalproblem in practice since they would most probably be invalidated by other techniques,such as schema and data analysis.

Discussion. The results obtained during the case study clearly confirm that SQL execu-tion traces main contain indications of the presence of implicit foreign keys. They alsosuggest that dynamic analysis of SQL statements can support the detection of implicitforeign keys with a satisfying level of recall and precision.


Table 2. SQL trace analysis results (recall)

Implicit foreign key (t1 → t2) #of

quer

ies

acce

ssin

gt 1

#of

quer

ies

acce

ssin

gt 2

#of

exec

utio

nsc

enar

ios

invo

lvin

gbo

tht 1

andt 2

#of

join

sbe

twee

nt 1

andt 2

#of

outp

ut-i

nput

depe

nden

ciesquery

(t1

)↔

query

(t2

)#

ofin

put-

inpu

tde

pend

enci

esquery

(t1

)↔

query

(t2

)

class→ class 0 0 0 0 0 0� course→ faculty 1927 1840 12 1 832 87 0� course→ right_profile 1927 51 12 0 1 0� course_user→ course 55 1927 9 9 47 5� course_user→ right_profile 55 51 9 0 9 0� course_user→ user 55 41 9 7 50 4

desktop_portlet_data→ user 0 41 0 0 0 0� dock→ module 244 496 14 58 297 0� faculty→ faculty 1840 1840 12 3 0 0� course_addons→ course 1838 1927 9 0 3 842 0

course_program→ course 0 1927 0 0 0 0� user_addons→ user 11 41 3 0 15 0

user_addons→ program 11 5 3 0 0 0� im_message→ course 3 1927 3 0 5 0� im_message_status→ user 4 41 3 0 2 0

im_message_status→ im_message 4 3 3 0 0 2� im_recipient→ user 4 41 3 0 2 0

im_recipient→ im_message 4 3 3 0 0 6� log→ user 3 41 3 0 1 0� module_contexts→ module 38 496 14 34 11 0� module_info→ module 17 496 4 13 11 0� notify→ course 9 1927 6 0 3 0� notify→ user 9 41 6 0 11 0

notify→ course_tool 9 452 2 0 0 0property_definition→ user_property 0 1 0 0 0 0rel_class_user→ user 0 41 0 0 0 0rel_class_user→ class 0 0 0 0 0 0rel_course_class→ class 1 0 0 0 0 0

� rel_course_class→ course 1 1927 1 0 3 0� right_rel_profile_action→ course 161 1927 6 0 22 0� right_rel_profile_action→ right_profile 161 51 7 0 711 0� right_rel_profile_action→ right_action 161 178 7 32 810 8

sso→ user 0 41 0 0 0 0� tracking_event→ user 3 41 2 0 2 0� user_property→ user 1 41 1 1 0 0

Notation: query(t) = arbitrary SQL query on table t.select(t) = select query on table t.


This case study allows us to say that if there is an implicit foreign key, then the SQLexecution traces will most probably contain indications of this foreign key if both in-volved tables are involved in the execution scenarios considered. As usual in dynamicprogram analysis, the main difficulty remains the identification of relevant executionscenarios based on the knowledge of the analyzed system. In our case, we considered alarge set of scenarios in order to reach a good coverage of the database schema. How-ever, the set of selected execution scenarios was not sufficient to reach a recall of 100%.This can be explained by several non-exclusive reasons. First, nothing guarantees thatthe WebCampus application actually exploits all the structures of the database. Second,we did not consider all possible execution scenarios of WebCampus. Third, it is likelythat we did not cover all possible interesting execution paths of each execution scenarioconsidered. In the context of this study, an execution path is considered as interesting ifit involves the execution of successive inter-dependent database queries accessing dif-ferent tables. Precisely evaluating the coverage of execution scenarios and input datais a non-trivial problem that has been extensively studied [20]. Several techniques havebeen proposed to support this evaluation in the particular case of data-intensive appli-cations [21]. Last, even for each execution path we followed, it is obvious that we didnot consider all possible combination of input data and database state.

6 Related Work and Further Readings

In this section, we briefly summarize some related work and/or further readings withrespect to this tutorial.

Database Reverse Engineering. The need for precise methods to reconstruct the doc-umentation of a database has been widely recognized in the eighties under the namedatabase reverse engineering. The first approaches were based on simple rules, thatwork nicely with databases designed in a systematic way [22–25]. A second genera-tion of methodologies coped with physical schemas resulting from empirical designin which practitioners tend to apply non standard and undisciplined techniques. Morecomplex design rules were identified and interpreted [2], and, based on them, structuredand comprehensive approaches were developed [26] while the first industrial tools ap-peared (e.g., Bachman’s Reengineering Tool). Many contributions were published in thenineties, addressing practically all the legacy technologies and exploiting such sourcesof information as application source code and database contents. Among synthesis pub-lications, we mention [27], the first tentative history of this discipline.

These second generation approaches were faced with two kinds of problems inducedby empirical design. The first problem is the one we addressed in this tutorial: therecovery of implicit constructs [11, 14, 15, 28–31]. Their elicitation requires the anal-ysis of such complex information sources as the source code of the application pro-grams (and particularly the DML8 statements), the contents of the database, the userinterface and, as in this tutorial, the SQL execution trace. The second problem is thatof the semantic interpretation of logical schemas that may include non standard datastructures.

8 Data Manipulation Language.


Implicit Construct Elicitation. Discovering implicit constructs is usually based onad hoc techniques depending on the nature and reliability of the information sources.In this tutorial, we have considered the use of dynamic program analysis techniques,combining the capture and analysis of SQL execution traces. Other techniques do exist,among which:

– Schema analysis [24, 32, 33]. Spotting similarities in names, value domains andrepresentative patterns may help identify hidden constructs such as foreign keys.

– Data analysis [34–37]. Mining the database contents can be used in two ways.Firstly, to discover implicit properties, such as functional dependencies and foreignkeys. Secondly, to check hypothetic constructs that have been suggested by theother techniques. Considering the combinatorial explosion that threaten the firstapproach, data analysis is most often applied to check the existence of formelyidentified patterns.

– Screen/report layout analysis [38–40]. Forms, reports and dialog boxes are user-oriented views on the database. They exhibit spatial structures (e.g., data agregates),meaningful names, explicit usage guidelines and, at runtime, data population anderror messages that, combined with dataflow analysis, provide much informationon hidden data structures and properties.

– Source code analysis [3, 7, 41, 42]. Even simple analysis, such as dataflow graphexploration, can bring valuable information on field structure and meaningfulnames. More sophisticated techniques such as dependency analysis and programslicing can be used to identify complex constraint checking or foreign keys. SQLstatements examination is one of the most powerful variants of source code analy-sis.

It is important to note that none of the above techniques can guarantee in an absoluteway the presence or the absence of implicit schema constructs, but they can all con-tribute to a better knowledge of the hidden components and properties of a databaseschema.

SQL Statement Analysis. Most of the previous approaches on SQL statement anal-ysis [3, 6, 7, 43] rely on static program analysis techniques. Petit et al. [3] present atechnique for extracting an Entity-Relationship schema from an operational relationaldatabase. The enrichment of the raw schema takes benefit from the analysis of the SQLqueries available in the application programs. In particular, joins are seen as heuris-tics for the detection of implicit dependencies between the columns of distinct tables.Willmor et al. [6] propose an approach to program slicing in the presence of databasestates. In particular, they introduce two forms of data dependencies related to databasequeries. The first category, called program-database dependencies, accounts for inter-actions between program statements and database statements. The database-databasedependencies capture the situation in which the execution of a database statement af-fects the behaviour of another database statement. We also consider the latter kind ofdependencies, but they are extracted from SQL execution traces rather than from thesource code. van den Brink et al. [43] present a tool-supported method to quality as-sessment for SQL statements. The initial phase of the method consists in extracting the


SQL statements from the source code using control and dataflow analysis techniques.Similarly, Ngo and Tan [44] make use of symbolic execution to extract database in-teraction points from web applications. Based on a case study, they showed that theirmethod is able to extract about 80% of such interactions.

Dynamic Program Analysis. Dynamic program analysis has long been considered asa valuable technique for supporting program understanding tasks. We refer to [45] fora recent survey of this active research domain. In contrast, the use of dynamic programanalysis in the context of database understanding has not been sufficiently investigatedso far. Our recent work [9, 10], that we have summarized and further illustrated throughthis tutorial, modestly aims at making a first step in this direction.

Dynamic Analysis of SQL Statements. Dynamic analysis of SQL statements hasalready been used by other authors, but for different purposes than database reverseengineering. Debusmann and Geihs [46] present an aspect-based method for the intru-mentation of application components. This method is used in the context of runtimesystem monitoring. They measure, in particular, the response time of database queries.Del Grosso et al. [47] propose an approach allowing to identify application features indata-intensive programs, with the ultimate goal to export those features as services. Themethod consists in clustering the set of SQL queries collected from program-databaseinteractions. Yang et al. [48] make use of the aspect-based tracing method introducedin [9] to support feature model recovery. Their experiments show that static analysistechniques would have been inapplicable in this context. The WAFA approach [49], byAlalfi et al., is dedicated to program comprehension. It combines static and dynamicprogram analysis techniques for achieving a fine-grained analysis of database interac-tions in web applications. The key idea is to automatically recover the link between SQLexecution instances and the original statement source. A similar method was proposedin previous work [9] in the case of Java systems.

7 Future Research Directions

This tutorial has illustrated the use of dynamic analysis of SQL statements in support todatabase reverse engineering. In fact, analyzing SQL execution traces has potentially amuch wider range of applications. We identify a few of them in this section.

Program comprehension. SQL execution trace analysis could be used, possibly com-bined with other techniques, to better understand the data manipulation behaviour ofprograms. The dynamic analysis techniques presented in this tutorial allow a better un-derstanding of a database schema, through the identification of dependencies betweensuccessive query executions. In the case of program comprehension, the interpretationof those dependencies will also be required in order to incrementally extract the datamanipulation workflow followed by the executed program. For instance, if each rowinsertion in table ORDERS depends on the result of a select query on table CUSTOMER,there is a high probability that the latter is a validation query, making sure that the new


order corresponds to an existing customer in the database. Eventually, we could betterunderstand the process of passing an order, by discovering that it requires the initialsuccessful identification of the customer.

Bug detection. Another interesting application that one could further investigate con-cerns the identification of unsafe data access paths [50], i.e., buggy program fragmentswhere an implicit integrity constraint is not correctly managed. In the case of implicitforeign keys, one could detect in an SQL trace that an insert, delete or updatestatement is performed without prior verification of the referential constraint. In thiscase, the analysis would be based on the absence of output-input and input-input de-pendency under particular scenarios. In our running example, one would detect, forinstance, that a new row is inserted in table ORDERS via a query that does not dependon a previous reading query on table CUSTOMER.

Impact analysis. SQL statement log analysis provides partial information that can beused to better analyze the links between the programs and the database. A simplethough quite useful derived information is the usage matrix [51] that specifies whichtables and which columns each program unit uses and modifies and, thanks to dynamicanalysis, at which frequency. Such dynamic information is very useful for determiningthe exact impact of a database schema change on application programs. This impactis usually defined as the number of queries that become invalid due to the schemachange. But even when only one query in the programs becomes invalid, the impactof the change may still be high if that query is executed very frequently. Conversely,a large set of queries may be impacted by the schema change, but the actual impactis low if all those queries are located in an obsolete program that is never executedanymore.

System monitoring. SQL execution traces constitutes a data collection that can bemined to extract aggregated information and statistics on the behaviour of the pro-gram as far as database interaction is concerned. According to the additional datarecorded with these statements, useful information can be derived such as databasefailure rate at each program point (update rejected, access denied, empty result set,etc.), most frequently used SQL forms, complexity of SQL statements, programmingidiosyncrasies, awkward and inefficient SQL patterns and non-standard syntax (in-ducing portability vulnerability). These derived data can be used to monitor the pro-gram behaviour and performance. The analysis of selection criteria in select, deleteand update statements can be used to define and tune the data access mechanismssuch as indexes and clusters, or even to suggest merging two tables for performancereasons.

Security management. A well-known vulnerability in database applications is the so-called SQL code injection [52]. It occurs when an external user is requested to providedata (typically its user ID and password) that are injected at building time into an in-complete SQL query. However, the user actually provides, instead of the expected data,spurious but syntactically valid data in such a way that a malicious query is formed.This query is then executed with privileges that the user has not been granted. Most


common attack detection techniques rely on the analysis of the value provided by theuser, but dynamically analysing the actual (vs intended) SQL query may prove easierand more reliable [53].

8 Conclusions

Database and program evolution processes should ideally rely on a complete and ac-curate database documentation. The latter usually includes its conceptual schema, thatformalizes the semantics of the data, and its logical schema that translates the formeraccording to an operational database model. In many cases, however, these schemasare missing, or, at best, incomplete and outdated. Their reconstruction, a process calleddatabase reverse engineering, requires DDL code analysis but, more importantly, theelicitation of implicit database structures and constraints.

In this context, we have developped automated, dynamic program analysis tech-niques in support to database redocumentation. Those techniques are based on the iden-tification of intra-query and inter-query dependencies in SQL execution traces. Analyz-ing those dependencies allows, in a second stage, to reveal implicit links (1) betweenschema constructs and (2) between schema constructs and program variables.

We believe that dynamic program analysis constitutes a very promising technique forfuture research in data-intensive system understanding and evolution. In particular, theincreasing use of Object-Relational-Mapping (ORM) technologies like Hibernate, andthe emergence of NoSQL database platforms, may lead to a new generation of legacysystems, where database-program interactions are even more dynamic, where databasestructures and constraints are even more implicit, and, therefore, where the co-evolutionof databases and programs becomes even more costly, time-consuming and error-prone.

Acknowledgments. We would like to thank Jean-Roch Meurisse for his fruitful collab-oration during the Webcampus case study. We also thank the organizing committee ofthe GTTSE summer school series, as well as all the tutorialists and participants for theircontribution to an exciting 2011 edition. Last but not least, we thank the anonymous re-viewers for their constructive feedback, which helped us to significantly improve thequality of these tutorial notes.

References

1. Chikofsky, E.J., Cross, J.H.: Reverse engineering and design recovery: A taxonomy. IEEESoftware 7(1), 13–17 (1990)

2. Blaha, M.R., Premerlani, W.J.: Observed idiosyncracies of relational database designs. In:Proc. of the Second Working Conference on Reverse Engineering (WCRE 1995), p. 116.IEEE Computer Society, Washington, DC (1995)

3. Petit, J.M., Kouloumdjian, J., Boulicaut, J.F., Toumani, F.: Using Queries to ImproveDatabase Reverse Engineering. In: Loucopoulos, P. (ed.) ER 1994. LNCS, vol. 881, pp. 369–386. Springer, Heidelberg (1994)


4. Andersson, M.: Searching for semantics in cobol legacy applications. In: Data Miningand Reverse Engineering: Searching for Semantics, IFIP TC2/WG2.6 Seventh Confer-ence on Database Semantics (DS-7). IFIP Conference Proceedings, vol. 124, pp. 162–183.Chapman & Hall (1998)

5. Embury, S.M., Shao, J.: Assisting the comprehension of legacy transactions. In: Proc. of the8th Working Conference on Reverse Engineering (WCRE 2001), p. 345. IEEE ComputerSociety, Washington, DC (2001)

6. Willmor, D., Embury, S.M., Shao, J.: Program slicing in the presence of a database state. In:ICSM 2004: Proceedings of the 20th IEEE International Conference on Software Mainte-nance, pp. 448–452. IEEE Computer Society, Washington, DC (2004)

7. Cleve, A., Henrard, J., Hainaut, J.L.: Data reverse engineering using system dependencygraphs. In: Proc. of the 13th Working Conference on Reverse Engineering (WCRE 2006),pp. 157–166. IEEE Computer Society, Washington, DC (2006)

8. Cleve, A.: Program Analysis and Transformation for Data-Intensive System Evolution. PhDthesis, University of Namur (October 2009)

9. Cleve, A., Hainaut, J.L.: Dynamic analysis of SQL statements for data-intensive applicationsreverse engineering. In: Proc. of the 15th Working Conference on Reverse Engineering, pp.192–196. IEEE Computer Society (2008)

10. Cleve, A., Meurisse, J.R., Hainaut, J.L.: Database semantics recovery through analysis ofdynamic SQL statements. Journal on Data Semantics 15, 130–157 (2011)

11. Hainaut, J.L.: Introduction to database reverse engineering. LIBD Publish. (2002),http://www.info.fundp.ac.be/ dbm/publication/2002/DBRE-2002.pdf

12. Lammel, R., De Schutter, K.: What does aspect-oriented programming mean to Cobol? In:Proc. of Aspect-Oriented Software Development (AOSD 2005), pp. 99–110. ACM Press(March 2005)

13. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An Overviewof AspectJ. In: Lee, S.H. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 327–353. Springer, Hei-delberg (2001)

14. Petit, J.M., Toumani, F., Kouloumdjian, J.: Relational database reverse engineering: Amethod based on query analysis. Int. J. Cooperative Inf. Syst. 4(2-3), 287–316 (1995)

15. Lopes, S., Petit, J.M., Toumani, F.: Discovery of “Interesting” Data Dependencies from aWorkload of SQL Statements. In: Zytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI),vol. 1704, pp. 430–435. Springer, Heidelberg (1999)

16. Tan, H.B.K., Ling, T.W., Goh, C.H.: Exploring into programs for the recovery of data depen-dencies designed. IEEE Trans. Knowl. Data Eng. 14(4), 825–835 (2002)

17. Tan, H.B.K., Zhao, Y.: Automated elicitation of inclusion dependencies from the source codefor database transactions. Journal of Software Maintenance 15(6), 379–392 (2003)

18. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6),377–387 (1970)

19. DB-MAIN: The DB-MAIN official website (2011), http://www.db-main.be20. Zhu, H., Hall, P.A.V., May, J.H.R.: Software unit test coverage and adequacy. ACM Comput.

Surv. 29, 366–427 (1997)21. Kapfhammer, G.M., Soffa, M.L.: A family of test adequacy criteria for database-driven ap-

plications. In: Proc. of the 9th European Software Engineering Conference Held Jointly with11th ACM SIGSOFT International Symposium on Foundations of Software Engineering,ESEC/FSE-11, pp. 98–107. ACM, New York (2003)

22. Casanova, M.A., De Sa, J.E.A.: Mapping uninterpreted schemes into entity-relationship dia-grams: two applications to conceptual schema design. IBM J. Res. Dev. 28(1), 82–94 (1984)

http://www.info.fundp.ac.be/~dbm/publication/2002/DBRE-2002.pdf

http://www.info.fundp.ac.be/~dbm/publication/2002/DBRE-2002.pdf

http://www.db-main.be


23. Davis, K.H., Arora, A.K.: A methodology for translating a conventional file system intoan entity-relationship model. In: Proc. of the Fourth International Conference on Entity-Relationship Approach, pp. 148–159. IEEE Computer Society, Washington, DC (1985)

24. Navathe, S.B., Awong, A.M.: Abstracting relational and hierarchical data with a semanticdata model. In: Proc. of the Sixth International Conference on Entity-Relationship Approach(ER 1987), pp. 305–333. North-Holland Publishing Co., Amsterdam (1988)

25. Johannesson, P.: A method for transforming relational schemas into conceptual schemas. In:Proc. of the Tenth International Conference on Data Engineering (ICDE 2004), pp. 190–201.IEEE Computer Society, Washington, DC (1994)

26. Hainaut, J.L., Englebert, V., Henrard, J., Hick, J.M., Roland, D.: Database reverse engineer-ing: From requirements to care tools. Automated Software Engineering 3, 9–45 (1996)

27. Davis, K.H., Aiken, P.H.: Data reverse engineering: A historical survey. In: Proc. of theSeventh Working Conference on Reverse Engineering (WCRE 2000), p. 70. IEEE ComputerSociety, Washington, DC (2000)

28. Hainaut, J.L., Chandelon, M., Tonneau, C., Joris, M.: Contribution to a theory of databasereverse engineering. In: Proc. of the IEEE Working Conf. on Reverse Engineering, pp. 161–170. IEEE Computer Society Press, Baltimore (1993)

29. Signore, O., Loffredo, M., Gregori, M., Cima, M.: Reconstruction of ER Schema fromDatabase Applications: a Cognitive Approach. In: Loucopoulos, P. (ed.) ER 1994. LNCS,vol. 881, pp. 387–402. Springer, Heidelberg (1994)

30. Yang, H., Chu, W.C.: Acquisition of entity relationship models for maintenance-dealing withdata intensive programs in a transformation system. J. Inf. Sci. Eng. 15(2), 173–198 (1999)

31. Shao, J., Liu, X., Fu, G., Embury, S.M., Gray, W.A.: Querying Data-Intensive Programsfor Data Design. In: Dittrich, K.R., Geppert, A., Norrie, M. (eds.) CAiSE 2001. LNCS,vol. 2068, pp. 203–218. Springer, Heidelberg (2001)

32. Markowitz, V.M., Makowsky, J.A.: Identifying extended entity-relationship object structuresin relational schemas. IEEE Trans. Softw. Eng. 16(8), 777–790 (1990)

33. Premerlani, W.J., Blaha, M.R.: An approach for reverse engineering of relational databases.Commun. ACM 37(5), 42–49 (1994)

34. Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases:extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142(1994)

35. Lopes, S., Petit, J.M., Toumani, F.: Discovering interesting inclusion dependencies: applica-tion to logical database tuning. Inf. Syst. 27(1), 1–19 (2002)

36. Yao, H., Hamilton, H.J.: Mining functional dependencies from data. Data Min. Knowl. Dis-cov. 16(2), 197–219 (2008)

37. Pannurat, N., Kerdprasop, N., Kerdprasop, K.: Database reverse engineering based on asso-ciation rule mining. CoRR abs/1004.3272 (2010)

38. Choobineh, J., Mannino, M.V., Tseng, V.P.: A form-based approach for database analysisand design. Communications of the ACM 35(2), 108–120 (1992)

39. Terwilliger, J.F., Delcambre, L.M.L., Logan, J.: The User Interface Is the Conceptual Model.In: Embley, D.W., Olive, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 424–436.Springer, Heidelberg (2006)

40. Ramdoyal, R., Cleve, A., Hainaut, J.-L.: Reverse Engineering User Interfaces for InteractiveDatabase Conceptual Analysis. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 332–347. Springer, Heidelberg (2010)

41. Di Lucca, G.A., Fasolino, A.R., de Carlini, U.: Recovering class diagrams from data-intensive legacy systems. In: Proc. of the 16th IEEE International Conference on SoftwareMaintenance (ICSM 2000), p. 52. IEEE Computer Society (2000)

42. Henrard, J.: Program Understanding in Database Reverse Engineering. PhD thesis, Univer-sity of Namur (2003)


43. van den Brink, H., van der Leek, R., Visser, J.: Quality assessment for embedded sql. In:Proc. of the 7th IEEE International Working Conference on Source Code Analysis and Ma-nipulation (SCAM 2007), pp. 163–170. IEEE Computer Society (2007)

44. Ngo, M.N., Tan, H.B.K.: Applying static analysis for automated extraction of database inter-actions in web applications. Inf. Softw. Technol. 50(3), 160–175 (2008)

45. Cornelissen, B., Zaidman, A., van Deursen, A., Moonen, L., Koschke, R.: A systematic sur-vey of program comprehension through dynamic analysis. IEEE Trans. Software Eng. 35(5),684–702 (2009)

46. Debusmann, M., Geihs, K.: Efficient and Transparent Instrumentationof Application ComponentsUsing an Aspect-Oriented Approach. In: Brunner, M., Keller, A. (eds.) DSOM 2003. LNCS,vol. 2867, pp. 209–220. Springer, Heidelberg (2003)

47. Del Grosso, C., Di Penta, M.: Garcıa Rodrıguez de Guzman, I.: An approach for miningservices in database oriented applications. In: Proceedings of the 11th European Conferenceon Software Maintenance and Reengineering (CSMR 2007), pp. 287–296. IEEE ComputerSociety (2007)

48. Yang, Y., Peng, X., Zhao, W.: Domain feature model recovery from multiple applicationsusing data access semantics and formal concept analysis. In: Proc. of the 16th InternationalWorking Conference on Reverse Engineering (WCRE 2009), pp. 215–224. IEEE ComputerSociety (2009)

49. Alalfi, M., Cordy, J., Dean, T.: WAFA: Fine-grained dynamic analysis of web applications.In: Proc. of the 11th International Symposium on Web Systems Evolution (WSE 2009), pp.41–50. IEEE Computer Society (2009)

50. Cleve, A., Lemaitre, J., Hainaut, J.L., Mouchet, C., Henrard, J.: The role of implicit schemaconstructs in data quality. In: Proc. of the 6th International Workshop on Quality in Databases(QDB 2008), pp. 33–40 (2008)

51. Deursen, A.V., Kuipers, T.: Rapid system understanding: Two cobol case studies. In: Proc.of the 6th International Workshop on Program Comprehension (IWPC 1998), p. 90. IEEEComputer Society (1998)

52. Merlo, E., Letarte, D., Antoniol, G.: Insider and outsider threat-sensitive sql injection vulner-ability analysis in php. In: Proc. Working Conf. Reverse Engineering (WCRE), pp. 147–156.IEEE Computer Society, Washington, DC (2006)

53. Halfond, W.G.J., Orso, A.: Combining static analysis and runtime monitoring to countersql-injection attacks. In: WODA 2005: Proceedings of the Third International Workshop onDynamic Analysis, pp. 1–7. ACM, New York (2005)

Model-Based Language Engineering

with EMFText

Florian Heidenreich, Jendrik Johannes, Sven Karol, Mirko Seifert,and Christian Wende

Institut fur Software- und MultimediatechnikTechnische Universitat DresdenD-01062, Dresden, Germany

{florian.heidenreich,jendrik.johannes,sven.karol,mirko.seifert,c.wende}@tu-dresden.de

Abstract. Model-based techniques are in wide-spread use for the de-sign and implementation of domain specific languages (DSLs) and theirtooling. The Eclipse Modeling Framework (EMF) is a frequently usedenvironment for model-based language engineering. With its underlyingmodelling language Ecore, its XML serialisation support and its ver-satile extensibility it provides a solid grounding for many task-specificlanguage development tools. In this tutorial, we give an introduction tomodel-based language engineering using EMFText, which allows users todevelop powerful textual editors for Ecore-based DSLs that are tightlyintegrated with the EMF.

1 Introduction

EMFText [1] is a tool for defining textual syntax for Ecore-based metamodels [2].It enables developers to define their own textual languages—be it domain specificlanguages (e.g., a language for describing forms) or general purpose languages(e.g., Java)—and generates accompanying tool support for these languages. Itprovides a syntax specification language from which it generates textual editorsand components to load and store textual model instances. Editors generatedby EMFText provide many advanced features that are known from, e.g., theEclipse Java editor. This includes code completion (with customisable comple-tion proposals), customisable syntax and occurrence highlighting via preferencepages, advanced bracket handling, code folding, hyperlinks and text hovers forquick navigation, an outline view and instant error reporting.

In the first part of this tutorial, we give an introduction to EMFText, its lan-guage development process and develop a basic domain-specific language (DSL)for describing form sheets. In this part, we also discuss how DSL semantics canbe specified with Reference Attribute Grammars (RAGs) [3]. Then, we dive intothe advanced features of EMFText and show how we have developed model basedlanguage tooling for a general-purpose language—namely Java—with EMFTextand provide insights on how such a language can easily be extended by newfeatures and integrated with other languages.


Model-Based Language Engineering with EMFText 323

Table 1. Typical formalisms in language engineering and their counterparts inEMFText

Concern Typical specification formalisms Realisation in EMFTextConcrete syntax Context-free grammars (CFGs), Ex-

tended Backus–Naur Form (EBNF)Concrete Syntax Specification Lan-guage (CS)

Abstract syntax Tree grammars, abstract CFGs Ecore diagrams or textecoreStatic semantics Reference attribute grammars

(RAGs)Default reference resolving or RAGsbased on JastEMF

Dynamic semantics Translational-, operational-, denota-tional semantics

Typically operational (manually im-plemented interpreters) or transla-tional via model transformations

2 Developing Languages with EMFText

In this section we shortly introduce basic concerns of textual language deve-lopment and how they are addressed in EMFText. Afterwards, we discuss thegeneral EMFText development process for languages and interactive editor too-ling and investigate the most compelling features of EMFText.

2.1 Basic Language Development Concerns

Typical concerns in the development of textual languages are concrete and ab-stract syntax, static semantics and dynamic semantics (cf. Table 1).

Concrete Syntax. denotes the physical representation of artefacts in a certainlanguage L as streams of characters. Typically, textual concrete syntax is spe-cified using a plain context-free grammar (CFG) or an equivalent formalismwith syntactic sugar, e.g., Extended Backus-Naur Form (EBNF). EMFText pro-vides the Concrete Syntax Specification Language (CS), whose core is basedon EBNF. A CS specification is the most centric artefact when implementingtextual languages with EMFText. In fact, it is a simple but rich syntax specifi-cation language that follows the concept of convention over configuration. Thisallows for very compact and intuitive specifications, but still supports tweakingspecifics where needed. Since the CS itself is implemented in EMFText, it has apowerful editor which analysis the specification in the background while the useris typing. A number of analyses inform developers about potential errors in thespecification—like missing EBNF rules for certain types in the metamodel, vio-lated lower/upper bounds in the metamodel or overlapping regular expressions.With EMFText, an initial CS specification for a DSL can be generated for anygiven Ecore-based metamodel [1]. One can generate a syntax that conforms tothe Human-Useable Textual Notation (HUTN) [4] standard, a Java-style syn-tax, or a custom syntax configured by using the custom syntax wizard. In allcases, the initial, generated specification of the syntax can be further tailoredtowards specific needs. Furthermore, the CS provides an import mechanism thatsupports specification of a single text syntax for multiple related Ecore modelsand also allows for modularisation and composition of CS specifications.

324 F. Heidenreich et al.

Abstract Syntax. denotes the representation of artefacts in L as data structure(e.g., an abstract syntax tree (AST)) for further processing. Typical formalismsare again (abstract) CFGs or tree grammars. Abstract syntax of modelling lan-guages is specified using languages like the Meta-Object Facility (MOF) [5] andEssential MOF (EMOF). In EMFText, we rely on the EMOF realisation of theEclipse Modeling Framework (EMF), which is called Ecore. Roughly spoken,an Ecore model is a restricted kind of class diagram with alternative graphicaland textual notations. In this paper, we will use textecore—an EMFText-basednotation for Ecore-based abstract syntax. Like the CS, textecore has been imple-mented using EMFText. Thus, language developers are supported by a powerfuleditor and validation algorithms.

Static Semantics. of a language L covers typical static analyses algorithms likename analysis or static type resolution. A well-known specification formalism forstatic semantics are RAGs. Essentially, RAGs are an extension to context-freegrammars which allows language developers to specify data flow and computa-tions over the nodes in a syntax tree. In this paper, we will use two alternativeapproaches for static semantics. First is a default name resolution mechanismfor models with globally unique names, which is available out of the box for anysyntax in EMFText. Also, external references are resolved automatically, if URIspoint to the referenced elements. More complex resolution mechanisms can berealized by implementing generated resolving methods. As a second approach,we will use JastEMF—a tool for specifying static semantics of Ecore modelsusing RAGs.

Dynamic Semantics. is the ”meaning” of artefacts in L, e.g., the execution of aprogram, compilation to less abstract target languages or even a rendered image.Typical approaches to language semantics can be operational (e.g., writing inter-preters in a general purpose language), translational (e.g., writing a code gene-rator or compiler back-end translating to some target platform) or denotational(e.g., a formal mathematical specification [6] or an interpreter implemented insome functional programming language). EMFText provides support for opera-tional and translational semantics. Interpretation can be realised using Java codeby implementing the generated interpreter stubs. EMFText also provides a postprocessor API including an Eclipse extension point. Post processors registeredvia that extension point are called immediately after statics semantics evaluationand are ideal for implementing model-to-model/model-to-text transformations,model validation or consistency checks. EMFText also generates a builder stubthat can be used by developers to easily hook their own background build taskinto Eclipse. For example, builders can process model instances on changes and toautomatically produce derived resources when needed or to trigger some modeltransformations in a concurrent thread.

2.2 The EMFText Language Development Process

Generating an advanced Eclipse editor for a new language with EMFText justrequires a few specifications and generation steps.


Fig. 1. The iterative EMFText language development process as workflow model basedon the Business Process Model and Notation (BPNM) [7]

The basic language development process of EMFText is depicted in Fig. 1. It isan iterative process and consists of the following basic tasks:

(1) specifying a language metamodel using Ecore/textecore,(2) specifying the concrete syntax using the CS,(3) optionally specifying static Semantics using reference resolvers or JastEMF,(4) optionally implementing dynamic semantics, e.g., by using the EMFText

interpreter stubs or a post processor,(5) generating and (optionally) customising the editor tooling, e.g., by tailoring

code completion, syntax highlighting, attaching quick fixes or implementingsome refactoring operations.

A language specification in EMFText consists (at least) of an Ecore metamodeland a concrete syntax specification (cf. Tasks (1) and (2)). Taking these speci-fications, the EMFText generator derives an advanced textual editor that uses alikewise generated parser and printer to parse textual artefacts to EMF models orto print EMF models to languages expressions respectively. Out of the box, theeditor has standard features such as syntax highlighting in standard colors, basiccode completion and reference resolution. For parser generation, EMFText relieson ANTLR [8], which uses a recursive descent parsing approach. Depending onthe objectives at hand, this may already fulfill the requirements of the languagedeveloper, e.g., if the editor should only be an add-on to an existing DSL whichalready has an existing implementation modulo editing support in a modern IDElike Eclipse. If required, all parts of the generated editor can be tweaked usingcode generation options an by extending it at the provided extension points (cf.Task (5)). For example, EMFText provides means to attach quick fixes to re-ported problems, which then can be fixed by the developer in a convenient way.Note that the code that is generated does not contain dependencies to a run-time environment. This implies that generated language tooling can be deployedin environments where EMFText is not available and that future compatibilityissues can be avoided. Since EMFText is also a framework for developing com-plete DSL implementations using model-based techniques, language semanticsis optionally supported using Java code and/or JastEMF RAGs (cf. Tasks (3)and (4)).


1 FORM "GTTSE’11 Questionnaire"2 GROUP "General Questions"3 ITEM "Name" : FREETEXT4 ITEM "Age" : NUMBER5 ITEM "Gender" : CHOICE ("Male", "Female")67 GROUP "Research Program"8 ITEM "Do you enjoy the GTTSE’11 research program?" : DECISION ("Yes", "No")9

10 ITEM "How many tutorials have you attended so far?" : NUMBER1112 GROUP "Food and Drinks"13 ITEM "Preferences" : CHOICE ("All", "Vegetarian", "Vegan")1415 ITEM "Does the menu match your eating preferences?" : DECISION ("Yes", "No")1617 ITEM "Do you like Vinho Verde?"18 : CHOICE multiple ("It’s great!", "It’s great for lunch!", "It’s OK.")

Listing 1.1. An example form.

In the following sections, we explain and exemplify each of the these tasksand the involved specifications by developing a simple DSL—forms.

3 Creating a DSL with EMFText: forms

In this section, we create a DSL for specifying form sheets like questionnaires,tax forms or surveys. The example covers all mandatory and optional tasks ofthe language development process discussed in the previous section.

Listing 1.1 contains an example specification for a questionnaire as it couldhave been handed to participants of the GTTSE summer school. The form hasa range of question items, which are arranged in related groups. As an exam-ple, consider the items grouped under research program (cf. lines 7–10) askingparticipants to evaluate the quality of the program and count the number ofattended tutorials.

To ease the creation, maintenance and evaluation of such form sheets, thesummer school organisers may have decided to develop a textual editor anda corresponding metamodel using the EMF and EMFText. A further issueis, how the questionnaires are presented to the participants and how theycould be filled out most comfortably. In this tutorial, we discuss two alterna-tives. First is the generation of a printable PDF document that can be addedto the conference package given to the participants at the beginning of thesummer school. The second option is to render the form as an interactivewebsite.

3.1 Specifying the forms Metamodel

To kick-start the development of a new language you can use the EMFTextproject wizard which initialises a new EMFText project containing a metamodelfolder that holds an initial metamodel and syntax specification.


1 package forms // package name2 forms // namespace prefix3 "http://www.emftext.org/language/forms" // namespace URI4 {5 class Form {6 attribute EString caption (0..1);7 containment reference Group groups (1..-1);8 }9 class Group {

10 attribute EString name (1..1);11 containment reference Item items (1..-1);12 reference Form root (1..1);13 }14 class Item {15 attribute EString text (0..1);16 attribute EString explanation (0..1);17 containment reference ItemType itemType (1..1);18 reference Option dependentOf (0..-1);19 }20 abstract class ItemType {}21 class FreeText extends ItemType {}22 class Date extends ItemType {}23 class Number extends ItemType {}24 class Choice extends ItemType {25 attribute EBoolean multiple (0..1);26 containment reference Option options (1..-1);27 }28 class Decision extends ItemType {29 containment reference Option options (2..2);30 }31 class Option {32 attribute EString id (0..1);33 attribute EString text (0..1);34 }35 }

Listing 1.2. Metamodel of the forms language.

As EMFText is tightly integrated with the EMF, language metamodels arespecified using the Ecore metamodelling language. The metamodel specifies theabstract syntax of the new language. It can be build from classes with attributesthat are related using references. References are further distinguished into con-tainment references and non-containment references. It is important to noticethis difference, as both reference types have different semantics in EMF andare also handled differently in EMFText. Containment references are used torelate a parent model element and a child model element that is declared in thecontext of the parent element. An example which can be found for instance inobject-oriented programming languages is the declaration of a method withinthe body of a class declaration. Non-containment references are used to relate amodel element with an element that is declared in a remote sub tree. A commonexample in programming languages is a method call in a code block that relatesto a method declaration via a non-containment reference.

To define a metamodel for a language, we have to consider the concepts thislanguage deals with, how they interrelate and what attributes they have. Inthe following, we discuss the concepts of the forms language—as they may be


derived from the example in Listing 1.1—and how they can be represented bymetamodelling concepts.

– A Form (class) has a caption (attribute) and contains (containment refe-rence) a number of question Groups (class).

– Each Group has a name (attribute) and contains (containment reference) anumber of question Items (class).

– Each Item has a question text (attribute) and an explanation (attribute).– There are various Types (class) of question items with regard to the answer

values they expect: e.g., Text questions (subclass), Date questions (subclass),Number questions (subclass), Choices (subclass), or Decisions (subclass).

– Choices and Decisions declare (containment reference) a number of selectionOptions (class).

– There may be question Items that are dependent of (non-containment refe-rence) the selection of a particular Option in another Item, e.g., a questionthat asks for the age of your children, only if you previously selected thatyou have some.

Listing 1.2 depicts a textual representation of the according EMF metamodel(specified using textecore). Since Ecore metamodels are organised in packages,the specification starts with a package declaration including the package nameand the namespace URI, which is used by the EMF to register the package ina package registry (lines 1–3). The rest of of the specification contains the defi-nitions of the above mentioned forms concepts and refines their multiplicitiesand types. Attributes are denoted by the keyword attribute, the followingattribute type, the feature name and a cardinality. For example, consider thedefinition of caption in line 6. It has the data-type EString (which corres-ponds to java.lang.String in generated Java code) and is optional. Simi-larly, non-containment references are denoted by the keyword reference, whilecontainment references are additionally marked by containment. Examples aredependentOf in line 18 and itemType in line 17.

For a more detailed introduction on the basics of Ecore metamodelling werefer to [2].

3.2 Specifying a Concrete Syntax for forms

After defining the metamodel, we can start specifying a concrete syntax. Theconcrete syntax specification defines the textual representation of all metamodelconcepts. For that purpose, EMFText provides the CS language. As a startingpoint, EMFText provides a syntax generator that can automatically create a CSspecification conforming to the HUTN standard from the language metamodel.

Listing 1.3 depicts a CS specification for the forms language that defines thesyntax used in the example form in Listing 1.1. It consists of five sections:

– In the first section (lines 1–3), the file extension for registration in Eclipse isdefined (line 1), the specification is bound to the metamodel by its namespaceURI (line 2), and a start symbol is defined (line 3).


1 SYNTAXDEF forms2 FOR <http://www.emftext.org/language/forms>3 START Form45 OPTIONS {6 overrideBuilder = "false";7 }89 TOKENS {

10 DEFINE MULTIPLE $’multiple’|’MULTIPLE’$;11 }1213 TOKENSTYLES {14 "TEXT" COLOR #da0000;15 "FORM", "ITEM", "CHOICE", "DATE", "FREETEXT", "NUMBER",16 "DECISION", "GROUP" COLOR #000000, BOLD;17 "ONLY", "IF" COLOR #da0000, BOLD;18 }1920 RULES {21 Form ::= "FORM" caption[’"’,’"’] !1 groups*;22 Group ::= !0 "GROUP" name[’"’,’"’] !0 items*;23 Item ::= "ITEM" text[’"’,’"’] ( explanation[’"’,’"’] )?24 ("ONLY" "IF" dependentOf[])? ":" itemType !0;25 Choice ::= "CHOICE" (multiple[MULTIPLE])? "(" options ("," options)* ")";26 Option ::= ( id[] ":")? text[’"’,’"’];27 Date ::= "DATE";28 FreeText ::= "FREETEXT";29 Number ::= "NUMBER";30 Decision ::= "DECISION" "(" options "," options ")";31 }

Listing 1.3. Concrete syntax specification of the forms language.

– In the second section (lines 5–7), various code generation options can beconfigured. For example, line 6 configures that the builder class should notbe regenerated if it already exists.

– In the third section (lines 9–11), tokens to be recognised by the lexical ana-lyser are defined. In this grammar, only one kind of token is specified (line10, MULTIPLE). However, EMFText has some built-in token definitions forwhitespace and identifiers (called TEXT).

– In the fourth section (lines 13–18), token styles are defined that customisesyntax highlighting, e.g., by defining a certain colour or a font face.

– In the fifth section (lines 20–31), the rules for the language syntax are spe-cified. Details about the syntax rules will be given below.

The syntax specification rules used in the CS language are in EBNF to supportarbitrary context-free languages. They are meant to define syntax for EMF-basedmetamodels and, thus, are specifically related to the Ecore concepts. Therefore, itprovides Ecore-specific specialisations of classic EBNF constructs like terminals,and nonterminals. This specialisation enables EMFText to provide advancedsupport during syntax specification, e.g., errors and warnings if the specificationis inconsistent with the metamodel. Furthermore, it enables the EMFText parsergenerator to derive a parser that directly instantiates EMF models from artefactsin the specified language.


In the following, we conclude the most important constructs found in the CSlanguage and their relation to Ecore metamodels. Each syntax construct will berelated to examples taken from Listing 1.3. For an extensive overview on thesyntax specification language we refer to the EMFText documentation.

Rules. A CS rule is always related (by its name) to a specific metaclass. Itdefines the syntactic representation of instances of this metaclass, attributevalues and references. All syntax rules are collected in the rules section ofthe CS file. Various common EBNF constructs like keywords, terminals,non-terminals, and multiplicities (?, +, *), alternative (|), or sub-rules areavailable. For example, Form ::= ...; in line 21 of Listing 1.3 defines thesyntax of the Form metaclass.

Keywords. Keywords are purely syntactic elements that are mainly used to dis-tinguish and markup particular language concepts. Examples are "FORM","GROUP", "ONLY", "IF".

Terminals. Terminals specify the symbolic representation of attribute valuesor non-containment references. They can be recognised by the correspon-ding feature name that is followed by square brackets. Within these squarebrackets, the name of a token type, or a prefix and a suffix that must sur-round symbolic values can be given. If nothing is given, the default TEXTtoken is assumed. In the case of non-containments, this value is later resolvedto the actual element.Examples for attribute features are: id[] (the value of the id attribute isdetermined by a TEXT token, cf. line 26), multiple[MULTIPLE] (the valueof the multiple attribute is determined by a MULTIPLE token which is au-tomatically mapped to true if present and false otherwise, cf. line 25),name[’"’,’"’] (the value of the name attribute is determined by an ar-bitrary string between double quotes, cf. line 22).Example for non-containment features: dependentOf[]—a placeholder forthis references is determined by a TEXT token and later, during reference re-solution, replaced by a reference to an instance object (cf. line 24).

Nonterminals (Containment References). Nonterminals are used in rulebodies to specify the syntactic representation for containment references ofthe according metaclass. They use the reference name without brackets. Du-ring parsing, a recursive descent parser descends in the syntax rule specifiedfor the class the containment reference points to. This is in line with thesemantics of containment references as used in metamodels. An example isgroups at the end of line 21. Note that groups refers the correspondingcontainment reference of the Form metaclass (cf. line 7 in Listing 1.2).

Printing Markup. Printing Markup is used to customise the behaviour of thegenerated printer. This is useful to achieve a particular standard layout forprinted language expressions. Two forms of printing markup are supported:blank markup, #<n>, which prints a given number of blanks, and line breakmarkup, !<n>, which introduces a line break followed by a given number oftab characters. As an example, consider lines 21 and 22: before the groups areserialised the printer emits a new line and adds a tab as basic indention for all


groups (!1). Consequently, using !0, a line break and the current indention isadded before the group items are printed. Note that if not specified otherwise(via #<n>), the printer emits a single blank between two tokens.

3.3 Implementing Static Semantics with JastEMF

In fact, metamodels only declare language concepts while means to implementtheir semantics are missing. A common example usually occurs when non-con-tainment references need to be resolved in textual languages, i.e., some nameanalysis has to be provided. To achieve this, EMFText provides an API forreference resolution—for each cross reference declared in a CS specification tem-plate methods can be implemented in ReferenceResolver classes. By default,EMFText tries to resolve non-containment references on a unique key–value ba-sis. In the forms DSL, the dependentOf reference (cf. line 18 in Listing 1.2)between Items and Options is resolved automatically by EMFText, which ispossible because options have unique ids. However, when it comes to more com-plex language constructs such as nested scopes and static typing, EMFText doesnot provide appropriate formalisms to specify these rules.

To overcome this issue and to provide an appropriate and usable approachfor specifying metamodel semantics, we developed the JastEMF tool [9] whichintegrates RAGs based on the JastAdd tool [10] with the EMF [11,12]. An RAGspecifies the computation of semantic values over the syntax trees of a targetlanguage. In contrast to attributes as structural features in metamodels (i.e.,EAttributes in Ecore), attributes in Attribute Grammars (AGs) are signa-tures with semantic functions. Semantic functions are always specified with re-spect to a certain production in the CFG while the evaluation takes place incontext of each node derived by the corresponding production. Computed valuescan be distributed bottom-up or top-down to parent or child nodes via attributes.Using AGs for semantics specification has several advantages: since the adventof AGs in 1968 [13], the basic formalism has been extended and improved se-veral times. RAGs are one such extension allowing references passed as valuesthrough attributes, which makes them suitable for computing cross-referencesover object-oriented ASTs [14,3]. Furthermore, best practices have emerged andcan be applied to DSL development [15]. Efficient evaluation algorithms for AGshave been developed, several efficient tools with very different implementationapproaches are available. For example, Eli [16] is a compiler construction frame-work for Unix that generates C code. Silver [17] provides a functional attributionlanguage and has a Java backend. Kiama [18] is an AG library and embeddedDSL for the Scala programming language.

JastEMF reuses the JastAdd tool and, thus, inherits its features and specifica-tion languages. It generates demand-driven attribute evaluators that are woveninto the generated AST classes and is seamlessly integrated with Java. Further-more, semantics can be specified in an extensible fashion using aspect modules.This extensibility motivates the application of attribute grammars when speci-fying semantics for language families [19]. A tutorial on JastAdd RAGs and their


1 Form ::= <caption:String> groups:Group*;2 Group ::= <name:String> items:Item*;3 Item ::= <text:String> <explanation:String> <dependentOfName:String>4 itemType:ItemType;56 abstract ItemType;7 FreeText:ItemType;8 Choice:ItemType ::= <multiple:boolean> options:Option*;9 Date:ItemType;

10 Number:ItemType;11 Decision:ItemType ::= options:Option*;1213 Option ::= <id:String> <text:String>;

Listing 1.4. The forms metamodel as JastAdd AST grammar.

usage can be found in [20]. Detailed information on how to set up a JastEMFsemantics modelling project is available at the JastEMF website.

Relating RAGs and Metamodels. JastEMF bridges the gap between thegenerated Java code of JastAdd and the EMF. This is possible because JastAddgenerates a hierarchy of Java classes from an AST grammar, which is quitesimilar to the EMF models and code.

To align the JastAdd AST and the Ecore metamodel, JastEMF applies the fol-lowing mapping. Each class in the original metamodel is mapped to a productionnonterminal in the AST grammar. Furthermore, each non-derived EAttributeof a metaclass is mapped to terminal in the grammar. Finally, containment ref-erences are mapped to right-hand side nonterminals. All these elements belongto the syntactic interface of the metamodel. Listing 1.4 shows the AST grammarderived from the forms metamodel in Listing 1.2. Note that in line 3 the termi-nal dependentOfName was added to make the name of the referenced optionavailable for AG computations. Other parts of the metamodel belong to the se-mantic interface, such as EAttributes marked as derived in the metamodel,non-containment references and operations [12].

Developing an Attribution for the Forms Example. In this section, weexemplify the application of JastEMF by developing an RAG-based name ana-lysis for the forms language. Therefore, we employ different kinds of attributegrammar concepts:

Synthesised attributes are used to model bottom-up data flow. A synthe-sised attribute is always defined with respect to a nonterminal (classes inour case) and may only depend on inherited attributes of its nonterminal,other synthesised values, of direct children or terminals (EAttributes). InJastAdd/JastEMF, they are identified by the keyword syn.

Inherited attributes model top-down data flow in AST structures. They arealways defined with respect to the context of a right hand nonterminal (i.e., acontainment references in our case) and may only depend on inherited valuesof the left hand side nonterminal (the containing class of the containment


1 //attribute declarations belong to the semantics interface2 inh Form ASTNode.form();3 syn EList Item.dependentOf();45 //declarations of "helper" attributes6 inh EList ASTNode.LookUpOption(String optionName);7 coll EList<Option> Form.Options() [new BasicEList()]8 with add;9

10 //attribute equations11 Option contributes this to Form.Options() for form();12 eq Form.getgroups(int index).form() = this;13 eq Item.dependentOf() = LookUpOption(getdependentOfName());14 eq Form.getgroups(int index).LookUpOption(String optionName){15 EList result = new BasicEList();16 for(Option option:Options()){17 if(optionName.equals(option.getid()))18 result.add(option);19 }20 return result;21 }

Listing 1.5. A Simple Name Analysis for the forms Example.

reference), synthesised values of siblings, or terminals. In JastAdd/JastEMF,they are identified by the keyword inh.

Collection attributes are used to collect values which can be freely distributedin an AST structure. In JastAdd/JastEMF, collection attributes are identi-fied by the keyword coll.

Reference attributes go beyond the standard AG definition by also allowpassing references to existing AST nodes (i.e., objects) through attributes.In JastAdd/JastEMF, reference attributes can be synthesised, inherited orcollection attributes.

Equations are the actual semantic functions. They specify how an attributehas to be computed. In JastAdd/JastEMF, equations are identified by thekeyword eq and may be given by a Java expression or method body.

The forms language has two features that belong to the semantics interface—theabove mentioned non-containment reference dependentOf and a non-contain-ment reference from Groups to the forms root. We consider both belonging tothe name analysis aspect, since dependentOf will point to an Option whichis referenced by name in the form root. Listing 1.5 contains the complete nameanalysis for our example. The algorithm replaces the default algorithm genera-ted by EMFText by a much more declarative JastAdd specification. Note that,besides the attributes in the semantic interface, we use the Options() andLookUpOption() attributes as helpers (1) to collect all Option instances in agiven Form at its root node and (2) to let all nodes in a model inherit referencesto Options for a given name. The actually implemented semantic functions canbe found in lines 11–21. Line 11 contains the specification for the Options attri-bute which tells the evaluator to add all Options objects to the Options list atthe Form root object. Line 12 contains the specification for the inherited formattribute. It tells the evaluator to compute the value via the Form root object


GTTSE'11 Questionnaire

General Questions1.1 Name

1.2 Age (please give a number)

1.3 Gender Please check only one:

Male

Female

Research Program2.1 Do you enjoy the

GTTSE'11 researchprogram?

Yes No

2.2 How many tutorialhave you attendedso far?

(please give a number)

Food and Drinks3.1 Preferences Please check only one:

All

Vegetarian

Vegan

3.2 Does the menumatch your eatingpreferences?

Yes No

3.3 Do you like VinhoVerde?

Please check all relevant:

It's great!

It's great for lunch!

It's OK.

Fig. 2. The GTTSE questionnaire in Listing 1.1rendered as a PDF document

Fig. 3. The GTTSE questionnairerendered as a website for mobiledevices

by directly passing the object to the group children and to all indirect descen-dants. Line 13 contains the specification for the dependentOf reference whichuses the LookUpOption attribute. LookUpOption is specified in lines 14–21as a Java method body. The algorithm traverses the collected Option objectsand adds them to the computation result if the id is equivalent to the passedreference name.

The presented name analysis is rather simple, since it actually just mapsnames in Items to Option ids. However, since we now have a more declarativespecification, the algorithm can be easily modified and extend. For example,one might consider extend the DSL with more complex visibility constraintssuch as declare before use constraints or shadowing declarations. We may alsoadd further semantic aspects, e.g., checking reachable questions under specificconstraints or simply let the AG control code generation.

3.4 Interpretation and Compilation of Language Instances

Up to this point, we defined the forms metamodel, a concrete syntax and staticsemantics. To fulfill the requirements stated at the beginning of this section,we realised two translational approaches for rendering forms as PDFs or web-sites. First, we implemented a code generator that produces PDF documents,that look quite similar to German tax sheets (cf. Fig. 2). The PDFs could beeasily printed on paper and handed to the participants. The generator uses JavaEmitter Templates (JET) [21] to generate style sheets in Extensible StylesheetLanguage (XSL) [22] with formatting rules that can be rendered to various bi-nary output formats, e.g., PDF. Second, we implemented a code generator that


produces HTML forms that can be rendered on small screens of mobile phones(cf. Fig. 3). The HTML generator also uses JET to generate the forms repre-sentation that can be rendered by a web browser. The implementations of thegenerators are available in the EMFText language zoo [23].

Besides the transformations discussed above, we also implemented a Javainterpreter for forms instances. It executes an Java-based user interface ondemand, where users can fill in form data in a wizard-based manner. The in-terpreter will be used in the forms and Java integration scenario presented inSection 4.

3.5 Generating and Customising the Language Tooling

The metamodel code is generated by the EMF Java code generator [2]. There-fore, each Ecore metamodel is accompanied by a generator model. The generatormodel is used to configure various options for EMF code generation (e.g., thetargeted version of the Java runtime). From the root element of the generatormodel the generation of Java code implementing the metamodel specificationcan be started. By default, the generated files can be found in the src folder ofthe metamodel plug-in, but this can also be configured in the generator model.

Given a correct and complete syntax specification, the EMFText code gene-rator can be used. There are two alternative ways to do this: manually fromwithin Eclipse or using an Apache Ant task (which is explained in the EMF-Text documentation). Manual code generation can be triggered from the con-text menu of the concrete syntax specification. This starts the EMF code gene-rator that produces a number of plug-ins: org.emftext.language.forms,which is the basic plug-in of the language, org.emftext.language.forms.-resources.forms, which contains the generated parser, printer and variousinfrastructure for the forms language, and org.emftext.language.forms.-resources.forms.ui, which contains all generated classes related to theEclipse-based user interface. Besides the files implementing the language too-ling, a number of extension points specific for the language are generated to theschema folder. They can be used to further customise language tooling. Fordetails we refer to the EMFText documentation.

The previous steps are mandatory to generate an initial implementation of thebasic language tooling. The generated text editor already comes with a number ofadvanced editing features that help editing language expressions a lot. However,there are various ways to make the language tooling more useful. EMFText helpsdevelopers in customising their language tooling with a number of additionalfunctions ranging from semantic validation of language expressions, languagecompilation, language interpretation, or editor functions like folding, customquickfixes, extended code completion, refactoring, custom icons and more.

To create an instance of the forms language, users can either deploy thegenerated plug-ins or run them directly as a new Eclipse Application out ofthe active workspace. Figure 4 shows a screenshot of the developed editor run-ning in Eclipse. On the left side, the package explorer shows the artifacts inthe gttse-questionnaire project. The gttse.forms file is currently open


Fig. 4. The generated and customised editor tooling in action

and shown in the central editor view, which is under control of the plug-ins ge-nerated by EMFText as described above. Its contents are equivalent to thosein Listing 1.1 at the beginning of this section. Amongst others, the packageexplorer also contains the generated artifacts such as the PDF file shownin Fig. 2 and the website in Fig. 3. The re-generation of the files is triggeredautomatically by an EMFText builder.

The Outline view shows the instantiated forms model using tree widgetsand custom icons. Note that editor and outline are linked: if an element in theeditor is marked by the user the corresponding model element is highlighted inthe tree and vice versa. Furthermore, the Properties view shows attributevalues of the currently selected element.

In the next section, we discuss the integration of forms language and theJava interpreter with the Java programming language.

4 Integrating DSLs and GPLs Using EMFText

As already discussed in the previous sections, DSL code is typically executed byan interpreter or by compilation to a general purpose language (GPL) program—which is then interpreted or compiled itself—instead of directly translating to


machine code. This creates a methodical and technical gap between DSL andtarget GPL that implies several drawbacks:

– Developers are required to use different tool machinery for DSLs and GPLs.– If implicit references between DSL and GPL code are hard to track and may

cause inconsistencies.– DSLs cannot directly reuse (parts of) the concepts in a GPL.– Naıve embeddings of DSL code (e.g., in Strings) do not provide means for

syntactic and semantic checking.– Interpreted DSL code is hard to debug, if the interpreter itself has no de-

bugging support.– Generated DSL code is hard to read, debug and maintain.

We aim at alleviating these drawbacks by a seamless integration of DSLs andGPLs. In this part of the tutorial we demonstrate how EMFText can be employedto close the methodological and technical gap and realise these different integra-tion scenarios in a coherent way. First, we discuss the Java Model Parser andPrinter (JaMoPP) project. JaMoPP contributes a complete Ecore-based meta-model for Java 5 [24], a complete EMFText based parser and printer for Javasource code, and an implementation of static semantics analysis. This enablesthe application of Model-Driven Software Development (MDSD) tool machineryto Java source code. Afterwards, we will discuss and exemplify how the JaMoPPmetamodel and grammar can be reused and the EMFText tooling can be used togenerate advanced textual editors for two different language integration scenariosbased on the forms language.

4.1 JaMoPP: The Java Model Parser and Printer

This section introduces the different parts of JaMoPP in detail. First, we discussthe Ecore metamodel for Java. Next, we present details of an EMFText syntaxspecification for Java, static semantics analysis in Java models, how JaMoPPintegrates Java class files, printing of Java code from models, and how JaMoPPis seamlessly integrated with EMF tooling. Finally, we discuss a number of basicapplications that were enabled by integrating Java and MDSD tooling.

The JaMoPP Metamodel. There is a huge amount of tools that operate onJava programs but turned out few have an explicit metamodel of it.

The Java Language Specification (JLS) [24] itself does not provide a formalmetamodel of Java. Existing Java parsers (e.g., javac or the Eclipse Java De-velopment Tools (JDT)) have internal metamodels written in Java. One imple-mentation that is closest to a standardised solution is the Java 5 implementationof the Stratego/XT system [25]. However, none of these implementations doesprovide an integration with standard metamodelling tools. The Java metamodelspublished by the OMG [26], the MoDisco [27] project, or the SPOON [28] projectare based on a standardised metamodelling language (in particular Ecore), butare rather incomplete.


Thus, we decided to compare the existing metamodels, extract commonali-ties and extend them to fully support the JLS. JaMoPP defines 80 abstractand 153 concrete classes, which are divided into 18 packages. It contains allelements of the Java language (e.g., classifiers, imports, types, modifiers, mem-bers, statements, variables, expressions and literals) and in particular those thatwere introduced with the release of Java 5 (e.g., annotations and generics). Thecomplete metamodel is available online at the JaMoPP website [29].

JaMoPP Syntax, Static Semantics and EMF Integration. To generatea Java parser and printer, we provided an EMFText syntax specification forthe JaMoPP metamodel. The complete syntax specification is available on theJaMoPP website. The generation, implementation and integration of a parser,a printer and reference resolvers for Java is explained next.

Parsing. of Java source code is based on an ANTLR recursive descent parsergenerated by EMFText. The back-tracking mechanism of ANTLR allowed usto specify a complete Java grammar according to the specification and generatethe parser from it.

Reference Resolving. in Java source code corresponds to static name analysisin Java models. EMFText generates reference resolvers for all non-containmentreferences, which replace symbolic names in the tree-structured parse modelby explicit links to the corresponding elements. They augment the basic ASTinstance instanciated by the parser to optain a complete graph-based instanceof the Java metamodel. To account for Java’s scoping, naming and typing rules,the reference resolvers were refined manually.

Because of the high fragmentation of a Java program into several Java sourcefiles there are many non-containment references that span multiple resources(e.g., imported Classifiers). JaMoPP uses a global registry, which corres-ponds to a Java classpath, to multiple resources and their physical location.This registry is used to find cross-referenced model resources on demand. Toaccess resources that are only available in byte code (e.g. libraries), we use theBCEL[30] byte code parser and translate the output of the BCEL parser intoan instance of the JaMoPP metamodel.

Printing Java Source Files. is the inverse process to parsing. EMFTextgenerates a printer from the syntax specification that contains a print methodfor each concrete metaclass. According to the CS rule that belongs to a class,the printer emits keywords for model elements, the values of element attributes,and recursively calls subsequent methods to print contained elements. Besidesmemorising the original position of the elements in source code, EMFTextremembers whitespaces and comments during parsing and uses them forformatting the output when re-printing a model. Printing directives defined inthe JaMoPP CS specification are only used for freshly introduced concepts, e.g.,a new method that has been added by a model transformation.

Tool Integration. Ecore-based modelling languages and tools are integratedinto the Eclipse platform by the EMF. New languages can be transparently


1 class EmbeddedForm extends java::types::TypedElement,2 java::instantiations::Instantiation {3 containment reference forms::Form form (1..1);4 attribute EString name (1..1);5 }

Listing 1.6. Additional meta clases for formsembedded.

integrated into this infrastructure by implementing EMF’s Resource interface.JaMoPP provides a JavaResource for *.java and *.class files that makes useof the generated parser and printer, the reference resolvers and the byte codeparser to load and store Java models. Thus, despite of their specific text syntax,Java programs can be handled as any other models by EMF-based tools.

Treating Java programs as models revealed a number of benefits: 1) Java pro-grams can be transformed to models and, thus, benefit from tools that are onlyavailable at the modelling level. 2) Modelling tools to construct and manipulatemodels can be used to create Java programs. 3) Both directions can be usedtightly integrated to enable full round-trip for Java programs and models. Thisenables the application of model-to-model transformations and sophisticated lan-guage extensions [31] for java code generation, the application of model analysisand constraint languages for source code analysis (e.g., OCL [32]), or the appli-cation of tools for graphical model visualisation and editing tools (e.g., GMF)for Java programs. For a whole list of applications, descriptions and runningexamples, please refer to [33] and the applications website [34] of JaMoPP.

4.2 Integrating forms and JaMoPP

In this section we demonstrate how EMFText supports different language in-tegration scenarios. In detail, we discuss two integration scenarios of practicalimportance: making a DSL available in GPL code and embedding GPL expres-sions into DSLs. We employ metamodel and grammar extension to implementintegrated languages that reuse and integrate parts of forms DSL and Java.Note that the idea of integrating DSLs and GPLs is not new and was discussedby many authors (e.g., [35,36,37,38,39]).

Formsembedded: A forms Extension for Java This example demonstratesa language integration scenario that allows for embedding DSL code into GPLsby integrating the forms with JaMoPP. It enables a convenient and domain-specific representation of forms in Java programs. A custom builder is used tonormalise a program with embedded forms to plain Java code.

To realise this extension, we first create a custom metamodel, that extendsthe JaMoPP metamodel with metaclasses from forms. As depicted in Listing1.6, it introduces the metaclass EmbeddedForm as a subclass of the meta-classes TypedElement and Instantiation from the JaMoPP metamodel.This means, that embedded forms are typed and can be declared wherever anInstantiation expression (e.g., a constructor call) is expected in Java. In


1 SYNTAXDEF formsembedded2 FOR <http://www.emftext.org/language/formsembedded>3 START java.Containers.CompilationUnit4 IMPORTS {5 forms : <http://www.emftext.org/language/forms> WITH SYNTAX forms <.../forms.cs>6 java : <http://www.emftext.org/java> WITH SYNTAX java <.../java.cs>7 }8 TOKENS {9 REDEFINE forms.QUOTED_34_34 AS QUOTED java.STRING_LITERAL;

10 REDEFINE forms.TEXT AS IDENTIFIER java.IDENTIFIER;11 REDEFINE forms.LINEBREAK AS LINEBREAKS java.LINEBREAKS;12 }13 RULES {14 EmbeddedForm ::= "#form" "{" form "}";15 }

Listing 1.7. Additional productions for formsembedded.

1 public class Example {2 public void showForm() {3 Form f = #form {4 FORM "An embedded form"5 GROUP "Personal Questions"6 ITEM "Firstname" : FREETEXT7 ITEM "Lastname" : FREETEXT8 ITEM "Age" : NUMBER9 };

10 new FormInterpreter().interprete(f); // interpret form11 }}

Listing 1.8. An embedded form.

addition, an EmbeddedForm has a name and declares a containment referenceto define the actual Form.

Next, we need to define the concrete syntax for embedding forms in Java (cf.Listing 1.7). We again give a syntax specification that imports both the Java andthe forms syntax specification (lines 4–7). It reuses all imported productions andwe only need to provide a custom syntax for the newly introduced metaclass. Thesyntax rule introduces a new keyword (#form) for forms declarations and allowsthe definition of the actual instance between curly brackets (line 14). For the formdefinition the original forms language syntax is reused. As EmbeddedForm isdefined as a subclass of Java Instantiations, this syntax rule is handled as analternative to other instantiation syntax rules. To circumvent problems resultingfrom token conflicts in the integrated grammars, we redefine some of the formstoken definitions in the TOKENS section to use their Java counterparts instead.

The specification of an embedded form in a Java file is exemplified in Listing1.8. We implemented a custom builder to process formsembedded specifica-tions. It extracts an embedded form specification to a plain forms file. Next, theembedded form is replaced with Java code that calls the forms interpreter withthe extracted forms file.

Javaforms: Use Java Statements to Declare Pre-conditions for FormItems This integration example shows how parts of GPL syntax can beintegrated and reused in a DSL. We realise an integration of Java expressions todefine pre-conditions in forms.


1 class ConditionalItem extends forms::Item,java::variables::LocalVariable {2 containment reference java::expressions::Expression condition (1..1);3 }

Listing 1.9. Metamodel for javaforms in text.ecore.

1 ConditionalItem ::=2 (name[IDENTIFIER] "=")?3 "ITEM" text[STRING_LITERAL]4 ( explanation[STRING_LITERAL] )?5 ("ONLY" "IF"6 ( "(" condition:java.Expressions.AssignmentExpression? ")" |7 dependentOf[IDENTIFIER] )8 )?9 ":" itemType !0;

Listing 1.10. Concrete syntax for javaforms.

1 FORM "GTTSE’11 Questionnaire"2 GROUP "General Questions"3 age = ITEM "Age" : NUMBER4 GROUP "Food and Drinks"5 ITEM "Do you like Vinho Verde?"6 ONLY IF (age>18) : CHOICE true ("It’s great!","It’s great for lunch!","It’s OK.")

Listing 1.11. A form with a conditional item.

Listing 1.9 depicts the metamodel for javaforms. It integrates parts of theJaMoPP Java metamodel with the formsmetamodel. To enable the definition ofpre-conditions for items, javaforms introduces the metaclass Conditional-Item as subclass of Item from the forms metamodel. Thus, a Conditio-nalItem can be defined wherever a form Item is expected. In addition, Con-ditionalItem is a subclass of LocalVariable from the Java metamodeland can, thus, be used to hold values during the execution of the form. EachConditionalItem can contain a Java Expression as pre-condition.

The additional syntax rule for the newly defined metaclass is given in Listing1.10. Note that the token redefinitions are exactly the same as in Listing 1.7 andare therefore excluded. Each ConditionalItem can define a name that is usedto refer to it within a form. In addition to a simple dependency, Conditional-Items can also specify a boolean Java expression that needs to be satisfied forthe ConditionalItem to be displayed. Such Java expressions can access andevaluate the runtime value of any other form item by the given name. The restof the syntax corresponds to the syntax for conventional form Items. As Java ismeant to be embedded into forms, the syntax specification defines Form fromthe forms language as start symbol.

The application of javaforms is demonstrated in Listing 1.11. For the eval-uation of javaforms specification we implemented a custom builder. Thisbuilder feeds the javaforms specification to a code generator and generatesJava code that renders and evaluates the given form.


4.3 More JaMoPP Integration Examples

Beyond these didactic examples of language integrations we applied JaMoPP toimplement more practical and sophisticated approaches:1

EJava is based on JaMoPP and Ecore and can be used to specify EOpe-rations externally with Java. This way, hand-written and generated code iscleanly separated and checked for potential compilation problems.

JavaTemplate extends the JaMoPP Java grammar with generic template con-cepts (e.g., placeholders, loops, conditions) and adapts JaMoPP’s static seman-tics analysis to obtain a safe Java template language [40].

PropertiesJava is an experimental extension of the JaMoPP Java syntax thatallows to define C# like properties.

JavaBehaviourForUML is an integration of the Unified Modeling Language(UML) class diagrams and JaMoPP. It is tightly integrated with the graphicalUML editor provided by the MDT-UML2 project [41].

5 Related Work

In this section, we give an overview of the closest EMFText competitors.Xtext [42] is the standard textual modelling framework in Eclipse. In its core,

Xtext is very similar to EMFText. It has its own modular syntax specifica-tion language, uses ANTLR as parser generator and generates powerful editors.However, historically there are different philosophies behind both tools. The CSlanguage of EMFText was designed to be compact and declarative. It followsthe convention over configuration principle by providing several defaults likethe automatic derivation of a HUTN-based or Java-style CS for a given Ecoremetamodel. Also, EMFText is well integrated into the Eclipse UI by providingactions and wizards that can be executed from context menus and file menus(e.g., the parser and printer generator is started by just one click in the contextmenu). In contrast, the Xtext specification language is more flexible and closerto the generated ANTLR grammar. It supports syntactic predicates and somerestricted sort of semantic actions to influence AST construction. Additionally,if a metamodel is not present, Xtext can derive one from the grammar specified.To generate the parser, Xtext users usually have to specify an extra workflowfile, which configures the whole generation process.

Other tools implementing textual editors for EMF are the Textual EditingFramework (TEF) [43] and Textual Concrete Syntax (TCS) [44]. While TCSalso uses ANTLR to generate the parser, TEF uses an LR bottom-up parsingalgorithm, which allows it to handle left-recursion.

MontiCore [45] is a tool for generating textual editors for Eclipse. It provides amodular integrated concrete syntax and abstract syntax specification language.It also uses ANTLR to generate the parser and provides its own context-awarelexical analyser. For static semantics, MontiCore supports AGs.

1 All examples can be found in the EMFText Concrete Syntax Zoo [23].


Spoofax [46] is a language workbench for Eclipse, that relies on a scanner-less generalised LR (SGLR) parsing algorithm and the Stratego language forstatic semantics. In comparison to standard LL and LR algorithms, the SGLRapproach has several benefits with respect to grammar modularisation, since itavoids conflicts between token definitions and can handle ambiguous context-freegrammars.

Another, completely different approach is the projectional editing frameworkJetBrains MPS [47]. Projection here means, that the editor consists of graphicalshapes that are mapped to a model. Hence, projectional editing is a mixtureof textual and graphical editing. Since no parsing technology is used, no syntaxconflicts occur. However, projectional editing feels different and is more restrictedin comparison to the plain textual approaches. Also, comparing different versionsof a projectional model can be an issue, since a simple text diff may not workreasonably.

More detailed and comparative overviews of most of the tools in this sectionand others can be found in [48] and [49].

6 Conclusion

The goal of this tutorial was to introduce model-based language engineering withEMFText. We gave an introduction to EMFText and developed a basic DSL fordescribing forms. Then, we showed how we have developed the implementationof the Java language JaMoPP and showed how DSLs and GPLs can easily beextended with new language features and integrated into other languages.

Acknowledgement. This research has been co-funded by the European SocialFund and Federal State of Saxony within the project ZESSY #080951806.

References

1. Heidenreich, F., Johannes, J., Karol, S., Seifert, M., Wende, C.: Derivation andRefinement of Textual Syntax for Models. In: Paige, R.F., Hartman, A., Rensink,A. (eds.) ECMDA-FA 2009. LNCS, vol. 5562, pp. 114–129. Springer, Heidelberg(2009)

2. Steinberg, D., Budinsky, F., Paternostro, M., Merks, E.: Eclipse ModelingFramework, 2nd edn. Pearson Education (2008)

3. Hedin, G.: Reference Attributed Grammars. Informatica 24(3), 301–317 (2000)4. Object Management Group Human Usable Textual Notation (HUTN) Specifica-

tion. Final Adopted Specification ptc/02-12-01 (2002)5. Meta-Object Facility (MOF) Core Specification. Version 2.0 (January 2006)6. Mosses, P.D.: Denotational semantics. In: Van Leeuwen, J. (ed.) Handbook of The-

oretical Computer Science, vol. B, pp. 575–631. MIT Press (1990)7. Object Management Group Business Process Model and Notation (BPNM)

Specication. Version 2.0 (January 2011)8. ANother Tool for Language Recognition (ANTLR), http://www.antlr.org/

http://www.antlr.org/


9. JastEMF website, http://www.jastemf.org/10. JastAdd website, http://www.jastadd.org/11. Burger, C., Karol, S.: Towards Attribute Grammars for Metamodel Semantics.

Technical Report TUD-FI10-03 - Marz 2010, Technische Universitat Dresden(March 2010)

12. Burger, C., Karol, S., Wende, C., Aßmann, U.: Reference Attribute Grammars forMetamodel Semantics. In: Malloy, B., Staab, S., van den Brand, M. (eds.) SLE2010. LNCS, vol. 6563, pp. 22–41. Springer, Heidelberg (2011)

13. Knuth, D.E.: Semantics of context-free languages. Theory of ComputingSystems 2(2), 127–145 (1968)

14. Grosch, J.: Object-Oriented Attribute Grammars. Technical report, CoCoLabDatenverarbeitung, Aachen (August 1990)

15. Paakki, J.: Attribute grammar paradigms—high-level methodology in languageimplementation. ACM Comput. Surv. 27(2), 196–255 (1995)

16. Gray, R.W., Levi, S.P., Heuring, V.P., Sloane, A.M., Waite, W.M.: Eli: a complete,flexible compiler construction system. Commun. ACM 35(2), 121–130 (1992)

17. Wyk, E.V., Bodin, D., Gao, J., Krishnan, L.: Silver: an Extensible Attribute Gram-mar System. Electron. Notes Theor. Comput. Sci. 203(2), 103–116 (2008)

18. Sloane, A.M., Kats, L.C.L., Visser, E.: A Pure Object-Oriented Embedding ofAttribute Grammars. Electron. Notes Theor. Comput. Sci. 253(7), 205–219 (2010)

19. Ekman, T., Hedin, G.: The JastAdd Extensible Java Compiler. SIGPLANNot. 42(10), 1–18 (2007)

20. Hedin, G.: An Introductory Tutorial on JastAdd Attribute Grammars. In: Fernan-des, J.M., Lammel, R., Visser, J., Saraiva, J. (eds.) TTSE 2009. LNCS, vol. 6491,pp. 166–200. Springer, Heidelberg (2011)

21. Java Emitter Templates (JET),http://www.eclipse.org/modeling/m2t/?project=jet

22. World Wide Web Consortium Extensible Stylesheet Language (XSL) Specification.Recommendation 1.1 (December 2006)

23. EMFText Concrete Syntax Zoo, http://www.emftext.org/index.php/EMFText_Concrete_Syntax_Zoo

24. Gosling, J., Joy, B., Steele, G., Bracha, G.: Java(TM) Language Specification.Addison-Wesley Professional (2005)

25. Bravenboer, M., Kalleberg, K.T., Vermaas, R., Visser, E.: Stratego/XT 0.17. ALanguage and Toolset for Program Transformation. Science of Computer Program-ming 72(1-2), 52–70 (2008)

26. Object Management Group Metamodel and UML Profile for Java and EJB Speci-fication Version 1.0. formal/2004-02-02 (2004)

27. The MoDisco project, http://www.eclipse.org/MoDisco/28. Pawlak, R.: Spoon: Compile-time Annotation Processing for Middleware. IEEE

Distributed Systems Online 7(11) (2006)29. JaMoPP website, http://jamopp.org/30. Byte Code Engineering Library (Apache Commons BCEL),

http://commons.apache.org/bcel/31. Heidenreich, F., Johannes, J., Seifert, M., Wende, C., Bohme, M.: Generating Safe

Template Languages. In: Proc. of the 8th Int’l Conf. on Generative Programmingand Component Engineering (GPCE 2009). ACM (2009)

32. Seifert, M., Samlaus, R.: Static Source Code Analysis using OCL. In: Cabot, J.,Van Gorp, P. (eds.) Proc. of the MoDELS 2008 Workshop on OCL Tools: FromImplementation to Evaluation and Comparison, OCL 2008 (2008)

http://www.jastemf.org/

http://www.jastadd.org/

http://www.eclipse.org/modeling/m2t/?project=jet

http://www.emftext.org/index.php/EMFText_Concrete_Syntax_Zoo

http://www.emftext.org/index.php/EMFText_Concrete_Syntax_Zoo

http://www.eclipse.org/MoDisco/

http://jamopp.org/

http://commons.apache.org/bcel/


33. Heidenreich, F., Johannes, J., Seifert, M., Wende, C.: JaMoPP: The Java ModelParser and Printer. Technical Report TUD-FI09-10 August 2009, Technische Uni-versitat Dresden (2009)

34. JaMoPP applications website, http://jamopp.org/applications/35. Van Deursen, A., Klint, P., Visser, J.: Domain-specific Languages: An Annotated

Bibliography. ACM Sigplan Notices 35(6), 26–36 (2000)36. Nystrom, N., Clarkson, M., Myers, A.: Polyglot: An Extensible Compiler Frame-

work for Java. In: Hedin, G. (ed.) CC 2003. LNCS, vol. 2622, pp. 138–152. Springer,Heidelberg (2003)

37. Bravenboer, M., de Groot, R., Visser, E.: MetaBorg in Action: Examples ofDomain-Specific Language Embedding and Assimilation Using Stratego/XT. In:Lammel, R., Saraiva, J., Visser, J. (eds.) GTTSE 2005. LNCS, vol. 4143, pp. 297–311. Springer, Heidelberg (2006)


39. Cunningham, H.: A Little Language for Surveys: Constructing an Internal DSL inRuby. In: Proceedings of ACM-SE 2008, pp. 282–287. ACM (2008)

40. Heidenreich, F., Johannes, J., Seifert, M., Wende, C., Bohme, M.: Generating SafeTemplate Languages. In: Proceedings of GPCE 2009. ACM Press (2009)

41. Model Development Tools UML implementation,http://wiki.eclipse.org/MDT/UML2

42. Xtext–textual modelling framework (March 2012),http://www.eclipse.org/Xtext/

43. Textual Editing Framework (TEF), http://www2.informatik.hu-berlin.de/sam/meta-tools/tef/index.html

44. Textual Concrete Syntax (TCS), http://www.eclipse.org/gmt/tcs/45. Krahn, H., Rumpe, B., Volkel, S.: MontiCore: a framework for compositional de-

velopment of domain specific languages. International Journal on Software Toolsfor Technology Transfer (STTT) 12(5), 353–372 (2010)

46. Kats, L.C., Visser, E.: The spoofax language workbench: rules for declarative spe-cification of languages and IDEs. In: Proceedings of OOPSLA 2010, pp. 444–463.ACM (2010)

47. Voelter, M.: Language and IDE modularization, extension and composition withMPS. In: Pre-Proceedings GTTSE 2011, pp. 395–431 (2011)

48. Goldschmidt, T., Becker, S., Uhl, A.: Classification of Concrete Textual SyntaxMapping Approaches. In: Schieferdecker, I., Hartman, A. (eds.) ECMDA-FA 2008.LNCS, vol. 5095, pp. 169–184. Springer, Heidelberg (2008)

49. Merkle, B.: Textual modeling tools: overview and comparison of language work-benches. In: Proceedings of SPLASH, SPLASH 2010, pp. 139–148. ACM, New York(2010)

http://jamopp.org/applications/

http://wiki.eclipse.org/MDT/UML2

http://www.eclipse.org/Xtext/

http://www2.informatik.hu-berlin.de/sam/meta-tools/tef/index.html

http://www2.informatik.hu-berlin.de/sam/meta-tools/tef/index.html

http://www.eclipse.org/gmt/tcs/

Feature-Oriented Software DevelopmentA Short Tutorial on Feature-Oriented Programming,

Virtual Separation of Concerns, and Variability-AwareAnalysis�

Christian Kästner1 and Sven Apel2

1 Philipps University Marburg, Germany2 University of Passau, Germany

Abstract. Feature-oriented software development is a paradigm for theconstruction, customization, and synthesis of large-scale and variablesoftware systems, focusing on structure, reuse and variation. In this tuto-rial, we provide a gentle introduction to software product lines, featureoriented programming, virtual separation of concerns, and variability-aware analysis. We provide an overview, show connections between thedifferent lines of research, and highlight possible future researchdirections.

1 Introduction

Feature-oriented software development (FOSD) is a paradigm for the construc-tion, customization, and synthesis of large-scale software systems. The concept ofa feature is at the heart of FOSD. A feature is a unit of functionality of a softwaresystem that satisfies a requirement, represents a design decision, and provides apotential configuration option. The basic idea of FOSD is to decompose a soft-ware system in terms of the features it provides. The goal of the decomposition isto construct well-structured variants of the software that can be tailored to theneeds of the user and the application scenario. Typically, from a set of features,many different software variants can be generated that share common featuresand differ in other features. The set of software systems generated from a set offeatures make up a software product line [28, 75].

FOSD aims essentially at three properties: structure, reuse, and variation. De-velopers use the concept of a feature to structure the design and code of a soft-ware system. Features are the primary units of reuse in FOSD. The variants of asoftware system vary in the features they contain. FOSD shares goals with othersoftware development paradigms, such as stepwise and incremental software de-velopment [74,98], aspect-oriented software development [36], component-basedsoftware engineering [88], and alternative flavors of software product line engi-neering [28,75], the differences of which are discussed elsewhere [4]. Historically,

� These tutorial notes share text with previous publications on feature-orientedsoftware development [3,4,47,49].

R. Lämmel, J. Saraiva, and J. Visser (Eds.): GTTSE 2011, LNCS 7680, pp. 346–382, 2013.c© Springer-Verlag Berlin Heidelberg 2013

Feature-Oriented Software Development 347

1 static int __rep_queue_filedone(dbenv, rep, rfp)2 DB_ENV *dbenv;3 REP *rep;4 __rep_fileinfo_args *rfp; {5 #ifndef HAVE_QUEUE6 COMPQUIET(rep, NULL);7 COMPQUIET(rfp, NULL);8 return (__db_no_queue_am(dbenv));9 #else

10 db_pgno_t first, last;11 u_int32_t flags;12 int empty, ret, t_ret;13 #ifdef DIAGNOSTIC14 DB_MSGBUF mb;15 #endif16 // over 100 further lines of C code17 #endif18 }

Fig. 1. Code excerpt of Oracle’s Berkeley DB

FOSD has emerged from different lines of research in programming languages,software architecture, and modeling; it combines results from feature modeling,feature interaction analysis, and various implementation forms for features [4].

In practice, software product lines are often implemented with build systemsand conditional compilation. Hence, developers see code fragments as exemplifiedin Figure 1, in which code fragments belonging to features are wrapped by #ifdefand #endif directives of the C preprocessor. For a given feature selection, thepreprocessor generates tailored code by removing code fragments not needed.Such preprocessor usage is dominant in practice; for example, in HP’s productline of printer firmware over 2 000 features are implemented this way, in theLinux kernel over 10 000 features. Although common, such implementations arerather ad-hoc, violate the principle of separation of concerns, and are error proneand difficult to debug; preprocessors are heavily criticized in literature [1,32,34,49,86, and others]. Especially, if features are scattered and tangled in large-scaleprograms (or even already at smaller scale as illustrated with the embeddedoperating system FemtoOS in Fig. 2), such problems quickly become apparent.

FOSD generally seeks more disciplined forms of feature implementation thatare easier to maintain and to reason about. Researchers have investigated differ-ent strategies for better feature implementations. In this tutorial, we describe twoimportant approaches. First, feature-oriented programming follows a language-based composition approach, in which features are implemented in separate im-plementation units and composed on demand. In contrast, work on virtual sepa-ration of concerns stays close to the annotation-based approach of preprocessors,but builds upon a disciplined foundation and provides tool support to supportreasoning and navigation.

The ability to combine features and derive different variants yields enormousflexibility but also introduces additional problems related to complexity. From n

features, we can derive up to 2n distinct variants (with 33 features, that’s morethan the number of humans on the planet; with 320 features, that’s more than

348 C. Kästner and S. Apel

Fig. 2. Preprocessor directives in the code of Femto OS: Black lines represent prepro-cessor directives such as #ifdef, white lines represent C code, comment lines are notshown [49]

the estimated number of atoms in the universe). Instead of a single product,product-line developers implement millions of variants in parallel. To supportthem in dealing with this complexity and to prevent or detect errors (even thosethat occur only in one variant with a specific feature combination, out of mil-lions), many researchers have proposed means for variability-aware analysis thatlifts existing analyses to the product-line world. So far, variability-aware analy-sis has been explored, for example, for type checking, parsing, model checking,and verification. Instead of analyzing each of millions of variants in a brute-forcefashion, variability-aware analysis seeks mechanisms to analyze the entire prod-uct line. We introduce the idea behind variability-aware analysis and illustrateit with the example of type checking, both for annotations and composition.

This tutorial gives a gentle introduction to FOSD. It is structured as follows:First, we introduce product lines, such as feature models and the process ofdomain engineering. Second, we exemplify feature-oriented programming withFeatureHouse to separate the implementation of features into distinct modules.Third, we introduce the idea of virtual separation of concerns, an approach that,


instead of replacing preprocessors, disciplines them and provides mechanismsto emulate modularity through dedicated tool support. Finally, we introducevariability-aware analysis by means of the example of type checking and illustratethe general concept behind it.

In contrast to our previous survey on feature-oriented software development [4],which connected different works around the FOSD community, in this tutorial,we take a more practical approach, focus on concepts relevant for implementers,and recommend relevant tools. Additionally, we repeat all relevant backgroundabout product-line engineering and feature modeling to make the tutorial moreself-contained. Furthermore, we provide a broader picture and a new classifica-tion for variability-aware analysis strategies.

2 Software Product Lines: The Basics

Traditionally, software engineering has focused on developing individual softwaresystems, one system at a time. A typical development process starts with analyz-ing the requirements of a customer. After several development steps – typicallysome process of specification, design, implementation, testing, and deployment– a single software product is the result. In contrast, software product line engi-neering focuses on the development of multiple similar software systems in onedomain from a common code base [14,75]. Although the resulting software prod-ucts are similar, they are each tailored to the specific needs of different customersor to similar but distinct use cases. We call a software product derived from asoftware product line a variant.

Bass et al. define a software product line as “a set of software-intensive systemssharing a common, managed set of features that satisfy the specific needs of aparticular market segment or mission and that are developed from a common setof core assets in a prescribed way” [14]. The idea of developing a set of relatedsoftware products in a coordinated fashion (instead of each starting from scratchor copying and editing from a previous product) can be traced back to conceptsof program families [42, 74].

Software product lines promise several benefits compared to individual devel-opment [14,75]: Due to co-development and systematic reuse, software productscan be produced faster, with lower costs, and higher quality. A decreased timeto market allows companies to adapt to changing markets and to move into newmarkets quickly. Especially in embedded systems, in which resources are scarceand hardware is heterogeneous, efficient variants can be tailored to a specific de-vice or use case [19,75,80,91]. There are many companies that report significantbenefits from software product lines. For example, Bass et al. summarize that,with software product lines, Nokia can produce 30 instead of previously 4 phonemodels per year; Cummins, Inc. reduced development time for a software for anew diesel engine from one year to one week; Motorola observed a 400 % increasein productivity; and so forth [14].


2.1 Domain Engineering and Application Engineering

The process of developing an entire software product line instead of a singleapplication is called domain engineering. A software product line must fulfil notonly the requirements of a single customer but the requirements of multiplecustomers in a domain, including both current customers and potential futurecustomers. Hence, in domain engineering, developers analyze the entire applica-tion domain and its potential requirements. From this analysis, they determinecommonalities and differences between potential variants, which are described interms of features. Finally, developers design and implement the software productline such that different variants can be constructed from common and variableparts.

In this context, a feature is a first-class domain abstraction, typically an end-user visible increment in functionality. In addition to features that add function-ality, it is also common to have alternative features for the same functionalitywith different nonfunctional properties (e.g., a fast versus an energy-saving sort-ing algorithm). We discuss different notions of the term “feature” elsewhere [4].

Czarnecki and Eisenecker distinguish between problem space and solutionspace [30]. The problem space comprises domain-specific abstractions that de-scribe the requirements on a software system and its intended behavior. Domainanalysis, as a part of domain engineering, takes place in the problem space, andits results are documented in terms of features. The solution space comprisesimplementation-oriented abstractions, such as code artifacts. Between featuresin the problem space and artifacts in the solution space, there is a mapping thatdescribes which artifact belongs to which feature. Depending on the implemen-tation approach and the degree of automation, this mapping can have differentforms and complexities, from simple implicit mappings based on naming con-ventions to complex machine-processable rules encoded in generators, includingpreprocessors and composition tools [30].

Application engineering is the process of deriving a single variant tailored tothe requirements of a specific customer from a software product line, based onthe results of domain engineering. Ideally, the customer’s requirements can bemapped to features identified during domain engineering (problem space), sothat the variant can be constructed from existing common and variable parts ofthe product line’s implementation (solution space). FOSD strives for a form ofproduct-line development in which all implementation effort is part of domainengineering so that application engineering can be reduced to requirements anal-ysis and automated code generation.

Typically, a software product line targets a specific domain, such as operatingsystems for mobile phones, control software for diesel engines, and embeddeddatabases. The scope of a software product line describes which variability isoffered and which kind of variants the product line can produce. A softwareproduct line with a narrow scope is easier to develop, but less flexible (it pro-vides only few, very similar variants). The wider the scope is, the higher is thedevelopment effort, but the more flexibility a software product line can offer. Se-lecting the right scope of a product line is a difficult design, business, and strategy


Domainknowledge

Customerneeds

Mapping

Featureselection

FeaturesNew

requirementsCommon

implementationartifacts

Problem Space Solution Space

Domain analysis

(incl. scoping,variability modeling)

Requirements analysis

(a.k.a. product derivation)

Domain implementation

(models, source code, ...)

Variant configuration/generation

(incl. variant testing)

Variant

Do

mai

n e

ng

inee

rin

gA

pp

licat

ion

en

gin

eeri

ng

Fig. 3. An (idealized) overview of domain engineering and application engineering(adapted from [30] to FOSD)

decision. In practice, the scope is often iteratively refined; domain engineeringand application engineering are rarely strictly sequential and separated steps. Forexample, it is common not to implement all features upfront, but incrementally,when needed. Furthermore, requirements identified in domain engineering maybe incomplete, so new requirements arise in application engineering, which devel-opers must either feed back into the domain-engineering process or address withcustom development during the application engineering of a specific variant [30].

Domain engineering and application engineering describe a general processframework as summarized in Figure 3. For each step, different approaches, for-malisms, and tools can be used. For example, there are different product-line–scoping approaches (see a recent survey [45]), different domain analysis meth-ods [30, 40, 46, 75, and many others], different mechanisms to model variabil-ity (see Sec. 2.2), different implementation mechanisms (our focus in Sec. 3and 4), and different approaches to derive a variant based on customer require-ments [78, 82, and others].

2.2 Variability Modeling

During domain analysis, developers determine the scope of the software productline and identify its common and variable features, which they then document in


a variability model. We introduce variability models, because they are central notonly for documenting variability in the problem space, but also for many imple-mentation approaches, for automated reasoning and error detection, and for au-tomated generation of variants. There are several different variability-modelingapproaches (see Chen et al. [26] for an overview). We focus on FODA-style fea-ture models [30, 46], because they are well known and broadly used in researchand practice; other variability models can be used similarly.

A feature model describes a set of features in a domain and their relationships.It describes which features a product line provides (i.e., its scope), which featuresare optional, and in which combination features can be selected in order to derivevariants. With a selection of features (a subset F of all features), we can specify avariant (e.g., “the database variant for Linux, with transactions, but without a B-Tree”). Not all feature combinations may make sense, for example, two featuresrepresenting different operating systems might be mutually exclusive. A featuremodel describes such dependencies. A feature selection that fulfils all constraintsis valid (“F is valid”).

In practice, feature models contain hundreds or thousands of features.1 Thenumber of potential variants can grow exponentially with the number of features.In theory, a software product line with n independent optional features canproduce 2n variants. In practice, many dependencies between features reducethe number of valid feature selections, but nevertheless, most software productlines give rise to millions or billions of valid feature selections.

A typical graphical representation of features and their dependencies is afeature diagram [46], as exemplified in Figure 4. A feature diagram representsfeatures in a hierarchy. Different edges between features describe their relation-ships: A filled bullet describes that a feature is mandatory and must be selectedwhenever its parent feature is selected. In contrast, a feature connected with anempty bullet is optional. Multiple child features connected with an empty arc arealternative (mutually exclusive); exactly one child feature needs to be selectedwhen the parent feature is selected. From multiple child features connected witha filled arc, at least one must be selected, but it is also possible to select more thanone. Dependencies that cannot (or should not) be expressed with the hierarchicalstructure may be provided as additional cross-tree constraints in the form of apropositional formula. In Figure 4, we show nine features from the core of a fic-tional database product line. Each variant must contain the features Database,Base, OS, and Storage, but feature Transactions is optional, so variantsmay or may not include it; each variant must have exactly one operating-systemfeature, either Windows or Linux; each variant must contain at least one stor-age structure; finally, a cross-tree constraint specifies that Transactions aresupported only if also feature B-Tree is selected. In this example, ten featureselections are valid.

1 For example, Bosch’s product line of engine-control software has over 1 000 fea-tures [87], HP’s Owen product line has about 2 000 features [79], and the Linuxkernel has over 10 000 features [90].


Fig. 4. Feature-diagram example of a small database product line

Alternative to the graphical notation, dependencies between features can beexpressed entirely by a propositional formula. Each feature corresponds to aBoolean variable that is true when selected and false otherwise. The proposi-tional formula evaluates to true for all valid feature selections. Feature diagramscan be transformed into propositional formulas with some simple rules [15]. Forexample, the feature diagram from Figure 4 is equivalent to the following propo-sitional formula:

Database∧ (Base ⇔ Database)∧ (OS ⇔ Database)∧

(Transactions ⇒ Database)∧ (Storage ⇔ Database)∧

(Windows∨ Linux ⇔ OS)∧ ¬(Windows∧ Linux)∧

(List ∨B-Tree ⇔ Storage)∧ (Transactions ⇒ B-Tree)

Representing feature models as propositional formulas has the advantage thatwe can reason about them automatically, which is essential for variability-awareanalysis, as we discuss in Section 5. With simple algorithms or with automatedreasoning techniques – including Boolean-satisfiability-problem solvers(SAT solvers), constraint-satisfaction-problem solvers, and binary decision di-agrams – we can efficiently answer a series of questions, including “Has thisfeature model at least one valid selection (i.e., is the formula satisfiable)?” and“Is there a valid feature selection that includes feature X but not feature Y?”Even though some of these algorithms are NP-complete, SAT solvers and otherreasoners can answer queries efficiently for practical problems, even for very largefeature models [67,68,94]. For further details, see a recent survey on automatedanalysis operations and tools [18].

Tooling. There are many languages and tools to manage feature models or drawfeature diagrams, ranging from dozens of academic prototypes to fully fledgedcommercial systems such as Gears2 and pure::variants.3 For a research setting, werecommend FeatureIDE, an Eclipse plugin that (among others) provides a sophis-ticated graphical feature-model editor and supporting tools [57]. Our graphics of2 http://www.biglever.com/solution/product.html3 http://www.pure-systems.com; a limited community edition is available free of

charge, and the authors are open for research collaborations.

http://www.biglever.com/solution/product.html

http://www.pure-systems.com


feature diagrams (Fig. 3 and 4) have been exported from FeatureIDE. FeatureIDEincludes many facilities for reasoning about features using a SAT solver, followingthe described translation to propositional formulas. FeatureIDE is open source,and also isolated parts such as the reasoning engine can be reused; contributionsare encouraged. FeatureIDE is available at http://fosd.net/fide.

2.3 What Is Feature-Oriented Software Development?

The concept of a feature is useful to describe commonalities and variabilitiesin the analysis, design, and implementation of software systems. FOSD is aparadigm that encourages the systematic application of the feature concept inall phases of the software life cycle. Features are used as first-class entities toanalyze, design, implement, customize, debug, or evolve a software system. Thatis, features not only emerge from the structure and behavior of a software sys-tem (e.g., in the form of the software’s observable behavior), but are also usedexplicitly and systematically to define variabilities and commonalities, to facili-tate reuse, to structure software along these variabilities and commonalities, andto guide the testing process. A distinguishing property of FOSD is that it aimsat a clean (ideally one-to-one) mapping between the representations of featuresacross all phases of the software life cycle. That is, features specified during theanalysis phase can be traced through design and implementation.

The idea of FOSD was not proposed as such in the first place but emergedfrom the different uses of features. Our main goal is to convey the idea of FOSDas a general development paradigm. The essence of FOSD can be summarizedas follows: on the basis of the feature concept, FOSD facilitates the structure,reuse, and variation of software in a systematic and uniform way.

3 Feature-Oriented Programming

The key idea of feature-oriented programming is to decompose a system’s designand code along the features it provides [16, 77]. Feature-oriented programmingfollows a disciplined language-oriented approach, based on feature composition.

3.1 Collaboration-Based Design

A popular technique for decomposing feature-oriented systems is collaboration-based design [85]. In Figure 5, we show a sample collaboration-based design of asimple object-oriented expression evaluator. A collaboration is a set of program el-ements that cooperate systematically to implement a feature. In an object-orientedworld, a collaboration comprises typicallymultiple classes and even only fragmentsof classes. The top-most collaboration (Expr) consists of three classes: Expr an ab-stract class for representing expressions, Val for representing literals, and Add forrepresenting addition. Each class defines a single operation toString for pretty print-ing.The collaborationEval adds thenewoperation eval,which evaluates an expres-sion. Evaluation is a crosscutting concern because eval must be defined by addinga method to each of the three classes. A collaboration bundles these changes.

http://fosd.net/fide


int valVal(int)String toString()

refinement

inheritanceEval

ExprExpr aExpr bAdd(Expr, Expr)String toString()

class Val class Add

refines class Val

class ExprString toString()

refines class Expr

refines class Add

int eval()

int eval() int eval()

Fig. 5. Collaboration-based design of a simple expression evaluator

3.2 Feature Modules

In feature-oriented programming, each collaboration implements a feature andis called a feature module [10, 16]. Different combinations of feature modulessatisfy different needs of customers or application scenarios. Figure 5 illustrateshow features crosscut the given hierarchical (object-oriented) program struc-ture. In contemporary feature-oriented–programming languages and tools, suchas AHEAD [16], FeatureC++ [9], FeatureHouse [7], or Fuji [8], collaborationsare represented by file-system directories, called containment hierarchies, andclasses and their refinements are stored in files. Features are selected by namevia command-line parameters or graphical tools. In Figure 6, we show a snapshotof the containment hierarchies and the feature model of the simple expressionevaluator in FeatureIDE.

A feature module refines the content of a base program either by addingnew elements or by modifying and extending existing elements. The order inwhich features are applied is important; earlier features in the sequence mayadd elements that are refined by later features.

3.3 Jak

Jak is an extension of Java for feature-oriented programming [16]. Figure 7 de-picts the Jak implementation of an extended version of the collaboration-baseddesign of Figure 5.

Feature Expr represents the base program. It defines class Expr, along withtwo terms: Val for integer literals and Add for addition. It also defines a singleoperation toString for pretty printing.

Feature Eval adds the new operation eval, which evaluates an expression.The feature module contains three class refinements (partial classes, using the


Fig. 6. Containment hierarchy (left) and feature model (right) of the expression-evaluator example

keyword refines) that extend other classes by introducing additional methods.During composition a class is composed with all its refinements.

Feature Mult introduces the new class Mult and refines a previously definedmethod in class Add to fix operator precedence. Refining a method is similar tomethod overriding; the new version of the method may call the old version usingJak’s keyword Super.

Finally, features Eval and Mult are each designed to extend Expr. However,they are not completely orthogonal. The combination of a new variant and a newoperation, creates a “missing piece” that must be filled in to create a completeprogram. We thus define an additional feature, called lifter [77] or derivative [65],that defines how each feature should be extended in the presence of the others.The derivative ‘Mult#Eval’ is present when both features Mult and Eval

are present.

3.4 AHEAD

AHEAD is an architectural model of feature-oriented programming [16]. WithAHEAD, each feature is represented by a containment hierarchy, which is a direc-tory that maintains a substructure organizing the feature’s artifacts (cf. Fig. 6).Composing features means composing containment hierarchies and, to this end,composing corresponding artifacts recursively by name and type (see Fig. 10for an example), much like the mechanisms of hierarchy combination [70, 89],


Feature Expr

1 abstract class Expr {2 abstract String toString();3 }4 class Val extends Expr {5 int val;6 Val(int n) { val = n; }7 String toString() { return String.valueOf(val); }8 }9 class Add extends Expr {

10 Expr a; Expr b;11 Add(Expr e1, Expr e2) { a = e1; b = e2; }12 String toString() { return a.toString() + "+" + b.toString(); }13 }

Feature Eval refines Expr

14 refines class Expr {15 abstract int eval();16 }17 refines class Val {18 int eval() { return val; }19 }20 refines class Add {21 int eval() { return a.eval() + b.eval(); }22 }

Feature Mult refines Expr

23 class Mult extends Expr {24 Expr a; Expr b;25 Mult(Expr e1, Expr e2) { a = e1; b = e2; }26 String toString() { return "(" + a.toString() + "∗" + b.toString() + ")";27 }28 }29 refines class Add {30 String toString() { return "(" + Super().toString() + ")"; }31 }

Derivative Mult#Eval

32 refines class Mult {33 int eval() { return a.eval() ∗ b.eval(); }34 }

Fig. 7. A solution to the “expression problem” in Jak

mixin composition [20,24,37,38,85], and superimposition [21,22]. In contrast tothese earlier approaches, for each artifact type, a different implementation of thecomposition operator ‘•’ has to be provided in AHEAD (i.e., different tools thatperform the composition, much like Jak for Java artifacts). The background isthat a complete software system does not just involve Java code. It also involvesmany non-code artifacts. For example, the simple expression evaluator of Fig-ure 7 may be paired with a grammar specification, providing concrete syntaxfor expressions, and documentation in XHTML. For grammar specifications andXML based languages, the AHEAD tool suite has dedicated composition tools.

Bali. Bali is a tool for synthesizing program-manipulation tools on the basisof extensible grammar specifications [16]. It allows a programmer to define a


Feature Expr

1 Expr: Val | Expr Oper Expr;2 Oper: ’+’;3 Val: INTEGER;


4 Oper: Super.Oper | ’∗’;

Fig. 8. A Bali grammar with separate features for addition and multiplication

grammar and to refine it subsequently, in a similar fashion to class refinementsin Jak. Figure 8 shows a grammar and a grammar refinement that correspondto the Jak program above. The base program defines the syntax of arithmeticexpressions that involve addition only. We then refine the grammar by addingsupport for multiplication.

Bali is similar to Jak in its use of keyword Super: Expression Super.Oper refersto the original definition of Oper.

Xak. Xak is a language and tool for composing various kinds of XML docu-ments [2]. It enhances XML by a module structure useful for refinement. Thisway, a broad spectrum of software artifacts can be refined à la Jak, (e.g., UMLdiagrams, build scripts, service interfaces, server pages, or XHTML).

Figure 9 depicts an XHTML document that contains documentation for our ex-pression evaluator. The base documentation file describes addition only, but werefine it to add a description of evaluation and multiplication as well. The tag xak:module labels a particular XML element with a name that allows the element tobe refined by subsequent features. The tag xak:extends overrides an element thathas been named previously, and the tag xak:super refers to the original definitionof the named element, just like the keyword Super in Jak and Bali.

AHEAD Tool Suite. Jak, Xak, and Bali are each designed to work with aparticular kind of software artifact. The AHEAD tool suite brings these separatetools together into a system that can handle many different kinds of softwareartifacts.

In AHEAD, a piece of software is represented as a directory of files. Composingtwo directories together will merge subdirectories and files with the same name.AHEAD will select different composition tools for different kinds of files. MergingJava files will invoke Jak to refine the classes, whereas merging XML files willinvoke Xak to combine the XML documents, and so on, as illustrated in Figure 10.

3.5 FeatureHouse

Recently, following the philosophy of AHEAD, the FeatureHouse tool suite hasbeen developed that allows programmers to enhance given languages rapidlywith support for feature-oriented programming (e.g., C#, C, JavaCC, Haskell,Alloy, and UML [7]).


Feature Expr

1 <html xmlns:xak="http://www.onekin.org/xak" xak:artifact="Expr" xak:type="xhtml">2 <head><title>A Simple Expression Evaluator</title></head>3 <body bgcolor="white">4 <h1 xak:module="Contents">A Simple Expression Evaluator</h1>5 <h2>Supported Operations</h2>6 <ul xak:module="Operations">7 <li>Addition of integers</li>8 <!−− a description of how integers are added −−>9 </ul>

10 </body>11 </html>

Feature Eval refines Expr

12 <xak:refines xmlns:xak="http://www.onekin.org/xak" xak:artifact="Eval" xak:type="xhtml">13 <xak:extends xak:module="Contents">14 <xak:super xak:module="Contents"/>15 <h2>Evaluation of Arithmetic Expressions</h2>16 <!−− a description of how expressions are evaluated −−>17 </xak:extends>18 </xak:refines>


19 <xak:refines xmlns:xak="http://www.onekin.org/xak" xak:artifact="Mult" xak:type="xhtml">20 <xak:extends xak:module="Operations">21 <xak:super xak:module="Operations"/>22 <li>Multiplication of integers</li>23 <!−− a description of how integers are multiplied −−>24 </xak:extends>25 </xak:refines>

Fig. 9. A Xak/XHTML document with separate features for addition, evaluation, andmultiplication

FeatureHouse is a framework for software composition supported by a cor-responding tool chain. It provides facilities for feature composition based on alanguage-independent model of software artifacts and an automatic plugin mech-anism for the integration of new artifact languages. FeatureHouse improves overprior work on AHEAD in that it implements language-independent softwarecomposition.

Feature Structure Trees. FeatureHouse relies on a general model of the struc-ture of software artifacts, called the feature structure tree (FST) model. AnFST represents the essential structure of a software artifact and abstracts fromlanguage-specific details. For example, an artifact written in Java contains pack-ages, classes, methods, and so forth, which are represented by nodes in its FST;a Haskell program contains equations, algebraic data types, type classes, etc.,which contain further elements; a makefile or build script consists of definitionsand rules that may be nested.

Each node of an FST has (1) a name that is the name of the correspondingstructural element and (2) a type that represents the syntactic category of thecorresponding structural element. For example, a Java class Foo is representedby a node Foo of type Java class. Essentially, an FST is a stripped-down abstract


Fig. 10. Composing containment hierarchies by superimposition [16]

Val

valeval() eval()

Val(int) toString()

Val

val toString()

Val(int)

Val

Expr EvalExpr Eval

Fig. 11. Superimposition of feature structure trees (excerpt of the expression example)

syntax tree (AST): it contains only information that is necessary for the spec-ification of the modular structure of an artifact and for its composition withother artifacts. The inner nodes of an FST denote modules (e.g., classes andpackages) and the leaves carry the modules’ content (e.g., method bodies andfield initializers). We call the inner nodes nonterminals and the leaves terminals.For illustration, in Figure 11, we depict on the left side the FST of concerningclass Val of feature Expr.

What code elements are represented as inner nodes and leaves? This dependson the language and on the level of granularity at which software artifacts areto be composed [50]. Different granularities are possible and might be desiredin different contexts. For Java, we could represent only packages and classesbut not methods or fields as FST nodes (a coarse granularity), or we could alsorepresent statements or expressions as FST nodes (a fine granularity). In anycase, the structural elements not represented in the FST are text content ofterminal nodes (e.g., the body of a method). In our experience, the granularityof Figure 11 is usually sufficient for composition of Java artifacts.

Superimposition. The composition of software artifacts proceeds by the su-perimposition of the corresponding FSTs, denoted by ‘•’. Much like in AHEAD,two FSTs are superimposed by merging their nodes, identified by their names,types, and relative positions, starting from the root and descending recursively.Figure 11 illustrates the process of FST superimposition with the expressionexample (only concerning class Val).


Generally, the composition of two leaves of an FST that contain further con-tent demands a special treatment. The reason is that the content is not repre-sented as a subtree but as plain text. Method bodies are composed differentlyfrom fields, Haskell functions, or Bali grammar productions. The solution isthat, depending on the artifact language and node type, different rules for thecomposition of terminals are used. Often simple rules such as replacement, con-catenation, specialization, or overriding suffice, but the approach is open to moresophisticated rules known from multi-dimensional separation of concerns [71] orsoftware merging [69]. For example, we merge two method bodies via overriding,in which Super defines how the bodies are merged, much like in Jak.

Generation and Automation. New languages can be plugged easily into Fea-tureHouse. The idea is that, although artifact languages are very different, theprocess of software composition by superimposition is very similar. For exam-ple, the developers of AHEAD/Jak [16] and FeatureC++ [9] have extended theartifact languages Java and C++ by constructs (e.g., refines or Super) and mecha-nisms for composition. They have each implemented a parser, a superimpositionalgorithm, and a pretty printer4 – all specific to the artifact language. We haveintroduced the FST model to be able to express superimposition independentlyof an artifact language [11].

In FeatureHouse, we automate the integration of further languages and baseit largely on the languages’ grammars. This allows us to generate most of thecode that must otherwise be provided and integrated manually (parser, adapter,pretty printer) and to experiment with different representations of software ar-tifacts. Our tool FSTGenerator expects the grammar of the language to be inte-grated in a specific format, called FeatureBNF, and generates a parser, adapter,and pretty printer accordingly. Using a grammar written in FeatureBNF, FST-Generator generates (a) an LL(k) parser that directly produces FST nodes and(b) a corresponding pretty printer. After the generation step, composition pro-ceeds as follows: (1) the generated parser receives artifacts written in the targetlanguage and produces one FST per artifact; (2) FeatureHouse performs thecomposition; and (3) the generated pretty printer writes the composed artifactsto disk. For the composition of the content of terminal nodes, we have developedand integrated a library of composition rules (e.g., rules for method overriding orfor the concatenation of the statements of two constructors). Figure 12 illustratesthe interplay between FSTGenerator and FeatureHouse;

A detailed description of FSTGenerator and FeatureBNF is available else-where [7].

Tooling. Both AHEAD5 and FeatureHouse6 are available for experimentationincluding several examples. Both are command-line tools. FeatureIDE provides

4 With ‘pretty printer’ we refer to a tool, also known as unparser, that takes a parsetree or an FST and generates source code.

5 http://www.cs.utexas.edu/users/schwartz/ATS.html6 http://fosd.net/fh

http://www.cs.utexas.edu/users/schwartz/ATS.html

http://fosd.net/fh


Java C JavaCC

Generator

...C#

Parser Composer Pretty Printer

Source Code Source CodeFST

Library of Composition Rules

Haskell

FSTFST

FeatureBNF

FSTGenerator

FSTComposer

Alloy

Fig. 12. The architecture of FeatureHouse

a graphical front end in Eclipse, with corresponding editors for Jak, a mappingfrom features to feature modules, automatic composition of selected features inthe background, generation of collaboration diagrams, and much more [57, 62].FeatureIDE ships with AHEAD and FeatureHouse and several example projects,ready to explore. After a developer graphically configures the desired features,FeatureIDE automatically calls the corresponding composition tools. It is likelythe easiest way to try AHEAD or FeatureHouse, for developers familiar withEclipse. Recently, Batory contributed even a video tutorial on FeatureIDE.7

4 Virtual Separation of Concerns

Recently, several researchers have taken a different path to tackle more disci-plined product-line implementations. Instead of inventing new languages andtools that support feature decomposition, they stay close to the concept of con-ditional compilation with preprocessors, but improve it at a tooling level. Thegoal is to keep the familiar and simple mechanisms of annotating code fragmentsin a common implementation (e.g., as with the C preprocessor), but to emulatemodularity with tool support and to provide navigation facilities as well as errordiagnostics. We work around the limitations for which traditional preprocessorsare typically criticized.

4.1 Variability Implementation with Preprocessors

Conditional-compilation mechanisms of preprocessors provide an easy strategyto implement compile-time variability in product lines. The concept is simple:Developers annotate code fragments with feature expressions. Subsequently the

7 http://www.cs.utexas.edu/users/dsb/cs392f/Videos/FeatureIDE/

http://www.cs.utexas.edu/users/dsb/cs392f/Videos/FeatureIDE/


preprocessor removes certain annotated code fragments before compilation, de-pending on the feature selection.

To introduce preprocessors, we exemplify a preprocessor-based implementa-tion of the “expression problem” from Figure 7 in Figure 13. We use the pre-processor Antenna,8 which was developed for Java code on mobile platforms.Conditional compilation in Antenna uses almost the same notation as in theC preprocessor, but preprocessor directives are written in comments, to not in-terfere with existing tool support for Java code. Variable code fragments areframed with #ifdef and #endif directives. In a feature-oriented context, the#ifdef directives refer to features from the feature model. If the correspondingfeature is not selected, the code fragment between the #ifdef and the #endifdirective is removed before compilation. Furthermore, #ifdef directives may benested, so that code is only included if multiple features are selected; for exam-ple, Line 45 in Figure 13 is only included if features Mult and Eval are bothselected (equivalent to derivative modules discussed in Section 3.3)

4.2 Disciplined Preprocessor Usage

A main problem of traditional (lexical) preprocessors, such as the C preproces-sor, is that they are oblivious to the underlying host language and the variabil-ity specification. It is possible to annotate individual tokens such as a closingbracket, leading to hard-to-find syntax errors. For the same reason, parsing un-preprocessed code for analysis is a difficult task (a parser can hardly foresee allpossibilities how the preprocessor is used) [17, 39, 56, 64, 73]. The mapping be-tween features in a feature model and #ifdef flags is not checked, hence a typo ina flag name leads to never compiling this code fragment [90]. In feature-orientedprogramming, these problems do not occur, because the underlying languageallows only disciplined usage, but preprocessors are a different story. Overall,the flexibility of lexical preprocessors allows undisciplined use that is hard tounderstand, to debug, and to analyze.

To overcome the above problems, we require a disciplined use of preprocessors.With disciplined use, we mean that annotations (in the simplest form #ifdef flags)must correspond to feature names in a feature model and that annotations alignwith the syntactic structure of the underlying language [50,54,64]. For example,annotating an entire statement or an entire function is considered disciplined; theannotation aligns with the language constructs of the host language. In contrast,we consider annotating an individual bracket or just the return type of a functionas undisciplined. In Figure 14, we illustrate several examples of disciplined andundisciplined annotations from the code of the text editor vim. A restrictionto disciplined annotations enables easy parsing of the source code [17, 64, 66]and hence makes the code available to automated analysis (including variability-aware analysis, as discussed in Sec. 5). Code with disciplined annotations can berepresented in the choice calculus [33], which opens the door for formal reasoning

8 http://antenna.sourceforge.net/

http://antenna.sourceforge.net/


1 abstract class Expr {2 abstract String toString();3 //#ifdef EVAL4 abstract int eval();5 //#endif6 }78 class Val extends Expr {9 int val;

10 Val(int n) { val = n; }11 String toString() { return String.valueOf(val); }12 //#ifdef EVAL13 int eval() { return val; }14 //#endif15 }1617 class Add extends Expr {18 Expr a; Expr b;19 Add(Expr e1, Expr e2) { a = e1; b = e2; }20 String toString() {21 StringBuffer r=new StringBuffer();22 //#ifdef MULT23 r.append("(");24 //#endif25 r.append(a.toString());26 r.append("+");27 r.append(b.toString());28 //#ifdef MULT29 r.append(")");30 //#endif31 return r.toString();32 }33 //#ifdef EVAL34 int eval() { return a.eval() + b.eval(); }35 //#endif36 }3738 //#ifdef MULT39 class Mult extends Expr {40 Expr a; Expr b;41 Mult(Expr e1, Expr e2) { a = e1; b = e2; }42 String toString() { return "(" + a.toString() + "*" + b.toString() + ")";43 }44 //#ifdef EVAL45 int eval() { return a.eval() * b.eval(); }46 //#endif47 }48 //#endif

Fig. 13. A preprocessor-based implementation of the “expression problem” from Fig-ure 7


1 void tcl_end() {2 #ifdef DYNAMIC_TCL3 if (hTclLib) {4 FreeLibrary(hTclLib);5 hTclLib = NULL;6 }7 #endif8 }

disciplined annotation

1 int n = NUM2INT(num);2 #ifndef FEAT_WINDOWS3 w = curwin;4 #else5 for (w = firstwin; w != NULL;

w = w->w_next, --n)6 #endif7 if (n == 0)8 return window_new(w);

undisciplined annotation (for wrap-per)

1 if (char2cells(c) == 12 #if defined(FEAT_CRYPT) ||

defined(FEAT_EVAL)3 && cmdline == 04 #endif5 )

undisciplined annotation at expression level

1 if (!ruby_initialized) {2 #ifdef DYNAMIC_RUBY3 if (ruby_enabled(TRUE)) {4 #endif5 ruby_init();

undisciplined annotation (if wrap-per)

Fig. 14. Examples of disciplined and undisciplined annotations in vim [64]

and for developing a mathematical theory of annotation-based FOSD. As a sideeffect, it guarantees that all variants are syntactically correct [54].

There are different ways to enforce annotation discipline. For example, wecan introduce conditional compilation facilities into a programming language,instead of using an external preprocessor, as done in D9 and rbFeatures [41].Similarly, syntactic preprocessors allow only transformations based on the un-derlying structure [23,66,97]. Alternatively, we can check discipline after the factby running additional analysis tools (however, even though Linux has a scriptto check preprocessor flags against a feature model, Tartler et al. report sev-eral problems in Linux with incorrect config flags as the tool is apparently notused [90]). Finally, in our tool CIDE, we map features to code fragments entirelyat the tool level, such that the tool allows only disciplined annotations; hence, adeveloper is not able to make an undisciplined annotation in the first place [50].

Enforcing annotation discipline limits the expressive power of annotationsand may require somewhat higher effort from developers who need to rewritesome code fragments. Nevertheless, experience has shown that the restrictionto disciplined annotations are not a serious limitation in practice [17, 54, 64, 96].Developers can usually rewrite undisciplined annotations locally into disciplinedones – there is even initial research to automate this process [39,56]. Furthermore,developers usually prefer disciplined annotations anyway (and sometimes, e.g.,in Linux, have corresponding guidelines), because they understand the threats tocode comprehension from undisciplined usage. Liebig et al. have shown that 84 %of all #ifdef directives in 40 substantial C programs are already in a disciplinedform [64]. So, we argue that enforcing discipline, at least for new projects, shouldbe a viable path that eliminates many problems of traditional preprocessors.

9 http://www.digitalmars.com/d/

http://www.digitalmars.com/d/


Disciplined usage of annotations opens annotation-based implementations tomany forms of analysis and tool support, some of which we describe in thefollowing. Many of them would not have been possible with traditional lexicalpreprocessors.

4.3 Views

One of the key motivations of modularizing features (for example, with feature-oriented programming) is that developers can find all code of a feature in onespot and reason about it without being distracted by other concerns. Clearly, ascattered, preprocessor-based implementation, as in Figure 2, does not supportthis kind of lookup and reasoning, but the core question “what code belongs tothis feature” can still be answered by tool support in the form of views [44,58,84].

With relatively simple tool support, it is possible to create an (editable) viewon the source code by hiding all irrelevant code of other features. In the simplestcase, we hide files from the file browser in an IDE. Developers will only see filesthat contain code of certain features selected interactively by the user. This way,developers can quickly explore all code of a feature without global code search.

In addition, views can filter code within a file (technically, this can be imple-mented like code folding in modern IDEs).10 In Figure 15, we show an exampleof a code fragment and a view on its feature Transaction (TXN). Note, wecannot simply remove everything that is not annotated by #ifdef directives, be-cause we could end up with completely unrelated statements. Instead, we needto provide some context (e.g., in which class and method is this statement lo-cated); in Figure 15, we highlight the context information in gray and italic font.Interestingly, similar context information is also present in modularized imple-mentations in the form of class refinements, method signatures, pointcuts, orextension points.

Beyond views on one or more individual features, (editable) views on variantsare possible [13, 43, 58]. That is, a tool can show the source code that would begenerated for a given feature selection and hide all remaining code of unselectedfeatures. With such a view, a developer can explore the behavior of a variantwhen multiple features interact, without being distracted by code of unrelatedfeatures. This goes beyond the power of physical separation with tools suchas FeatureHouse, with which the developer has to reconstruct the behavior ofmultiple components/plug-ins/aspects in her mind. Especially, when many fine-grained features interact, from our experience, views can be a tremendous help.Nevertheless, some desirable properties such as separate compilation or modulartype checking cannot be achieved with views.10 Although editable views are harder to implement than read-only views, they are

more useful since users do not have to go back to the original code to modify it. Im-plementations of editable views have been discussed intensively in work on databaseor model-roundtrip engineering. Furthermore, a simple but effective solution, whichwe apply in our tools, is to leave a marker indicating hidden code [50]. Thus, modi-fications occur before or after the marker and can be unambiguously propagated tothe original location.


1 class Stack implements IStack {2 void push(Object o) {3 //#ifdef TXN4 Lock l = lock(o);5 //#endif6 //#ifdef UNDO7 last = elementData[size];8 //#endif9 elementData[size++] = o;

10 //#ifdef TXN11 l.unlock();12 //#endif13 fireStackChanged();14 }15 //#ifdef TXN16 Lock lock(Object o) {17 return LockMgr.lockObject(o);18 }19 //#endif20 ...21 }

(a) original (all features selected)

1 class Stack [] {2 void push([]) {3 Lock l = lock(o);4 []5 l.unlock();6 []7 }8 Lock lock(Object o) {9 return LockMgr.lockObject(o);

10 }11 []12 }

(b) view on TXN (hidden code is indi-cated by ‘[]’, necessary context infor-mation is shown in gray italics)

Fig. 15. View emulates separation of concerns [47]

Hence, views can emulate some advantages of separating features as in feature-oriented programming. Developers can quickly explore all code of a feature andcan deliberately navigate between features by switching between different views.We have implemented the described views in our tool CIDE [50]. Instead of aphysical separation of features into separate files or directories, views provide avirtual separation, hence the name virtual separation of concerns.

4.4 Coping with Obfuscated Source Code

Traditional preprocessors have a reputation for obfuscating source code suchthat the resulting code is difficult to read and maintain. The reason is that pre-processor directives and statements of the host language are intermixed. Whenreading source code, many #ifdef and #endif directives distract from the actualcode and can destroy the code layout (with cpp, every directive must be placedon its own line). There are cases in which preprocessor directives entirely obfus-cate the source code as illustrated in Figure 1611 and in our previous FemtoOSexample in Figure 2. Furthermore, nested preprocessor directives and multipledirectives belonging to different features as in Figure 1 are other typical causesof obfuscated code.

11 In the example in Figure 16, preprocessor directives are used for Java code at a finegranularity [50], annotating not only statements but also parameters and part ofexpressions. We need to add eight additional lines just for preprocessor directives.Together with additional necessary line breaks, we need 21 instead of 9 lines for thiscode fragment.


1 class Stack {2 void push(Object o3 //#ifdef TXN4 , Transaction txn5 //#endif6 ) {7 if (o==null8 //#ifdef TXN9 || txn==null

10 //#endif11 ) return;12 //#ifdef TXN13 Lock l=txn.lock(o);14 //#endif15 elementData[size++] = o;16 //#ifdef TXN17 l.unlock();18 //#endif19 fireStackChanged();20 }21 }

Fig. 16. Java code obfuscated byfine-grained annotations with cpp

1 class Stack {2 void push(Object o, Transaction txn) {3 if (o==null || txn==null) return;4 Lock l=txn.lock(o);5 elementData[size++] = o;6 l.unlock();7 fireStackChanged();8 }9 }

Features: Transaction

, Transaction txn|| txn==null

Lock l=txn.lock(o);

l.unlock();

Transaction

Fig. 17. Annotated code represented bybackground color instead of textual anno-tation [49]

While language-based mechanisms such as feature-oriented programming avoidthis obfuscation by separating feature code, researchers have explored severalways to improve the representation in the realm of preprocessors: First, textualannotations with a less verbose syntax that can be used within a single line couldhelp, and can be used with many tools. Second, views can help programmers tofocus on the relevant code, as discussed above. Third, visual means can be usedto differentiate annotations from source code: Like some IDEs for PHP use differ-ent font styles or background colors to emphasize the difference between HTMLand PHP in a single file, different graphical means can be used to distinguishpreprocessor directives from the remaining source code. Finally, it is possibleto eliminate textual annotations altogether and use the representation layer toconvey annotations, as we show next.

In our tool CIDE, we abandoned textual annotations in favor of backgroundcolors to represent annotations [50]. For example, all code belonging to featureTransaction is highlighted with background color red. Using the representa-tion layer, also our example from Figure 16 is much shorter as shown in Fig-ure 17. The use of background colors mimics our initial steps to mark featureson printouts with colored text markers and can easily be implemented since thebackground color is not yet used in most IDEs. Instead of background colors thetool Spotlight uses colored lines next to the source code [29]. Background colorsand lines are especially helpful for long and nested annotations, which may oth-erwise be hard to track. We are aware of some potential problems of using colors(e.g., humans are only able to distinguish a certain number of colors), but still,there are many interesting possibilities to explore; for example, usually a fewcolors for the features a developer currently focuses on are sufficient. Recently,the tool FeatureCommander combined background colors, lines, and several


further enhancements in a way that scales for product lines with several hun-dred features [35].

Despite all visual enhancements, there is one important lesson. Using pre-processors does not require modularity to be dropped at all, but rather freesprogrammers from the burden of forcing them to physically modularize every-thing. Typically, most of a feature’s code will be still implemented mostly mod-ularly, by a number of modules or classes, but additional statements for methodinvocations may be scattered in the remaining implementation as necessary. Inmost implementations, there are rarely annotations from more than two or threefeatures on a single page of code [47].

4.5 Summary

There are many directions from which we can improve annotation-based imple-mentations without replacing them with alternative implementation approaches,such as feature-oriented programming. Disciplined annotations remove many low-level problems and open the implementation for further analysis; views emulatemodularity by providing a virtual separation of concerns; and visualizations re-duce the code cluttering. At the same time, we keep the flexibility and simplicityof preprocessors: Developers still just mark and optionally remove code fragmentsfrom a common implementation.

Together, these improvements can turn traditional preprocessors into a viablealternative to composition-based approaches, such as feature-oriented program-ming. Still there are trade-offs: For example, virtual separation does not sup-port true modularity and corresponding benefits such as separate compilation,whereas compositional approaches have problems at a fine granularity. Even com-bining the two approaches may yield additional benefits. We have explored thesedifferences and synergies elsewhere [47,48]. Recently, we have explored also auto-mated transformations between the two representations [51]. We cannot make arecommendation for one or the other approach. We believe that much (empirical)evaluation is still necessary. Currently, we are exploring both paths in parallel.

Tooling. Basic preprocessors are widely available for most languages. For Java,Antenna is a good choice for which also tool integration in Eclipse and NetBeansis available. Most advanced concepts discussed here have been implemented inour tool CIDE as an Eclipse plugin.12 CIDE uses the feature-model editor andreasoning engine from FeatureIDE. CIDE is open source and comes with a num-ber of examples and a video tutorial. Visualizations have been explored further inView Infinity13 and FeatureCommander,14 the latter of which comes with Xeno-mai (a realtime extension for Linux with 700 features) as example. For graphicalmodels, FeatureMapper15 provides similar functionality.

12 http://fosd.net/cide13 http://fosd.net/vi14 http://fosd.net/fc15 http://featuremapper.org/

http://fosd.net/cide

http://fosd.net/vi

http://fosd.net/fc

http://featuremapper.org/


5 Variability-Aware Analysis

The analysis of product lines is difficult. The exponential explosion (up to 2nvariants for n features) makes a brute-force approach infeasible. At the sametime, checking only sampled variants or variants currently shipped to customersleads to the effect that errors can lurk in the system for a long time. Errorsare detected late, only when a specific feature combination is requested for thefirst time (when the problem is more expensive to find and fix). While this maywork for in-house development with only a few products per year (e.g., softwarebundled with a hardware product line), especially in systems in which users canfreely select features (e.g., Linux), checking variants in isolation obviously doesnot scale.

Variability-aware analysis is the idea to lift an analysis mechanism for a singlesystem to the product-line world. Variability-aware analysis extends traditionalanalysis by reasoning about variability. Hence, instead of checking variants, vari-ability is checked locally where it occurs inside the product-line implementation(without variant generation). Variability-aware analysis has been proposed formany different kinds of analysis, including type checking [5,53,92], model check-ing [12,27,60,76], theorem proving [95], and parsing [56]; other kinds of analysescan probably be lifted similarly. There are very different strategies, but thekey idea is usually similar. We will illustrate variability-aware analysis with typechecking, first for annotation-based implementations, then for composition-basedones. Subsequently, we survey different general strategies.

5.1 Type Checking Annotation-Based Implementations

To illustrate variability-aware type checking, we use the trivial hello-world pro-gram with three features shown in Figure 18: From this program, we can gen-erate eight different variants (with any combination of WORLD, BYE, andSLOW). Quite obviously, some of these programs are incorrect: Selecting nei-ther WORLD nor BYE leads to a dangling variable access in the println param-eter (msg has not been declared); selecting both WORLD and BYE leads to avariable declared twice.

To detect these errors with a brute-force approach, we would simply generateand type check all eight variants individually. While brute force seems acceptablein this example, it clearly does not scale for implementations with many features.Instead, variability-aware type checking uses a lifted type system that takesvariability into account.

As a first step, we need to reason about conditions under which certain codefragments are included. Czarnecki and Pietroszek coined them presence condi-tions, to describe the conditions under which a code fragment is included with apropositional formula (the code line is included iff the presence condition of thatline evaluates to true) [31]. In our example, the formulas are trivial: WORLDfor Line 4, BYE for Line 7, SLOW∧WORLD for Line 12, and true for all otherlines. With more complex #ifdef conditions and nesting, the formulas becomemore complex as described in detail elsewhere [83].


1 #include <stdio.h>23 #ifdef WORLD4 char *msg = "Hello World\n";5 #endif6 #ifdef BYE7 char *msg = "Bye bye!\n";8 #endif9

10 main() {11 #if defined(SLOW) && defined(WORLD)12 sleep(10);13 #endif1415 println(msg);16 }

Fig. 18. Hello-world example with annotations

Now, we can formulate type rules based on presence conditions. For example,whenever we find an access to a local variable, we need to make sure that wecan reach at least one declaration. In our example, we require that the presencecondition of accessing msg (i.e., true) implies the presence condition of eitherdeclaration of msg (i.e., WORLD and BYE): true ⇒ (WORLD∨BYE). Since thisformula is not a tautology, we detect that a variant selecting neither feature is nottype correct. Similar reachability conditions for function calls are straightforwardand uninteresting, because the target declaration in a header file has presencecondition true. As an additional check, we require that multiple definitions withthe same name must be mutually exclusive: ¬(WORLD ∧ BYE). This checkreports an error for variants with both features. If the product line has a featuremodel describing the valid variants, we are only interested in errors in validvariants. By using a representation of the feature model as propositional formulafm (translations are straightforward, cf. Sec. 2.2), we check only variants thatare valid with respect to the feature model: fm ⇒ (true ⇒ (WORLD ∨ BYE))and fm ⇒ ¬(WORLD∧ BYE) as illustrated in Figure 19.

1 #include <stdio.h>23 #ifdef WORLD4 char *msg = "Hello World\n";5 #endif6 #ifdef BYE7 char *msg = "Bye bye!\n";8 #endif9

10 main() {11 #if defined(SLOW) && defined(WORLD)12 sleep(10);13 #endif1415 println(msg);16 }

Fig. 19. Constraints in the hello-world example

fm ⇒ (true ⇒ (WORLD∨BYE))

fm ⇒ (true ⇒ true)

fm ⇒(SLOW ∧WORLD

⇒ true)

fm ⇒ ¬(WORLD∧BYE)


Abstracting from the example, we can define generic reachability anduniqueness conditions. A reachability condition between a caller and multipletargets is:

fm ⇒ pc(caller) ⇒∨

t∈targets

pc(t)

where pc denotes a presence condition. The uniqueness condition that enforcesthat no variant defines multiple definitions is:

fm ⇒∧

d1∈definitions, d2∈definitions, d1 �=d2

¬(pc(d1)∧ pc(d2)

)

Even for complex presence conditions and feature models, we can check whetherthese constraints hold efficiently with SAT solvers (Thaker et al. provide a gooddescription of how to encode and implement this [92]).16

So, how does variability-aware type checking improve over the brute-force ap-proach? Instead of just checking reachability and unique definitions in a singlevariant, we formulate conditions over the space of all variants. The importantbenefit of this approach is that we check variability locally, where it occurs.In our example, we do not need to check the combinations of SLOW andBYE, which are simply not relevant for typing. Technically, variability-awaretype checking requires lookup functions to return all possible targets and theirpresence conditions. Furthermore, we might need to check alternative types ofa variable. Still, in large systems, we do not check the surface complexity of2n variants, but analyze the source code more closely to find essential com-plexity, where variability actually matters. We cannot always avoid exponentialblowup, but practical source code is usually well behaved and has comparablylittle local variability. Also, caching of SAT-solver queries is a viable optimizationlever. Furthermore, the reduction to SAT problems enables efficient reasoning inpractice, even in the presence of complex presence conditions and large featuremodels [53, 67, 92].

In prior work, we have described variability-aware type checking in more detailand with more realistic examples; we have formalized the type system and provenit sound (when the type system judges a product line as well-typed all variantsare well-typed); and we have provided experience from practice [53].

5.2 Type Checking Composition-Based Implementations

The same concept of introducing variability into type checking can also be ap-plied to feature-oriented programming. To that end, we first need to define atype system for our new language (as, for example, FFJ [6]) and then make itvariability-aware by introducing reachability checks (as, for example, FFJPL [5]).

16 Other logics and other solvers are possible, but SAT solvers seem to provide a sweetspot between performance and expressiveness [67].


Since the type-checking mechanisms are conceptually similar for annotation-based and composition-based product lines, we restrict our explanation to asimple example of an object store with two basic implementations (examplefrom [93]) that each can be extended with a feature AccessControl in Fig-ure 20. Lookup of function calls works across feature boundaries and checkingpresence conditions is reduced to checking relationships between features.

Feature SingleStore

1 class Store {2 private Object value;3 Object read() { return value; }4 void set(Object nvalue) { value = nvalue; }5 }

Feature MultiStore

6 class Store {7 private LinkedList values = new LinkedList();8 Object read() { return values.getFirst(); }9 Object[] readAll() { return values.toArray(); }

10 void set(Object nvalue) { values.addFirst(nvalue); }11 }

Feature AccessControl

12 refines class Store {13 private boolean sealed = false;14 Object read() {15 if (!sealed) { return Super().read(); }16 else { throw new RuntimeException("Access denied!"); }17 }18 Object[] readAll() {19 if (!sealed) { return Super().readAll(); }20 else { throw new RuntimeException("Access denied!"); }21 }22 void set(Object nvalue) {23 if (!sealed) { Super(Object).set(nvalue); }24 else { throw new RuntimeException("Access denied!"); }25 }26 }

Fig. 20. Checking whether references to read and readAll are well-typed in all validproducts

fm ⇒AccessControl ⇒MultiStore

fm ⇒AccessControl ⇒SingleStore ∨MultiStore

More interestingly, the separation of features into distinct modules allows usto check some constraints within a feature. Whereas the previous approachesassume a closed world in which all features are known, separation of featuresencourages modular type checking in an open world. As illustrated in Figure 21,we can perform checks regarding fragments that are local to the feature. At thesame time, we derive interfaces, which specify the constraints that have to bechecked against other features. To check constraints between features, we can usebrute force (check on composition) or just another variability-aware mechanism.

Modular type checking paves the road to true feature modularity, in whichwe distinguish between the public interface of a feature and private hidden


Feature SingleStore

1 class Store {2 private Object value;3 Object read() { return value; }4 void set(Object nvalue) { value =

nvalue; }5 }

Feature MultiStore

6 class Store {7 private LinkedList values = new

LinkedList();8 Object read() { return

values.getFirst(); }9 Object[] readAll() { return

values.toArray(); }10 void set(Object nvalue) {

values.addFirst(nvalue); }11 }

Feature AccessControl

12 refines class Store {13 private boolean sealed = false;14 Object read() {15 if (!sealed) { return Super().read(); }16 else { throw new

RuntimeException("Accessdenied!"); }

17 }18 Object[] readAll() {19 if (!sealed) { return

Super().readAll(); }20 else { throw new


21 }22 void set(Object nvalue) {23 if (!sealed) {

Super(Object).set(nvalue); }24 else { throw new


25 }26 }

⇒

Interface of SingleStore

1 provides Object read();2 provides void set(Object);

Interface of MultiStore

3 provides Object read();4 provides Object[]

readAll();5 provides void set(Object);

Interface of AccessControl

6 requires Object read();7 requires Object[]

readAll();8 requires void set(Object);

Fig. 21. References to field sealed can be checked entirely within feature AccessCon-trol (left); references to read and readAll cut across feature boundaries and are checkedat composition time based on the features’ interfaces (right)


implementations. Modular analysis of a feature reduces analysis effort, becausewe need to check each feature’s internals only once and need to check only in-terfaces against interfaces of other features (checking interfaces usually is muchfaster than checking the entire implementation). Furthermore, we might be ableto establish guarantees about features, without knowing all other features (open-world reasoning). For an instantiation of modular type checking of features, seethe work on gDeep [3] and delta-oriented programming [81]. Li et al. explored asimilar strategy for model checking [63].

5.3 Analysis Strategies

In general, we see three different strategies of how we can approach variability-aware analysis:

– Brute-force strategy. We check variants individually with standard analy-sis techniques. We can try to reduce effort by sampling relevant variantsand focusing on certain coverage heuristics. For example, pair-wise featurecoverage samples a small number of variants in the hope to discover allproblems related to the interaction of pairs of features [72]. Especially fortesting and measurement, approaches to select suitable variants have beenexplored [59, 72, 75, 82].

– Family-based strategy. We check the whole product line at once, as out-lined for type checking above. We assume a closed world in which we knowthe implementation of all features and their relationships. The family-basedstrategy has been explored extensively for type checking and model check-ing [5, 12, 27, 31, 53, 60, 76, 92].

– Feature-based strategy. We check each feature in isolation as far as possible.Modular feature checks do not require implementation details of other fea-tures. For noncompositional properties that cannot be checked locally, wederive interfaces or constraints that must be checked when composing twofeatures (per variant, or using a brute-force or family-based strategy). Mod-ular checks avoid re-performing certain checks for each variant that are localto individual features; the strategy is suited especially if features are alreadyseparated. It has been explored, for example, for type checking, model check-ing and verification [3, 63, 81, 95].

These strategies can be applied to different forms of implementation and differentkinds of analysis. Of course the strategies can be combined. For details on thesestrategies, their combinations, and a survey of existing analysis techniques seethe recent report by Thüm et al. [93].

Tooling. Most variability-aware analyses, we are aware of, are in the state of re-search prototypes. See the corresponding references for further information. Ourenvironment for virtual separation of concerns, CIDE, contains a variability-aware type system that covers large parts of Java. The safegen tool implementspart of a variability-aware type system for the feature-oriented language Jak and


is available as part of the AHEAD tool suite. We are currently in the processof integrating such type system into the Fuji compiler for feature-oriented pro-gramming in Java,17 and afterward into FeatureIDE, and we are developing atype system for C code with #ifdefs as part of the TypeChef project.18

6 Open Challenges

So far, we have illustrated different strategies to implement features in productlines. They all encourage disciplined implementations, that alleviate many prob-lems traditionally associated with product-line implementations. Nevertheless,there are many open challenges.

A core challenge is the exponential explosion of the number of variants. Themore features a product line supports, the more complex interaction patterns canoccur that challenge maintenance tasks and quality assurance tasks. Although wehave outlined possible strategies for variability-aware analysis, they cannot (yet)fully replace sophisticated software testing methods known from single-programdevelopment.

Feature interactions are especially problematic. A feature interaction occurswhen two features behave different combined than they behave in isolation.A standard example are two features flood control and fire alarm in home-automation software that work well in isolation, but when combined, flood con-trol may accidentally turn of sprinklers activated when a fire was detected [61].When feature interactions are known, there are several implementation strate-gies, for example with additional derivative modules or nested preprocessor direc-tives [55]. However, feature interactions can be difficult to detect, specify, andcheck against. Calder et al. provide a deeper introduction into the topic [25].Many problems in product lines are caused by feature interactions.

Furthermore, both feature-oriented programming and preprocessor-based im-plementations have been criticized for neglecting modularity and overly relyingon structures of the implementation. Although feature modules localize all fea-ture code, only few approaches provide explicit interfaces that could enforceinformation hiding. We discuss this issue in detail elsewhere [52].

In general, also FOSD requires variability management as an essential taskof project management. Developers should not add features, just because theycan. Variability should always serve a mean for the project, such as answeringto customer demands for tailor-made products, serving to a broader marketsegment, or preparing for potential customers. Variability adds effort, complexity,and costs for development, maintenance, and quality assurance. If (compile-time)variability is not really needed, it might be best to develop a traditional singleprogram and use conventional development and testing approaches. However, ifvariability adds value to the project, as discussed in Section 2, the disciplinedimplementation approaches of FOSD discussed in this tutorial may provide agood balance between gained variability and required effort and costs.17 http://fosd.net/fuji18 https://github.com/ckaestne/TypeChef

http://fosd.net/fuji

https://github.com/ckaestne/TypeChef


7 Conclusion

With this tutorial, we have introduced FOSD. Beginning with basic conceptsfrom the field of software product line engineering, we have introduced two ap-proaches to FOSD: feature-oriented programming à la AHEAD and Feature-House and virtual separation of concerns. Subsequently, we have introduced thesubfield of variability-aware analysis, which highlights a promising avenues offurther work. We have covered only the basic concepts and a few methods, tools,and techniques, with a focus on techniques that can be readily explored. Forfurther information, we recommend a recent survey, which covers also relatedareas including feature interactions, feature design, optimization, and FOSDtheories [4, 49].

Acknowledgements. Kästner’s work is supported by the European ResearchCouncil, grant #203099 ‘ScalPL’. Apel’s work is supported by the German DFGgrants AP 206/2, AP 206/4, and LE 912/13.

References

1. Adams, B., Van Rompaey, B., Gibbs, C., Coady, Y.: Aspect mining in the presenceof the C preprocessor. In: Proc. AOSD Workshop on Linking Aspect Technologyand Evolution (LATE), pp. 1–6. ACM Press (2008)

2. Anfurrutia, F.I., Díaz, Ó., Trujillo, S.: On Refining XML Artifacts. In: Baresi, L.,Fraternali, P., Houben, G.-J. (eds.) ICWE 2007. LNCS, vol. 4607, pp. 473–478.Springer, Heidelberg (2007)

3. Apel, S., Hutchins, D.: A calculus for uniform feature composition. ACM Trans.Program. Lang. Syst. (TOPLAS) 32(5), 1–33 (2010)

4. Apel, S., Kästner, C.: An overview of feature-oriented software development. J.Object Technology (JOT) 8(5), 49–84 (2009)

5. Apel, S., Kästner, C., Größlinger, A., Lengauer, C.: Type safety for feature-orientedproduct lines. Automated Software Engineering 17(3), 251–300 (2010)

6. Apel, S., Kästner, C., Lengauer, C.: Feature Featherweight Java: A calculus forfeature-oriented programming and stepwise refinement. In: Proc. Int’l Conf. Gen-erative Programming and Component Engineering (GPCE), pp. 101–112. ACMPress (2008)

7. Apel, S., Kästner, C., Lengauer, C.: FeatureHouse: Language-independent, auto-mated software composition. In: Proc. Int’l Conf. Software Engineering (ICSE), pp.221–231. IEEE Computer Society (2009)

8. Apel, S., Kolesnikov, S., Liebig, J., Kästner, C., Kuhlemann, M., Leich, T.: Ac-cess control in feature-oriented programming. Science of Computer Programming(Special Issue on Feature-Oriented Software Development) 77(3), 174–187 (2012)

9. Apel, S., Leich, T., Rosenmüller, M., Saáke, G.: FeatureC++: On the Symbiosisof Feature-Oriented and Aspect-Oriented Programming. In: Glück, R., Lowry, M.(eds.) GPCE 2005. LNCS, vol. 3676, pp. 125–140. Springer, Heidelberg (2005)

10. Apel, S., Leich, T., Saake, G.: Aspectual feature modules. IEEE Trans. Softw. Eng.(TSE) 34(2), 162–180 (2008)


11. Apel, S., Lengauer, C.: Superimposition: A Language-Independent Approach toSoftware Composition. In: Pautasso, C., Tanter, É. (eds.) SC 2008. LNCS, vol. 4954,pp. 20–35. Springer, Heidelberg (2008)

12. Apel, S., Speidel, H., Wendler, P., von Rhein, A., Beyer, D.: Detection of featureinteractions using feature-aware verification. In: Proc. Int’l Conf. Automated Soft-ware Engineering (ASE), pp. 372–375. IEEE Computer Society (2011)

13. Atkins, D.L., Ball, T., Graves, T.L., Mockus, A.: Using version control data toevaluate the impact of software tools: A case study of the Version Editor. IEEETrans. Softw. Eng. (TSE) 28(7), 625–637 (2002)

14. Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice. Addison-Wesley, Boston (1998)

15. Batory, D.: Feature Models, Grammars, and Propositional Formulas. In: Obbink,H., Pohl, K. (eds.) SPLC 2005. LNCS, vol. 3714, pp. 7–20. Springer, Heidelberg(2005)

16. Batory, D., Sarvela, J.N., Rauschmayer, A.: Scaling step-wise refinement. IEEETrans. Softw. Eng. (TSE) 30(6), 355–371 (2004)

17. Baxter, I., Mehlich, M.: Preprocessor conditional removal by simple partial evalu-ation. In: Proc. Working Conf. Reverse Engineering (WCRE), pp. 281–290. IEEEComputer Society (2001)

18. Benavides, D., Seguraa, S., Ruiz-Cortés, A.: Automated analysis of feature models20 years later: A literature review. Information Systems 35(6), 615–636 (2010)

19. Beuche, D., Papajewski, H., Schröder-Preikschat, W.: Variability management withfeature models. Sci. Comput. Program. 53(3), 333–352 (2004)

20. Bono, V., Patel, A., Shmatikov, V.: A Core Calculus of Classes and Mixins. In:Guerraoui, R. (ed.) ECOOP 1999. LNCS, vol. 1628, pp. 43–66. Springer, Heidelberg(1999)

21. Bosch, J.: Super-imposition: A component adaptation technique. Information andSoftware Technology (IST) 41(5), 257–273 (1999)

22. Bouge, L., Francez, N.: A compositional approach to superimposition. In: Proc.Symp. Principles of Programming Languages (POPL), pp. 240–249. ACM Press(1988)

23. Brabrand, C., Schwartzbach, M.I.: Growing languages with metamorphic syntaxmacros. In: Proc. Workshop on Partial Evaluation and Semantics-Based ProgramManipulation (PEPM), pp. 31–40. ACM Press (2002)

24. Bracha, G., Cook, W.: Mixin-based inheritance. In: Proc. Int’l Conf. Object-Oriented Programming, Systems, Languages and Applications (OOPSLA), pp.303–311. ACM Press (1990)

25. Calder, M., Kolberg, M., Magill, E.H., Reiff-Marganiec, S.: Feature interaction: Acritical review and considered forecast. Computer Networks 41(1), 115–141 (2003)

26. Chen, L., Babar, M.A., Ali, N.: Variability management in software product lines:A systematic review. In: Proc. Int’l Software Product Line Conference (SPLC), pp.81–90. Carnegie Mellon University (2009)

27. Classen, A., Heymans, P., Schobbens, P.-Y., Legay, A., Raskin, J.-F.: Model check-ing lots of systems: Efficient verification of temporal properties in software productlines. In: Proc. Int’l Conf. Software Engineering (ICSE), pp. 335–344. ACM Press(2010)

28. Clements, P., Northrop, L.: Software Product Lines: Practices and Patterns.Addison-Wesley, Boston (2001)

29. Coppit, D., Painter, R., Revelle, M.: Spotlight: A prototype tool for software plans.In: Proc. Int’l Conf. Software Engineering (ICSE), pp. 754–757. IEEE ComputerSociety (2007)


30. Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Ap-plications. ACM Press/Addison-Wesley, New York (2000)

31. Czarnecki, K., Pietroszek, K.: Verifying feature-based model templates against well-formedness OCL constraints. In: Proc. Int’l Conf. Generative Programming andComponent Engineering (GPCE), pp. 211–220. ACM Press (2006)

32. Ernst, M., Badros, G., Notkin, D.: An empirical analysis of C preprocessor use.IEEE Trans. Softw. Eng. (TSE) 28(12), 1146–1170 (2002)

33. Erwig, M., Walkingshaw, E.: The choice calculus: A representation for softwarevariation. ACM Trans. Softw. Eng. Methodol. (TOSEM) 21(1), 6:1–6:27 (2011)

34. Favre, J.-M.: Understanding-in-the-large. In: Proc. Int’l Workshop on ProgramComprehension, p. 29. IEEE Computer Society (1997)

35. Feigenspan, J., Schulze, M., Papendieck, M., Kästner, C., Dachselt, R., Köppen,V., Frisch, M.: Using background colors to support program comprehension insoftware product lines. In: Proc. Int’l Conf. Evaluation and Assessment in SoftwareEngineering (EASE), pp. 66–75 (2011)

36. Filman, R.E., Elrad, T., Clarke, S., Aksit, M. (eds.): Addison-Wesley, Boston (2005)37. Findler, R., Flatt, M.: Modular object-oriented programming with units and mixins.

In: Proc. Int’l Conf. Functional Programming (ICFP), pp. 94–104. ACM Press(1998)

38. Flatt, M., Krishnamurthi, S., Felleisen, M.: Classes and mixins. In: Proc. Symp.Principles of Programming Languages (POPL), pp. 171–183. ACM Press (1998)

39. Garrido, A.: Program Refactoring in the Presence of Preprocessor Directives. PhDthesis, University of Illinois at Urbana-Champaign (2005)

40. Griss, M.L., Favaro, J., d’ Alessandro, M.: Integrating feature modeling with theRSEB. In: Proc. Int’l Conf. Software Reuse (ICSR), p. 76. IEEE Computer Society(1998)

41. Günther, S., Sunkle, S.: Feature-oriented programming with Ruby. In: Proc. GPCEWorkshop on Feature-Oriented Software Development (FOSD), pp. 11–18. ACMPress (2009)

42. Habermann, A.N., Flon, L., Cooprider, L.: Modularization and hierarchy in a fam-ily of operating systems. Commun. ACM 19(5), 266–272 (1976)

43. Heidenreich, F., Şavga, I., Wende, C.: On controlled visualisations in software prod-uct line engineering. In: Proc. SPLC Workshop on Visualization in Software Prod-uct Line Engineering (ViSPLE), pp. 303–313. Lero (2008)

44. Janzen, D., De Volder, K.: Programming with Crosscutting Effective Views. In:Vetta, A. (ed.) ECOOP 2004. LNCS, vol. 3086, pp. 197–220. Springer, Heidelberg(2004)

45. John, I., Eisenbarth, M.: A decade of scoping – a survey. In: Proc. Int’l SoftwareProduct Line Conference (SPLC), pp. 31–40. Carnegie Mellon University (2009)

46. Kang, K., Cohen, S.G., Hess, J.A., Novak, W.E., Peterson, A.S.: Feature-OrientedDomain Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-21, SEI, Pittsburgh, PA (1990)

47. Kästner, C.: Virtual Separation of Concerns. PhD thesis, University of Magdeburg(2010)

48. Kästner, C., Apel, S.: Integrating compositional and annotative approaches forproduct line engineering. In: Proc. GPCE Workshop on Modularization, Composi-tion and Generative Techniques for Product Line Engineering, pp. 35–40. Univer-sity of Passau (2008)

49. Kästner, C., Apel, S.: Virtual separation of concerns – A second chance for prepro-cessors. Journal of Object Technology (JOT) 8(6), 59–78 (2009)


50. Kästner, C., Apel, S., Kuhlemann, M.: Granularity in software product lines. In:Proc. Int’l Conf. Software Engineering (ICSE), pp. 311–320. ACM Press (2008)

51. Kästner, C., Apel, S., Kuhlemann, M.: A model of refactoring physically and vir-tually separated features. In: Proc. Int’l Conf. Generative Programming and Com-ponent Engineering (GPCE), pp. 157–166. ACM Press (2009)

52. Kästner, C., Apel, S., Ostermann, K.: The road to feature modularity? In: Proceed-ings of the Third Workshop on Feature-Oriented Software Development (FOSD),pp. 5:1–5:8. ACM Press (September 2011)

53. Kästner, C., Apel, S., Thüm, T., Saake, G.: Type checking annotation-based prod-uct lines. ACM Trans. Softw. Eng. Methodol. (TOSEM) 21(3), 14:1–14:39 (2012)

54. Kästner, C., Apel, S., Trujillo, S., Kuhlemann, M., Batory, D.: Guaranteeing Syn-tactic Correctness for All Product Line Variants: A Language-Independent Ap-proach. In: Oriol, M., Meyer, B. (eds.) TOOLS EUROPE 2009. LNBIP, vol. 33,pp. 175–194. Springer, Heidelberg (2009)

55. Kästner, C., Apel, S., ur Rahman, S.S., Rosenmüller, M., Batory, D., Saake, G.:On the impact of the optional feature problem: Analysis and case studies. In: Proc.Int’l Software Product Line Conference (SPLC), pp. 181–190. Carnegie MellonUniversity (2009)

56. Kästner, C., Giarrusso, P.G., Rendel, T., Erdweg, S., Ostermann, K., Berger, T.:Variability-aware parsing in the presence of lexical macros and conditional com-pilation. In: Proc. Int’l Conf. Object-Oriented Programming, Systems, Languagesand Applications (OOPSLA), pp. 805–824. ACM Press (October 2011)

57. Kästner, C., Thüm, T., Saake, G., Feigenspan, J., Leich, T., Wielgorz, F., Apel, S.:FeatureIDE: Tool framework for feature-oriented software development. In: Proc.Int’l Conf. Software Engineering (ICSE), pp. 611–614. IEEE Computer Society(2009)

58. Kästner, C., Trujillo, S., Apel, S.: Visualizing software product line variabilities insource code. In: Proc. SPLC Workshop on Visualization in Software Product LineEngineering (ViSPLE), pp. 303–313. Lero (2008)

59. Kim, C.H.P., Batory, D.S., Khurshid, S.: Reducing combinatorics in testing productlines. In: Proc. Int’l Conf. Aspect-Oriented Software Development (AOSD), pp. 57–68. ACM Press (2011)

60. Lauenroth, K., Pohl, K., Toehning, S.: Model checking of domain artifacts in prod-uct line engineering. In: Proc. Int’l Conf. Automated Software Engineering (ASE),pp. 269–280. IEEE Computer Society (2009)

61. Lee, J.J., Kang, K.C., Kim, S.: A Feature-Based Approach to Product Line Pro-duction Planning. In: Nord, R.L. (ed.) SPLC 2004. LNCS, vol. 3154, pp. 183–196.Springer, Heidelberg (2004)

62. Leich, T., Apel, S., Marnitz, L.: Tool support for feature-oriented software devel-opment: FeatureIDE: An eclipse-based approach. In: Proc. OOPSLA Workshop onEclipse Technology eXchange (ETX), pp. 55–59. ACM Press (2005)

63. Li, H.C., Krishnamurthi, S., Fisler, K.: Modular verification of open features us-ing three-valued model checking. Automated Software Engineering 12(3), 349–382(2005)

64. Liebig, J., Kästner, C., Apel, S.: Analyzing the discipline of preprocessor annota-tions in 30 million lines of C code. In: Proc. Int’l Conf. Aspect-Oriented SoftwareDevelopment (AOSD), pp. 191–202. ACM Press (2011)

65. Liu, J., Batory, D., Lengauer, C.: Feature oriented refactoring of legacy applications.In: Proc. Int’l Conf. Software Engineering (ICSE), pp. 112–121. ACM Press (2006)


66. McCloskey, B., Brewer, E.: ASTEC: A new approach to refactoring C. In: Proc.Europ. Software Engineering Conf./Foundations of Software Engineering (ES-EC/FSE), pp. 21–30. ACM Press (2005)

67. Mendonça, M., Wąsowski, A., Czarnecki, K.: SAT-based analysis of feature modelsis easy. In: Proc. Int’l Software Product Line Conference (SPLC), pp. 231–240.Carnegie Mellon University (2009)

68. Mendonça, M., Wąsowski, A., Czarnecki, K., Cowan, D.D.: Efficient compilationtechniques for large scale feature models. In: Proc. Int’l Conf. Generative Program-ming and Component Engineering (GPCE), pp. 13–22. ACM Press (2008)

69. Mens, T.: A state-of-the-art survey on software merging. IEEE Trans. Softw. Eng.(TSE) 28(5), 449–462 (2002)

70. Ossher, H., Harrison, W.: Combination of inheritance hierarchies. In: Proc. Int’lConf. Object-Oriented Programming, Systems, Languages and Applications (OOP-SLA), pp. 25–40. ACM Press (1992)

71. Ossher, H., Tarr, P.: Hyper/J: Multi-dimensional separation of concerns for Java.In: Proc. Int’l Conf. Software Engineering (ICSE), pp. 734–737. ACM Press (2000)

72. Oster, S., Markert, F., Ritter, P.: Automated Incremental Pairwise Testing of Soft-ware Product Lines. In: Bosch, J., Lee, J. (eds.) SPLC 2010. LNCS, vol. 6287, pp.196–210. Springer, Heidelberg (2010)

73. Padioleau, Y.: Parsing C/C++ Code without Pre-processing. In: de Moor, O.,Schwartzbach, M.I. (eds.) CC 2009. LNCS, vol. 5501, pp. 109–125. Springer, Hei-delberg (2009)

74. Parnas, D.L.: On the design and development of program families. IEEE Trans.Softw. Eng. (TSE) 2(1), 1–9 (1976)

75. Pohl, K., Böckle, G., van der Linden, F.J.: Software Product Line Engineering:Foundations, Principles and Techniques. Springer, Heidelberg (2005)

76. Post, H., Sinz, C.: Configuration lifting: Verification meets software configuration.In: Proc. Int’l Conf. Automated Software Engineering (ASE), pp. 347–350. IEEEComputer Society (2008)

77. Prehofer, C.: Feature-Oriented Programming: A Fresh Look at Objects. In: Ak-sit, M., Auletta, V. (eds.) ECOOP 1997. LNCS, vol. 1241, pp. 419–443. Springer,Heidelberg (1997)

78. Rabiser, R., Grünbacher, P., Dhungana, D.: Supporting product derivation byadapting and augmenting variability models. In: Proc. Int’l Software Product LineConference (SPLC), pp. 141–150. IEEE Computer Society (2007)

79. Refstrup, J.G.: Adapting to change: Architecture, processes and tools: A closerlook at HP’s experience in evolving the Owen software product line. In: Proc. Int’lSoftware Product Line Conference, SPLC (2009), Keynote presentation

80. Rosenmüller, M., Apel, S., Leich, T., Saake, G.: Tailor-made data management forembedded systems: A case study on Berkeley DB. Data and Knowledge Engineering(DKE) 68(12), 1493–1512 (2009)

81. Schaefer, I., Bettini, L., Damiani, F.: Compositional type-checking for delta-oriented programming. In: Proc. Int’l Conf. Aspect-Oriented Software Development(AOSD), pp. 43–56. ACM Press (2011)

82. Siegmund, N., Rosenmüller, M., Kuhlemann, M., Kästner, C., Apel, S., Saake,G.: SPL Conqueror: Toward optimization of non-functional properties in softwareproduct lines. Software Quality Journal - Special issue on Quality Engineering forSoftware Product Lines (in press, 2012)

83. Sincero, J., Tartler, R., Lohmann, D., Schröder-Preikschat, W.: Efficient extrac-tion and analysis of preprocessor-based variability. In: Proc. Int’l Conf. GenerativeProgramming and Component Engineering (GPCE), pp. 33–42. ACM Press (2010)


84. Singh, N., Gibbs, C., Coady, Y.: C-CLR: A tool for navigating highly configurablesystem software. In: Proc. AOSD Workshop on Aspects, Components, and Patternsfor Infrastructure Software (ACP4IS), p. 9. ACM Press (2007)

85. Smaragdakis, Y., Batory, D.: Mixin layers: An object-oriented implementation tech-nique for refinements and collaboration-based designs. ACM Trans. Softw. Eng.Methodol. (TOSEM) 11(2), 215–255 (2002)

86. Spencer, H., Collyer, G.: #ifdef considered harmful or portability experience withC news. In: Proc. USENIX Conf., pp. 185–198. USENIX Association (1992)

87. Steger, M., Tischer, C., Boss, B., Müller, A., Pertler, O., Stolz, W., Ferber, S.:Introducing PLA at Bosch Gasoline Systems: Experiences and Practices. In: Nord,R.L. (ed.) SPLC 2004. LNCS, vol. 3154, pp. 34–50. Springer, Heidelberg (2004)

88. Szyperski, C.: Component Software: Beyond Object-Oriented Programming, 2ndedn. Addison-Wesley, Boston (2002)

89. Tarr, P., Ossher, H., Harrison, W., Sutton Jr., S.M.: N degrees of separation: Multi-dimensional separation of concerns. In: Proc. Int’l Conf. Software Engineering(ICSE), pp. 107–119. IEEE Computer Society (1999)

90. Tartler, R., Lohmann, D., Sincero, J., Schröder-Preikschat, W.: Feature consistencyin compile-time-configurable system software: Facing the Linux 10,000 feature prob-lem. In: Proc. European Conference on Computer Systems (EuroSys), pp. 47–60.ACM Press (2011)

91. Tešanović, A., Sheng, K., Hansson, J.: Application-tailored database systems: Acase of aspects in an embedded database. In: Proc. Int’l Database Engineering andApplications Symposium, pp. 291–301. IEEE Computer Society (2004)

92. Thaker, S., Batory, D., Kitchin, D., Cook, W.: Safe composition of product lines. In:Proc. Int’l Conf. Generative Programming and Component Engineering (GPCE),pp. 95–104. ACM Press (2007)

93. Thüm, T., Apel, S., Kästner, C., Kuhlemann, M., Schaefer, I., Saake, G.: Analysisstrategies for software product lines. Technical Report FIN-004-2012, School ofComputer Science, University of Magdeburg (April 2012)

94. Thüm, T., Batory, D., Kästner, C.: Reasoning about edits to feature models. In:Proc. Int’l Conf. Software Engineering (ICSE), pp. 254–264. IEEE Computer So-ciety (2009)

95. Thüm, T., Schaefer, I., Kuhlemann, M., Apel, S.: Proof composition for deduc-tive verification of software product lines. In: Proc. Int’l Workshop on Variability-Intensive Systems Testing, Validation & Verification (VAST), pp. 270–277. IEEEComputer Society (2011)

96. Vittek, M.: Refactoring browser with preprocessor. In: Proc. European Conf. onSoftware Maintenance and Reengineering (CSMR), pp. 101–110. IEEE ComputerSociety (2003)

97. Weise, D., Crew, R.: Programmable syntax macros. In: Proc. Conf. ProgrammingLanguage Design and Implementation (PLDI), pp. 156–165. ACM Press (1993)

98. Wirth, N.: Program development by stepwise refinement. Commun. ACM 14(4),221–227 (1971)

Language and IDE Modularization

and Composition with MPS

Markus Voelter

Oetztaler Strasse 38, Stuttgart, [email protected]

Abstract. Modularization and composition of languages and their IDEsis an important building block for working efficiently with domain-specific languages. Traditionally this has been a challenge because manygrammar formalisms are not closed under composition, hence syntac-tic composition of languages can be challenging. Composing semanticscan also be hard, at least in the general case. Finally, a lot of existingwork does not consider IDEs for the composed languages. This paperillustrates how JetBrains MPS addresses language and IDE modulariza-tion and composition based on a projectional editor and modular typesystems and transformations. The paper also classifies composition ap-proaches according to the dependencies between the composed languagesand whether syntactic composition is supported. Each of the approachesis illustrated with an extensive example implementation in MPS.

1 Introduction

Programmers typically use general purpose languages (GPLs) for developingsoftware systems. The term general-purpose refers to the fact that they can beused for any programming task. They are Turing complete, and provide meansto build custom abstractions using classes, higher-order functions, or logic pred-icates, depending on the particular language. Traditionally, a complete softwaresystem has been implemented using a single GPL, plus a number of configura-tion files. However, more recently this has started to change; systems are builtusing a multitude of languages.

One reason is the rising level of complexity of target platforms. For example,web applications consist of business logic on the server, a database backend,business logic on the client as well as presentation code on the client, mostof these implemented with their own set of languages. A particular languagestack could use Java, SQL, JavaScript and HTML. The second reason drivingmulti-language programming is the increasing popularity of domain-specific lan-guages (DSLs). These are specialized, often small languages that are optimizedfor expressing programs in a particular application domain. Such an applica-tion domain may be a technical domain (e.g. database querying, user interfacespecification or scheduling) or a business domain (such as insurance contracts,refrigerator cooling algorithms or state-based programs in embedded systems).DSLs support these domains more efficiently than GPLs because they provide


384 M. Voelter

linguistic abstractions for common idioms encountered in these domains. Usingcustom linguistic abstractions makes the code more concise, more suitable forformal analysis, verification, transformation and optimization, and more acces-sible to non-programmer domain experts.

The combined use of multiple languages in a single system raises the ques-tion of how the syntax, semantics, and the development environments (IDEs)of the various languages can be integrated. As we discuss in Section 6, each ofthese aspects has its own challenges and has been addressed to various degrees.Syntactic composition has traditionally been hard [26]. In particular, retainingdecent IDE support (such as code completion, syntax coloring, static error check-ing, refactoring or debugging) in the face of syntactically composed languagesis a challenge and hence is often not supported for a particular combinationof languages. In some rare cases, syntactic integration between specific pairsof languages has been built, for example, embedded SQL in Java [31]. A moresystematic approach for language and IDE modularization and composition isrequired. Language and IDE modularization and composition addresses the fol-lowing concerns:

– The concrete and the abstract syntax of the two languages have to be com-posed. This may require the embedding of one syntax into another one. This,in turn, requires modular syntax definitions.

– The static semantics (constraints and type system) have to be integrated.For example, existing operators have to be overridden for new types.

– The execution semantics have to be combined as well. In practice, this maymean mixing the code generated from the composed languages, or composingthe generators or interpreters.

– Finally, the IDE that provides code completion, syntax coloring, static checksand other relevant services has to be composed as well.

In this paper we focus on JetBrains MPS1 to demonstrate language compositionapproaches. MPS is a projectional editor and no grammars or parsers are used.Instead, editing gestures directly modify the abstract syntax tree (AST), and therepresentation on the screen is projected from the changing AST. Consequently,MPS’ main contribution to language composition addresses the syntax and IDEaspects.

1.1 Contribution and Structure of the paper

In this paper we make the following contributions. First, we identify four differ-ent composition approaches (Referencing, Extension, Reuse and Embedding) andclassify them regarding dependencies and syntactic mixing. Second, we demon-strate how to implement these four approaches with JetBrains MPS. We empha-size syntax and IDE, but also discuss type systems and transformation. Whileother, parser-based approaches can do language composition to some extent as

1 http://www.jetbrains.com/mps/

Language and IDE Modularization and Composition with MPS 385

well, it is especially simple to do with projectional editors. So our third contri-bution is an implicit illustration of the benefits of using projectional editors inthe context of language composition, based on the MPS example.

The paper is structured as follows. In Section 1.3 we define terms and con-cepts used throughout the paper. Section 1.4 introduces the four compositionapproaches discussed in this paper, and provides a rationale why we discussthose four approaches, and not others. We then explain how projectional editorswork in general, and how MPS works specifically (Section 2). We develop thecore language which acts as the basis for the modularization and compositionexamples in Section 3. This section also serves as a brief tutorial on languagedefinition in MPS. The main part of the paper is Section 4 which shows theimplementation of the four composition approaches in MPS. Section 5 discusseswhat works well and at what could be improved in MPS with regards to languageand IDE modularization and composition. We conclude the paper with relatedwork (Section 6) and a short summary.

1.2 Additional Resources

The example code used in this paper can be found at github2 and works withMPS 2.0. A set of screencasts that walk through the example code is available onYoutube3. This paper is not a complete MPS tutorial. We refer to the LanguageWorkbench Competition (LWC 11) MPS tutorial4 for details.

1.3 Terminology

Programs are represented in two ways: concrete syntax (CS) and abstract syntax(AS). Users use the CS as they write or change programs. The AS is a datastructure that contains all the data expressed with the CS, but without thenotational details. The AS is used for analysis and downstream processing ofprograms. A language definition CS and AS, as well as rules for mapping oneto the other. Parser-based systems map the CS to the AS. Users interact witha stream of characters, and a parser derives the abstract syntax tree (AST)by using a grammar. Projectional editors go the other way round. User editinggestures directly change the AST, the concrete syntax being a mere projectionthat looks (and mostly feels) like text. MPS is a projectional editor.

The AS of programs is a primarily a tree of program elements. Every element(except the root) is contained by exactly one parent element. Syntactic nestingof the CS corresponds to a parent-child relationship in the AS. There may alsobe any number of non-containing cross-references between elements, establishedeither directly during editing (in projectional systems) or by a linking phase thatfollows parsing.

A program may be composed from several program fragments that may ref-erence each other. Each fragment f is a standalone AST. In file-based tools, afragment corresponds to a file. Ef is the set of program elements in a fragment.

2 https://github.com/markusvoelter/MPSLangComp-MPS2.03 http://www.youtube.com/watch?v=lNMRMZk8KBE4 http://code.google.com/p/mps-lwc11/wiki/GettingStarted

386 M. Voelter

A language l defines a set of language concepts Cl and their relationships. Weuse the term concept to refer to CS, AS plus the associated type system rules andconstraints as well as a definition of its semantics. In a fragment, each programelement e is an instance of a concept c defined in a language l. We define theconcept-of function co to return the concept of which a program element is aninstance: co(element) ⇒ concept . Similarly we define the language-of functionlo to return the language in which a given concept is defined: lo(concept) ⇒language. Finally, we define a fragment-of function fo that returns the fragmentthat contains a given program element: fo(element) ⇒ fragment .

We also define the following sets of relations between program elements. Cdnf

(short for children) is the set of parent-child relationships in a fragment f . Eachc ∈ Cdn has the properties parent and child. Refsf (short for references) isthe set of non-containing cross-references between program elements in a frag-ment f . Each reference r ∈ Refsf has the properties from and to, which referto the two ends of the reference relationship. Finally, we define an inheritancerelationship that applies the Liskov Substitution Principle [30] to language con-cepts: a concept csub that extends another concept csuper can be used in placeswhere an instance of csuper is expected. Inhl (short for inheritances) is the setof inheritance relationships for a language l.

An important concern in language and IDE modularization and compositionis the notion of independence. An independent language does not depend onother languages. It can be defined as follows:

∀r ∈ Refsl | lo(r .to) = lo(r .from) = l (1)

∀s ∈ Inhl | lo(s .super) = lo(s .sub) = l (2)

∀c ∈ Cdnl | lo(c.parent) = lo(c.child) = l (3)

An independent fragment is one where all references stay within the fragment:

∀r ∈ Refsf | fo(r .to) = fo(r .from) = f (4)

We distinguish homogeneous and heterogeneous fragments. A homogeneous frag-ment is one where all elements are expressed with the same language:

∀e ∈ Ef | lo(e) = l (5)

∀c ∈ Cdnf | lo(c.parent) = lo(c.child) = l (6)

As elaborated by Harel and Rumpe in [19] the execution semantics of a lan-guage l are defined by mapping the syntactic constructs of l to concepts fromthe semantic domain S of the language. Different representations of S and themapping l → S exist. Harel and Rumpe prefer to use mathematical formalisms asS because their semantics is well known, but acknowledge that other formalismsare useful as well. In this paper we consider the semantics of a language l to bedefined via a transformation that maps a program expressed in l to a program inanother language l2 that has the same observable behavior. The observable be-havior can be determined in various ways, for example using a sufficiently large


set of test cases. A discussion of alternative ways to define language semantics isbeyond the scope of this paper, and, in particular, we do not discuss interpretersas an alternative to transformations. This decision is driven partly by the factthat, in our experience, transformations are the most widely used approach fordefining execution semantics.

The paper emphasizes IDE modularization and composition in addition to lan-guage modularization and composition. When referring to IDE services, we meansyntax highlighting, code completion and static error checking.Other concerns arerelevant in IDEs, including refactoring, quick fixes, support for testing, debuggingand version control integration. Most of these are supported by MPS in a modu-lar and composable way (the exceptions are profiling, which is not supported, anddebugging, which is supported but on a too low-level of abstraction), we do notdiscuss those aspects in this paper to keep the paper at a reasonable length.

1.4 Classification of Composition Approaches

In this paper we identify the following four modularization and composition ap-proaches: Referencing, Extension, Reuse and Embedding. Below is an intuitive de-scription of each approach; stricter definitions follow in the remainder of the paper.

� Referencing. Referencing refers to the case where a program is expressed intwo languages A and B, but the parts expressed in A and B are kept in sepa-rate homogeneous fragments (files), and only name-based references connect thefragments. The referencing language has a direct dependency on the referencedlanguage. An example for this case is a language that defines user interface (UI)forms for data structures defined by another language. The UI language refer-ences the data structures defined in a separate program.

� Extension. Extension also allows a dependency of the extending language tothe extended language (also called base language). However, in this case thecode written in the two languages resides in a single, heterogeneous fragment,i.e. syntactic composition is required. An example is the extension of Java or Cwith new types, operators or literals.

� Reuse. Reuse is similar to Referencing in that the respective programs residein separate fragments and only references connect those fragments. However,in contrast to Referencing, no direct dependencies between the languages areallowed. An example would be a persistence mapping language that can be usedtogether with different data structure definition languages. To make this possible,it cannot depend on any particular data definition language.

� Embedding. Embedding combines the syntactic integration introduced byExtension with not having dependencies introduced by Reuse: independent lan-guages can be used in the same heterogeneous fragment. An example is em-bedding a reusable expression language into another DSL. Since neither of thetwo composed languages can have direct dependencies, the same expression lan-guage can be embedded into different DSLs, and a specific DSL could integratedifferent expression languages.

388 M. Voelter

Fig. 1. We distinguish the four modularization and composition approaches regardingfragment structure and language dependencies. The dependencies dimension captureswhether the languages have to be designed specifically for a specific composition partneror not. Fragment structure captures whether the composition approach supports mixingof the concrete syntax of the composed languages or not.

As can be seen from the above descriptions, we distinguish the four approachesregarding fragment structure and language dependencies, as illustrated in Fig. 1(other classifications have been proposed, they are discussed in Section 6). Fig. 2shows the relationships between fragments and languages in these cases. We usedthese two criteria as the basis for this paper because we consider them essentialfor the following reasons. Language dependencies capture whether a languagehas to be designed with knowledge about a particular composition partner inorder to be composable with that partner. It is desirable in many scenarios thatlanguages be composable without previous knowledge about possible compositionpartners. Fragment Structure captures whether the composed languages can besyntactically mixed. Since modular concrete syntax can be a challenge, this isnot always easily possible, though often desirable.

Fig. 2. The relationships between fragments and languages in the four compositionapproaches. Boxes represent fragments, rounded boxes are languages. Dotted lines aredependencies, solid lines references/associations. The shading of the boxes representthe two different languages.

1.5 Case Study

In this paper we illustrate the language and IDE modularization and composi-tion approaches with MPS based on a set of example languages. At the centeris a simple entities language. We then build additional languages to illustratethe composition approaches introduced above (Fig. 3). The uispec language il-lustrates Referencing with entities. relmapping is an example of Reuse with


Fig. 3. entities is the central language. uispec defines UI forms for the entities.uispec_validation adds validation rules, and embeds a reusable expressions language.relmapping provides a reusable database mapping language, relmapping_entitiesadapts it to the entities language. rbac is a reusable language for specifying accesscontrol permissions; rbac_entities adapts this language to the entities language.

separated generated code. rbac illustrates Reuse with intermixed generated code.uispec_validation demonstrates Extension (of the uispec language) and Em-bedding with regards to the expressions language. We also show Extension byextending MPS’ built in BaseLanguage, a variant of Java.

2 How MPS Works

The JetBrains Meta Programming System5 is a projectional language work-bench available as open source software under the Apache 2.0 license. The termLanguage Workbench was coined by Martin Fowler [16]. He defines a languageworkbench as a tool with the following characteristics:

1. Users can freely define languages which are fully integrated with each other.2. The primary source of information is a persistent abstract representation.3. A DSL is defined in three main parts: schema, editor(s), and generator(s).4. Language users manipulate a DSL through a projectional editor.5. A language workbench can persist incomplete or contradictory information.

MPS exhibits all of these characteristics. MPS’ most distinguishing feature isits projectional editor. This means that all text, symbols, and graphics are pro-jected, and not parsed. Projectional editing is well-known from graphical mod-eling tools (UML, Entity-Relationship, State Charts). In those tools only themodel structure is persisted, often using XML or a database. For editing pur-poses, graphical editors project the abstract syntax using graphical shapes. Usersuse mouse gestures and keyboard actions tailored to graphical editing to modifythe model structure directly. While the CS of the model does not have to bestored because it is specified as part of the language definition and hence knownby the projection engine, graphical modeling tools usually also store informationabout the visual layout.

5 http://jetbrains.com/mps

390 M. Voelter

Projectional editing can also be used for textual syntax. However, since theprojection looks like text, users expect editing gestures known from ”real text”to work. MPS achieves this quite well (it is beyond the scope of this paper todescribe how). The following is a list of benefits of projectional editing:

– No grammar or parser is required. Editing directly changes the underlyingstructure. Projectional editors can handle unparseable code. Language com-position is made easy, because it cannot result in ambiguous grammars.

– Graphical, symbolic, tabular and textual notations can be mixed and com-bined, and they can be defined with the same formalism and approach. Forexample, a graphical tool for editing state machines can embed a textualexpression language for editing the guard conditions on transitions6.

– Since projectionally defined languages always need an IDE for editing (to dothe projection), language definition and composition always implies IDE def-inition and composition. The IDE provides code completion, error checkingand syntax highlighting for any valid language, composed or not.

– Since the model is stored independent of its concrete notation, it is possibleto represent the same model in different ways simply by providing severalprojections. Different viewpoints of the overall program can be stored inone model, but editing can still be viewpoint-specific. It is also possible tostore out-of-band data (i.e. annotations on the core model). Examples ofthe latter include documentation, pointers to requirements (traceability) orfeature dependencies in the context of product lines.

Projectional editors also have drawbacks. The first one is that editing the pro-jected representation as opposed to ”real text” needs some time to get used to.Without specific customization, every program element has to be selected froma drop-down list to be ”instantiated”. However, MPS provides editor customiza-tions to enable an editing experience that resembles modern IDEs that use auto-matically expanding code templates. This makes editing in MPS quite convenientand productive in all but the most exceptional cases. The second drawback isthat models are not stored as readable text, but rather as an XML-serializedAST. Integrating XML files with an otherwise ASCII-based development in-frastructure can be a challenge. MPS addresses the most critical aspect of thisdrawback by supporting diff and merge on the level of the projected CS. A finaldrawback is that MPS is not based on any industry standards. For example, itdoes not rely on EMF7 or another widely used modeling formalism. However,since MPS’ meta-metamodel is extremely close to EMF Ecore, it is trivial tobuild an EMF exporter. Also, all other language workbenches also do not sup-port portability of the language definition beyond the AS— which is trivial interms of implementation effort.

6 Intentional’s Domain Workbench has demonstrated this repeatedly, for examplein [48]. As of 2012, MPS can do text, symbols (such as big sum signs or fractionbars) and tables. Graphics will be supported in 2013.

7 http://eclipse.org/emf


MPS has been designed to work with sets of integrated languages. This makesMPS particularly well suited to demonstrate language and IDE modularizationand composition techniques. In particular, the following three characteristics areimportant in this context:

� Composable Syntax. Depending on the particular composition approach, com-position of the CS is required. In traditional, grammar-based systems, combiningindependently developed grammars can be a problem: many grammar classesare not closed under composition, and various invasive changes (such as left-factoring or redefinition of terminals or non-terminals), or unwanted syntacticescape symbols are required [26]. As we will see, this is not the case in MPS.Arbitrary languages can be combined syntactically.

� Extensible Type Systems. All composition techniques require some degree oftype system extension or composition8. MPS’ type system specification is basedon declarative typing rules that are executed by a solver. This way, additionaltyping rules for additional language concepts can be defined without invasivelychanging the existing typing rules of the composed languages.

� Modular Transformation Framework. Transformations can be defined sepa-rately for each language concept. If a new language concept is added via acomposition technique, the transformation for this new concept is modular. Ifexisting transformation must be overridden or a certain program structure mustbe treated specially, a separate transformation for these cases can be written,and, using generator priorities, it can be configured to run before an existingtransformation.

The examples discussed in this paper will elaborate on these characteristics.This is why for each technique, we discuss structure and syntax, type systemand transformation concerns.

3 Implementing a DSL with MPS

This section illustrates the definition of a language with MPS. Like otherlanguage workbenches, MPS comes with a set of DSLs for language defini-tion, a separate DSL for each language aspect such as structure, editor, typesystems, generators as well as things like quick fixes or refactorings. MPSis bootstrapped, so these DSLs are built (and can be extended) with MPSitself.

We illustrate language definition with MPS based on a simple entities lan-guage. Example code is shown below. Modules are root nodes that live as top-level elements in models. According to the terminology introduced in Section 1.3root nodes (and their descendants) are considered fragments.

8 Note that the term type system really just refers to the actual type calculation andchecks, as well as other constraints on the program. Resolution of symbol names ishandled differently and not part of the type system.

392 M. Voelter

module company // continued...entity Employee { entity Department {

id : int id : intname : string description : stringrole : string }worksAt : Departmentfreelancer : boolean

}

� Structure and Syntax. Language definition starts with the AS, referred to asstructure in MPS. Fig. 4 shows a UML diagram of the entities language AS.The following code shows the definition of the Entity concept9. Entity extendsBaseConcept, the top-level concept similar to java.lang.Object in Java. Itimplements the INamedConcept interface to inherit a name property. It declaresa list of children of type Attribute in the attributes role. A concept may alsohave references to other concepts (as opposed to children).

concept Entity extends BaseConcept implements INamedConceptis root: truechildren:

Attribute attributes 0..n

Fig. 4. The abstract syntax of the entities language. An Entity has Attributes whichhave a Type and a name. EntityType extends Type and references Entity. This adaptsentities to types (cf. the Adapter pattern [18]). Concepts like EntityType which haveexactly one reference are called smart references and are treated specially by MPS:instead of proposing to explicitly instantiate the reference concept and then selectingthe target, the code completion menu shows the possible targets of the reference directly.The reference concept is implicitly instantiated once a target is selected.

Editors in MPS are based on cells. Cells are the smallest unit relevant for pro-jection. Consequently, defining an editor consists of arranging cells and definingtheir content. Different cell types are available. Fig. 5 explains the editor forEntity. The editors for the other concepts are defined similarly.

� Type System. As we have explained in Section 2, language developers specifytyping rules for language concepts. To calculate and check types for a program,

9 In addition to properties, children and references, concept definitions can havemore characteristics such as concept properties or concepts links. However,these are not needed for this example, so we do not show them here. The codeabove shows all the characteristics used in this example.


Fig. 5. The editor for Entity. The outermost cell is a vertical list [/ .. /]. In thefirst line, we use a horizontal list [> .. <] that contains the keyword entity, the nameproperty and an opening curly brace. In the second line we use indentation –-> and avertical arrangements of the contents of the attributes collection (> .. <). Finally,the third line contains the closing curly brace.

MPS ”instantiates” these rules for each instance of the concept, resulting in a setof type equations for a program. These equations contain type values (such asint) as well as type variables (for program elements whose type has not yet beencalculated). MPS’ then solves the set of type equations, trying to assign typevalues to the type variables in a way such that all the equations for a programare free of contradictions. If a contradiction arises, this is flagged as a typingerror. For the entities language, we specify two simple typing rules. The firstone specifies the type of a Type (it answers the questions of what the type sys-tem’s type of should be for e.g. int if it is used in an attribute such as int age;):

rule typeof_Type for Type as t {typeof(t) :==: t.copy; }

This rule has the name typeof_Type and applies to the language conceptType (int or string are subconcepts of the abstract Type concept). Thetypeof(...) operator creates a type variable associated with a programelement, in this case, with an instance t of Type. We calculate that type bycloning t itself. In other words, if the type system engine needs to find out thetype of an int program element, that type is int as well. This may be a bitconfusing, because instances of Type (and its subconcepts) play two roles. First,they are part of the program itself if they are explicitly specified in attributes(int age;). Second, they are also the objects with which the type systemengine works. By cloning the program element, we express that types representthemselves in the type system engine (we need a clone since for technical reasonswe cannot return the program element itself). Another typing rule defines thetype of the Attribute as a whole to be the type of the attribute’s type property:

rule typeof_Attribute for Attribute as a {typeof(a) :==: typeof(a.type); }

This rule answers the question of what the type system engine’s type should befor instances of Attribute. Note how this equation has two type variables and

394 M. Voelter

no type values. It simply propagates the type of the attribute’s specified type(the int in int age;) to be the type of the overall attribute. Note how the twotyping rules discussed in this section play together to calculate the type of theAttribute from the type specified by the user: First we calculate the type ofthe specified type (a clone of itself) and then we propagate this type to the typeof the whole attribute.

There is one more rule we have to define. It is not strictly part of the typecalculation. It is a constraint that checks the uniqueness of attribute names forany given Entity:

checking rule check_Entity for Entity as e {set<string> names = new hashset<string>;foreach a in e.attributes {

if (names.contains(a.name)) {error "duplicate attribute name" -> a;

}names.add(a.name);

}}

This rule does not establish typing equations, it just checks a property of theprogram (note the checking in the rule header). It checks attribute name unique-ness based on a set of the names. It reports an error if it finds a duplicate. Itannotates the error with the attribute a, so the editor can highlight the respec-tive program element. Note how in case of the typing rules shown above we don’thave to perform the check and report an error ourselves. This is done implicitlyby the type system engine.

� Generator. From entities models we generate Java Beans. Since Java isavailable in MPS (called the BaseLanguage), the generation is actually a model-to-model transformation: from the entities model we generate a Java model.MPS supports several kinds of transformations. The default case is the template-based transformation which maps ASTs onto other ASTs. Alternatively, one canuse an API to manually construct the target tree. Finally, the textgen DSLis available to generate ASCII text (at the end of the transformation chain).Throughout this paper we use the template-based approach.

MPS templates look like text generation templates known from tools such asXpand10, Jet11 or StringTemplate12 since they use the CS of the target languagein the template. However, that CS is projected like any other program, and theIDE can provide support for the target language in the template (we discussdetails on support for the target language in templates in Related Work, Sec-tion 6). This also means that the template code itself must be valid in terms ofthe target language.

Template-based generators consist of mapping configurations and templates.Mapping configurations define which elements are processed by which templates.For the entities language, we need a root mapping rule and reduction rules.

10 http://www.eclipse.org/modeling/m2t/?project=xpand11 http://www.eclipse.org/modeling/m2t/?project=jet12 http://www.stringtemplate.org/


Root mapping rules create new root nodes from existing root nodes (they a frag-ment to another fragment). In our case we generate a Java class from an Entity.Reduction rules are in-place transformations. Whenever the engine encountersan instance of the specified source concept somewhere in a model, it replaces theelement with the result of the associated template. In our case we reduce thevarious types (int, string, etc.) to their Java counterparts. Fig. 6 shows a partof the entities mapping configuration.

Fig. 6. The mapping configuration for the entities language. The root mappingrule for Entity specifies that instances of Entity should be transformed with themap_Entity template (which produces a Java class and is shown in Fig. 7). The reduc-tion rules use inline templates, i.e. the template is embedded in the mapping configu-ration. For example, the IntType is replaced with the Java int and the EntityRefTypeis reduced to a reference to the class generated from the target entity. The ->$ is a so-called reference macro. It contains code (not shown) that ”rewires” the reference (thatpoints to the Double class in the template code) to a reference to the class generatedfrom the target entity.

Fig. 7 shows the map_Entity template. It generates a complete Java class froman input Entity. To understand how templates work in MPS we discuss in moredetail the generation of Java fields for each Entity Attribute:

– Developers first write structurally correct example code in the target lan-guage. To generate a field into a class for each Attribute of an Entity, onewould first add a field to a class (see aField in Fig. 7).

– Then macros are attached to those program elements in the example codethat have to be replaced with elements from the input model during thetransformation. In the Attribute example in Fig. 7 we first attach a LOOPmacro to the whole field. It contains an expression node.attributes;wherenode refers to the input Entity (this code is entered in the Inspector win-dow and is not shown in the screenshot). This expression returns the set

396 M. Voelter

of Attributes from the current Entity, making the LOOP iterate over allattributes of the entity and create a field for each of them.

– At this point, each created field would be identical to the example code towhich we attached the LOOP macro (private int aField;). To make thegenerated field specific to the particular Attribute we iterate over, we usemore macros. A COPY_SRC macro is used to transform the type. COPY_SRCcopies the input node (the inspector specifies the current attribute’s type asthe input here) and applies reduction rules (those defined in Fig. 6) to maptypes from the entities language to Java types. We then use a propertymacro (the $ sign around aField) to change the name property of the fieldwe generate to the name of the Attribute we currently transform.

Fig. 7. The template for creating a Java class from an Entity. The generated classcontains a field, a getter and a setter for each of the Attributes of the Entity. Therunning text explains the details.

Instead of mixing template code and target language code (and separating themwith some kind of escape character) we annotate macros to regular, valid targetlanguage code. Macros can be attached to arbitrary program elements. Thisway, the target language code in templates is always structurally correct, but itcan still be annotated to control the transformation. Annotations are a genericMPS mechanism not specific to transformation macros and are discussed inSection 4.5.

4 Language Composition with MPS

In this section we discuss the four language and IDE modularization and com-position techniques introduced in Section 1.4, plus an additional one that works


only with a projectional editor such as MPS. For the first four, we provide aconcise prose definition plus a set of formulas. We then illustrate each techniquewith a detailed MPS-based example based on the entities language.

4.1 Language Referencing

Language Referencing enables homogeneous fragments with cross-referencesamong them, using dependent languages (Fig. 8).

Fig. 8. Referencing: Language l2 depends on l1, because concepts in l2 reference con-cepts in l1. (We use rectangles for languages, circles for language concepts, and UMLsyntax for the lines: dotted = dependency, arrows = associations, hollow-triangle-arrowfor inheritance.)

A fragment f2 depends on f1. f2 and f1 are expressed with languages l2 and l1,respectively. We call l2 the referencing language, and l1 the referenced language.The referencing language l2 depends on the referenced language l1 because atleast one concept in the l2 references a concept from l1. While equations (2) and(3) (from Section 1.3) continue to hold, (1) does not. Instead

∀r ∈ Refsl2 | lo(r .from) = l2 ∧ (lo(r .to) = l1 ∨ lo(r .to) = l2) (7)

From a CS perspective, such a reference is a simple identifier, (possibly withdots). This terminal can easily be redefined in the referencing language and doesnot require reusing and embedding non-terminals from the referenced language.Hence no syntactic composition is required in this case.

As an example, for Referencing we define a language uispec for defining UIforms for entities. Below is an example. This is a homogeneous fragment,expressed only in the uispec language. Only the identifiers of the referencedelements (such as Employee.name) have been added to the referencing languageas discussed in the previous paragraph. However, the fragment is dependent,since it references elements from another fragment (expressed in the entitieslanguage).

form CompanyStructureuses Departmentuses Employeefield Name: textfield(30) -> Employee.namefield Role: combobox(Boss, TeamMember) -> Employee.rolefield Freelancer: checkbox -> Employee.freelancerfield Office: textfield(20) -> Department.description

398 M. Voelter

Fig. 9. The abstract syntax of the uispec language. Dotted boxes represent classes fromanother language (here: the entities language). A Form contains EntityReferencesthat connect to an entities model. A Form also contains Fields, each referencing anAttribute from an Entity and containing a Widget.

� Structure and Syntax. The AS for the uispec language is shown in Fig. 9.The uispec language extends13 the entities language. This means thatconcepts from the entities language can be used in the definition of theuispec language. A Form owns a number of EntityReferences, which in turnreference an Entity. Below is the definition of the Field concept. It has alabel property, owns a Widget and refers to the Attribute it edits.

concept Field extends BaseConceptproperties:

label : stringchildren:

Widget widget 1references:

Attribute attribute 1

Note that there is no composition of concrete syntax, since the programs writtenin the two composed languages remain separated into their own fragments. Noambiguities or clashes between names of concepts may occur in this case.

� Type System. There are limitations regarding which widget can be used withwhich attribute type. The typing rule below implements these checks and isdefined in the uispec language. It references types from the entities language.We use a checking rule to illustrate how constraints can be written that donot use the inference engine introduced earlier.

checking rule checkTypes for Field as field {node<Widget> w = field.widget;node<Type> t = field.attribute.type;if (w.isInstanceOf(CheckBoxWidget) && !(t.isInstanceOf(BooleanType))) {

error "checkbox can only be used with booleans" -> w;}if (w.isInstanceOf(ComboWidget) && !(t.isInstanceOf(StringType))) {

error "combobox can only be used with strings" -> w;} }

13 MPS uses the term ”extension” whenever the definition of one language uses orrefers to concepts defined in another language. This is not necessarily an example oflanguage Extension as defined in this paper.


� Generator. The defining characteristic of Referencing is that the two lan-guages only reference each other, and the instance fragments are dependent,but homogeneous. No syntactic integration is necessary. In this example, thegenerated code exhibits the same separation. From a Form we generate a Javaclass that uses Java Swing to render the UI. It uses the Beans generated fromthe entities: they are instantiated, and the setters are called. The generatorsare separate but they are dependent, since the uispec generator knows aboutthe names of the generated Java Beans, as well as the names of the setters andgetters. This dependency is realized by defining a set of behaviour methods onthe Attribute concept that are called from both generators (the colon in thecode below represents the node cast operator and binds tightly; the code caststhe Attribute’s parent to Entity and then accesses the name property).

concept behavior Attribute {public string qname() { this.parent : Entity.name + "." + this.name; }public string setterName() { "set" + this.name.toFirstUpper(); }public string getterName() { "get" + this.name.toFirstUpper(); }

}

4.2 Language Extension

Language Extension enables heterogeneous fragments with dependent languages(Fig. 10). A language l2 extending l1 adds additional language concepts to thoseof l1. We call l2 the extending language, and l1 the base language. To allow thenew concepts to be used in the context of l1, some of them typically extendconcepts in l1. While l1 remains independent, l2 is dependent on l1:

∃i ∈ Inh(l2 ) | i .sub = l2 ∧ (i .super = l2 ∨ i .super = l1) (8)

A fragment f contains language concepts from both l1 and l2:

∀e ∈ Ef | lo(e) = l1 ∨ lo(e) = l2 (9)

In other words f is heterogeneous. For heterogeneous fragments (3) does not holdanymore, since

∀c ∈ Cdnf | (lo(co(c.parent)) = l1 ∨ lo(co(c.parent)) = l2) ∧(lo(co(c.child)) = l1 ∨ lo(co(c.child)) = l2) (10)

Fig. 10. Extension: l2 extends l1. It provides additional concepts B3 and B4. B3extends A3, so it can be used as a child of A2, plugging l2 into the context providedby l1. Consequently, l2 depends on l2.

400 M. Voelter

Note that copying a language definition and changing it does not constitute acase of Extension, because the approach would not be modular, it is invasive.Also, native interfaces that supports calling one language from another (such ascalling C from Perl or Java) is not Extension; rather it is a form of languageReferencing. The fragments remain homogeneous.

As an example we extend the MPS base language with block expressionsand placeholders. These concepts make writing generators that generate baselanguage code much simpler. Fig. 11 shows an example. We use a screenshotinstead of text because we use non-textual notations and color.

Fig. 11. Block Expressions (rendered with a shaded background) are basically anony-mous inline methods. Upon transformation, an actual method is generated that con-tains the block content, and the block expression is replaced with a call to this gener-ated method. Block expressions are used mostly when implementing generators; thisscreenshot shows a generator that uses a block expression.

A block expression is a block that can be used where an Expression isexpected [6]. It can contain any number of statements; yield can be used to”return values” from the block. A block expression can be seen as an ”inlinedmethod” or a closure that is defined and called directly. The generator of theblock expression from Fig. 11 transforms it into a method and a call to it:

aEmployee.setName( retrieve_name(aEmployee, widget0) );...

public String retrieve_name(Employee aEmployee, JComponent w) {String newValue = ((JTextField) w).getText();return newValue;

}

� Structure and Syntax. The jetbrains.mps.baselanguage.exprblockslanguage extends MPS’ BaseLanguage. The block expression is used in placeswhere the base language expects an Expression, so a BlockExpression extendsExpression. Consequently, fragments that use the exprblocks language, cannow use BlockExpressions in addition to the concepts provided by the baselanguage. The fragments become heterogeneous.

concept BlockExpression extends Expression implements INamedConceptchildren:

StatementList body 1

� Type System. The type of the yield statement is the type of the expressionthat is yielded, specified by typeof(aYield) :==: typeof(aYield.result)


(the type of yield 1; is int, because the type of 1 is int). Since the BlockEx-pression is used as an expression, it has to have a type as well: the typeof the BlockExpression is the common super type of the types of all the yields:

var resultType;for (node<BlockExpressionYield> y: blockExpr.descendants<BlockExpressionYield>){

resultType :>=: typeof(y.result);}typeof(blockExpr) :==: resultType;

This equation iterates over all yield statements in a block expression and es-tablishes an equation between the current yield’s type and a type variableresultType. It uses the :>=: operator to express that the resultType must bethe same or a supertype of the type of each yield. The only way to make allof these equations true (which is what the type system solver attempts to do)is to assign the common supertype of all yield types to resultType. We thenassociate this resultType with the type of the overall block expression.

� Generator. The generator reduces BlockExpressions to BaseLanguage. Ittransforms a heterogeneous fragment (BaseLanguage and exprblocks) to a ho-mogeneous fragment (BaseLanguage only). The first step is the creation of theadditional method for the block expression as shown in Fig. 12 and Fig. 13.

Fig. 12. We use a weaving rule to create an additional method for a block expression.A weaving rule processes an input element (a BlockExpression) by creating anotherelement in a different location. The context function defines the target location. Inthis example, it simply gets the class in which we have defined the particular blockexpression, so the additional method is generated into that same class. The calledtemplate weaveBlockExpression is shown in Fig. 13.

The template shown in Fig. 13 shows the creation of the method. The mappinglabel (b2M) creates a mapping between the BlockExpression and the createdmethod. We will use this label to refer to this generated method when we generatethe method call that replaces the BlockExpression (Fig. 14).

Another concept introduced by the exprblocks language is the Placehol-derStatement. It extends Statement so it can be used in function bodies. Itis used to mark locations at which subsequent generators can add additional

402 M. Voelter

Fig. 13. This generator template creates a method from the block expression. It usesCOPY_SRC macros to replace the dummy string type with the computed return type ofthe block expression, inserts a computed name, adds a parameter for each referencedvariable outside the block, and inserts all the statements from the block expression intothe body of the method. The b2M (block-to-method) mapping label is used later whengenerating the call to this generated method (shown in Fig. 14).

Fig. 14. Here we generate the call to the method generated in Fig. 13. We use themapping label b2M to refer to the correct method (not shown; happens inside the refer-ence macro). We pass in the variables from the call’s environment as actual argumentsusing the LOOP and COPY_SRC macros.

code. These subsequent generators will use a reduction rule to replace the place-holder with whatever they want to put at this location. It is a means to buildingextensible generators, as we will see later.

In the classification (Section 1.4) we mentioned that we consider language re-strictionas a formofExtension.To illustrate this pointweprevent theuse ofreturnstatements inside block expressions (the reason for this restriction is that theway we generate from the block expressions cannot handle return statements).To achieve this, we add a can be ancestor constraint to the BlockExpression:

can be ancestor:(operationContext, scope, node, childConcept, link)->boolean {

childConcept != concept/ReturnStatement/;} }

The childConcept variable represents the concept of which an instance is aboutto be added under a BlockExpression. The constraint expression has to returntrue if the respective childConcept is valid in this location. We return true ifthe childConcept is not a ReturnStatement. Note how this constraint is writtenfrom the perspective of the ancestor (the BlockExpression). MPS also supportswriting constraints from the perspective of the child. This is important to keepdependencies pointing in the right direction.

Extension comes in two flavors. One feels like Extension, and the other onefeels more like Embedding. In this section we have described the Extension flavor:we provide (a little, local) additional syntax to an otherwise unchanged language(block expressions and placeholders). The programs still essentially look like Java


programs, and in a few particular places, something is different. Extension withEmbedding flavor is where we create a completely new language, but use someof the syntax provided by a base language in that new language. For example,we could create a state machine language that reuses Java’s expressions in guardconditions. This use case feels like Embedding (we embed syntax from the baselanguage in our new language), but in terms of our classification (Section 1.4)it is still Extension. Embedding would prevent a dependency between the statemachine language and Java.

4.3 Language Reuse

Language Reuse enables homogenous fragments with independent languages.Given are two independent languages l2 and l1 and two fragment f2 and f1. f2depends on f1, so that

∃r ∈ Refsf2 | fo(r .from) = f2 ∧(fo(r .to) = f1 ∨ fo(r .to) = f2) (11)

Since l2 is independent, its concepts cannot directly reference concepts in l1. Thismakes l2 reusable with different languages, in contrast to language Referencing,where concepts in l2 reference concepts in l1. We call l2 the context languageand l1 the reused language.

A way of realizing dependent fragments with independent languages is usingan adapter language lA (cf. [18]) that contains concepts that extend conceptsin l2 and references concepts in l1 (Fig. 15). One could argue that in this caseReuse is just a combination of Referencing and Extension. This is true from animplementation perspective, but it is worth describing as a separate approachbecause it enables the combination of two independent languages with an adapterafter the fact, so no pre-planning during the design of l1 and l2 is necessary.

Fig. 15. Reuse: l1 and l2 are independent languages. Within an l2 fragment, we stillwant to be able to reference concepts in a fragment expressed with l1. To do this, anadapter language lA is added that uses Extension and Referencing to adapt l1 to l2.

Reuse covers the case where a language has been developed independent ofits reuse context. The respective fragments remain homogeneous. We cover twoalternative cases: in the first one (a persistence mapping language) the generatedcode is separate from the code generated from the entities language. Thesecond one (a language for role-based access control) describes the case wherethe generated code has to be ”woven into” the entities code.

404 M. Voelter

Separated Generated Code. relmapping is a reusable language for map-ping arbitrary data to relational tables. It supports the definition of relationaltable structures, but leaves the actual mapping to the source data unspecified.When the language is adapted to a specific context, this one mapping has to beprovided. The left side if the code below shows the reusable part. A databaseis defined that contains tables with columns. Columns have (database-specific)data types. On the right side we show the database definition code when itis reused with the entities language; each column is mapped to an entityattribute.

database CompanyDB database CompanyDBtable Departments table Departments

number id number id <- Department.idchar descr char descr <-Department.description

table People table Peoplenumber id number id <- Employee.idchar name char name <- Employee.namechar role char role <- Employee.rolechar isFreelancer char isFreelancer <- Employee.freelancer

� Structure and Syntax. Fig. 16 shows the structure of the relmapping language.The abstract concept ColumnMapper serves as a hook: if we reuse this languagein a different context, we extend this hook in a context-specific way.

Fig. 16. A Database contains Tables which contain Columns. A column has a nameand a type. A column also has a ColumnMapper. This is an abstract concept thatdetermines where the column gets its data from. It is a hook intended to be specializedin sublanguages, specific to the particular Reuse context.

The relmapping_entities language extends relmapping and adapts it forreuse with the entities language. To this end, it provides a subconcept ofColumnMapper, the AttributeColMapper, which references an Attribute fromthe entities language as a means of expressing the mapping from the attributeto the column. The relmapping language projects the column mapper — andits context-specific sub-concepts — on the right of the field definition, resultingin heterogeneous fragments.

� Type System.The type of a column is the type of its type property. In addition,the type of the column must also conform to the type of the column mapper,so the concrete subtype must provide a type mapping as well. This ”typinghook” is implemented as an abstract behaviour method typeMappedToDB on theColumnMapper. The typing rules then look as follows:


typeof(column) :==: typeof(column.type);typeof(column.type) :==: typeof(column.mapper);typeof(columnMapper) :==: columnMapper.typeMappedToDB();

The AttributeColMapping concept implements this method by mapping Int-Type to NumberType, and everything else to CharType:

public node<> typeMappedToDB() overrides ColumnMapper.typeMappedToDB {node<> attrType = this.attribute.type.type;if (attrType.isInstanceOf(IntType)) { return new node<NumberType>(); }return new node<CharType>();

}

� Generator. The generated code is also separated into a reusable base classgenerated by the generator of the relmapping language and a context-specificsubclass, generated by relmapping_entities. The generic base class containscode for creating the tables and for storing data in those tables. It containsabstract methods for accessing the data to be stored in the columns. The depen-dency structure of the generated fragments, as well as the dependencies of therespective generators, resembles the dependency structure of the languages: thegenerated fragments are dependent, and the generators are dependent as well(they share the name and implicitly the knowledge about the structure of theclass generated by the reusable relmapping generator).

public abstract class CompanyDBBaseAdapter {

private void createTableDepartments() { // SQL to create the Departments table}

private void createTablePeople() { // SQL to create the People table }

public void storeDepartments(Object applicationData) {Insert i = new Insert("Departments");i.add( "id", getValueForDepartments_id(applicationData));i.add( "descr", getValueForDepartments_descr(applicationData));i.execute();

}

public void storePeople(Object applicationData) { // like above }public abstract String getValueForDepartments_id(Object applicationData);public abstract String getValueForDepartments_descr(Object applicationData);// abstract getValue methods for the People table

}

The subclass generated by the generator (shown below) in the relmapping_entities language implements the abstract methods defined by the genericsuperclass. The interface, represented by the applicationData object, has to begeneric so any kind of user data can be passed in. Note how this class referencesthe Beans generated from the entities.

public class CompanyDBAdapter extends CompanyDBBaseAdapter {public String getValueForDepartments_id(Object applicationData) {

Object[] arr = (Object[]) applicationData;Department o = (Department) arr[0];return o.getId();

}

406 M. Voelter

public String getValueForDepartments_descr(Object applicationData) {Object[] arr = (Object[]) applicationData;Department o = (Department) arr[0];return o.getDescription();

} }

Interwoven Generated Code. rbac is a language for specifying role-basedaccess control for entities. The code below shows an example fragment. Notehow it references entities (Department) and attributes (Employee.name) froman entities fragment.

users: user mv : Markus Voelteruser ag : Andreas Grafuser ke : Kurt Ebert

roles: role admin : kerole consulting : ag, mv

permissions: admin, W : Departmentconsulting, R : Employee.name

� Structure and Syntax. The structure is shown in Fig. 17. Like relmapping,rbac provides a hook Resource to adapt it to context languages. The sub-language rbac_entities provides two sub-concepts of Resource, namely At-tributeResource to reference to an Attribute, and EntityResource to referto an Entity, to define permissions for entities and their attributes.

Fig. 17. Language structure of the rbac language. An RBACSpec contains Users, Rolesand Permissions. Users can be members in several roles. A permission assigns a roleand right (read, write) to a Resource (such as an Entity or an Attribute).

� Type System. No type system rules apply here, because none of the conceptsadded by the rbac language are typed or require constraints regarding the typesin the entities language.

� Generator. What distinguishes this case from the relmapping case is that thecode generated from the rbac_entities language is not separated from the codegenerated from the entities (we cannot use the convenient base class/subclassapproach). Instead, a permission check is required inside the setters of the JavaBeans. Here is some example code:


public void setName(String newValue) {// check permission (from rbac_entities language)if (new RbacSpecEntities().hasWritePermission("Employee.name")) {

throw new RuntimeException("no permission");}this.name = newValue;

}

The generated fragment is homogeneous (it is all Java code), but it is multi-sourced, since several generators contribute to the same fragment. To implementthis, several approaches are possible:

– We could use AspectJ14. This way, we could generate separate Java artifacts(all single-sourced) and then use the aspect weaver to ”mix” them. Whilethis would be a simple approach in terms of MPS (because we only gener-ate singled-sourced artifacts), it fails to illustrate advanced MPS generatorconcepts. So we do not use this approach here.

– An interceptor framework (see the Interceptor pattern in [9]) could be addedto the generated Java Beans, with the generated code contributing spe-cific interceptors (effectively building a custom aspect oriented programming(AOP) solution). We will not use this approach either, for the same reasonwe do not use AspectJ in this paper.

– We could ”inject” additional code generation templates to the existing enti-ties generator from the rbac_entities generator. This would make thegenerators woven as opposed to just dependent. However, weaving generatorsin MPS is not supported, so we cannot use this approach.

– We could define a hook in the generated Java beans code and then have therbac_entities generator contribute code to this hook. This is the approachwe will use. The generators remain dependent because they share knowledgeabout the way the hook works.

Notice that only the AspectJ solution would work without any pre-planningfrom the perspective of the entities language, because it avoids mixing thegenerated code artifacts (it is handled by AspectJ). All other solutions require theoriginal entities generator to ”expect”extensions. In our case we have modifiedthe entities generator to generate a PlaceholderStatement (Fig. 18) into thesetters. The placeholder acts as a hook at which subsequent generators can addstatements. While we have to pre-plan that we want to extend the generator inthis location, we do not have to predefine how.

The rbac_entities generator contains a reduction rule forPlaceholderState- ments. If the generator encounters a placeholder (thathas been put there by the entities generator) it replaces it with code thatchecks for the permission (Fig. 19). To make this work we have to specify inthe generator priorities that this generator runs strictly after the entitiesgenerator (since the entities generator has to create the placeholder beforeit can be replaced) and strictly before the BaseLanguage generator (whichtransforms BaseLanguage code into Java text for compilation). Priorities

14 http://www.eclipse.org/aspectj/

http://www.eclipse.org/aspectj/

408 M. Voelter

Fig. 18. This generator fragment creates a setter method for each attribute of anEntity. The LOOP iterates over all Attributes. The $ macro computes the name ofthe method, and the COPY_SRC macro on the argument type computes the type. Theplaceholder is used later to insert the permission check.

Fig. 19. This reduction rule replaces PlaceholderStatements with a permission check.Using the condition, we only match those placeholders whose identifier is pre-set(notice how we have defined this identifier in the template shown in Fig. 18). Theinserted code queries another generated class that contains the actual permission check.A runtime exception is thrown if the permission check fails.

specify a partial ordering (cf. the strictly before and strictly after) ongenerators and can be set in a generator properties dialog. Note that specifyingthe priorities does not introduce additional language dependencies, modularityis retained.

4.4 Language Embedding

Language Embedding enables heterogeneous fragments with independent lan-guages. It is similar to Reuse in that there are two independent languages l1 andl2, but instead of establishing references between two homogeneous fragments,we now embed instances of concepts from l2 in a fragment f expressed with l1:

∀c ∈ Cdnf | lo(co(c.parent)) = l1 ∧(lo(co(c.child)) = l1 ∨ lo(co(c.child)) = l2)) (12)

Unlike language Extension, where l2 depends on l1 because concepts in l2 extendconcepts in l1, there is no such dependency in this case. Both languages areindependent. We call l2 the embedded language and l1 the host language. Again,an adapter language can be used to achieve this (we describe an Embeddingwithout adapters in Section 4.5). However, in this case concepts in lA don’t just


reference concepts from l1. Instead, they contain them (similar to Fig. 15, butwith a containment link between B5 and A3).

As an example we embed an existing expressions language into the uispeclanguage without modifying either the uispec language or the expressionlanguage, since, in case of Embedding, none of them may have a dependency onthe other. Below is an example program using the resulting language that usesexpressions after the validate keyword:

form CompanyStructureuses Departmentuses Employeefield Name: textfield(30) -> Employee.name

validate lengthOf(Employee.name) < 30field Role: combobox(Boss, TeamMember) -> Employee.rolefield Freelancer: checkbox -> Employee.freelancer

validate if (isSet(Employee.worksAt)) Employee.freelancer == true elseEmployee.freelancer == false

field Office: textfield(20) -> Department.description

� Structure and Syntax. We create a new language uispec_validation thatextends uispec and also extends expressions. Fig. 20 shows the structure. Tobe able to use the expressions, the user has to instantiate a ValidatedFieldinstead of a Field. ValidatedField is also defined in uispec_validation andis a subconcept of Field.

Fig. 20. The uispec_validation language defines a subtype of uispec.Field thatcontains an Expression from a reusable expressions language. The language alsodefines a couple of additional expressions, including the AttributeRefExpr, which canbe used to refer to attributes of entities.

To support the migration of existing models that use Field instances, weprovide an intention. An Intention (known as a Quick Fix in Eclipse) is an in-place model transformation that can be triggered by the user by selecting itfrom the Intentions menu accessible via Alt-Enter. This particular intentionis defined for a Field, so the user can press Alt-Enter on a Field and selectAdd Validation15. This transforms an existing Field into a ValidatedField,so that a validation expression can be entered. The core of the Intention is thefollowing script, which performs the actual transformation:

15 We could alternatively also implement a way for people to just type validate onthe right side of a field to trigger this transformation.

410 M. Voelter

execute(editorContext, node)->void {node<ValidatedField> vf = node.replace with new(ValidatedField);vf.widget = node.widget;vf.attribute = node.attribute;vf.label = node.label;

}

As mentioned, the uispec_validation language extends the uispec and ex-pressions languages. ValidatedField has a property expr that containsthe actual Expression. As a consequence of polymorphism, we can use anyexisting subconcept of Expression defined in the expressions language here.So without doing anything else, we could write 20 + 40 > 10, since integerliterals and the + operator are defined as part of the embedded expressionslanguage. However, to write anything useful, we have to be able to referenceentity attributes from within expressions. To achieve this, we create theAttributeRefExpr as shown in Fig. 20. We also create LengthOfExpr andIsSetExpression as further examples of how to adapt an embedded languageto its new context (the uispec and entities languages in the example). Thefollowing is the structure definition of the LengthOfExpr.

concept LengthOfExpr extends Expressionproperties:

alias = lengthOfchildren:

Expression expr 1

Note how it defines an alias. The alias is used to pick the concept from the codecompletion menu. If the user is in expression context, he must type the alias ofa concept to pick it from the code completion menu. Typically, the alias is similarto the leading keyword of the concept’s CS. The LengthOfExpr is projected aslengthOf(something), so by choosing the alias to also be lengthOf, the conceptcan be entered naturally.

The AttributeRefExpr references entity attributes. However, it may onlyreference attributes of entities that are used in the Form within which we de-fine the validation expression. The code below defines the necessary scoping rule:

(model, scope, referenceNode, enclosingNode) -> sequence<node< >> {nlist<Attribute> res = new nlist<Attribute>;node<Form> form = enclosingNode.ancestor<Form>;for (node<EntityReference> er : form.usedEntities) {

res.addAll(er.entity.attributes);}return res; }

Notice that the actual syntactic embedding of the expressions in the uispec_validation language is not a problem because of how projectional editors work.No ambiguities may arise. We simply add a child of type Expression to theValidatedField concept.

� Type System. Primitive types such as int and string are defined in theentities language and in the reusable expression language. Although theyhave the same names, they are not the same concepts, so the two sets of


types must be mapped. For example, the type of the IsSetExpression isexpressions.Boo- leanType so it fits in with the expressions language.The type of the LengthOf- Expr, which takes an AttrRefExpression as itsargument, is expressions.Int- Type. The type of an attribute reference is thetype of the attribute’s type property, as in typeof(attrRef) :==: typeof(attrRef.attr.type). However, consider the following code:

field Freelancer: checkbox -> Employee.freelancervalidate if (isSet(Employee.worksAt))

then Employee.freelancer == falseelse Employee.freelancer == true

This code states that if the worksAt attribute of an employee is set, thenits freelancer attribute must be false else it must be true. It uses the ==operator from the expressions language. However, that operator expects twoexpressions.BooleanType arguments, but the type of the Employee.free-lancer is entities.BooleanType. In effect, we have to override the typingrules for the expressions language’s == operator. In the expressions language,we define overloaded operation rules. We specify the resulting type for anEquals- Expression depending on its argument types. Below is the code inthe expressions language that defines the resulting type to be boolean if thetwo arguments are expressions.BooleanType:

operation concepts: EqualsExpressionleft operand type: new node<BooleanType>()right operand type: new node<BooleanType>()operation type: (op, leftOperandType, rightOperandType)->node< > {

new node<BooleanType>;}

This overloaded operation specification is integrated with the inference-basedtyping rules using the following code:

rule typeof_BinaryExpression for BinaryExpression as binExpr {node<> opType = operation type( binExpr , left , right );if (opType != null) {

typeof(binExpr) :==: opType;} else {

error "operator " + binExpr.concept.name + " cannot be applied to operands "+ left.concept.name + "/" + right.concept.name -> binExpr;

} }

To override these typing rules for entities.BooleanType, we simply provideanother overloaded operation specification in the uispec_validation language:

operation concepts: EqualsExpressionone operand type: new node<BooleanType> // this is the entities.BooleanType!operation type: (op, leftOperandType, rightOperandType)->node< > {

node<BooleanType>; // expressions.BooleanType}

� Generator. For the generator we can use the following two alternative ap-proaches. We can use the expressions language’s existing to-text generator

412 M. Voelter

and wrap the expressions in some kind of TextWrapperStatement. A wrapper isnecessary because we cannot simply embed text in BaseLanguage — this wouldnot work structurally. Alternatively, we can write a (reusable) transformationfrom expressions to BaseLanguage; these rules would be used as part of thetransformation of uispec_validation code to BaseLanguage. Since many DSLswill map code to BaseLangauge, it is worth the effort to write a reusable gen-erator from expressions to BaseLanguage expressions. We choose this secondalternative.

Fig. 21. A number of reduction rules that map the reusable expressions language toBaseLanguage (Java). Since the languages are very similar, the mapping is trivial. Forexample, a PlusExpression is mapped to a + in Java, the left and right arguments arereduced recursively through the COPY_SRC macro.

The actual expressions defined in the expressions language and those ofBaseLanguage are almost identical, so this generator is trivial. We create a newlanguage project expressions.blgen and add reduction rules. Fig. 21 showssome of these reduction rules.

We also need reduction rules for the new expressions added in the uispec_validation language (AttrRefExpression, isSetExpression, LengthOfExpr).Those rules are defined in uispec_validation. As an example, Fig. 22 showsthe rule for handling the AttrRefExpression. The validation code itself is ”in-jected” into the UI form via the same placeholder reduction as in the case of therbac_entities language.

Just as in the discussion on Extension (Section 4.2), we may want to useconstraints to restrict the embedded language in the context of a Validated-Field. Consider the case where we wanted to embed the expressions part ofC instead of the expressions language. C comes with all kinds of operatorsrelating to pointers, bit shifting and other C-specifics that are not relevant inthe validation of UI fields. In this case we may want to use a can be ancestorconstraint to restrict the use of those operators in the validation expressions.


Fig. 22. References to entity attributes are mapped to a call to their getter method.The template fragment (inside the <TF .. TF>) uses reference macros (->$) to ”rewire”the reference to the Java Bean instance, and the toString method call to a call to thegetter.

As a consequence of MPS’ projectional editor, no ambiguities may arise ifmultiple independent languages are embedded (the same discussion applied tothe case where a base language is extended with several independently developedextensions at the same time). Let us consider the potential cases:

Same Concept Name: Embedded languages may define concepts with thesame name as the host language. This will not lead to ambiguity becauseconcepts have a unique ID as well. A program element will use this ID torefer to the concept whose instance it represents.

Same Concrete Syntax: The projected representation of a concept is not rel-evant to the functioning of the editor. The program would still be unambigu-ous to MPS even if all elements had the same notation. Of course it wouldbe confusing to the users (users can always see the qualified name of theinstantiated concept in the Inspector as a means of disambiguation).

Same Alias: If two concepts that are valid at the same location use the samealias, then, as the user types the alias, it is not clear which of the twoconcepts should be instantiated. This problem is solved by MPS opening thecode completion window and requiring the user to explicitly select whichalternative to choose. Once the user has made the decision, the unique ID isused to create an unambiguous program tree.

4.5 Language Annotations

In a projectional editor, the CS of a program is projected from the AST. Aprojectional system always goes from AS to CS, never from CS to AS (as parsersdo). This has the important consequence that the CS does not have to containall the data necessary to build the AST (which in case of parsers, is necessary).This has two consequences:

– A projection may be partial. The AS may contain data that is not shownin the CS. The information may, for example, only be changeable via inten-tions (see Section 4.4), or the projection rule may project some parts of theprogram only in some cases, controlled by some kind of configuration.

– It is also possible to project additional CS that is not part of the CS definitionof the original language. Since the CS is never used as the information source,

414 M. Voelter

such additional syntax does not confuse the tool (in a parser-based tool thegrammar would have to be changed to take into account this additionalsyntax to not derail the parser).

In this section we discuss the second alternative. It represents a variant ofEmbedding: no dependencies, but syntactic composition. The mechanism MPSuses for this is called annotations, which we have seen when we introducedtemplates (Section 3): an annotation can be attached to arbitrary programelements and can be shown together with CS of the annotated element. Inthis section we use this approach to implement an alternative approach for theentity-to-database mapping. Using this approach, we can store the mappingfrom entity attributes to database columns directly in the Entity, resulting inthe following code:

module companyentity Employee {

id : int -> People.idname : string -> People.namerole : string -> People.roleworksAt : Department -> People.departmentIDfreelancer : boolean -> People.isFreelancer

}

entity Department {id : int -> Departments.iddescription : string -> Departments.descr

}

This is a heterogeneous fragment, consisting of code from entities, as wellas the annotations (e.g. -> People.id). From a CS perspective, the columnmapping is embedded in the Entity. In the AST the mapping information is alsoactually stored in the entities model. However, the definition of the entitieslanguage does not know that this additional information is stored and projected”inside” entities. The entities language is not modified.

� Structure and Syntax. We define an additional language relmapping_anno-tations which extends the entities language as well as the relmappinglanguage. In this language we define the following concept:

concept AttrToColMapping extends NodeAnnotationreferences:

Column column 1properties:

role = colMappingconcept links:

annotated = Attribute

AttrToColMapping concept extends NodeAnnotation, a concept predefined byMPS. Concepts that extend NodeAnnotation have to provide a role prop-erty and an annotated concept link. As we have said above, structurally, anannotation is a child of the node it annotates. So the Attribute has a newchild of type AttrToColMapping, and the reference that contains the child iscalled @colMapping — the value of the role property prepended with @. The


annotated concept link points to the concept to which this annotation can beadded. AttrToColMappings can be annotated to instances of Attribute.

While structurally the annotation is a child of the annotated node, the re-lationship is reversed in the CS: The editor for AttrToColMapping wraps theeditor for Attribute, as Fig. 23 shows. Since the annotation is not part of theoriginal language, it cannot just be ”typed in”, instead it must be attached tonodes via an Intention.

Fig. 23. The editor for the AttrToColMapping embeds the editor of the concept it isannotated to (using the attributed node cell). It then projects the reference to thereferenced column. This way the editor of the annotation has control of if and how theeditor of the annotated element is projected.

It is possible to define the annotation target to be BaseConcept, which meansthe annotation can be attached to any program element. This is useful for genericmetadata such as documentation, requirements traces or presence conditions inproduct line engineering (we describe this in [54] and [52]). MPS’ template lan-guage uses this approach as well. Note that this is a way to support Embeddinggenerically, without the use of an adapter language. The reason why this genericapproach is useful mostly for metadata is related to semantics: since the an-notations can be composed with any other language without an adapter, thesemantics must be generic as well, i.e. not related to any particular target lan-guage. This is true for the generic metadata mentioned above.

� Type System. The same typing rules are necessary as in relmapping_enti-ties described previously. They reside in relmapping_annotations.

� Generator. The generator is also similar to one for relmapping_entities. Ittakes the entities model as the input, and then uses the column mappings inthe annotations to create the entity-to-database mapping code.

5 Discussion

In this section we discuss limitations of MPS in the context of language and IDEmodularization and composition and discuss an approach for improving some ofthese shortcomings. We also look at real world use of MPS.

5.1 Limitations

The paper paints a very positive picture about the capabilities of MPS regardinglanguage and IDE modularization and composition. However, there are some

416 M. Voelter

limitations and shortcomings in the system. Most of them are not conceptualproblems, but missing features. and problems have been solved ad hoc as thethey arose. A consistent, unified approach is sometimes missing. I propose suchan approach in Section 5.2.

� Syntax. The examples in this paper show that meaningful language and IDEmodularization and composition is possible with MPS. The challenge of grammarcomposition is not an issue in MPS, since no grammars and parsers are used.The fact that we hardly ever discuss syntactic issues in the above discussionsis testament to this. Potential ambiguities are resolved by the user as he enterthe program (discussed at the end of Section 4.4) — once entered, a program isalways unambiguous. The luxury of not running into syntactic composition issuescomes at the price of the projectional editor (we have discussed the drawbacksof projectional editors in Section 2).

One particular shortcoming of MPS is that it is not possible to override theprojection rule of a concept in a sublanguage (this feature is on the roadmapfor MPS 3.0). If this were possible, ambiguities for the user in terms of the CScould be solved by changing the notation (or color or font) of existing conceptsif they are used together with a particular other language. Such a new CS wouldbe defined in the respective adapter language.

� IDE. This paper emphasizes IDE composition in addition to language com-position. Regarding syntax highlighting, code completion, error marks on theprogram and intentions, all the composition approaches automatically composethose IDE aspects. No additional work is necessary by the language developer.However, there are additional concerns an IDE may address including versioncontrol integration, profiling and debugging. Regarding version control integra-tion, MPS provides diff/merge for most of today’s version control systems on thelevel of the projected syntax — including for heterogeneous fragments. No sup-port for profiling is provided, although a profiler for language implementations ison the roadmap. MPS comes with a debugging framework that lets language de-velopers create debuggers for languages defined in MPS. However, this frameworkis relatively low-level and does not provide specific support for language compo-sition and heterogeneous fragments. However, as part of the mbeddr project [53]that develops an extensible version of the C programming language on top ofMPS we have developed a framework for extensible C debuggers. Developers ofC extensions can easily specify how the extension integrates into the C debuggerso that debugging on the syntax of the extension becomes possible for hetero-geneous fragments. We are currently in discussions with JetBrains to make theunderlying extensible debugging framework part of MPS. Debuggers for DSLsalso been discussed by Visser et al. in [29] and by Wu et. al. in [55].

� Evolution. Composing languages leads to coupling. In the case of Referencingand Extension the coupling is direct, in the case of Reuse and Embedding thecoupling is indirect via the adapter language. As a consequence of a change ofthe referenced/base/context/host language, the referencing/extending/reused/


embedded language may have to change as well. MPS, at this time, provides noautomatic way of versioning and migrating languages, so co-evolution has to beperformed manually. In particular, a process discipline must be established inwhich dependent languages are migrated to new versions of a changed languagethey depend on.

� Type System. Regular typing rules cannot be overridden in a sublanguage.Only the overloaded operations container can be overloaded (as their name sug-gests) from a sublanguage. As a consequence it requires some thought whendesigning a language to make the type system extensible in meaningful ways.

� Generators. Language designers specify a partial ordering among generatorsusing priorities. It is not easily possible to ”override” an existing generator, butgenerators can run before or after existing ones. Generator extension is notpossible directly. This is why we use the placeholders that are put in by earliergenerators to be reduced by later ones. Obviously, this requires pre-planning onthe part of the developer of the generator that adds the placeholder.

5.2 A Unified Approach

Looking at the limitations discussed in the previous subsection it is clear thata consistent approach for addressing the modularization, extension and compo-sition of all language aspects would be useful. In this section we propose sucha unified approach based on the principles of component-based design [50]. Inthis approach, all language aspects would use components as the core structuralbuilding block. Components have facets and a type. The type of the compo-nent determines the kinds of facets it has. A facet is a kind of interface thatexposes the (externally visible) ingredients of the component. A component oftype structure exposes language concepts. A component of type editor exposeseditors, type type system exposes type system rules, and so on. To support mod-ularization, a component (in a sublanguage) can specify an advises relationshipto another component (from a super language). Then each of the facets can de-termine which facets from the advised component it wants to preempt, enhanceor override:

– preemption means that the respective behavior is contributed before thebehavior from the base language. A generator may use this to reduce anelement before the original generator gets a chance to reduce it.

– enhancement means that the sublanguage component is executed after theadvised component from the base language. Notice that for declarative as-pects where ordering is irrelevant, preempt and enhance are exchangeable.

– overriding means that the original facet is completely shadowed by the newone. This could be used to define a new editor for an existing concept.

This approach would provide the same way of packaging behavior for all languageaspects, as well as a single way of changing that behavior in a sublanguage.To control the granularity at which preemption, enhancement or overriding isperformed, the base language designer would have to group the structures or

418 M. Voelter

behaviors into suitably cut facets. This amount of pre-planning is acceptable:it is just as in object-oriented programming, where behavior that should beoverridable has to be packaged into its own method.

The approach could be taken further. Components could be marked as ab-stract and define a set of parameters for which values need to be provided bynon-abstract sub-components. A language is abstract as long as it has at least oneabstract component, for which no concrete sub-component is provided. Compo-nent parameters could even be usable in structure definitions, for example as thebase concept; this would make a language extension parameterizable regardingthe base language it extends.

5.3 Real-World Use of MPS

The examples in this paper are toy examples — the simplest possible languagesthat can illustrate the composition approaches. However, MPS scales to realisticsystems, both in terms of language complexity and in terms of program size.The composition techniques — especially those involving syntactic composition— are used in practice. We illustrate this with two examples: embedded softwareand web applications.

� Embedded Software. Embedded systems are becoming more software intensiveand the software becomes more complex. Traditional embedded system develop-ment approaches use a variety of tools for various aspects of the system,making toolintegration amajor challenge. Some of the specific problems of embedded softwaredevelopment include the limited capability for meaningful abstraction in C, someof C’s ”dangerous”features (leading to various coding conventions such as Misra-C [20]), the proprietary and closednature ofmodeling tools, the integration ofmod-els and code, traceability to requirements, long build times as well as managementof product line variability. The mbeddr project16 addresses these challenges us-ing incremental, modular extension of C with domain-specific language concepts.mbeddr uses Extension to add interfaces and components, state machines, andmeasurement units to C. mbeddr is based on MPS, so users of mbeddr can buildtheir own Extensions. mbeddr implements all of C in less than 10.000 lines of MPScode. Scalability tests have shown that mbeddr scales to at least 100.000 lines ofequivalent C code. A detailed description, including more details on language andprogram sizes and implementation effort can be found in [53].

�Web Development. JetBrains’ YouTrack issue tracking system is an interac-tive web application with many UI features known from desktop applications.YouTrack is developed completely with MPS and comprises thousands of Javaclasses, web page templates and other artifacts. The effort for building the nec-essary MPS-based languages will be repaid by future applications that build onthe same web platform architecture and hence use the same set of languages.Language Extension and Embedding is used to provide an integrated web devel-opment environment17.

16 http://mbeddr.com17 http://www.jetbrains.com/mps/docs/MPS_YouTrack_case_study.pdf


For example, the dnq language extends Java class definitions with all the in-formation necessary to persist instances in a database via an object-relationalmapper. This includes real associations (specifying navigability and compositionvs. reference) or length specifications for string properties. dnq also includes acollections language which supports the manipulation of collections in a waysimilar to .NET’s Linq [34]. Other languages include webr, a language used forimplementing interactions between the web page and the backend. It supports aunified programming model for application logic on the server and on the browserclient. webr also provides first-class support for controllers. For example, con-trollers can declare actions and attach them directly to events of UI components.webr is well-integrated with dnq. For example, it is possible to use a persistententity as a parameter to a page. The database transaction is automatically man-aged during request processing.

In email communication with the author, JetBrains reported significant im-provements in developer productivity for web applications. In particular, thetime for new team members to become productive on the YouTrack team isreported to have been reduced from months to a few weeks, mostly becauseof the very tight integration in a single language of the various aspect of webapplication development.

6 Related Work

This paper addresses language and IDE modularization and composition withMPS, a topic that touches on many different topics. In this section we discussrelated work focusing on modular grammars and parsers, projectional editing,modular compilers and modular IDEs. We conclude with a section on relatedwork that does not fit these categories.

6.1 Modular Grammars and Parsers

As we have seen in this paper, modular composition of concrete syntax is thebasis for several of the approaches to language composition. Hence we start bydiscussing modularization and composition of grammars.

In [26] Kats, Visser and Wachsmut describe nicely the trade-offs with non-declarative grammar specifications and the resulting problems for composition ofindependently developed grammars. Grammar formalisms that cover only sub-sets of the class of context-free grammars are not closed under composition:resulting grammars are likely to be outside of the respective grammar class.Composition (without invasive change) is prohibited. Grammar formalisms thatimplement the full set of context-free grammars do not have this problem andsupport composition much better. In [47] Schwerdtfeger and van Wyk also dis-cuss the challenges in grammar composition. They also describe a way of veri-fying early (i.e. before the actual composition attempt) whether two grammarsare composable or not.

An example of a grammar formalism that supports the full set of context-freegrammars is the Syntax Definition Formalism [22]. SDF is implemented with

420 M. Voelter

scannerless GLR parsers. Since it parses tokens and characters in a context-awarefashion, there will be no ambiguities if grammars are composed that both definethe same token or production in different contexts. This allows, for example, toembed SQL into Java (as Bravenboer et al. discuss in [31]). However, if the samesyntactic form is used by the composed grammars in the same location, thensome kind of disambiguation is necessary. Such disambiguations are typicallycalled quotations and antiquotations and are defined in a third grammar thatdefines the composition of two other independent grammars (discussed in [7]).The SILVER/COPPER system described by van Wyk in [56] solves these am-biguities via disambiguation functions written specifically for each combinationof ambiguously composed grammars. Note that in MPS such disambiguation isnever necessary. We discuss the potential for ambiguity and the way MPS solvesthe problem at the end of Section 4.4.

Given a set of extensions for a language, SILVER/COPPER allows users toinclude a subset of these extensions into a program as needed (this has beenimplemented for Java in AbleJ [58] and and for SPIN’s Promela language inAbleP [32]. A similar approach is discussed for an SDF-based system in [8].However, ad-hoc inclusion only works as long as the set of included extensions(which have presumably been developed independent from each other) are notambiguous with regards to each other. In case of ambiguities, disambiguationshave to be defined as described above.

Polyglot, an extensible compiler framework for Java [40] also uses an exten-sible grammar formalism and parser to supports adding, modifying or removingproductions and symbols defined in a base grammar. However, since Polyglotuses LALR grammars, users must make sure manually that the base languageand the extension remains in the LALR subclass.

In Section 3 we mentioned that MPS’ template language provides IDE supportfor the target language in the template. In traditional text-generation templatelanguages this is typically not supported because it requires support for languagecomposition: the target language must be embedded in the template language.However, there are examples of template languages that support this. Not sur-prisingly they are built on top of modular grammar formalisms. An example isthe Repleo template language [1] which is built on SDF. However, as explainedin the discussion on SDF above, SDF requires the definition of an additionalgrammar that defines how the host grammar (template language in this case)and the embedded grammar (target language) fit together: for all target lan-guage non-terminals where template code should be allowed, a quotation hasto be defined. MPS does not require this. Any target language can be markedup with template annotations. No separate language has to be defined for thecombination of template language and target language.

6.2 Projectional Editing

Projectional editing (also known as structural editing) is an alternative approachfor handling the relationship between CS and AS, i.e. it is an alternative toparsing. As we have seen, it simplifies modularization and composition.


Projectional editing is not a new idea. An early example is the IncrementalProgramming Environment (IPE, [33]). It uses a structural editor for users tointeract with the program and then incrementally compiles and executes theresulting AST. It supports the definition of several notations for the same pro-gram as well as partial projections. However, the projectional editor forces usersto build the program tree top-down. For example, to enter 2 + 3 users first haveto enter the + and then fill in the two arguments. This is very tedious and forcesusers to be aware of the language structure at all times. MPS in contrast goes along way in supporting editing gestures that much more resemble text editing,particularly for expressions. IPE also does not address language modularity. Infact it comes with a fixed, C-like language and does not have a built-in facilityto define new languages. It is not bootstrapped. Another projectional system isGANDALF [39]. Its ALOEGEN component generates projectional editors froma language specification. It has the same usability problems as IPE. This is nicelyexpressed in [42]: Program editing will be considerably slower than normal key-board entry although actual time spent programming non-trivial programs shouldbe reduced due to reduced error rates..

The Synthesizer Generator described in [45] also supports structural edit-ing. However, at the fine-grained expression level, textual input and parsing isused. This removes many of the advantages of projectional editing in the firstplace, because simple language composition at the expression level is prohibited.MPS does not use this ”trick”, and instead supports projectional editing also onexpression level, with convenient editing gestures. We have seen in this paperthat extensions of expressions are particularly important to tightly integrate anembedded language with its host language.

Bagert and Friesen describe a multi-language syntax directed editor in [4].However, this tool supports only Referencing, syntactic composition is notsupported.

The Intentional Domain Workbench [48] is another contemporary projectionaleditor that has been used in real projects. An impressive demonstration about itscapabilities can be found in an InfoQ presentation titled Domain Expert DSL18.

6.3 Modular Compilers

Modular compilers make use of modular parsers and add modular specificationof semantics, including static semantics (constraints and type systems) as wellas execution semantics.

Many systems describe static semantics using attribute grammars. Attributegrammars associate attributes with AST elements. These attributes can capturearbitrary data about the element (such as its type). Example of systems thatmake use of attribute grammars for type computation and type checking includeSILVER ([56], mentioned above), JastAdd [21] and LISA ([36], discussed in moredetail in the next section). Forwarding (introduced in [57]) is a mechanism that

18 http://www.infoq.com/presentations/DSL-Magnus-Christerson-Henk-Kolk

http://www.infoq.com/presentations/DSL-Magnus-Christerson-Henk-Kolk

http://www.infoq.com/presentations/DSL-Magnus-Christerson-Henk-Kolk

422 M. Voelter

improves the modularity of attribute grammars by delegating the look-up of anattribute value to another element.

While MPS’ type system specification language can be seen as associating atype attribute with AST elements using the typeof function, MPS’ type systemis different from attribute grammars. Attribute values are calculated by explicitlyreferring to the values of other attributes, often recursively. MPS’ type systemrules are declarative: users specify typing rules for language concepts and MPS”instantiates” each rule for each AST element. A solver then solves all typeequations in that AST. This way, the typing rules of elements contributed bylanguage extensions can implicitly affect the overall typing of the program.

As we have seen, for language Extension the execution semantics is definedvia transformation to the base language. In [56], van Wyk discusses under whichcircumstances such transformations are valid: the changes to the overall ASTmust be local. No global changes are allowed to avoid unintended interactionsbetween several independently developed extensions used in the same program.In MPS such purely local changes are performed with reduction rules. In ourexperience, it is also feasible to add additional elements to the AST in selectplaces. In MPS, this is achieved using weaving rules. However, in both cases (localreduction and selective adding) there is no way to detect in advance whetherusing two extensions in the same program will lead to conflicts.

More formal ways of defining semantics include denotational semantics, op-erational semantics and and a mapping to a formally defined action language.These have been modularized to make them composable. For example, Mossesdescribes modular structural operational semantics [38] and language composi-tion by combining action semantics modules [11].

Aspect orientation supports the modularization of cross-cutting concerns. Thishas also been applied to language development. For example, in [43] Rebernak elal. discuss AspectLISA and AspectG. AspectLISA supports adding new, cross-cutting attribute grammar attributes into a LISA language definition. AspectGallows weaving additional action code into ANTLR grammars. Note that bothAspectLISA and AspectG address semantics and do not support aspect-orientedextension of the concrete syntax.

6.4 Modular IDEs

Based on the fundamentals that enable modular syntax and semantics we nowlook at tools that, from a language definition, also create a language-aware editor.

Among the early examples are the Synthesizer Generator [45], mentionedabove, as well as the Meta Environment [27]. The latter provides an editor forlanguages defined via and ASF+SDF, i.e. it is parser-based. More recent tools inthe ASF+SDF family include Rascal [28] and Spoofax [25]. Both provide Eclipse-based IDE support for languages defined via SDF. In both cases the IDE supportfor the composed languages is still limited (for example, at the time of this writ-ing, Spoofax only provides syntax highlighting for an embedded language, butno code completion), but will be improved. For implementing semantics, Rascaluses a Java-like language that has been extended with features for program con-struction, transformation and analyses. Spoofax uses term rewriting based on


the Stratego [5] language. An interesting tool is SugarJ [15] also based on SDF,which supports library based language extension. Spoofax-based IDE support isdiscussed in [14].

SmartTools [2] supports generating editors for XML schemas. Based on as-signing UI components to AS elements it can project an editor for programs.However, this projectional editor does not try to emulate text-like editing asMPS does, so there is no convenient way for editing expressions. To do this, agrammar-based concrete syntax can be associated with a the AS elements definedin the schema. Based on this definition, SmartTools then provides a text-basedrepresentation for the language. However, this prevents syntax composition andSmartTools only supports homogeneous files. Different UI components and gram-mars can be defined for the same AS, supporting multi-notation editing. Staticsemantics is implemented based on the Visitor pattern [18]. SmartTools pro-vides support for much of the infrastructure and makes using Visitors simple.For transformation, SmartTools provides Xpp, a transformation language thatprovides a more concise syntax for XSLT-based XML transformations.

LISA [36] (mentioned earlier) supports the definition of language syntax andsemantics (via attribute grammars) in one integrated specification language. Itthen derives, among other things, a syntax-aware text editor for the language,as well as various graphical and structural viewing and editing facilities. Userscan use inheritance and aspect-orientation to define sub-grammars. The use ofthis approach for incremental language development is detailed in [37]. However,users have to make sure manually that sub-grammars remain unambiguous withrespect to the base grammar. The same is true for the combination of indepen-dently developed grammars. LISA supports interactive debugging and programstate visualization based on interpreting programs based on the semantic partsof the language specification.

Eclipse Xtext19 generates sophisticated text editors from an EBNF-like lan-guage specification. Syntactic composition is limited since Xtext is based onANTLR [41] which is a two phase LL(k) parser. It is possible for a languageto extend one other language. Concepts from the base language can be used inthe sub language and it is possible to redefine grammar rules defined in the baselanguage. Combination of independently defined extensions or Embedding is notsupported. Xtext’s abstract syntax is based on EMF Ecore20, so it can be usedtogether with any EMF-based model transformation and code generation tool(examples include Xpand, ATL, and Acceleo, all located at the Eclipse Modelingsite21). Static semantics is based on constraints written in Java or on third-partyframeworks that support declarative description of type systems such as XtextTypesystem22 or XSemantics23. Xtext comes with Xbase, an expression lan-guage that can be used as the base language for custom DSL. Xbase also comes

19 http://eclipse.org/Xtext20 http://eclipse.org/emf21 http://eclipse.org/modeling22 http://code.google.com/a/eclipselabs.org/p/xtext-typesystem/23 http://xsemantics.sourceforge.net/

424 M. Voelter

with an interpreter and compiler framework that makes creating type systems,interpreters and compilers for DSLs that extend Xbase relatively simple.

The Helvetia system [44] by Renggli et al. supports language extension ofSmalltalk with an approach where the host language (Smalltalk) is also used fordefining the extensions. The authors argue that the approach is independent ofthe host language and could be used with other host languages as well. Whilethis is true in principle, the implementation strategy heavily relies on aspectsof the Smalltalk system that are not present for other languages. Also, sinceextensions are defined in the host language, the complete implementation wouldhave to be redone if the approach were to be used with another host language.This is particularly true for IDE support, where the Smalltalk IDE is extendedusing this IDE’s APIs. The approach discussed in this paper does not have theselimitations: MPS provides a language-agnostic framework for language and IDEextension that can be used with any language, once the language is implementedin MPS.

Cedalion [46] is a host language for defining internal DSLs. It uses a pro-jectional editor and semantics based on logic programming. Both Cedalion andMPS aim at combining the best of both internal DSLs (combination and exten-sion of languages, integration with a host language) and external DSLs (staticvalidation, IDE support, flexible syntax). Cedalion starts out from internal DSLsand adds static validation and projectional editing, the latter avoiding ambigui-ties resulting from composed syntaxes. MPS starts from external DSLs and addsmodularization, and, as a consequence of implementing base languages with thesame tool, optional tight integration with general purpose host languages.

For a general overview of language workbenches, please refer to the LanguageWorkbench Competition24. Participating tools have implemented a common ex-ample language and document the implementation. This serves as a good tutorialof the tool and makes them comparable. As of June 2012, the site contains 15submissions.

6.5 Other Related Work

In this paper we classify language composition approaches based on syntacticmixing and language dependencies. Other classifications have been proposed,for example by Mernik et al. [35]. Their classification includes Extension (con-cepts are added to a language, similar to Extension as defined in this paper) andRestriction (concepts are removed from a language). The latter can actually beseen as a form of Extension: to restrict a language, we create an Extension thatprohibits the use of some language concepts in certain contexts. We discuss thisat the end of Section 4.2. Mernik et al. also propose Piggybacking and Pipelin-ing. We do not discuss Pipelining in this paper, because it does not composelanguages, it just chain their transformations. Piggybacking refers to a languagereusing concepts from an existing language. This corresponds to Extension withembedding flavor. In [13], Erdweg et al. also propose a classification. Extension

24 http://languageworkbenches.net


is the same as in our paper. They also consider Restriction as a form of Exten-sion, where the extension restricts the use of certain language concepts. Theycall Unification what we call Embedding: two independent languages are usedtogether in the same fragment. The two languages are combined without an in-vasive change to either of them. Each of the languages may have to be extendedto ”interface” with the other one. Erdweg and his colleagues also discuss whatthey call Extension Composition. This addresses the question of how several ex-tensions can be used together in a program. Erdweg et at. distinguish two cases.Incremental extension (where an extension l2 is built on top of another extensionl1 that is based on some base language lb, as well as extension unification, wheretwo languages l2 and l1 both extend a base language lb, and still can be usedtogether in the same program. MPS supports both of these. In fact, for extensionunification, there isn’t even a need to define explicitly a unification of l1 and l2.The two extensions can be used in the same program ”just so”, as long as thesemantics do not clash (see our discussion about transformation locality abovein respect to [56]). We discuss these features of MPS in [53], which addresses themodular extension of C. Erdweg et al. also introduce the term Self-Extension,which describes the case where extensions are developed by means of the baselanguage itself, an approach which is used by internal DSLs (see below) and isbeyond the scope of this paper.

We already discussed the language modularization and composition ap-proaches proposed by Mernik et al. [35] in Section 1.4. In the Helvetia paper [44]Renggli and his colleagues introduce three different flavors of language Exten-sion. A pidgin creatively bends the existing syntax of the host language to toextend its semantics. A creole introduces completely new syntax and customtransformations back to the host language. An argot reinterprets the semanticsof valid host language code. In terms of this classification, both Extension andEmbedding are creoles.

The notion of incremental extension of languages was first popularized inthe context of Lisp, where definition of language extensions to solve problemsin a given domain is a well-known approach. Guy Steele’s Growing a Languagekeynote explains the idea well [49]. Sergey Dmitriev discusses the idea of languageand IDE extension in his article on Language Oriented Programming [10], whichuses MPS as the tool to achieve the goal.

Macro Systems support the definition of additional syntax for existinglanguages. Macro expansion maps the new syntax to valid base language code,and this mapping is expressed with special host language constructs instead ofa separate transformation language. Macro systems differ with regard to degreeof freedom they provide for the extension syntax, and whether they supportextensions of type systems and IDEs. The most primitive macro system is theC preprocessor which performs pure text replacement during macro expansion.The Lisp macro system is more powerful because it is aware of the syntacticstructure of Lisp code. An example of a macro system with limited syntacticfreedom is the The Java Syntactic Extender [3] where each macro has to beginwith a unique name, and only a limited set of syntactic shapes is supported. In

426 M. Voelter

OpenJava [51], the locations where macros can be added is limited. More fine-grained Extensions, such as adding a new operator, are not possible. SugarJ,discussed above, can be seen as a sophisticated macro system that avoids theselimitations.

A particular advantage of projectional editing is that it can combine sev-eral notational styles in one fragment; examples include text, tables, symbols(fraction bars, square roots or big sums). All of these notations are seamlesslyintegrated in one fragment and can be defined with the same formalism, as part ofthe same language (as mentioned earlier, MPS supports text, tables and syntax.Graphics will supported in 2013). Other approaches for integrating different no-tational styles exist. For example, Engelen et al. [12] discuss integrating textualand graphical notations based on grammar-based and Eclipse modeling tech-nologies. However, such an approach requires dealing with separate tools for thegraphical and the textual aspects, leading to a high degree of accidental com-plexity in the resulting implementation and mismatches in the resulting tool, asthe author knows from personal experience.

Internal DSLs are languages whose programs reside within programs expressedin a general purpose host language. In contrast to the Embedding approach dis-cussed in this paper, the DSL syntax and semantics are also defined with thissame host language (as explained by Martin Fowler in his DSL book [17]). Suit-able host languages are those that provide a flexible syntax, as well as meta pro-gramming facilities to support the definition of new abstractions with a customconcrete syntax. For example, Hofer et al. describes internal DSLs in Scala [23].The landmark work of Hudak [24] introduces internal DSLs as language exten-sions of Haskell. While Haskell provides advanced concepts that enable creatingsuch DSLs, they are essentially just libraries built with the host language and arenot first class language entities: they do not define their own syntax, compilererrors are expressed in terms of the host language, no custom semantic analysesare supported and no specific IDE-support is provided. Essentially all internalDSLs expressed with dynamic languages such as Ruby or Groovy, but also thosebuilt with static languages such as Scala suffer from these limitations. Sincewe consider IDE modularization and composition essential, we do not addressinternal DSLs in this paper.

7 Summary

MPS is powerful environment for language engineering, in particular where mod-ular language and IDE composition is concerned. We have seen in this paper howthe challenges of composing the concrete syntax are solved by MPS and how itis also capable of addressing modularity and composition of type systems andgenerators. Code completion, syntax highlighting and error marks for composedlanguages are provided automatically in MPS. The major drawback of MPSis its non-trivial learning curve. Because it works so different from traditionallanguage engineering environments, and because it addresses so many aspectsof languages (incl. type systems, data flow and refactorings) mastering the tool


takes a significant investment in terms of time: experience shows that ca. 4 weeksare necessary. I hope that in the future this investment will be reduced by betterdocumentation and better defaults, to keep simple things simple and complexthings tractable. There are initial ideas on how this could be done.

References

1. Arnoldus, J., Bijpost, J., van den Brand, M.: Repleo: a syntax-safe template engine.In: Consel, C., Lawall, J.L. (eds.) 6th International Conference on Generative Pro-gramming and Component Engineering, GPCE 2007, pp. 25–32. ACM, Salzburg(2007)

2. Attali, I., Courbis, C., Degenne, P., Fau, A., Parigot, D., Pasquier, C.: SmartTools:A Generator of Interactive Environments Tools. In: Wilhelm, R. (ed.) CC 2001.LNCS, vol. 2027, pp. 355–360. Springer, Heidelberg (2001)

3. Bachrach, J., Playford, K.: The Java syntactic extender (JSE). In: OOPSLA 2001:Proceedings of the 16th ACM SIGPLAN Conference on Object-Oriented Program-ming, Systems, Languages, and Applications (2001)

4. Bagert, D.J., Friesen, D.K.: A multi-language syntax-directed editor. In: Davis,P., McClintock, V. (eds.) Proceedings of the 15th ACM Annual Conference onComputer Science, St. Louis, Missouri, USA, February 16-19, pp. 300–302. ACM(1987)

5. Bravenboer, M., Kalleberg, K.T., Vermaas, R., Visser, E.: Stratego/XT 0.17. Alanguage and toolset for program transformation. Science of Computer Program-ming 72(1-2), 52–70 (2008)

6. Bravenboer, M., Vermaas, R., Vinju, J.J., Visser, E.: Generalized Type-Based Dis-ambiguation of Meta Programs with Concrete Object Syntax. In: Gluck, R., Lowry,M. (eds.) GPCE 2005. LNCS, vol. 3676, pp. 157–172. Springer, Heidelberg (2005)

7. Bravenboer, M., Visser, E.: Concrete syntax for objects: domain-specific languageembedding and assimilation without restrictions. In: Vlissides, J.M., Schmidt, D.C.(eds.) Proceedings of the 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2004,pp. 365–383. ACM, Vancouver (2004)

8. Bravenboer, M., Visser, E.: Designing Syntax Embeddings and Assimilations forLanguage Libraries. In: Giese, H. (ed.) MODELS 2008. LNCS, vol. 5002, pp. 34–46.Springer, Heidelberg (2008)

9. Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., Stal, M.: Pattern-Oriented Software Architecture: A System of Patterns. Wiley (1996)

10. Dmitriev, S.: Language Oriented Programming: The Next Programming Paradigm(2004),http://www.onboard.jetbrains.com/is1/articles/04/10/lop/mps.pdf

11. Doh, K.-G., Mosses, P.D.: Composing programming languages by combiningaction-semantics modules. Science of Computer Programming 47(1), 3–36 (2003)

12. Engelen, L., van den Brand, M.: Integrating Textual and Graphical Modelling Lan-guages. Electronic Notes in Theoretical Computer Science 253(7), 105–120 (2010)

13. Erdweg, S., Giarrusso, P.G., Rendel, T.: Language composition untangled. In: Pro-ceedings of Workshop on Language Descriptions, Tools and Applications, LDTA(to appear, 2012)

428 M. Voelter

14. Erdweg, S., Kats, L.C.L., Kastner, C., Ostermann, K., Visser, E.: Growing a Lan-guage Environment with Editor Libraries. In: Denney, E., Schultz, U.P. (eds.) Pro-ceedings of the 10th ACM International Conference on Generative Programmingand Component Engineering (GPCE 2011), pp. 167–176. ACM, New York (2011)

15. Erdweg, S., Rendel, T., Kastner, C., Ostermann, K.: SugarJ: library-based syn-tactic language extensibility. In: Proceedings of the 2011 ACM International Con-ference on Object Oriented Programming Systems Languages and Applications,OOPSLA 2011, pp. 391–406. ACM, New York (2011)

16. Fowler, M.: Language Workbenches: The Killer-App for Domain Specific Lan-guages? (2005),http://www.martinfowler.com/articles/languageWorkbench.html

17. Fowler, M.: Domain-Specific Languages. Addison Wesley (2010)18. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design patterns: elements of

reusable object-oriented software. Addison-Wesley Professional (1995)19. Harel, D., Rumpe, B.: Meaningful Modeling: What’s the Semantics of ”Semantics”?

IEEE Computer 37(10), 64–72 (2004)20. Hatton, L.: Safer language subsets: an overview and a case history, MISRA C.

Information & Software Technology 46(7), 465–472 (2004)21. Hedin, G., Magnusson, E.: JastAdd–an aspect-oriented compiler construction sys-

tem. Science of Computer Programming 47(1), 37–58 (2003)22. Heering, J., Hendriks, P.R.H., Klint, P., Rekers, J.: The syntax definition formalism

SDF - reference manual. SIGPLAN 24(11), 43–75 (1989)23. Hofer, C., Ostermann, K., Rendel, T., Moors, A.: Polymorphic embedding of DSLs.

In: Smaragdakis, Y., Siek, J.G. (eds.) Proceedings of the 7th International Con-ference on Generative Programming and Component Engineering, GPCE 2008,Nashville, TN, USA, October 19-23, pp. 137–148. ACM (2008)

24. Hudak, P.: Modular Domain Specific Languages and Tools. In: Proceedings ofthe 5th International Conference on Software Reuse, ICSR 1998. IEEE ComputerSociety, Washington, DC (1998)

25. Kats, L.C.L., Visser, E.: The Spoofax language workbench: rules for declarativespecification of languages and IDEs. In: Cook, W.R., Clarke, S., Rinard, M.C.(eds.) Proceedings of the 25th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2010,pp. 444–463. ACM, Reno/Tahoe (2010)

26. Kats, L.C.L., Visser, E., Wachsmuth, G.: Pure and declarative syntax definition:paradise lost and regained. In: Cook, W.R., Clarke, S., Rinard, M.C. (eds.) Pro-ceedings of the 25th Annual ACM SIGPLAN Conference on Object-Oriented Pro-gramming, Systems, Languages, and Applications, OOPSLA 2010, pp. 918–932.ACM, Reno/Tahoe (2010)

27. Klint, P.: A Meta-Environment for Generating Programming Environments. ACMTransactions on Software Engineering Methodology 2(2), 176–201 (1993)

28. Klint, P., van der Storm, T., Vinju, J.J.: RASCAL: A Domain Specific Languagefor Source Code Analysis and Manipulation. In: Ninth IEEE International WorkingConference on Source Code Analysis and Manipulation, SCAM 2009, Edmonton,Alberta, Canada, September 20-21, pp. 168–177. IEEE Computer Society (2009)

29. Lindeman, R.T., Kats, L.C.L., Visser, E.: Declaratively Defining Domain-SpecificLanguage Debuggers. In: Denney, E., Schultz, U.P. (eds.) Proceedings of the 10thACM International Conference on Generative Programming and Component En-gineering (GPCE 2011), pp. 127–136. ACM, New York (2011)

30. Liskov, B., Wing, J.M.: A Behavioral Notion of Subtyping. ACM Transactions onProgramming Languages and Systems 16(6), 1811–1841 (1994)


31. Bravenboer, M., Dolstra, E., Visser, E.: Preventing injection attacks with syn-tax embeddings. In: Consel, C., Lawall, J.L. (eds.) 6th International Conferenceon Generative Programming and Component Engineering, GPCE 2007, pp. 3–12.ACM, Salzburg (2007)

32. Mali, Y., Van Wyk, E.: Building Extensible Specifications and Implementations ofPromela with AbleP. In: Groce, A., Musuvathi, M. (eds.) SPIN Workshops 2011.LNCS, vol. 6823, pp. 108–125. Springer, Heidelberg (2011)

33. Medina-Mora, R., Feiler, P.H.: An Incremental Programming Environment. IEEETrans. Software Eng. 7(5), 472–482 (1981)

34. Meijer, E., Beckman, B., Bierman, G.M.: LINQ: reconciling object, relations andXML in the .NET framework. In: Chaudhuri, S., Hristidis, V., Polyzotis, N. (eds.)Proceedings of the ACM SIGMOD International Conference on Management ofData, Chicago, Illinois, USA, June 27-29, p. 706. ACM (2006)

35. Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specificlanguages. ACM Computing Surveys 37(4), 316–344 (2005)

36. Mernik, M., Lenic, M., Avdicausevic, E., Zumer, V.: LISA: An Interactive Envi-ronment for Programming Language Development. In: Horspool, R.N. (ed.) CC2002. LNCS, vol. 2304, pp. 1–4. Springer, Heidelberg (2002)

37. Mernik, M., Zumer, V.: Incremental programming language development. Com-puter Languages, Systems & Structures 31(1), 1–16 (2005)

38. Mosses, P.D.: Modular structural operational semantics. Journal of Logic and Al-gebraic Programming 60-61, 195–228 (2004)

39. Notkin, D.: The GANDALF project. Journal of Systems and Software 5(2), 91–105(1985)

40. Nystrom, N., Clarkson, M.R., Myers, A.C.: Polyglot: An Extensible CompilerFramework for Java. In: Hedin, G. (ed.) CC 2003. LNCS, vol. 2622, pp. 138–152.Springer, Heidelberg (2003)

41. Parr, T.J., Quong, R.W.: ANTLR: A Predicated-LL(k) Parser Generator. Software:Practice and Experience 25(7), 789–810 (1995)

42. Porter, S.W.: Design of a syntax directed editor for psdl. Master’s thesis, NavalPostgraduate School, Monterey, CA, USA (1988)

43. Rebernak, D., Mernik, M., Wu, H., Gray, J.G.: Domain-specific aspect languages formodularising crosscutting concerns in grammars. IEE Proceedings - Software 3(3),184–200 (2009)

44. Renggli, L., Gırba, T., Nierstrasz, O.: Embedding Languages without BreakingTools. In: D’Hondt, T. (ed.) ECOOP 2010. LNCS, vol. 6183, pp. 380–404. Springer,Heidelberg (2010)

45. Reps, T.W., Teitelbaum, T.: The Synthesizer Generator. In: Proceedings of theFirst ACM SIGSOFT/SIGPLAN Software Engineering Symposium on PracticalSoftware Development Environments, pp. 42–48. ACM, New York (1984)

46. Rosenan, B.: Designing language-oriented programming languages. In: SPLASH2010: Proceedings of the ACM International Conference Companion on ObjectOriented Programming Systems Languages and Applications Companion. ACM,New York (2010)

47. Schwerdfeger, A., Van Wyk, E.: Verifiable composition of deterministic grammars.In: Hind, M., Diwan, A. (eds.) Proceedings of the 2009 ACM SIGPLAN Confer-ence on Programming Language Design and Implementation, PLDI 2009, Dublin,Ireland, June 15-21, pp. 199–210. ACM (2009)

430 M. Voelter

48. Simonyi, C., Christerson, M., Clifford, S.: Intentional software. In: Tarr, P.L., Cook,W.R. (eds.) Proceedings of the 21th Annual ACM SIGPLANConference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2006,Portland, Oregon, USA, October 22-26, pp. 451–464. ACM (2006)

49. Steele, G.L.: Growing a Language. Higher-Order and Symbolic Computation 12(3),221–236 (1999)

50. Szyperski, C.A.: Component software - beyond object-oriented programming.Addison-Wesley-Longman (1998)

51. Tatsubori, M., Chiba, S., Killijian, M.-O., Itano, K.: OpenJava: A Class-BasedMacro System for Java. In: Cazzola, W., Houmb, S.H., Tisato, F. (eds.) Reflectionand Software Engineering. LNCS, vol. 1826, pp. 117–133. Springer, Heidelberg(2000)

52. Voelter, M.: Implementing Feature Variability for Models and Code with Projec-tional Language Workbenches. In: Proceedings of the Second International Work-shop on Feature-Oriented Software Development (2010)

53. Voelter, M., Ratiu, D., Schaetz, B., Kolb, B.: mbeddr: an Extensible C-based Pro-gramming Language and IDE for Embedded Systems. In: Systems, Programming,Languages and Applications: Software for Humanity, SPLASH/Wavefront (2012)

54. Voelter, M., Visser, E.: Product Line Engineering using Domain-Specific Lan-guages. In: de Almeida, E.S., Kishi, T. (eds.) 15th International Software ProductLine Conference (SPLC), pp. 70–79. CPS (2011)

55. Wu, H., Gray, J., Mernik, M.: Grammar-driven generation of domain-specific lan-guage debuggers. SPE 38(10), 1073–1103 (2008)

56. Van Wyk, E., Bodin, D., Gao, J., Krishnan, L.: Silver: an Extensible AttributeGrammar System. Electronic Notes in Theoretical Computer Science 203(2), 103–116 (2008)

57. Van Wyk, E., de Moor, O., Backhouse, K., Kwiatkowski, P.: Forwarding in At-tribute Grammars for Modular Language Design. In: CC 2002. LNCS, vol. 2304,pp. 128–142. Springer, Heidelberg (2002)


Tengi Interfaces for Tracingbetween Heterogeneous Components

Rolf-Helge Pfeiffer and Andrzej Wąsowski

IT University of Copenhagen, Software Development Group{ropf,wasowski}@itu.dk

Abstract. Contemporary software systems comprise many heteroge-neous artifacts; some expressed in general programming languages, somein visual and textual domain-specific languages and some in ad hoc tex-tual formats. During construction of a system diverse artifacts are in-terrelated. Only few formats, typically general programming languages,provide an interface description mechanism able to specify software com-ponent boundaries. Unfortunately, these interface mechanisms can notexpress relations for components containing heterogeneous artifacts.We introduce Tengi, a tool that allows for the definition of software

components containing heterogeneous artifacts. Tengi interfaces linkcomponents containing different textual and visual software developmentartifacts ranging from high-level specification documents to low-level im-plementation documents. We formally define and implement Tengi inter-faces, a component algebra and operations on them and present a casestudy demonstrating Tengi’s capabilities.

1 Introduction

Contemporary software systems are constructed out of a multitude of heteroge-neous artifacts. A software system (...) consists of a number of separate programs,configuration files, (...) system documentation [21]. These artifacts contain in-formation at different abstraction levels, in various languages and may be tiedto different development phases. Still, they form a single whole and thus each ofthem provides a different view on parts or aspects of the systems. The develop-ment artifacts are related either by directly referencing each other or by referringto the same aspect of a system. Some of these relations may be explicit. Sourcecode in a general purpose language usually contains explicit references to othersoftware components or methods called. Other relations may be implicit. Forexample visual models and code generated from them are both descriptions ofthe same system aspect at different abstraction levels, but the detailed relationis hidden in the code generator. Some artifact relations can even remain com-pletely undocumented, only stored in human memory. For instance requirementsdocuments are sometimes directly translated to source code without recordingany traces from them. Explicit or not, software developers continuously have toreason about and navigate across such relations, and this creates difficulties. For


432 R.-H. Pfeiffer and A. Wąsowski

example [15] points that it is a major challenge in the Linux kernel project, tomaintain consistency between the kernel variability model and the source code.This difficulty calls for investigating language oblivious tools that allow spec-

ifying components comprising heterogeneous artifacts, including definition oflinks across languages and formats, and allowing monitoring of, and navigationalong, such links. The challenge in design of such tools lies in the tension be-tween the generic and the specific. Heterogeneous components, and even moreso relations between them, are often domain specific—intrinsically hard to sup-port with generic tools. In this paper we take up the challenge of constructingsuch a generic tool, which is able to capture domain-specific component relations.To do so, we address the two questions: how to specify component boundariesfor heterogeneous components? And how to technically link the components tothese specifications?Component boundaries can be specified by interfaces, which are abstract de-

scriptions of the way in which a component relates to its context. We consideranything from files to folder structures as components. Artifacts in softwaredevelopment are files or multiple-files that are used together.We present Tengi1, a toolkit for defining, reusing, and relating software compo-

nents by means of specifying interfaces for artifacts. Artifacts can be expressedin various languages on different levels of abstraction, ranging from high-levelspecification documents to low-level implementation documents and from de-velopment artifacts expressed in textual as well as in visual languages. Tengi,implemented as an Eclipse plug-in, extends numerous Eclipse editors with anability to define ports on the edited artifacts. Further, it provides a language forspecifying dependencies between these ports as interface specifications resem-bling contracts. Operators are provided for automatic checking of componentcompatibility and refinement and for composition of components.Let us illustrate the problem of interrelated heterogeneous artifacts and

Tengi’s use with a small example. Figure 2 shows a requirements document foran aspect of a simple application, implemented using Java classes (not shownyet). How do we record the knowledge that this specification fragment is imple-mented exactly by the three classes? Tengi provides a traceability mechanismbased on on a simple component algebra. Instead of explicitly declaring tracesbetween the requirements document and the Java classes, with Tengi, a usercan define ports in any documents (also free text documents like the one inFig. 2). These ports are available in Tengi interfaces. Links or traces are realizedby the algebra operations on such interfaces. A Tengi interface for the require-ments document in Fig. 2 would provide a port for a certain requirement andJava classes implementing this requirement would require this port in a Tengiinterface.We use Eclipse to implement Tengi, as Eclipse is a prime representative of mod-

ern Integrated Development Environments (IDE). However, neither the problem,nor the principal solution discussed in this paper is Eclipse specific.

1 Tengi, Icelandic for interface, was chosen to avoid conflicts with all other kind ofinterfaces appearing frequently in computer science.

Tengi Interfaces for Tracing between Heterogeneous Components 433

Fig. 1. Examples of software system artifacts: a requirements document fragment ontop, fragment of an analysis document in formal BON (bottom left), and a UML statemachine (bottom right). Concepts referring to each other are illustrated with red lines.

We proceed by motivating our work in Sect. 2 with an example of a hetero-geneous software system. This system is also used for the case study in Sect. 5,which illustrates how to apply Tengi to development of a heterogeneous softwaresystem. Section 3 introduces Tengi’s component algebra followed by detailingTengi internals in Sect. 4. We finish with a discussion of Tengi, related work andconclusion in Sections 6–8.

2 Running Example

We use a small system as our running example, also for the case study inSect. 5. The system is a clone of a Commodore 64 video game, Bouncy Cars(http://www.gb64.com/game.php?id=1049). It was developed as an exercise in agraduate course on modeling at IT University of Copenhagen [1]. The task was tospecify and implement an automatically verifiable, small-sized, object-oriented(OO) application. The system is specified using the BON method [22]. BONsupports informal and formal, textual and visual specification of structural andbehavioral properties of OO-systems. Visual BON is similar to UML diagrams,including constraints not unlike OCL constraints.Our version of Bouncy Cars is an example of a heterogeneous software system.

It comprises artifacts in several languages, at different levels of abstraction:

– A requirements document. A regular text file containing the exercise task innatural language.

http://www.gb64.com/game.php?id=1049


TENGI assignment ENTITY "assignment.txt" [IN: {}; CONSTRAINT: true;OUT: { informal analysis, formal design };CONSTRAINT: informal analysis & formal design;] {LOCATOR informal analysis IN "assignment.txt"OFFSET 6692 LENGTH 179;LOCATOR formal design IN "assignment.txt"OFFSET 7112 LENGTH 106;

}

Fig. 2. The requirements document assignment.txtwith two marked ports in the document and theTengja dictionary (below)

Fig. 3. Interface for the documentshown in Figure 2

– A high-level analysis document. This is an informal system specification ininformal textual BON.– More concrete design documents. There are design documents in formal tex-tual and visual BON giving system design in formal textual BON. FormalBON is refined from the former informal BON specification. Furthermore,a UML state machine specifies the system’s behavior. The UML diagramwas not strictly necessary, but we have used it to replace the standard BONevent chart in order to expand the number of involved languages.– Implementation artifacts. Multiple JML annotated Java classes [13] imple-ment the system specification.

The Bouncy Cars example contains artifacts in natural language and in six soft-ware languages. The requirements and high-level analysis documents are moreabstract than design documents and implementation artifacts. Figure 1 showsthree artifacts: a fragment of the requirements document in natural language inthe top part of the figure; a fragment of an analysis document in formal tex-tual BON in the bottom left part; a UML state machine specifying behavior inthe bottom right part. The three artifacts describe different views on the sys-tem, at different abstraction levels. All three artifacts are implicitly interrelated.They refer to shared concepts from different view points. For example, the re-quirements document, the design documents in formal BON and the UML statemachine, all refer to a concept “game”. Furthermore, both the formal BON andthe UML state machine artifact refer to the concept “level”. Figure 1 illustratesthese relations by red arrows between the shared concepts.The main challenge in development of heterogeneous systems is caused by

implicit nature of relations across artifacts and languages. They exist in human-mind, the mind of the developers, but they are not explicitly available for com-puters to reason about. Imagine that a new Bouncy Cars developer deletes theGAME class in the formal BON specification in Fig. 1. The system is now incom-plete and other colleagues who require this class for their work will face errors.For instance, code generators consuming the BON specification will produceincorrect results. These errors could be avoided, if suitable warning messagesabout the impact of changes were produced early on. This however requiresmaking cross-language relations explicit and using tools to reason about them.


Fig. 4. Excerpt of the meta-model of the Tengi interface DSL

In this paper we set out to address this issue by investigating and implement-ing interfaces which allow for linking or tracing information across componentscontaining heterogeneous artifacts.

3 Tengi Concepts

This section introduces the notions used in Tengi and Tengi’s component algebra.Two artifacts are heterogeneous if they are instances of different meta-modelsor if there exists no meta-model to describe them (the terms meta-model andlanguage grammar are used synonymously, since they can be mapped to eachother in the considered domain [3]). For example, a program artifact in Java andone in C# are heterogeneous, but also a UML class diagram and an arbitraryvisual domain-specific language (DSL) are heterogeneous. In particular, thereexist development artifacts that are heterogeneous to others due to a lack of ameta-model, e.g., simple text files.

3.1 Tengi Interfaces

We consider anything from files to folder structures as components. We specifycomponent boundaries by Tengi interfaces. Interfaces are abstract descriptionsof the way in which a component relates to its context. In Tengi interfaces thisrelation is expressed using ports, which could be anything from communicationchannels to cross-file references. Tengi interface ports are just abstract namesthat can be related to each other and to the artifacts. Tengi interfaces considerstatic, development-time properties of components only.Tengi provides an interface description DSL for heterogeneous artifacts corre-

sponding to the meta-model in Fig. 4. In the following we illustrate an examplefor such an interface and provide a formal definition.Figure 3 shows an example for a Tengi interface for the required tasks of

Fig. 2. This interface simply specifies two ports in assignment.txt, which corre-spond to the requirement of informal analysis and a formal design. Both of themare output ports, meaning that they are provided to the components context.Furthermore, any concretization of this interface, will have to provide informal

assignment.txt


analysis and formal design; pointing to the locations in its components wherethese ports are realized. We chose to avoid constructing more complicated in-terfaces, for the sake of simplicity of the example. Ports, classified into inputsand outputs, provide an alias to a corresponding location. They characterizewhat information is provided by a component (output ports) or what informa-tion is required from the environment (input ports). Ports in the meta-modelare represented by class PortSpec and the division into input and output portsis manifested by the containment relations inputPorts and outputPorts, see Fig. 4.Semantically, ports are Boolean variables. Assignment of true to a port meansthat it is ’present’, otherwise it is ’absent’. Constraints, implemented by Port-Spec in Fig. 4, are propositional statements that raise the expressiveness of aninterface. The default constraint is true which for outputs means that nothingis guaranteed to be provided, and for inputs that nothing is required by thecomponent. Both input and output ports can be constrained, see containmentrelations in constraint and out constraint in Fig. 4. A locator links a port to aphysical location in the file system. A physical location is specified by a path toa file, an offset and the length of the marked information, see class Locator inFig. 4.Tengi relies on physical locations for the following reasons: (i) Since we pro-

vide interfaces for heterogeneous artifacts, we want the locators to be as generalas possible. Physical locations are advantageous due to their meta-model inde-pendence. That is, new languages can be used with Tengi without modifying it.(ii) It is important that Tengi indicates the locators visually, raising developer’sawareness of important dependencies. This is naturally done with physical lo-cators. (iii) Furthermore, Tengi allows for the evolution of artifacts referred bylocators. For example a locator can be moved, if the file containing it has beenedited. This is now automatically supported for artifacts using text editors. Weintend to investigate technologies that would support other evolution scenar-ios. Since locators relate to physical locations in files, Tengi interfaces can beconsidered lexical interfaces.

Definition 1. T = (I,O, ϕ, ψ) is an interface iff I is a set of input ports, O is aset of output ports and I∩O = ∅; ϕ is a propositional constraint over I (required),which constrains the valid input port combinations; and ψ is a propositionalconstraint over O (provided), which constrains the valid output port combinations.Denote the set of all ports as P = I ∪ O.

3.2 Operations on Tengi Interfaces

Composition. We say that interfaces T1 = (I1,O1, ϕ1, ψ1) and T2 =(I2,O2, ϕ2, ψ2) are composable iff I1 ∩ I2 = O1 ∩ O2 = ∅. Composeable interfaces(and thus their components) can be composed. The interface of the compositionis defined as an interface T = T1 ⊕ T2 = (I,O, ϕ, ψ), where I = I1 ∪ I2 \ (O1 ∪O2)and O = O1∪O2. The intuition is that all ports provided (outputs) by T1 and T2remain provided by the composition T , but the required inputs that are providedwithin the composition itself are no longer required—thus the set difference in


computing the input set. The constraints over input and output ports are givenby (i) ϕ = ∃(I1 ∪ I2)∩O. ϕ1 ∧ϕ2 and (ii) ψ = ∀I.(ϕ1 ∧ϕ2) → (ψ1 ∧ψ2) where theexistential elimination of a variable x ∈ X from formula ϕ over variablesX is theformula ∃x. ϕ = ϕ[0/x] ∨ ϕ[1/x], which extends to ∃A.ϕ = ∃x1. · · · ∃xn. ϕ for aset of variables A = {x1, . . . , xn} ⊆ X . Dually the universal elimination of x fromψ is the formula ∀x. ψ = ψ[0/x]∧ψ[1/x], generalizing to ∀A.ψ = ∀x1. · · · ∀xn. ψfor the same set of variables A. Intuitively, the first point above means that in-puts required by the components are still required by the composition, except forthe part of the constraint, which has been satisfied. Point two above states thatthe component might provide any combination of outputs such that regardless ofwhat inputs are given (that satisfy the required constraint) this combination stillcan be delivered. Two interfaces are compatible if their output constraint ψ issatisfiable. This corresponds to a requirement that a precondition of a procedureis consistent. We only require satisfiability (and not validity) in order to achievean optimistic notion of composition [4], in which a component is useful as longas there exists a context, with which it is consistent. When composing two inter-faces their locator lists are simply concatenated. Tengi implements compositionusing an Xpand [2] template, so by composing the syntactical representations ofinterfaces.

Subtyping. (or refinement) is a binary relation that allows comparing inter-faces, in a fashion that is similar to object oriented generalization hierarchy. Wesay that T1 is a subtype of T2 iff: (i) I1 = I2 and O2 = O1 and (ii) ϕ1 → ϕ2

and ψ2 → ψ1. Presently, checks of propositional statements in Tengja are im-plemented using binary decision diagrams (BDD) [5]. The subtyping definitionis somewhat rigid in that it requires that both interfaces completely agreeon their input and output alphabets. This is not a limitation. If we want toplace a subtype interface in a context of the supertype, we basically need toadd extra constraints setting the unused inputs and outputs of the context tofalse.

Conformance. checking is a check of an interface against one or moredevelopment artifacts. More precisely, all development artifacts are checkedif they provide the information specified in the corresponding Tengi inter-faces. An interface conforms to the corresponding artifacts iff for all the lo-cators exists a marker on the appropriate file with the appropriate physicallocations.This component algebra, albeit simple, exhibits all crucial properties that

are expected of such: (i) The composition operator is associative and commuta-tive. (ii) The composition operator is optimistic [4] (iii) The refinement relationis a preorder (reflexive and transitive). (iv) Composition satisfies independent-implementability [4]., i.e., it is possible to replace an interface by any of itsrefinements in any context, without breaking compatibility (as long as this re-finement does not introduce clashing names—a technicality caused by the factthat all port names are global).


Fig. 5. A visual model element, the corresponding data model element their textualrepresentations and their relations highlighted

4 Tengi Tool Details

Tengi works with all textual artifacts in Eclipse and most importantly withEMF and GMF-based models. EMF and GMF are the modeling components ofthe Eclipse DSL toolkit supporting visual modeling. Generally, Eclipse providesthree different kinds of model editors: diagram editors (as part of GMF), struc-tured tree editors (used by EMF), and text editors. Diagram editors and treeeditors allow interacting with visual syntax. Text editors allow for editing modelsin serialization syntax or of other textual representations. The XML MetadataInterchange (XMI) format is used to persist models in Eclipse. To separate amodel’s visual information from the actual data model, Eclipse spreads their per-sistent representations over two files. These are integrated together by modelingeditors following the MVC pattern [18]. When an editor is opened, the visualinformation model is loaded first, then the data model is loaded, and both areinterpreted, before the model is presented to the user. This technicality is thereason why a visual diagram, such as the one shown in Fig. 5 (top left), which in aphysical world would appear on paper, is stored in two files, see Fig. 5 (bottom).Figure 5 illustrates this technicality. A visual BON model displayed by a GMF

editor is shown in the top left. Right next to it we show its data model presentedin an EMF structured tree editor. Below each of these views, you can find itscorresponding serialization syntax in a text editor.Now assume that we want to define a Tengi interface for the BON model in the

top left of Fig. 5.What is the information that needs to be specified in the locator todefineaportforBONclassLevel(highlightedbyaredrectangle)?Thecorresponding


Car_java.tengi

Game_java.tengi

Level_java.tengi

code.tengi

assignment_txt.tengi

bouncycars_informal_bon.

tengi

bouncycars_formal_bon.

tengi

textual_bon.tengi

bouncyCars_bonide_

diagram.tengi

bouncyCars_bonide.tengi

visual_bon.tengi

bouncyCars_umlstm.tengi

bouncyCars_uml.tengi

visual_uml.tengi

visual.tengi

+ +

+

++

+

bouncycars.tengi

subtypes

Fig. 6. Interfaces for all the artifacts in the case study project

elements in serializationsyntaxarehighlightedbyredrectangles. Ingeneral, it isnottrivial to provide physical file locators for elements of visual languages.Tengi supports computing physical locators for visual model elements auto-

matically, using its traceability component Tengja2 [17]. With Tengja it requiresjust a button click to move from a marked element to the persistent modelsopened in text editors, with the highlighted text corresponding to the originalmodel element. This functionality is instantly available for all DSLs defined withEcore, and all GMF and EMF generated DSL editors. Tengja establishes the con-nections, the traceability links between model elements in visual syntax and theircorresponding serialization syntax and highlights these elements.But how does Tengja bridge gap between the visual layout representation, its

visual concrete syntax, and persistent textual representation, the serializationsyntax?Technically, Tengja is an extension to Eclipse, which recovers the links be-

tween the abstract and concrete syntax and the serialization syntax of modelsby observing the persistence mechanism. Since Eclipse’s standard persistencemechanism obscures traces, and since we aim at a reusable and non-invasivetracing toolkit, we settle on observing the standard persistence mechanism withan aspect, recording the context elements and linking them to the generatedsyntax. The aspect observes the top-most traversing method and its callees inorg.eclipse.emf.ecore.xmi.impl. It observes the sequence of model elementsthat get treated in the control-flow of these methods, and keeps track of startand stop positions in the generated stream of text in serialization syntax for amodel element. Subsequently, it maps model elements to indices in the gener-ated serialization stream. Thereby, we can trace each model element to its textualrepresentation and establish an explicit mapping between them. The mappingis then exposed to the development environment via the Tengja dictionary andcan be used in Tengi interfaces.Tengja allows to mark arbitrary model elements in Ecore-based visual mod-

els and to navigate from the respective element to all related other model el-ements and textual representations in abstract syntax, visual concrete syntax,

2 Tengja, Icelandic connect, was chosen to avoid conflicts with “connects”, “connec-tions”, and “connectors” appearing frequently in MDE literature.


Fig. 7. Excerpt of the analysis documentin informal BON (bouncycars informal.bon)with a marked port on top and be-low the corresponding Tengi interface(bouncycars informal.tengi)

Fig. 8. Excerpt of the design documentin formal BON (bouncycars formal.bon)with a marked port on top and be-low the corresponding Tengi interface(bouncycars formal.tengi)

and serialization syntax, and furthermore, to persist those connections or trace-ability links in a global locator dictionary. To define locators in Tengi interfacesthis dictionary can be used to drag single entries into the interface definition.

5 Case Study

This section demonstrates how Tengi is used in a project containing multipleheterogeneous artifacts, how the Tengi interfaces are defined, and what are the re-sults of applying operators to them. We use the Bouncy Cars project introducedin Sect. 2. Notably, we successfully apply Tengi to textual and visual languagesand editors, developed independently of this work by other authors.Figure 6 presents the overview of the entire project using the composition

structure of Tengi interfaces. Rectangles represent Tengi interfaces. The inter-faces in the bottom row correspond directly to the individual artifacts of thekinds listed above. We construct the abstract interface specification for the entireBouncy Cars project using stepwise bottom-up composition with the ⊕ compo-sition operator introduced in Sect. 3. The Tengi interfaces for basic components(files) are presented as follows: Figure 3 shows the interface for the requirementsdocument assignment.txt, itself presented in Fig. 2; Interfaces for the informal andformal textual BON specifications are found in the bottom of Fig. 7 and in Fig. 8respectively; In Fig. 13 interfaces for the visual BON model and its correspond-ing data model are shown, see Sect. 4; The interface for the UML state machineis in Fig. 10; and Fig. 11 shows interfaces for the Java classes Car.java, Level.java


Fig. 9. UML state machine modelbouncyCars.umlstm with two ele-ments marked as ports, which ap-pear in the Tengja dictionary

TENGI visual uml ENTITY"(bouncyCars uml.tengi+bouncyCars umlstm.tengi)"[

IN: { }; CONSTRAINT: true;OUT: { level uml data, game uml data, game uml vis, level uml vis };CONSTRAINT: level uml data & game uml data

& level uml vis & game uml vis;] {LOCATOR level uml data IN "bouncyCars.uml" OFFSET 1313 LENGTH 79;LOCATOR game uml data IN "bouncyCars.uml" OFFSET 975 LENGTH 1469;LOCATOR game uml vis IN "bouncyCars.umlstm" OFFSET 335 LENGTH 5333;LOCATOR level uml vis IN "bouncyCars.umlstm" OFFSET 2385 LENGTH 941;

}

Fig. 10. Composition of interfaces for the UMLstate machine and its corresponding data model(Fig. 9)

and Game.java. All file paths in interfaces in this paper are abbreviated to avoidclutter. Complete model files are available at www.itu.dk/people/ropf/src/tengi.All basic components listed above provide views on the same domain, the

Bouncy car game, from the point of view of different abstraction levels. That is,they all contain pieces of information that are related to each other. For example,all of the basic components care about a “game” that contains multiple “levels”and some of them tell something about a “car”. Similarly, the state Level inbouncyCars.umlstm and the class Level in bouncyCars.bonide diagram refer to eachother, but there is no explicit link that allows for automatic reasoning over suchrelations. Tengi interfaces establish such a link.Letus examineabit closer the interfaces offilesbouncyCars formal.bonandboun-

cyCars.umlstm (Fig. 8–9). The interfaces are shown in figures 8 and 10 respectively.Thefirst one states that the componentbouncycars formal.bonprovides, amongoth-ers, a port level form bon that refers via its locator to the specification of a classLevel. The Tengi interface for the UML state machine (Fig. 10) requires, amongstothers, the formal specification of Level in BON (level form bon), to provide thestate Level via two new ports level uml data and level uml vis. These are then usedto trace the refinement further to Java implementation in other interfaces.The Tengi interface textual bon.tengi is a simple example of refinement (sub-

typing) of assignment txt.tengi. Both interfaces provide the ports informal analysisand formal design, the former, since it corresponds to the high-level require-ments document, more abstract the latter more concrete. This means that tex-tual bon.tengi provides both the informal analysis and the formal design andexplicitly indicates, by means of locators, where these are placed in the model.The composition of all interfaces in the case study results in the synthesized

interface presented in Fig. 14. The overall interface shows no input and thus noconstraints on inputs. This is expected as the entire system is supposed to becomplete, and should not require anything. We also remark that the output con-straint warrants satisfaction of informal analysis and formal design, whichcan be traced all the way back to the initial requirement.This case study demonstrates that Tengi allows defining interfaces and thereby

components for heterogeneous development artifacts (here free text files, GMF

www.itu.dk/people/ropf/src/tengi


TENGI Game ENTITY "Game.java" [IN:

{ game uml data, game uml vis, game bon data,game bon vis };

CONSTRAINT:( game uml data & game uml vis & game bon data &game bon vis);

OUT: { game java };CONSTRAINT: game java;] {LOCATOR game uml data IN "bouncyCars.uml" OFFSET 975 LENGTH 1469;LOCATOR game uml vis IN "bouncyCars.umlstm" OFFSET 335 LENGTH 5333;LOCATOR game bon data IN "bouncyCars.bonide" OFFSET 1302 LENGTH 2100;LOCATOR game bon vis IN "bouncyCars.bonide diagram" OFFSET 19177 LENGTH 31651;LOCATOR game java IN "Game.java" OFFSET 94 LENGTH 2228;

}

TENGI bouncyCars bonide ENTITY"bouncyCars bonide.tengi" [ IN: { }; CONSTRAINT: true;OUT: { game bon data, level bon data };CONSTRAINT: (game bon data & level bon data);] {LOCATOR game bon data IN "bouncyCars.bonide"OFFSET 1302 LENGTH 2100;LOCATOR level bon data IN "bouncyCars.bonide"

OFFSET 274 LENGTH 1023; }

TENGI Level ENTITY "Level.java" [IN: { level uml data, level uml vis, level bon data, level bon vis };CONSTRAINT: (level uml data & level uml vis & level bon data & level bon vis);OUT: { level java };CONSTRAINT: level java;] { LOCATOR level uml data IN "bouncyCars.uml" OFFSET 1313 LENGTH 79;

LOCATOR level uml vis IN "bouncyCars.umlstm" OFFSET 2385 LENGTH 941;LOCATOR level bon data IN "bouncyCars.bonide" OFFSET 274 LENGTH 1023;LOCATOR level bon vis IN "bouncyCars.bonide diagram" OFFSET 630 LENGTH 18540;LOCATOR level java IN "Level.java" OFFSET 43 LENGTH 1511; }

Fig. 11. Tengi interfaces for the Java classes

and EMF models, and Java source code), and further to process such interfacesusing appropriate interface operations. The interface specifications, particularlyinterface’s provisions and requirements, not only define components, but alsoprovide traceability links by marking port’s locations explicitly and interrelatingports using the constraints and component algebra operators.In this section we have constructed the Tengi interface in a bottom-up fashion,

starting with the interfaces for basic components. This is not generally required,as Tengi allows definition of components of any granularity.

6 Discussion

Currently, Tengi allows for the following:

– Defining Tengi interfaces using a textual DSL. The tool provides an appro-priate editor with syntax highlighting, live validation, and code completion.– Applying operations to Tengi interfaces, i.e., composition, subtype checking,compatibility checking, and conformance checking.– Establishing links between visual model elements and their serialization syn-tax and organizing them in a global dictionary.– Highlighting of information which is referred by Tengja locators in textualand graphical editors (except for tree viewers).

Tengi itself relies on Eclipse’s model-driven software development tools. For ex-ample, the interface editor was generated using Xtext. That is, Tengi interfacesare internally represented by Ecore-based models. The composition operation isimplemented via an Xpand template. Xtext and Xpand are both parts of theEclipse Modeling Project [6]. Furthermore, interface operations are implemented


TENGI bouncyCars bonide diagram ENTITY"bouncyCars bonide diagram.tengi" [

IN: { game bon data, level bon data };CONSTRAINT: (game bon data & level bon data);OUT: { game bon vis, level bon vis };CONSTRAINT: (game bon vis & level bon vis);] { LOCATOR game bon vis IN "bouncyCars.bonide diagram"

OFFSET 19177 LENGTH 31651;LOCATOR game bon data IN "bouncyCars.bonide"

OFFSET 1302 LENGTH 2100;LOCATOR level bon vis IN "bouncyCars.bonide diagram"

OFFSET 630 LENGTH 18540;LOCATOR level bon data IN "bouncyCars.bonide"

OFFSET 274 LENGTH 1023;}

TENGI bouncyCars bonide ENTITY "bouncyCars bonide.tengi" [IN: { }; CONSTRAINT: true;OUT: { game bon data, level bon data };CONSTRAINT: (game bon data & level bon data);] { LOCATOR game bon data IN "bouncyCars.bonide"

OFFSET 1302 LENGTH 2100;LOCATOR level bon data IN "bouncyCars.bonide"OFFSET 274 LENGTH 1023; }

Fig. 12. A visual BON model for theGame which contains levels

Fig. 13. Interfaces for the visual modeland its data model (Fig. 12)

using Binary Decision Diagrams (BDD’s), in particular the JavaBDD [23] library,for the representation of the port specification constraints.The most advanced technically part of Tengi is its traceability mechanism,

Tengja, that allows linking physical locations in files to model elements in mod-eling editors by applying suitable aspects to Eclipse editors. Tengja, described inmore detail in a preliminary version of this work [17], modifies the standard seri-alization mechanism of Eclipse using aspect oriented programming. The Tengjaaspect observes model serialization to establish physical positions of model ele-ments in files in a meta-model independent manner. Thus it is not required thatusers of Tengi manually change modeling and programming editors to allow forvisualization of ports.Tengi is generally applicable to development projects that are executed in the

Eclipse IDE. However, any other modern IDE with support for visual modelscould have been chosen to serve as platform for Tengi. As mentioned earlier,Tengi is able to deal with all textual development artifacts as well as visualmodels that are EMF/GMF based. To our understanding, this covers the mostimportant artifacts in current software development projects. Supporting newartifact types would require the extension of the Tengi tool to deal with theartifact’s specific editor, since Tengi distinguishes and handles artifacts basedon their specific editor.Tengi interfaces are separated, i.e., non-invasive, to the corresponding devel-

opment artifacts. We could have investigated an invasive approach. That wouldmean that information that should appear in development artifacts would bedirectly marked within the development artifact. We decided against this ap-proach to make the use of Tengi optional and ease adoption in legacy projects.Further, non-invasive component definition approaches can be researched moreeasily since existing projects do not need to be inherently modified. A drawbackof choosing a non-invasive approach is that it requires the use of additional toolsin the development process, here it is the to use of Eclipse IDE with our plugins.


TENGI bouncycarsENTITY "Car.java+Game.java+Level.java+bouncyCars bonide diagram.tengi+bouncyCars bonide.tengi+bouncyCars uml.tengi+bouncyCars umlstm.tengi+bouncycars formal.bon+bouncycars informal.bon"[IN: { }; CONSTRAINT: true;OUT: {car java, game java, level java, game bon vis, level bon vis,game bon data, level bon data, level uml data, game uml data, game uml vis,level uml vis, formal design, level form bon, car form bon, game form bon,car inform bon, game inform bon, informal analysis, level inform bon};CONSTRAINT: car java & game java & level java & game bon vis & level bon vis &game bon data & level bon data & level uml data & game uml data & game uml vis &level uml vis & formal design & level form bon & car form bon & game form bon &car inform bon & game inform bon & informal analysis & level inform bon;

] {LOCATOR car java IN "bouncycars/Car.java" OFFSET 65 LENGTH 2471;...LOCATOR car inform bon IN "bouncycars informal.bon" OFFSET 2326 LENGTH 787;}

Fig. 14. The interface synthesized for the BouncyCars project

As described in Sect. 3, Tengi allows for the specification of ports for arbitraryinformation in development artifacts. It might be a shortcoming that such portsare presently untyped and that it is thereby possible to construct Tengi interfaceswhich relate information that either should not be related. On the other hand, wethink that untyped ports are advantageous since they do not restrict developersin the specification of interfaces and allow to apply Tengi interfaces in varioussettings and environments and under various requirements. For example, withuntyped ports it is possible that in one development project Tengi interfacesrelate only documentation artifacts that relate whole chapters to each other,whereas another development projects relates only method names of Java classesto each other.

7 Related Work

The composition operators in Tengi’s algebra is a simplified and regularized ver-sion of the algebra presented in [11], originally inspired by the input/outputinterfaces in [4]. Unlike in [11], there is no concept of meta-interfaces in Tengi,since Tengi regards all software development artifacts as first level artifacts. Alsothis version of the component algebra does not reason about internal dependen-cies between outputs and inputs within the component.We have settled on a simple, propositional specification language that can be

efficiently treated using state of the art technologies like SAT-solving or BDDs. Itwas not our objective to create a very rich component algebra. One starting pointto get an overview of this research area is the anthology by Liu and Jifeng [14],which discusses languages beyond propositional logics.Static interrelations of heterogeneous software development artifacts are cur-

rently not widely discussed. The work of Henriksson et al. [9] is very close toours. They provide an approach to add modularity to arbitrary languages byenriching their grammars or meta-models with variation points. That is theyprovide an invasive modularization support. Also Heidenreich et al. [8] take a


similar route. Both works require an extension of a language’s specifications tosupport modularity. First, the described mechanism is language focused, i.e.,each new language’s grammar needs to be modified before supporting modu-larization support, and second the described approach is invasive in the sensethat no separate interfaces are constructed but the artifacts itself define theirprovisions and requirements.Current traceability solutions like macromodels [19], mega-models [12], trace

models [7], and relation models [16] rely on an explicit model containing tracesbetween different model elements. Such explicit models can be regarded as com-posed or “wired” interfaces where the trace ends are ports of interfaces. Differ-ently, to Tengi all these solutions interrelate models, whereas Tengi abstractseven more by concentrating on visual and textual artifacts in their textual repre-sentation. Similarly, SmartEMF [10] checks cross-reference constraints to supportdevelopers and cross-references may be regarded as interface ports of implicit in-terfaces. The present paper can be seen as a generalization of [10] in the (specific)sense that Tengi could also be used to address the same problem.

8 Conclusion and Future Work

This paper presented Tengi a tool that allows for the construction of componentsof heterogeneous development artifacts using interfaces. Tengi interfaces rely onports to physical locations. Combined with the presented component algebra,such ports describe relations between heterogeneous artifacts themselves. Thetool provides a textual DSL for defining interfaces for heterogeneous softwaredevelopment artifacts, an appropriate editor including syntax highlighting, livevalidation, and code completion, and operations on the interfaces. Furthermore,the tool includes Tengja, a mechanism for connecting visual model elements withtheir serialization syntax and thereby enabling their integration into a global,IDE-wide, locator dictionary, so that they can be used in Tengi interfaces. Thetool is integrated into the Eclipse IDE as a plug-in. To demonstrate the abilitiesand advantages of our tool we provided a case study, that applies Tengi in thedevelopment process of a small sized software system.In future we will continue developing Tengi. We want to investigate use of

structured locators. We intend to use query languages and express locators asqueries to particular information. This is not trivial, since we would still like tosupport evolution of development artifacts with interfaces, which requires beingable to evolve queries in parallel. It is much simpler to track evolution of physicallocation, than of complex structures defined by queries. We will address this issueby investigating heterogeneous development artifacts with respect to common-alities and differences in their structure. This will result in more developmentartifacts being usable in Tengi and a standard mechanism for registering newdevelopment artifacts to Tengi. We consider evaluating the tool in a real-worldsoftware development scenario to understand its impact on developers and onthe quality of software produced.


Acknowledgements. The assignment of Fig. 2 is due to Joe Kiniry, whoalso introduced Pfeiffer to BON. We thank Ralph Skinner for developing aGMF-based development environment for BON [20], and for supporting us inits use. We also thank the GTTSE reviewers for their constructive comments onearlier versions of this paper.

References

1. Advanced Models and Programs, Course Homepage (2010),http://www.itu.dk/research/pls/wiki/index.php/AMP-Spring2010

2. Xpand (May 2010), http://wiki.eclipse.org/Xpand3. Alanen, M., Porres, I.: A Relation Between Context-Free Grammars and MetaObject Facility Metamodels. Tech. rep., Turku Centre for Computer Science (2003)

4. de Alfaro, L., Henzinger, T.A.: Interface Theories for Component-Based Design. In:Henzinger, T.A., Kirsch, C.M. (eds.) EMSOFT 2001. LNCS, vol. 2211, pp. 148–165.Springer, Heidelberg (2001)

5. Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEETransactions on Computers 8, 677–691 (1986)

6. Gronback, R.C.: Eclipse Modeling Project: A Domain-Specific Language Toolkit.Addison-Wesley (2009)

7. Guerra, E., de Lara, J., Kolovos, D.S., Paige, R.F.: Inter-modelling: From Theoryto Practice. In: Petriu, D.C., Rouquette, N., Haugen, Ø. (eds.) MODELS 2010,Part I. LNCS, vol. 6394, pp. 376–391. Springer, Heidelberg (2010)

8. Heidenreich, F., Johannes, J., Zschaler, S.: Aspect Orientation for Your Languageof Choice. In: Workshop on Aspect-Oriented Modeling (AOM at MoDELS) (2007)

9. Henriksson, J., Johannes, J., Zschaler, S., Aßmann, U.: Reuseware - Adding Mod-ularity to Your Language of Choice. Journal of Object Technology 6(9) (2007)

10. Hessellund, A., Czarnecki, K., Wąsowski, A.: Guided Development with MultipleDomain-Specific Languages. In: Engels, G., Opdyke, B., Schmidt, D.C., Weil, F.(eds.) MODELS 2007. LNCS, vol. 4735, pp. 46–60. Springer, Heidelberg (2007)

11. Hessellund, A., Wąsowski, A.: Interfaces and Metainterfaces for Models and Meta-models. In: Czarnecki, K., Ober, I., Bruel, J.-M., Uhl, A., Volter, M. (eds.)MODELS 2008. LNCS, vol. 5301, pp. 401–415. Springer, Heidelberg (2008)

12. Jouault, F., Vanhooff, B., Bruneliere, H., Doux, G., Berbers, Y., Bezivin, J.: Inter-DSL Coordination Support by Combining Megamodeling and Model Weaving. In:Proceedings of the 2010 ACM Symposium on Applied Computing (2010)

13. Leavens, G.T., Cheon, Y.: Design by Contract with JML (2004)14. Liu, Z., Jifeng, H. (eds.): Mathematical Frameworks for Component Software:Models for Analysis and Synthesis. Springer (2007)

15. Lotufo, R., She, S., Berger, T., Czarnecki, K., Wąsowski, A.: Evolution of the LinuxKernel Variability Model. In: Bosch, J., Lee, J. (eds.) SPLC 2010. LNCS, vol. 6287,pp. 136–150. Springer, Heidelberg (2010)

16. Pfeiffer, R.H., Wąsowski, A.: Taming the Confusion of Languages. In: Proceedingsof the 7th European Conference on Modelling Foundations and Applications (2011)

17. Pfeiffer, R.H., Wąsowski, A.: An Aspect-based Traceability Mechanism for DomainSpecific Languages. In: ECMFA Traceability Workshop (2010)

18. Reenskaug, T.M.H.: Models - Views - Controllers (1979),http://heim.ifi.uio.no/˜trygver/1979/mvc-2/1979-12-MVC.pdf

http://www.itu.dk/research/pls/wiki/index.php/AMP-Spring2010

http://wiki.eclipse.org/Xpand

http://heim.ifi.uio.no/~trygver/1979/mvc-2/1979-12-MVC.pdf


19. Salay, R., Mylopoulos, J., Easterbrook, S.: Using Macromodels to ManageCollections of Related Models. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.)CAiSE 2009. LNCS, vol. 5565, pp. 141–155. Springer, Heidelberg (2009)

20. Skinner, R.: An Integrated Development Environment for BON. Master’s thesis.School of Computer Science and Informatics. University College Dublin (2010)

21. Sommerville, I.: Software Engineering, 8th edn. International Computer SciencesSeries. Addison Wesley, Harlow (2006)

22. Walden, K., Nerson, J.M.: Seamless object-oriented software architecture: analysisand design of reliable systems. Prentice-Hall, Inc. (1995)

23. Whaley, J.: JavaBDD Project Homepage (March 2012),javabdd.sourceforge.net/

javabdd.sourceforge.net/

Author Index

Apel, Sven 346

Bencomo, Nelly 271Blasband, Darius 1

Cleve, Anthony 297

Erwig, Martin 55

Fuhrer, Robert M. 101

Hainaut, Jean-Luc 297Heidenreich, Florian 322

Johannes, Jendrik 322

Karol, Sven 322Kastner, Christian 346Kolovos, Dimitrios S. 197

Matragkas, Nikos 197Mikhaiel, Rimon 159

Negara, Natalia 159Noughi, Nesrine 297

Paige, Richard F. 197Pfeiffer, Rolf-Helge 431

Rose, Louis M. 197

Seifert, Mirko 322Stroulia, Eleni 159

Terwilliger, James F. 219Tsantalis, Nikolaos 159

Voelter, Markus 383

Walkingshaw, Eric 55W ↪asowski, Andrzej 431Wende, Christian 322Williams, James R. 197

Xing, Zhenchang 159