[ieee 2008 15th working conference on reverse engineering (wcre) - antwerp, belgium...

Error Correcting Graph Matching Application to Software Evolution

Segla Kpodjedo∗, Filippo Ricca∗∗, Philippe Galinier∗ and Giuliano Antoniol∗

[email protected], [email protected], [email protected], [email protected]

∗ SOCCER Lab. – DGIGL, Ecole Polytechnique de Montreal, Quebec, Canada∗∗ Unita CINI at DISI, University of Genoa, Italy.

Abstract

Graph representations and graph algorithms are widelyadopted to model and resolve problems in many differentareas from telecommunications, to bio-informatics, to civiland software engineering. Many software artefacts such asthe class diagram can be thought of as graphs and thus,many software evolution problems can be reformulated as agraph matching problem.

In this paper, we investigate the applicability of an error-correcting graph matching algorithm to object-orientedsoftware evolution and report results, obtained on a smallsystem — the Latazza application —, supporting applica-bility and usefulness of our proposal.

Keywords: Software evolution, Error-Correcting GraphMatching (ECGM) algorithm, tunnel.

1 Introduction and Problem Statement

Graph representations are well suited for modeling allkinds of real life objects and problems, including softwareengineering problems. When one represents two given soft-ware artefacts as graphs, one legitimate question is to deter-mine how similar (quantitatively and qualitatively) they are.An intuitive way to answer these questions is to match, withrespect to some constraints, the nodes and edges of the firstgraph to the nodes and edges of the second graph.

Exact matching, which requires a strict correspondenceamong the two objects being matched or their subparts,often fails to provide exploitable results and there is theneed to resort to approximate graph matching. Briefly, ap-proximate graph matching algorithms allow matching twonodes that violate constraints such as the edge-preservationconstraint — exact correspondence of edges — or anyother characteristic such as node/edge labels, weights etc.Instead, a penalty is assigned to those constraint viola-tions, depending on the specific problem and desired re-sults. Thus, a best matching is considered to be one that

minimizes the overall penalty cost. Unfortunately, find-ing a best matching is known to be NP-hard [7], and op-timal algorithms suffer from prohibitive computation timeson medium and large graphs.

In this paper, we investigate the application of a Google-inspired Error-Correcting Graph Matching (ECGM) algo-rithm to Object-Oriented (OO) software evolution. Our al-gorithm based on Tabu search overcomes some limitationsof previous traceability recovery work such as [1]. Indeed,previously known approaches hardly ever model, in a con-sistent and elegant way, class diagram distortion due to soft-ware evolution such as adding or deleting relations betweenclasses.

To demonstrate the feasibility and usefulness of ECGMalgorithms for studying software evolution, we applied thealgorithm detailed in [9] to the evolution of the LaTazzaapplication (a small application of 17 Java classes and 37relations for a total of 6184 Lines of Code).

The primary contributions of this paper can be summa-rized as follows. The paper proposes an ECGM algorithmto study software artefact evolution. The paper applies anECGM algorithm to reconstruct class diagram evolutionand it reports validated results on a small application.

This paper is organized as follows. Section 2 presentsrelated work while Section 3 summarizes our ECGM algo-rithm. Section 4 describes the case study and briefly reportsthe obtained results. Finally, Section 5 concludes and out-lines future work.

2 Related Work

The problem of studying subsequent releases of a soft-ware system and evaluating version deltas has been alreadyfaced with different algorithms by Antoniol et al. in [1].Similar to our proposal, the method presented in [1] re-covers the design from the code in an intermediate rep-resentation and compares it with subsequent software re-leases. Different from us, the problem has been tackled

2008 15th Working Conference on Reverse Engineering

1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.48

289


1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.48

289


1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.48

289


1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.48

289


1095-1350/08 $25.00 © 2008 IEEE

DOI 10.1109/WCRE.2008.48

289

with a maximum match algorithm [6] applied to a bipar-tite graph. Nodes in the bipartite graph are the classes ofthe two releases and the similarity between them is derivedfrom class and attribute/method names by means of stringedit distances [8]. It is worth noticing that in [1] authorsdid not address relations’ evolution and no details were pro-vided on how relations between classes can be modeled.

Xing and Stroulia in [15] propose an algorithm namedUMLDiff for matching different versions of an application,using several class metrics. Similarly to our algorithm, ittakes as input two UML class diagrams reverse engineeredfrom two corresponding code versions and it produces asoutput a tree of structural changes, that reports the differ-ences between the two design versions.

Bouktif et al. in [2] proposed an approach, based on dy-namic time warping, to discover co-changing files in CVSrepositories, i.e., files that change together (almost) all thetime. The approach tries to answer the important and valu-able question: “if this particular file changes, what otherfiles should change?” The same question was also ad-dressed by Zimmermann et al. in [16].

Previous work showed that simple measures, such assize and complexity, can be used to discover interestingfacts in the evolution of software systems. By collectingand analyzing these measures over time, it is possible todetect, for example, “jumps” in the complexity of a soft-ware system [10] and architectural changes in an automaticway [13]. The same strategy has been followed by Vasaet al. in [14] where different measures for each class havebeen extracted to study the evolution of 12 OO softwaresystems. The main results of the study are that: (i) littlecode is modified over time, (ii) size and complexity mea-sures rapidly stabilize and (iii) popular classes (i.e., classeswith greater fan-in) are more likely to change.

Overall, no previous work applies approximated errortolerant graph matching to software evolution; in the bestof authors’ knowledge, previous works are also lacking of acomprehensive way to model edge evolution.

3 Background

Class diagrams can be thought of as labelled graphs withnodes being the classes and edges representing the relationsbetween classes. Labels on edges can specify the type ofthe edge (i.e., association, aggregation or inheritance) whilenode labels can specify properties such as the class name.Given two class diagrams of the same software at differentstages of evolution - as illustrated in Figure 1 - we are in-terested into finding a mapping, i.e., a correspondence, be-tween them. A solution can be represented as a correspon-dence table linking classes of the two diagrams and speci-fying added or deleted classes.

To apply ECGM algorithms to study software evolution,

Figure 1. Example of class diagrams to bematched

we envisage the following steps. First, software artefacts,class diagrams in our case, are represented as graphs. Oncethe graphs are available, we build a mapping between themvia an ECGM algorithm. We are interested in finding an op-timal or a near optimal mapping, i.e., a mapping minimiz-ing a cost function representative of the problem at hand.We resort on meta-heuristics, more precisely a Tabu searchalgorithm, to search for optimal or near optimal solutions.To guide the search toward regions containing promisingsolutions and speed up computation, we consider local andglobal information about the nodes of the graphs. Our al-gorithm exploits similarities of nodes based both on theirnumber of edges and their hierarchical node position in theoverall graph structure. This latter heuristic is implementedvia a Google inspired PageRank algorithm [9].

3.1 The Error Correcting Graph Match-ing Model

A graph with labels from two finite alphabets of sym-bols

∑V (vertices’ labels) and

∑E (edges’ labels) is de-

fined as a triple (V, LV , LE) where V is the finite set ofelements, called nodes or vertices; LV : V → ∑

V isthe node labelling function and LE : V × V → ∑

E isthe edge labelling function. Let g1 = (V1, LV 1, LE1) andg2 = (V2, LV 2, LE2) be two graphs. An ECGM from g1

to g2 is a bijective function m : V1 → V2 where V1 ⊆ V1,V2 ⊆ V2. We say x ∈ V1 is matched to node y ∈ V2 ifm(x) = y. Furthermore, any node from V1 − V1 is said tobe deleted from g1, and any node from V2 − V2 is said to beinserted in g2 under m.

More formal definitions of ECGM can be found in [4]. Inessence, any ECGM can be thought of as a set of edit opera-tions that transform a given graph g1 into another graph g2.

290290290290290

We call node matching a couple (n1, m(n1)) ∈ (V1 × V2).An ECGM solution, called matching, is then a set of thosecouples with the constraint that a node is matched to at mostone node. Penalties are assigned to every distortion foundby the solution. We distinguish edit operations leading todistortions into node/edge deletions, node/edge insertionsand node/edge matching errors. Given (n1, m(n1)), a nodematching error refers to the dissimilarity between n1 andm(n1). Edge matching refers to any edge replacement fromV1 × V1 to V2 × V2. Two types of edge matching errors areto be considered: replacing a missing edge (insertion) by anexisting edge (structural error) or replacing one edge by an-other (label error). As a result, there are seven possible editoperations or distortions and each one is assigned a givencost depending from the problem at hand. In summary, anyECGM cost function could then be parameterised by sevencost values of the seven edit operations:

• node matching, deletion and insertion: cnm, cnd, cni;

• edge deletion and insertion applied to edges of deletedand added nodes: respectively ced (cost of deleting anedge of a deleted node from g1) and cei (cost of addingan edge for nodes added into g2); and

• edge matching: edge structural error cems when anedge is inserted/deleted between two matched nodesand edge label error ceml (for example, an associationis mapped into an aggregation).

Often, the cost of adding or deleting a node (or an edge)can be considered identical and thus there is no need tospecify two different values cnd, cni (or ced and cei); thusfive real positive values suffice to define a cost function:(cnm, cno, ceo, cems, ceml).

3.2 Modeling software evolution as anECGM

A straightforward mapping of a class diagram into agraph may disregard elements of classes such as the classname, the number of attributes and methods. However,these are important elements in software evolution and theyhave to be modeled as node properties and matched by theECGM algorithm. In this preliminary study, for each class,we considered only a subset of possible class characteris-tics: the class name and the number of attributes and meth-ods. For example, we did not consider method signature, at-tribute names and types, and so on; we believe that if a sim-plified representation produces interesting results, a richermodel will more likely improve the quality of the results.

Considering the graphs of Figure 1, the optimal solu-tion is to match “TheClient” to “Client”, “Ticket” to “MyT-icket”, “Lottery” to “Lottery”, “FreeticketLaw” to “Tick-

etLaw” (potential node matching errors); to delete “Win-ningOrder” (node deletion) and any of its adjacent edges(edge deletion); to insert “C” (node insertion) and any of itsadjacent edges (edge insertion). As for edge matching, inthe example, the relation between “Lottery” and “Ticket” issubstituted by the one between “Lottery” and “MyTicket”.

When we match “Ticket” to “MyTicket”, we have topay a penalty expressing the graph differences and theclasses’ internal dissimilarity. Given two classes and theirfeatures (label, number of attributes, number of meth-ods): v1(l1, #m1, #a1) and v2(l2, #m2, #a2), we com-pute their internal similarity as follows1:

intSim(v1, v2) = 1 − [lw × ( Levenshtein(l1 ,l2)max(length(l1),length(l2)) )

+ mw × ( |#m1−#m2|max(#m1,#m2)

) + aw × ( |#a1−#a2|max(#a1,#a2)

)]

where lw, mw and aw are respectively weights for labels,methods and attributes. lw, mw and aw are real numbersbetween 0 and 1 such as lw + mw + aw = 1. As a conse-quence, intSim(v1, v2) is a real between [0, 1] with 1 beingthe maximum similarity. Those values are used to computea node matching error, with 0 of similarity indicating thehighest cost cnm, and 1 no cost.

When internal similarity of classes is taken into account,we need to add lw, mw and aw to the weights previouslypresented. Our final cost function is then represented by(cnm, cno, ceo, cems, ceml, lw, mw, aw). Note that thesecost values should be chosen carefully as they are stronglyrelated to the tackled problem and desired results [5]. Sec-tion 4 provides details about the process we applied to tunecosts for class diagram evolution.

3.3 Tabu Search Algorithm

Our ECGM algorithm relies on a Tabu Search (TS) al-gorithm guided by global information on the nodes fromthe PageRank algorithm and local node features such as thenumber of edges.

Given a function f (cost function) to be minimized (ormaximized) over some set S (the Search Space), a localsearch technique starts from some initial feasible point (so-lution) in the search space and proceeds iteratively (moves)from one point in S to another (a neighbour) until some ter-mination criterion is met. There is no guarantee of obtainingan optimal solution as the search may get trapped in localoptima, but some techniques are helpful in avoiding localoptima and finding good solutions. For instance, to preventcycles in the search, TS introduces one or several tabu lists

1Levenshtein(l 1,l 2) is the Levenshtein distance or edit distance be-tween the two strings l 1 and l 2. It is given by the minimum number ofoperations (insertion, deletion or substitution of a single character) neededto transform one string into the other.

291291291291291

used to exclude moves which would tend to make the searchprocess go back to a previously visited solution.

For ECGM, a move is either adding a new match ordeleting one which is in the current solution. Before match-ing two nodes, one should consider their similarity. We in-troduce here the external similarity of nodes. The externalsimilarity should consider the whole graph structure and thepositions of the considered nodes in their respective graphs.Local features of a node such as the incoming edges andthe outgoing edges are undoubtedly first-hand and helpfulinformation. The problem here is that local information isnot representative of the position of a node in a graph sincea lot of nodes might share the same information e.g., num-ber of incoming and outgoing edges. To obtain this kind ofglobal information, we rely on PageRank described in thenext paragraph.

PageRank [3], one of the main components behind thefirst versions of Google, basically measures the relative im-portance of each element of a hyperlinked set and assignsit a numerical weighting. In essence, the more references(incoming arcs) an element (vertex) gets from other ele-ments (preferably important), the more importance it de-serves. Further details can be found in [11]. Using PageR-ank, we can easily compute a metric representative of globalstructure for each vertex of a given graph. Once combinedwith local metric, this metric allows us to have a more ac-curate assessment of the structural similarity of two nodes;structural similarity that is used to guide the TS search.

4 Applying our algorithm to LaTazza

As a first step toward larger applications, we apply ouralgorithm to the Latazza application in order to experimen-tally tune the parameters of the cost function — i.e., (cnm,cno, ceo, cems, ceml, lw, mw, aw) — using 5 versions of theLaTazza application.

The LaTazza application, also used in other case studies(see for example [12]), is a simple Java application devel-oped by master students of the Software Engineering courseof DISI (University of Genova). It implements a bever-age vending machine. During LaTazza development (theproject lasted about 4 months), the application evolved in5 versions. The system, in its final version, consists of 17Java classes and 37 relations (associations, aggregations andgeneralizations) for a total of 6184 Lines of Code.

4.1 Execution

We applied our algorithm to the 5 versions of LaTazza.The ECGM algorithm coded in C++ was compiled with g++and run on a Linux standard Redhat Advanced Server ver-sion 4 configuration.

4.2 Tuning the parameters

Applying ECGM to software evolution requires tuningthe cost parameters in order to improve the accuracy of theimplemented algorithm. The iterative procedure that we fol-lowed to tune the parameters consists of the following steps:

1. determine/tune (cnm, cno, ceo, cems, ceml, lw, mw,aw) and put I=1;

2. If I=5, stop; otherwise, run the algorithm on the twosubsequent versions (VI , VI+1) of Latazza;

3. check the results; if they are satisfactory, put I=I+1 andGoto step 2; otherwise Goto step 1;

The steps of the procedure were executed several timesas long as we were satisfied with the outputs of the algo-rithm. Finally, after 4 trials, we considered (cnm, cno, ceo,cems, ceml, lw, mw, aw)= (50, 25, 10, 30, 25, 70, 25, 5)as the cost parameters for our empirical study. Regardinginternal dissimilarity of nodes, these cost parameters givemore importance to the name of the class (70%). The max-imum cost for a node matching error (cnm) equals to thecost of removing and adding a node (cno + cno) and penal-ties for edge matching errors (cems, ceml) are significantlyhigher than those assigned to add/remove unmatched edges(ceo). Thus, the algorithm will judge bad correspondencesworse than missing correspondences.

4.3 Preliminary Results

With those cost parameters obtained by trial and error,we were able to retrace without any error the evolution ofLatazza. The evolution of LaTazza involved a huge refac-toring of the class names — a translation from English toItalian — modifications of relations between classes, inser-tions of new classes and deletions of old ones.

Moreover, we considered the last version of Latazza asthe starting point and applied our algorithm to the previousversions. A thread starts from a class of the last snapshot ofLatazza and its length (Maximum = 5) depends on whetherthe class is matched in the previous versions. If a class goesunder high distortion of itself and its structural relations,it will not be matched. When the algorithm is not able tofind a class correspondence with a previous snapshot, thethread is stopped. The length of a thread is the number oftimes a last version class is matched in its past. We didnot consider, in this study, threads beginning at a versiondifferent from the last. The classes always matched in allthe versions are those which maintained stable relations —associations, inheritances and aggregations — with smallor no distortion through all the evolution of Latazza. Tenclasses out of the 17 in the last version of Latazza were in

292292292292292

the stable part. This means that 59% of the classes belongto the tunnel. Regarding the edges, we can observe that 16of them (out of the 37 edges in the last Latazza’s snapshot)are in the tunnel and 13 kept the same value throughout it(i.e., the type of the relation is not changed during all thesnapshots). The classes present in the tunnel (see Figure 2)were indeed the backbone of the Latazza application.

Figure 2. Latazza’s tunnel (green classes)

5 Conclusion

We believe that the study of the evolution of object-oriented programs is possible in an effective way with theTabu search based Error-Correcting Graph Matching algo-rithm. This work represents a first attempt aimed at evalu-ating such an algorithm by means of a small case study.

This work is only first steps toward a more ambitiousproject, i.e., re-formulate robustness to change of soft-ware micro-architectures as an error tolerant graph match-ing problem. Future work will be devoted to (i) consideringthe signature in its entirety (not only number of attributesand methods), (ii) evaluating the scalability of our algo-rithm on a bigger case study (e.g., Mozilla) and finally (iii)comparing our ECGM algorithm with the maximum matchalgorithm proposed in [1].

References

[1] G. Antoniol, G. Canfora, G. Casazza, and A. D. Lu-cia. Maintaining traceability links during object-oriented

software evolution. Software - Practice and Experience,31:331–355, April 2001.

[2] S. Bouktif, Y.-G. Gueheneuc, and G. Antoniol. Extractingchange-patterns from cvs repositories. In WCRE ’06: Pro-ceedings of the 13th Working Conference on Reverse En-gineering, pages 221–230, Los Alamitos CA USA, 2006.IEEE Computer Society.

[3] S. Brin and L. Page. The anatomy of a large-scale hyper-textual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107–117, 1998.

[4] H. Bunke. On a relation between graph edit distanceand maximum common subgraph. Pattern Recogn. Lett.,18(9):689–694, 1997.

[5] H. Bunke. Error correcting graph matching: On the influ-ence of the underlying cost function. IEEE Trans. PatternAnal. Mach. Intell., 21(9):917–922, 1999.

[6] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduc-tions to Algorithms. MIT Press, 1990.

[7] C. D., F. P., S. C., and V. M. Thirty years of graph match-ing in pattern recognition. International Journal of PatternRecognition and Artificial Intelligence, 18:265–294, 2004.

[8] D. Gusfield. Algorithms on Strings Trees and Sequences.Cambridge University Press, New York, 1997.

[9] S. Kpodjedo, P. Galinier, and G. Antoniol. A google-inspired error correcting graph matching algorithm.Technical Report EPM-RT-2008-06, available athttps://web.soccerlab.polymtl.ca/repos/soccer-lab/technical-reports/EPM-2008-06.pdf, Ecole Polytechnique deMontreal, 06 2008.

[10] G. Michael and T. Qiang. Growth evolution and structuralchange in open source software. In Int. Workshop on Prin-ciples of Software Evolution, Vienna Austria, Sept. 2001.

[11] L. Page, S. Brin, R. Motwani, and T. Winograd. The pager-ank citation ranking: Bringing order to the web. Technicalreport, Stanford Digital Library Technologies Project, 1998.

[12] F. Ricca, M. D. Penta, M. Torchiano, P. Tonella, M. Cec-cato, and A. Visaggio. Are fit tables really talking? a seriesof experiments to understand whether fit tables are usefulduring evolution tasks. In International Conference on Soft-ware Engineering, pages 361–370. IEEE Computer SocietyPress, 2008.

[13] R. Vasa, J. Schneider, C.Woodward, and A. Cain. Detect-ing structural changes in object-oriented software systems.In ISESE 05: Proceedings of 4th International Symposiumon Empirical Software Engineering, pages 463–470, LosAlamitos CA USA, 2005. IEEE Computer Society.

[14] R. Vasa, J. Schneider, and O. Nierstrasz. The inevitable sta-bility of software change. In Proceedings of the 23rd Inter-national Conference on Software Maintenance, pages 413–422, Los Alamitos CA USA, 2007. IEEE Computer Society.

[15] Z. Xing and E. Stroulia. Umldiff: an algorithm for object-oriented design differencing. In ASE ’05: Proceedings ofthe 20th IEEE/ACM international Conference on Automatedsoftware engineering, pages 54–65, New York, NY, USA,2005. ACM.

[16] T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller.Mining version histories to guide software changes. InProceedings of the 26th International Conference on Soft-ware Engineering, pages 563–572. IEEE Computer Society,2004.

293293293293293

[ieee 2008 15th working conference on reverse engineering (wcre) - antwerp, belgium...

Documents