[ieee 2009 16th working conference on reverse engineering - lille, france (2009.10.13-2009.10.16)]...

Computing Structural Types of Clone Syntactic Blocks

Ettore Merlo, Thierry LavoieDepartement of Computer and Software Engineering

Ecole Polytechnique de MontrealMontreal, Quebec

{ettore.merlo, thierry-m.lavoie}@polymtl.ca

Abstract—A clone classification scheme is presented based onthe structure of the Abstract Syntax Tree (AST ) of a systemand on the similarity measures between syntactic blocks ofsource code. Syntactic blocks in a system may represent classes,methods, statement blocks, and so on. An inclusion relationmay exist between the source code lines of some of these blocks,depending of the syntactic structure of the source code. Forexample, a block corresponding to a method body may containseveral possibly nested statement blocks.

This paper introduces an algorithm to identify differenttypes of clone relations between blocks that are either methodbodies or statement blocks.

Clone relation types between these blocks are interestingbecause they indicate properties of the structural relation ofthese clones and may give hints on re-factoring opportunities.

The proposed structural type clone classification scheme hasbeen investigated on two open source Java systems, Tomcat andEclipse. Experimental results are presented. Execution timeperformance of clone classification has been measured andreported. Results and further proposed research are discussed.

Keywords: clone types, software similarity, clone detec-tion, software re-factoring, open source analysis.

I. INTRODUCTION

Metrics-based clone detection has often been used in thecontext of functions or methods as clone candidates [1], [2],although syntactic blocks have also been analyzed in [3]. Aset inclusion relation of source code syntactic blocks can beeasily extracted from an AST representation of code or fromthe token sequence representation of it. The representation ofthis set inclusion relation between blocks is a forest of treesthat we call the Block Inclusion Forest (BIF ). Extractionof the BIF from the AST is linear on the size of the ASTand indirectly linear on the size of a system (LOC).

In this paper, two structural classes of blocks are definedon the BIF : blocks at the root of some inclusion treeare labelled as methods and blocks deeper in the BIFare simply labelled as blocks. In most cases this is theimmediate and obvious definition, but issues raised by Javainner classes will be addressed in Section V.

From this simple structural labelling of code fragments,three relation types can be defined on the clone similarityrelation based on the combination of structural labelling ofblocks: method to method (MM), block to block (BB),block to method (BM) and method to block (MB).

The proposed structural type clone classification schemehas been evaluated on two open source Java systems, Tomcatand Eclipse. Experimental results are presented and dis-cussed.

The clone relation used in this paper is based on ourprevious work [1], [2], [4], [5], [6], [7]. Other approachesfor clone analysis have been presented in [8], [9], [10],[11], [12], [13]. Empirical studies and evaluation of clonedetection approaches can be found in [3], [14], [15], [16],[17], [18]. Scalability of clone detection approaches has beenaddressed in [19], [20], while clones and software evolutionhave been investigated in [21], [22]. Very interesting andcomprehensive surveys on the clone detection literature canbe found in [23], [24]. Section II introduces an example toillustrate the presented algorithm and definitions. Section IIIdescribes in details the proposed algorithm to compute struc-tural clone pair types. Section IV describes the experiments,set-up, and results; Section V discusses results and issues,and Section VI concludes this paper.

II. EXAMPLE

Suppose that a system is composed of methodsA, B, C, D, and E whose code structure is depicted inFigs. 1 and 2. Suppose also that method A is similar to Band C is similar to D.

1 A(Z...) { 1 B(Z...) {2 if (W...) { 2 if (W...) {3 X... 3 X...4 } else { 4 } else {5 Y... 5 Y...6 } 6 }7 } 7 }

Figure 1: Structure of methods A and B

1 C(Z...) { 1 D(Z...) { 1 E(Z...) {2 if (W...) { 2 if (W...) { 2 X...3 X... 3 X... 3 }4 } 4 }5 } 5 }

Figure 2: Structure of methods C, D, and E

Blocks in this example are identified by the methodname and the beginning and ending lines of enclosing curly

2009 16th Working Conference on Reverse Engineering

1095-1350/09 $25.00 © 2009 IEEEDOI 10.1109/WCRE.2009.33

274

2009 16th Working Conference on Reverse Engineering

1095-1350/09 $26.00 © 2009 IEEEDOI 10.1109/WCRE.2009.33

274

brackets in the code. A2,4 identifies the block enclosedin curly brackets from line 2 to 4 in method A. Fig.

A B B C

D

DA

E

2−4 4−6 2−4 4−6 2−4

1−5

2−4

1−7A B

1−7C

1−51−3

Figure 3: Sub-tree nesting structure

3 shows the Block Inclusion Forest (BIF ) of methodsA, B, C, D, and E. Nodes in Fig. 3 are depictedas polygons to emphasize the similarity relation betweenblocks. Methods are identified by the block representing themethod body. Thus, method A is identified by block A1−7

and so on for the other methods. Fig. 4 shows the clusters

A2−4

B2−42−4C

E1−3

2−4D

B4−6

4−6AA

1−7

1−7B

1−5C

cl2

cl1

cl0cl3

D1−5

Figure 4: Clone partitions

cl0, cl1, cl2, cl3 of similar fragments. We can observe inthis figure that blocks A1−7 and B1−7 belong to cluster cl0;C1−5 and D1−5 belong to cluster cl1; A2−4, B2−4, C2−4,D2−4, and E1−3 belong to cluster cl2; and A4−6 and B4−6

belong to cluster cl3.Clone pair (A1−7, B1−7) is a MM relation because its

elements are methods and they don’t have parents in theBIF .

Clone pair (A2−4, C2−4) is a BB relation because theparents of its elements exist and are not clones. A1−7 parentof A2−4 is in cluster cl0 and C1−5 parent of C2−4 is incluster cl1.

Clone pair (A2−4, B2−4) belonging to cluster cl2 isconsidered redundant and is not labelled as a proper BBrelation because the parents of its elements exist and areclones. A1−7 parent of A2−4 and B1−7 parent of B2−4 arein the same cluster cl0.

Clone pair (B2−4, E1−3) is of BM type and, similarly,pair (E1−3, B2−4) is of MB type, because B2−4 is a blockat some nesting level and E1−3 is a method.

III. STRUCTURAL TYPE CLONE CLASSIFICATION

Suppose we have the BIF of a system and that thesimilarity relation between code fragments is an equivalence

relation clone(fi, fj) between code fragments fi and fj ,which produces a partition of all fragments into mutuallyexclusive clusters clk. Equivalence properties of clone de-tection are often satisfied by approaches based on AST ,metrics, prefix and suffix similarity, and so on. Pairs ofclones belonging to the same cluster are partitiond in classesBB, BM , MB, or MM according to types computed bythe algorithm in Fig. 5. Type classification is performed pairby pair because fragments in the same cluster may belongto different types as described in Section II; for example, ablock may participate at the same time in both a BB anda BM relation. Therefore, type classification presents anoverall quadratic component on the size of clusters and itsworst case complexity is O(n2), where n is the number ofanalyzed blocks.

1 cloneType ← computeCloneType(BIF, clId, vi, vj)

2 pi = BIF.parent(vi)3 pj = BIF.parent(vj)4 if pi �= UNDEF5 if pj �= UNDEF6 if clId[pi] �= clId[pj ]7 return BB8 else9 return R

10 else11 return BM12 else13 if vj �= UNDEF14 return MB15 else16 return MM

Figure 5: Single structural clone typing (SSCT)

Fig. 5 reports the algorithm for determining the typeof one clone relation between syntactic blocks. First, theparents pi and pj in the BIF of blocks vi and vj arecomputed (lines 2 and 3). Then, reasoning about parents isperformed (lines 4 and 5) to determine clone relation types.If a block’s parent is undefined in the BIF structure, theblock represents a method. When both blocks are methods(line 14) a MM type is returned. When only one parent isundefined, BM (line 9) or MB (line 12) types are returneddepending on the pair component that is a block. When bothblocks are not method bodies, i.e., they are not at the rootof a tree in the BIF , an additional test is performed (line6) to check whether or not the blocks’ parents are clones.A BB type is returned only if the blocks’ parents are notclones, since in the opposite case the relation would bestructurally redundant and labeled R. Algorithm SSCT inFig. 5 contains a short sequence of conditions and statementswith no loops. Its complexity can be considered O(1).

275275

IV. EXPERIMENTS AND RESULTS

Experiments have been performed on two open-sourceJava systems, Tomcat and Eclipse, and have been executedon an Intel Core 2 Duo, 3.0 GHz clock, 3 GB RAM, underLinux Fedora 8. Code has been compiled with g++ 4.1.2.Tomcat [25] is an implementation of the Java Servlet andthe Java Server Pages technologies and is widely used topower different kinds of web-based systems. Eclipse [26]is a complete IDE to develop Java applications. The size,number of methods, number of blocks and average nestinglevels of blocks in the BIF for Tomcat and Eclipse arereported in Table I. In this table and in all reported figuresand experiments, data refer to blocks larger than 6 LOC,which is a threshold that has been chosen as a lower boundof significance for fragment sizes. Methods are considered asbeing at nesting level 0. Nesting levels of blocks in the BIFstart at 1 and increase as nesting does. Syntactic analysis

System Tomcat EclipseVersion 5.5 3.3LOC 130K 1.3MMethods 5047 60326Blocks 5538 32113BIF Average Nesting 0.89 0.62

Table I: System features

of investigated systems has been performed using Eclipse toextract the BIF and block metrics.

Block similarity has been computed using CLAN [3].The following metrics have been computed for each blockin the investigated systems: number of statements, number ofbranches (IF, CASE, etc.), number of loops (FOR, WHILE,DO, etc.), number of calls, number of parameters (zero fornested blocks, possibly non-zero for methods).

System MM BB BMTomcat 3,859 10,413 1,991Eclipse 324,656 190,464 175,607

Table II: Block type cardinalities

Table II shows the block type cardinalities figures forTomcat and Eclipse. Clone pairs offer interesting opportuni-ties for re-factoring [4] since they are identical or parametric[3]. Reported figures seem quite high, but it should benoted that they refer to pairs of clones and there is aquadratic influence in the figures with respect to the numberof involved blocks. Nevertheless, the numbers are interestingfor the three types and may offer interesting re-factoringopportunities.

Table III reports statistical parameters of the distributionof the block sizes corresponding to pairs reported in Table II.

Tomcat (LOC) Eclipse (LOC)avg max avg max

MM 9.82 208 9.27 723BB 7.97 56 8.08 600BM 7.80 32 7.67 36

Table III: Blocks sizes

Sizes are measured in Lines of Code (LOC) as reported bywc command under Linux.

1

10

100

1000

10000

100000

1000000

5 10 15 20 25 30 35 40N

umbe

r of p

airs

LOC

EclipseTomcat

Figure 6: Tomcat and Eclipse BM size distribution

Figs. 6 show the size distribution of the blocks belongingto the type cardinalities BM reported in Table II for Tomcatand Eclipse. Diagrams in this figure are not exactly smoothand discontinuities seem to appear. Details of higher peaksof frequencies at specific sizes have not been investigatedin this paper, but it would be interesting to study them. Inany case, the sizes of involved blocks are significant andcould offer real and useful information to help evaluation ofre-factoring opportunities.

In Fig. 7 the body of method release lines 320 to 344in file ApplicationF ilterConfig.java in catalina/coredirectory in Tomcat is reported and it is similar to a blockin lines 370 to 394 in method setF ilterDef from the samefile.

Table IV reports cumulative execution times of clone anal-ysis and clone pairs structural type classification for Tomcatand Eclipse. Parsing and metrics extraction execution timesare not reported.

System Time (s)Tomcat 0.3Eclipse 2.2

Table IV: Execution times

V. DISCUSSION

Results show that clone relation structural typing can beevaluated using a common desktop computer.

276276

(a) Method clone (b) Block cloneFigure 7: Method-to-block clone relation

The presented approach identifies thousands of MM ,BB, and BM pairs for Tomcat and Eclipse. Some formof automated processing like that presented in [4] shouldbe investigated to handle this large amount of information.Feedback to developers using visualization is useful, butdevelopers may be overwhelmed by the volume of op-portunities. It should be noted that the presented figuresregarding pairs of clones are high because the number ofpairs of clones grows with a quadratic trend with respect tothe number of distinct blocks that are clones. Nevertheless,the number of distinct clone blocks is high and automatedprocessing of findings is still helpful.

The presented results also depend on several specificfactors used in the experiments and discussed below.

Investigated systems are written in Java. Furthermore,only two specific systems with specific structural nestingand cloning features have been investigated. Results usingother object oriented languages or procedural languages maybe different and should be investigated. Investigation shouldalso be carried out on more Java systems for different nestingand cloning characteristics.

The Java inner class programming construct presentsan anomaly of classification between blocks and methods.Methods from inner classes are actually methods that arenested in other methods from other classes. For the presentedexperiments and results, methods from inner classes havebeen labeled as blocks at a nesting level higher than zero.

Investigation of structural clone relations types has beenbased on software metrics and on the specific metrics men-tioned in Section IV. Other approaches, based for exampleon string matching, prefix or suffix trees, or graph matching,may produce slightly different clusters of similar fragments.In addition, only blocks bigger than 6 LOC have beenconsidered for experiments. No results have been produced

for blocks of smaller sizes.Metrics based block analysis is efficient because metrics

for clone analysis are compositional and a single pass thoughthe AST is sufficient to annotate it with the metrics valuefor each sub-tree. It should be noted that the choice ofmetrics has been based on the authors’ previous experience,but the metrics presented in [9], [6] could be used almostinterchangeably.

VI. CONCLUSION

In this paper, the position in the Block Inclusion For-est (BIF ) of blocks involved in clone relations was in-vestigated from the perspective of three types of struc-tural block combinations: method to method (MM),block to block (BB), block to method (BM) and itssymmetric method to block (MB) type.

Presented relation types may be interesting because oftheir potential impact on clone re-factoring: block to blockrelation pairs could be candidates for wrapper synthesis;block to method or the symmetric method to block re-lation pairs may be almost ready-to-re-factor opportunities;and method to method ones may require some interfacemerging prior to re-factoring.

The presented structural type clone classification schemehas been evaluated on two open source Java systems, Tomcatand Eclipse. Experimental results show that thousands ofMM , BB, and BM pairs have been identified for Tomcatand Eclipse. These figures represent a significant amount ofappealing re-factoring opportunities.

Future research may follow the perspectives of extendingexperiments to larger and more diversified systems to betterassess performance. Other programming languages than Javacould also be investigated. Automated or semi-automatedre-factoring approaches based on structural types of clonerelations should also be addressed.

277277

ACKNOWLEDGEMENTS

This research has been funded by the Natural Sciencesand Engineering Research Council of Canada under the Dis-covery Grants Program. The authors wish to thank MelissaMongeau for her contribution to Eclipse based metricsextraction.

REFERENCES

[1] J. Mayrand, C. Leblanc, and E. Merlo, “Experiment on theautomatic detection of function clones in a software systemusing metrics,” in Proceedings of the International Confer-ence on Software Maintenance - IEEE Computer SocietyPress, Monterey, CA, Nov 1996, pp. 244–253.

[2] E. Merlo, G. Antoniol, M. D. Penta, and F. Rollo, “Linearcomplexity object-oriented similarity for clone detection andsoftware evolution analysis,” in Proceedings of the Interna-tional Conference on Software Maintenance - IEEE ComputerSociety Press. IEEE Computer Society Press, 2004, pp. 412–416.

[3] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo,“Comparison and evaluation of clone detection tools,” IEEETransactions on Software Engineering, vol. 33, no. 9, pp.577–591, 2007.

[4] M. Balazinska, E. Merlo, M. Dagenais, B. Lagu, and K. Kon-togiannis, “Advanced clone-analysis as a basis for object-oriented system refactoring,” in Proc. Working Conferenceon Reverse Engineering (WCRE). IEEE Computer SocietyPress, 2000, pp. 98–107.

[5] S. Bouktif, G. Antoniol, M. Neteler, and E. Merlo, “A novelapproach to optimize clone refactoring activity,” in Geneticand Evolutionary Computation Conference (GECCO). ACMPress, 2006, pp. 1037–1043.

[6] K. Kontogiannis, R. De Mori, R. Bernstein, M. Galler, andE. Merlo, “Pattern matching for clone and concept detection,”Journal of Automated Software Engineering, vol. 3, pp. 77–108, March 1996.

[7] E. Merlo, “Detection of plagiarism in university projectsusing metrics-based spectral similarity,” in Duplication, Re-dundancy, and Similarity in Software, ser. Dagstuhl SeminarProceedings, R. Koschke, E. Merlo, and A. Walenstein, Eds.,no. 06301. Dagstuhl, Germany: IBFI, 2007.

[8] B. Baker, “Finding clones with dup: Analysis of an experi-ment,” IEEE Transactions on Software Engineering, 2007.

[9] I. Baxter, A. Yahin, l. Moura, M. Sant’Anna, and L. Bier,“Clone detection using abstract syntax trees.” in Proceedingsof the International Conference on Software Maintenance -IEEE Computer Society Press, 1998, pp. 368–377.

[10] S. Ducasse, O. Nierstrasz, and M. Rieger, “On the effective-ness of clone detection by string matching,” InternationalJournal on Software Maintenance and Evolution: Researchand Practice, 2006.

[11] T. Kamiya, S. Kusumoto, and K. Inoue, “Ccfinder: A multi-linguistic token-based code clone detection system for largescale source code,” IEEE Transactions on Software Engineer-ing, vol. 28, no. 7, pp. 654–670, 2002.

[12] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “Cp-miner: Findingcopy-paste and related bugs in large-scale software code,”IEEE Transactions on Software Engineering, pp. 1–17, 2006.

[13] A. Marcus and J. I. Maletic, “Identification of high-levelconcept clones in source code,” in ASE ’01: Proceedings ofthe 16th IEEE International Conference on Automated Soft-ware Engineering. Washington, DC, USA: IEEE ComputerSociety, 2001, p. 107.

[14] R. Al-Ekram, C. Kapser, R. Holt, and M. Godfrey, “Cloningby accident: An empirical study of source code cloning acrosssoftware systems,” in International Symposium on EmpiricalSoftware Engineering, 2005.

[15] L. Aversano, L. Cerulo, and M. D. Penta, “How clones aremaintained: An empirical study,” in European Conference onSoftware Maintenance and Reengineering, 2007.

[16] R. Falke, P.Frenzel, and R. Koschke, “Empirical evaluation ofclone detection using syntax suffix trees,” Empirical SoftwareEngineering Journal, vol. 13, no. 6, pp. 601–643, 2008.

[17] M. Kim, V. Sazawal, D. Notkin, and G. Murphy, “An empir-ical study of code clone genealogies,” in European SoftwareEngineering Conference and Symposium on the Foundationsof Software Engineering, 2005.

[18] C. K. Roy and J. R. Cordy, “An empirical study of func-tion clones in open source software,” in Proceedings of theWorking Conference on Reverse Engineering, 2008.

[19] H. Basit, S. Pugliesi, W. Smyth, A. Turpin, and S. Jarz-abek, “Efficient token based clone detection with flexibletokenization,” in European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering,2007.

[20] Z. Jiang and A. Hassan, “A framework for studying clonesin large software systems,” in Workshop on Source CodeAnalysis and Manipulation, 2007.

[21] G. Antoniol, U. Villano, E. Merlo, and M. D. Penta, “Ana-lyzing clone evolution in the linux kernel,” Information andSoftware Technology, pp. 755–765, 2002.

[22] E. Duala-Ekoko and M. Robillard, “Tracking code clonesin evolving software,” in Proceedings of the InternationalConference on Software Engineering. IEEE ComputerSociety Press, 2007.

[23] U. of Alabama at Birmingham, “Clone literature,””http://students.cis.uab.edu/tairasr/clones/literature”.

[24] C. Roy and J. Cordy, “A survey on software clone detectionresearch,” School of Computing, Queen’s University, Tech.Rep. Technical Report 2007-541, November 2007.

[25] “Tomcat,” http://tomcat.apache.org.

[26] “Eclipse,” http://www.eclipse.org.

278278

[ieee 2009 16th working conference on reverse engineering - lille, france (2009.10.13-2009.10.16)]...

Documents