an empirical study of function clones in open source software
DESCRIPTION
This a presentation on a Research paper basically they made a tool call NICAD.TRANSCRIPT
An Empirical Study of Function Clones in Open Source
SoftwareChnchal K.Roy and James R. Cordy
Queen’s University
Presenter: MF Khan
Outline
• Introduction• NICAD Overview• Experimental Setup• Experimental Results• Conclusions• Discussion
2
Introduction• Code Clone/Clone
– Reusing a code of fragment by copying and pasting with or without minor modifications
• Benefits– Software Maintenance (Bug detection)
• History– Several techniques were proposed– Lack of in depth comparative studies on cloning in
Variety of systems
3
Introduction (Cont)• NICAD
– In depth study of function cloning in 15+ C and Java Systems including Apache and Linux kernel
– Accurate Detection of Near-Miss functions Clones.– Focusing on its worth in detecting copy/Pasted near-miss
clones by using pretty printing, Code normalization and filtering
– Light Weight using simple text line– Capable of detecting clones in very large system in different
languages
4
NICAD Overview• Three phases of clone detection
– ExtractionAll potential clones are identified and extracted.All function and method in C & Java with their
original source coordinates– Comparison (Determination of Clones)
Potential clones are clustered and compared.Pretty printed potential clones line by line text wise using
Longest common subsequence(LCS).
5
NICAD OverviewUnique Percentage of Items(UPI)
IF UPI for both line sequence is zero or below certain threshold.
– Potential Clones are consider to be clone
– Reporting Results from NICAD reported in XML database form and interactive HTML
6
Experimental Setup
Paper applied NICAD to find function clones in a number of open source systems
Later on paper introduce a set of metrics to analyze the results
7
Experimental SetupSubject Systems 10 C and 7 Java systems
8
Clone Definition
• Non empty functions of at least 3 LOC• In Pretty printed format.• Different Unique Percentage of Items (UPI)
use to find exact and near miss clones.• E.g.
– If UPI threshold is 0.0 =Exact clone– If UPI threshold is 0.10=Two function as clone
9
Validation of Clones
• To validate detected clone is 2 step process• 1:NICADE’s INTRACTIVE HTML OUTPUT
– To given an overall view of original source of clone classes an over view of original source of clone classes.
• 2:XML OUTPUT– To pair wise compare the original source of the
functions in each clone class– using Linux diff to determine the textual similarity
of the original source10
Metrics and Visualizations
• Total Cloned Methods(TCM)– How to get over all cloning statistics
• File Associated with Clone(FAWC)– Overall localization of clones.– From a s/w maintenance point of view, a lower value of
FAWCP is desirable...Why?– If clone are localized to certain specific files and thus may
be easier to maintain– Still one can’t say which files contain the majority of clone
in the system11
Metrics and Visualizations
• Cloned Ratio of File for Methods(CRFM)– With CRFM we attempt discover highly cloned files– In a particular file (f)
• Profile of Cloning Locality w.r.t Methods(PCLM)– Kapser and Godfrey provide 3 location base
function clones.– 1:In the same File 2:Same DIR 3: Different DIR
12
Experimental Results
13
1.More function cloning in Open Source java than in C. On AvG about 15%(7.2% wrt LOC)
2.Effect of increasing UPI is almost identical.
Detail Overview
14
1.Several of C system have <10% cloning function.
Java systems are consistent in cloning
Clone Associated Files
15
Clone Associated Files
• FAWC address the issue of what portion of the files in a system is associated with clone.
• A system with more clones but with associated with only a few files is in some sense better than a system with fewer clones scattered over many files from a software maintenance point of view.
16
Profiles of Cloning Density• It tell us which files are highly cloned or which
files contain the majority of clones
17
That’s mean Scattered File and more near miss clones
Profile of cloning Density
18
Assuming that cloned method in high density cloned file have been intentionally copy/Pasted.
Profile Cloning Localization
19
Location of a clone pair is a factor in s/w maintenanceExcept Linux there are no exact clone in (UPI threshold 0.0) in C
When UPI threshold is 0.3,On average 45.9 %(49.0 % LOC) of clone pair in C Occur.
Conclusion
• NICAD is capable of accurately finding the1.Exact Function Clone2.Near Miss Function Clones
20
Discussion
21
• What is definition of Clone?• What is definition of near-miss clones?• Why Wel tab is higher in slide 14?• What if we use C++ or C#?• What will happen if we use smaller clone
granularity such as begin- end block
Thank you.
22