detecting clones across microsoft .net programming languages (wcre2012)
DESCRIPTION
This presentation is given in Working Conference on Reverse Engineering (WCRE 2012). The paper title is: "Detecting Clones across Microsoft .NET Programming Languages" Abstract: The Microsoft .NET framework and its language family focus on multi-language development to support interoperability across several programming languages. The framework allows for the development of similar applications in different languages through the reuse of core libraries. As a result of such a multi-language development, the identification and traceability of similar code fragments (clones) becomes a key challenge. In this paper, we present a clone detection approach for the .NET language family. The approach is based on the Common Intermediate Language, which is generated by the .NET compiler for the different languages within the .NET framework. In order to achieve an acceptable recall while maintaining the precision of our detection approach, we define a set of filtering processes to reduce noise in the raw data. We show that these filters are essential for Intermediate Languagebased clone detection, without significantly affecting the precision of the detection approach. Finally, we study the quantitative and qualitative performance aspects of our clone detection approach. We evaluate the number of reported candidate clone-pairs, as well as the precision and recall (using manual validation) for several open source cross-language systems, to show the effectiveness of our proposed approach.TRANSCRIPT
Detecting Clones across Microsoft .NET Programming Languages
Farouq Al-omari Iman Keivanloo Chanchal K. Roy
Juergen Rilling
Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo
Contact: [email protected]
This is not the original version given in the WCRE 2012 conference (no animation etc.)
Mergesort
Clones (Mergesort)
Mergesort Mergesort
Mergesort
The C# planet
Other Planets
3
Clone Detection across LanguagesGeneral Solution
• C#
• VB.NET•
• J#
• F#
• COBOL (.NET)
• Java
Intermediate Language (IL)(low level)
Compilation
The solution is to use this (instead of dealing with several languages)
4
Clone Detection across Languages using ILIs there any chance to work?
• Up to 3 times more cloned fragment detected using IL
Dataset
Input Data TypeCIL Source Code
# Clone Class
# Clone Fragment
# Clone Class # Clone Fragment
ASXGUI 9 393 69 261
Mono 37 4373 369 1523
5
Clone Detection across Languages using ILObserved Challenges (using an example)
Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub
static void main(string[] args){int x=10;if(x<0)x++;elseconsole.WriteLine ("Positive number");}
VB.NET C#
6
Clone Detection across Languages using ILObserved Challenges (using an example)
Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub
static void main(string[] args){int x=10;if(x<0)x++;elseconsole.WriteLine ("Positive number");}
VB.NET C#
IL from VBVB IL from C#C#
7
Clone Detection across Languages using ILObserved Challenges
Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub
static void main(string[] args){int x=10;if(x<0)x++;elseconsole.WriteLine ("Positive number");}
VB.NET C#
Observed Challenges1- Larger unpredictable size at IL level [Keivanloo IWSC’12]
2- Higher dissimilarity at IL level
8
Observed Challenges #2: High DissimilarityNoise
• Sample IL
IL_000c: ldloc.0 IL_000d: ldc.i4.1 IL_000e: add.ovf IL_000f: stloc.0 IL_0010: br.s IL_0024 IL_0012: nop IL_0013: ldstr "Positive number" IL_0018: call void [mscorlib]System.Console::WriteLine(string)
Major noise types:• Line numbers• Pointers to line number• Push, Pop …• Detailed Data Type data
9
Clone Detection across Languages using ILThe Core Solution
• The Challenge: Noise• Solution: Data cleansing (filtering noises)• Why? (Answer: to increase recall)
Source Code IL + noise
Filters
IL - noise
10
Filters for noise reduction
Our Filter Set
BeforeFiltering
AfterFiltering
ExampleDescription
Filter 1 IL_0003: stloc.0 stloc.0 IL_0003 (instruction address)Filter 2 brtrue.s IL_0015 brtrue.s The IL_0015 address of the
branch destinationFilter 3 ldarg 3
starg 1ldargstarg
The value 3&1 represent argument number
Filter 4 ldc.i4.s 10 ldc.i4.s 10 is the number (pushed to the stack)
Filter 5 ldstr "Positive number" ldstr “positive number” is the printed string constant
Filter 6 stloc 7 stloc 7 represents variable indexFilter 7 ldc.i4.s 10 ldc i4 represent the int32 data
type in CIL and s for Short Filter 8 IL_0011: add
IL_0012: stloc.0IL_0013: br.s IL_0020IL_001a: call void [mscorlib]System.Console::WriteLine (string)
addstlocbrcall
Note that Filter 8 is just a nick name. Refer to the Filter 8 description section for more details.
11
Clone Detection across Languages using ILFiltering Advantage: Recall Improvement
Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub
VB.NET C#Before Filtering Noises:~50% similarity
After:~90% similarity
12
Disadvantage of Noise reductionDanger!
• Data Loss• What if we remove important data during data cleansing• Might mislead the detection by
making non-cloned pairs identical Possible negative effect on Precision
Filtering Color Data
13
RQ: Are They (Filters) Dangerous?Evaluation Preparation
1. Filter Contribution Formula:
2. Dataset preparation:– Controlled dataset (iText.NET J#) 25 pairs * 3 Lang.
1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#)2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#)
14
RQ: Are They (Filters) Dangerous?Filter Contribution - Study #1
• Are they harmful? (The answer is NO - based on following graphs, filters do not remove similar amount of data from actual clones vs. NONcloned code fragments)
A strong threshold for the Judge to decide
0.3 0.2
Cloned Dataset NonCloned Dataset
15
RQ: Are They (Filters) Dangerous?Filter Contribution - Study #2
• Are they useful?(The answer is YES - based on the given figure, our filters help to
discriminate among actual clones and NONcloned fragments, therefore it is possible to separate them with high confidence with the chosen threshold)
16
RQ: Are They (Filters) Dangerous?Filter Contribution - Study #3
Final Conclusion:
Filters contribute to discriminate between cloned and noncloned fragments
• Does filtering make actual clone-pairs and nonclonedpairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in most cases)
17
An Interesting Unexpected DiscoveryLanguage-dependency!!!
Corresponding faces in each group are not similar, while all of them are
extracted from single language (IL). Specially look at C#-J# faces, all of them
are different from other groups. This is an interesting discovery that the original
high-level programming languages affect similarity at the IL level
18
Clone Detection across Languages using ILOur Clone Detection Framework
Input: .NET Code
Source Code
MS .NET
EXE & DLL
CIL Manipulation for Clone Detection
Proposed Filtering Mechanism
Clone Detection Algorithms
SimHash-based(from SimCad)
Levenshtein Distance-based
Clone Analysis
Clone Clusters
Merging
Source Code Mapping
Reporting
Report (CIL)
Report (Src Code)
LCS-based(from NiCad)
IlDasm.exe
CIL (plain text)
language File LOC MethodASXGUI 2.5 VB.NET 47 32,594 303ASXGUI 3.0 C# 19 2088 78
language File LOC MethodMono 2.10 VB.NET 375 - -Mono 2.10 C# 57 - -Total 432 - 4998
language File LOC MethodiText C# - - -iText.NET J# - - -Total 2.5 K 600 K
4th Dataset: iText.NET dataset from 1st case studyWe used part of iText.NET library to create our last dataset.This dataset contains source code related iText.NET API usagewritten in three languages (C#, J#, and VB.NET). This featuremakes the dataset an important resource for our study since itallowed us to create a small (75 clone pairs) but controlleddataset (i.e., all actual clones are aligned, tagged and known inthe cross-language), creating a unique oracle for furtheranalysis. We use this oracle to obtain precise recall andprecision measures, since the number of actual clones isknown. This is contrast to the other datasets, where recall andprecision measure cannot be computed as precisely, since theactual number of clone-pairs is unknown
19
The Selected Datasets for Performance Evaluation
20
Clone Detection across Languages using ILOur Clone Detection Framework Performance
Pay attention to changes within
0.6 … 0.8
21
Clone Detection across Languages using ILOur Clone Detection Framework
• 2K clone-pair manually investigated
0.6 Normal0.7 High0.8 Extreme
PrecisionThe optimum, considering the trade-offbetween precision and recall, was achieved using Levenshtein Distance-based comparison with the High threshold (80% TP)
Recall(iText.NET API) 76% using High threshold between three languages (C#, J#, and VB.NET).
TP = {E and S}
22
private static string filename_nodir(string name) { int slash = -1, len = name.Length; for (int i = 0; i < len; i++) { string sub = name.Substring(i, 1); if (sub == "\\" || sub == "/") slash = i; } slash++; return name.Substring(slash, len - slash); }
Function Filename_Nodir() As String Dim intFileName As Integer, intSlash As Integer, strFilename As String strFileName = editvid.video For intFilename = 1 To len(strFileName) If mid(strfilename, intfilename, 1) = "\" Or mid(strfilename, intfilename, 1) = "/" Then intslash = intFilename End If Next Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash) End Function
*The matching algorithm was limited to the content available within the boxes (it was NOT aware of same method names)
C#VB
.NET
An Interesting CloneDetected by Our Approach
23
Summary
• The first comprehensive research focusing on, (1) .NET clone detection, (2) across programming languages, and (3) using Intermediate Language
• Identified challenges in cross language clone detection + IL
Input: .NET Code
Source Code
MS .NET
EXE & DLL
CIL Manipulation for Clone Detection
Proposed Filtering Mechanism
Clone Detection Algorithms
SimHash-based(from SimCad)
Levenshtein Distance-based
Clone Analysis
Clone Clusters
Merging
Source Code Mapping
Reporting
Report (CIL)
Report (Src Code)
LCS-based(from NiCad)
IlDasm.exe
CIL (plain text)
Related Publication
Iman Keivanloo, Chanchal K. Roy, Juergen Rilling,
“Java Bytecode Clone Detection via Relaxation on Code Fingerprint and Semantic Web Reasoning,”
6th International Workshop on Software Clones (IWSC), 2012.
25
ANY QUESTION?Contact: [email protected]