similarity of source code in the presence of pervasive modifications [scam'16]
TRANSCRIPT
Similarity of Source Code in the Presence of Pervasive
Modifications
Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark
Centre for Research on Evolution, Search and Testing (CREST) Dept. of Computer Science, UCL, London, UK
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Pervasive Modifications
2
/* ORIGINAL */ private static int partition (Comparable[] a, int lo, int hi) { int i = lo; int j = hi+1; Comparable v = a[lo]; while (true) { while (less(a[++i], v)) { if (i == hi) break; } while (less(v, a[--j])) { if (j == lo) break; } if (i >= j) break; exch(a, i, j); } exch(a, lo, j); return j;}
/* PERVASIVELY MODIFIED CODE */ private static int partition (int[] bob, int left, int right){ int x = left; int y = right+1; for (;;) { while (less(bob[left],bob[--y])) if (y == left) break; while (less(bob[++x],bob[left])) if (x == right) break; if (x >= y) break; swap(bob, y, x); } swap(bob, y, left); return y;}
From: https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Pervasive Modifications
3
Changes affecting many locations in the whole method, file, or project
Examples: layout changes, identifier renaming, API changes, refactoring
Code cloning, software plagiarism, software evolution
But do not include (strong) code obfuscation
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 4
When source code is pervasively modified, which similarity detection techniques or tools get the most
accurate results?
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
30 Similarity Analysers
5
CCFinderX iClones
Simian, NiCad Deckard
Clone detectors
JPlag Plaggie, Sherlock
Sim
Plagiarism detectors
7zncd, bzip2ncd gzipncd, xz-ncd
icd, ncd
Compression
diff, bsdiff difflib, fuzzywuzzy
jellyfish, ngram, sklearn
Others
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Test Data Generation
6
original
source obfuscator
bytecode obfuscator decompilers
InfixConverter.java SqrtAlgorithm.java Hanoi.java Queens.java MagicSquare.java
pervasively modified code
to be used in detection phase
pervasively modified code
compiler
javac
ARTIFICEProGuard Krakatau
Procyon
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Parameter Settings
7
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Similarity Report
8
InfC/orig
InfC/artfc
InfC/origno
krakatau
InfC/origno
procyon
InfC/origpg
krakatau
InfC/origpg
procyon
InfC/artfcno
krakatau
InfC/artfcno
procyon
InfC/artfcpg
krakatau
InfC/artfcpg
procyon
Sqrt/orig
Sqrt/artfc
… Squr/artfcpg
krakatau
Squr/artfcpg
procyon
InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17
InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17
InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17
InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21
InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20
InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17
InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19
InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17
InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16
Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18
… … … … … … … … … … … … … … … …
Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32
Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Similarity Threshold = 50
9
InfC/orig
InfC/artfc
InfC/origno
krakatau
InfC/origno
procyon
InfC/origpg
krakatau
InfC/origpg
procyon
InfC/artfcno
krakatau
InfC/artfcno
procyon
InfC/artfcpg
krakatau
InfC/artfcpg
procyon
Sqrt/orig
Sqrt/artfc
… Squr/artfcpg
krakatau
Squr/artfcpg
procyon
InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17
InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17
InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17
InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21
InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20
InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17
InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19
InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17
InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21
Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16
Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18
… … … … … … … … … … … … … … … …
Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32
Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Best Threshold
10
F-m
easu
re
0.00
0.23
0.45
0.68
0.90
Threshold Value (T)
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10031
F-measure = 0.8282
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Optimal Configuration
11
Best ThresholdBest Parameter Settings
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Results
12
Tool Settings T Acc Prec Rec AUC Prec@n F1
ccfx b=20,t=1 4 0.9640 0.9145 0.9040 0.9468 0.9040 0.9095
simjava r=22 5 0.9568 0.8769 0.9120 0.9490 0.8840 0.8941
jplag-text t=8 2 0.9408 0.8235 0.8960 0.9453 0.8440 0.8582
py-difflib noautojunk 35 0.9392 0.8901 0.7940 0.9147 0.8080 0.8393
7zncd-BZip2 mx=1 39 0.9368 0.8977 0.7720 0.9419 0.8180 0.8301
ncd-bzlib 31 0.9336 0.8584 0.8000 0.9482 0.8200 0.8282
jplag-java t=3 43 0.9160 0.7526 0.8640 0.9667 0.7860 0.8045
py-sklearn 33 0.8488 0.5894 0.8040 0.9146 0.6200 0.6802
ccfxdeckard
iclonesnicad
simianjplag-javajplag-text
plaggiesherlocksimjavasimtext
7zncd-BZip27zncd-LZMA
7zncd-LZMA27zncd-Deflate
7zncd-Deflate647zncd-PPMd
bzip2ncdgzipncd
icdncd-bzlib
ncd-zlibxz-ncd
bsdiffdiff
py-difflibpy-fuzzywuzzy
py-jellyfishpy-ngram
py-sklearn
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1
Clone det.
Plag det.
Comp.
Others
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 14
Highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures.
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Normalisation by Decompilation
15
javac
Krakatau
Procyon
Pervasively modified code
Normalised code
Normalisation
Compile
Decompile
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Code Before Decompilation
16
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK
Code After Decompilation
17
Clone det.
Plag det.
Comp.
Others
ccfxdeckard
iclonesnicad
simianjplag-javajplag-text
plaggiesherlocksimjavasimtext
7zncd-BZip27zncd-LZMA
7zncd-LZMA27zncd-Deflate
7zncd-Deflate647zncd-PPMd
bzip2ncdgzipncd
icdncd-bzlib
ncd-zlibxz-ncd
bsdiffdiff
py-difflibpy-fuzzywuzzy
py-jellyfishpy-ngram
py-sklearn
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1
Orig.
Dec.
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 19
Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code
Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 20
Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code
Highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures.
Similarity of Source Codein the Presence of Pervasive Modifications
Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark — CREST, UCL
More info: http://crest.cs.ucl.ac.uk/resources/cloplag/