similarity of source code in the presence of pervasive modifications [scam'16]

20
Similarity of Source Code in the Presence of Pervasive Modifications Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark Centre for Research on Evolution, Search and Testing (CREST) Dept. of Computer Science, UCL, London, UK

Upload: chaiyong-ragkhitwetsagul

Post on 13-Apr-2017

29 views

Category:

Science


0 download

TRANSCRIPT

Similarity of Source Code in the Presence of Pervasive

Modifications

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark

Centre for Research on Evolution, Search and Testing (CREST) Dept. of Computer Science, UCL, London, UK

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Pervasive Modifications

2

/* ORIGINAL */ private static int partition (Comparable[] a, int lo, int hi) { int i = lo; int j = hi+1; Comparable v = a[lo]; while (true) { while (less(a[++i], v)) { if (i == hi) break; } while (less(v, a[--j])) { if (j == lo) break; } if (i >= j) break; exch(a, i, j); } exch(a, lo, j); return j;}

/* PERVASIVELY MODIFIED CODE */ private static int partition (int[] bob, int left, int right){ int x = left; int y = right+1; for (;;) { while (less(bob[left],bob[--y])) if (y == left) break; while (less(bob[++x],bob[left])) if (x == right) break; if (x >= y) break; swap(bob, y, x); } swap(bob, y, left); return y;}

From: https://www.princeton.edu/pr/pub/integrity/pages/plagiarism/

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Pervasive Modifications

3

Changes affecting many locations in the whole method, file, or project

Examples: layout changes, identifier renaming, API changes, refactoring

Code cloning, software plagiarism, software evolution

But do not include (strong) code obfuscation

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 4

When source code is pervasively modified, which similarity detection techniques or tools get the most

accurate results?

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

30 Similarity Analysers

5

CCFinderX iClones

Simian, NiCad Deckard

Clone detectors

JPlag Plaggie, Sherlock

Sim

Plagiarism detectors

7zncd, bzip2ncd gzipncd, xz-ncd

icd, ncd

Compression

diff, bsdiff difflib, fuzzywuzzy

jellyfish, ngram, sklearn

Others

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Test Data Generation

6

original

source obfuscator

bytecode obfuscator decompilers

InfixConverter.java SqrtAlgorithm.java Hanoi.java Queens.java MagicSquare.java

pervasively modified code

to be used in detection phase

pervasively modified code

compiler

javac

ARTIFICEProGuard Krakatau

Procyon

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Parameter Settings

7

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Similarity Report

8

InfC/orig

InfC/artfc

InfC/origno

krakatau

InfC/origno

procyon

InfC/origpg

krakatau

InfC/origpg

procyon

InfC/artfcno

krakatau

InfC/artfcno

procyon

InfC/artfcpg

krakatau

InfC/artfcpg

procyon

Sqrt/orig

Sqrt/artfc

… Squr/artfcpg

krakatau

Squr/artfcpg

procyon

InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17

InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17

InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17

InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21

InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20

InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21

InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17

InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19

InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17

InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21

Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16

Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18

… … … … … … … … … … … … … … … …

Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32

Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Similarity Threshold = 50

9

InfC/orig

InfC/artfc

InfC/origno

krakatau

InfC/origno

procyon

InfC/origpg

krakatau

InfC/origpg

procyon

InfC/artfcno

krakatau

InfC/artfcno

procyon

InfC/artfcpg

krakatau

InfC/artfcpg

procyon

Sqrt/orig

Sqrt/artfc

… Squr/artfcpg

krakatau

Squr/artfcpg

procyon

InfConv/orig 100 55 36 63 32 43 34 60 31 43 20 20 … 14 17

InfConv/artifice 55 100 35 54 33 39 37 56 32 39 19 30 … 14 17

InfConv/orig_no_krakatau 36 35 100 38 60 26 80 35 59 26 13 14 … 28 17

InfConv/orig_no_procyon 63 54 38 100 34 58 37 80 34 58 21 20 … 15 21

InfConv/orig_pg_krakatau 32 33 60 34 100 33 61 33 82 33 17 17 … 29 20

InfConv/orig_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21

InfConv/artific_no_krakatau 34 37 80 37 61 26 100 36 59 26 14 14 … 28 17

InfConv/artifice_no_procyon 60 56 35 80 33 59 36 100 32 59 19 20 … 15 19

InfConv/artifice_pg_krakatau 31 32 59 34 82 33 59 32 100 33 15 16 … 28 17

InfConv/artifice_pg_procyon 43 39 26 58 33 100 26 59 33 100 19 20 … 14 21

Sqrt/orig 20 19 13 21 17 19 14 19 15 19 100 32 … 14 16

Sqrt/artifice 20 30 14 20 17 20 14 20 16 20 32 100 … 15 18

… … … … … … … … … … … … … … … …

Square/artifice_pg_krakatau 14 14 28 15 29 14 28 15 28 14 14 15 … 100 32

Square/artifice_pg_procyon 17 17 17 21 20 21 17 19 17 21 16 18 … 32 100

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Best Threshold

10

F-m

easu

re

0.00

0.23

0.45

0.68

0.90

Threshold Value (T)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10031

F-measure = 0.8282

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Optimal Configuration

11

Best ThresholdBest Parameter Settings

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Results

12

Tool Settings T Acc Prec Rec AUC Prec@n F1

ccfx b=20,t=1 4 0.9640 0.9145 0.9040 0.9468 0.9040 0.9095

simjava r=22 5 0.9568 0.8769 0.9120 0.9490 0.8840 0.8941

jplag-text t=8 2 0.9408 0.8235 0.8960 0.9453 0.8440 0.8582

py-difflib noautojunk 35 0.9392 0.8901 0.7940 0.9147 0.8080 0.8393

7zncd-BZip2 mx=1 39 0.9368 0.8977 0.7720 0.9419 0.8180 0.8301

ncd-bzlib 31 0.9336 0.8584 0.8000 0.9482 0.8200 0.8282

jplag-java t=3 43 0.9160 0.7526 0.8640 0.9667 0.7860 0.8045

py-sklearn 33 0.8488 0.5894 0.8040 0.9146 0.6200 0.6802

ccfxdeckard

iclonesnicad

simianjplag-javajplag-text

plaggiesherlocksimjavasimtext

7zncd-BZip27zncd-LZMA

7zncd-LZMA27zncd-Deflate

7zncd-Deflate647zncd-PPMd

bzip2ncdgzipncd

icdncd-bzlib

ncd-zlibxz-ncd

bsdiffdiff

py-difflibpy-fuzzywuzzy

py-jellyfishpy-ngram

py-sklearn

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1

Clone det.

Plag det.

Comp.

Others

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 14

Highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures.

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Normalisation by Decompilation

15

javac

Krakatau

Procyon

Pervasively modified code

Normalised code

Normalisation

Compile

Decompile

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Code Before Decompilation

16

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK

Code After Decompilation

17

Clone det.

Plag det.

Comp.

Others

ccfxdeckard

iclonesnicad

simianjplag-javajplag-text

plaggiesherlocksimjavasimtext

7zncd-BZip27zncd-LZMA

7zncd-LZMA27zncd-Deflate

7zncd-Deflate647zncd-PPMd

bzip2ncdgzipncd

icdncd-bzlib

ncd-zlibxz-ncd

bsdiffdiff

py-difflibpy-fuzzywuzzy

py-jellyfishpy-ngram

py-sklearn

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1

Orig.

Dec.

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 19

Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code

Similarity of Source Code in the Presence of Pervasive Modifications — C. Ragkhitwetsagul, J. Krinke, D. Clark — CREST, UCL, UK 20

Compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection on Java source code

Highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures.

Similarity of Source Codein the Presence of Pervasive Modifications

Chaiyong Ragkhitwetsagul, Jens Krinke, David Clark — CREST, UCL

More info: http://crest.cs.ucl.ac.uk/resources/cloplag/