detecting source code plagiarism with codematch
TRANSCRIPT
![Page 1: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/1.jpg)
Detecting Source Code Plagiarism with CodeMatch
Bob ZeidmanZeidman Consulting
![Page 2: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/2.jpg)
Agenda
Source code plagiarismPrevious toolsCodeMatch
– Statement Matching– Comment Matching– Identifier Matching– Partial Identifier Matching– Instruction Sequence Matching
Conclusion
![Page 3: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/3.jpg)
Source Code Plagiarism
Entities– Universities– Corporations
Reasons– Internet– Search engines– Open source movements– Mobile employees
![Page 4: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/4.jpg)
Plague: Algorithm
Geoff Whale, University of New South WalesThree phases:– Create a sequence of tokens and a list of metrics to
describe each program.– Compare the structure metrics of files to find similar
code structures.– Compare token sequences within similar source
code structures.
![Page 5: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/5.jpg)
Plague: Example
if (x == 5){
// Loop on j herefor (j = 0; j < Index; j++)
printf("x = %i", j);}else
while (i < 5) i++;
CONDITIONAL_BEGINLOOP_BEGINDISPLAYLOOP_ENDCONDITIONAL_ENDCONDITIONAL_BEGINLOOP_BEGINARITHMETICLOOP_ENDCONDITIONAL_END
![Page 6: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/6.jpg)
Plague: Problems
Hard to adapt to new programming languages.The output needs interpretation.Uses slow UNIX shell tools for processing.Vulnerable to changing the order of code lines in the source code. Throws out useful information when it discards comments, variable names, function names, and other identifiers.
![Page 7: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/7.jpg)
YAP, YAP2, YAP3: Algorithm
Michael Wise, University of Sydney, AustraliaTwo phases:– Remove whitespace, comments, and identifier
names, replace language statements with tokens.– Compare pairs of token files.
![Page 8: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/8.jpg)
JPlag: Algorithm
Lutz Prechelt and Guido Malpohl, University KarlsruheMichael Philippsen, University of Erlangen-NurembergPhases:– Remove whitespace, comments, and identifier
names, replace language statements with tokens.– Compare tokens in different files.
![Page 9: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/9.jpg)
YAP, JPlag: Problems
To decrease the run time, uses hashing and only considers matches of strings of a minimal length.Tokens are still dependent on knowledge of the programming language.Although less so than Plague, still vulnerable to changing the order of code lines.Throws out useful information when it discards comments, variable names, function names, and other identifiers.
![Page 10: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/10.jpg)
MOSS: Algorithm
Alex Aiken, Stanford UniversityPhases:
– Remove all whitespace and punctuation, convert characters to lower case.
– Divide remaining characters into k-grams, which are contiguous substrings of length k, by sliding a window of size kthrough the file.
– Hash each k-gram and select a subset of all k-grams to be the fingerprints of the file.
– Compare file fingerprints to find similar files.
![Page 11: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/11.jpg)
MOSS: Example
She loves you yeah, yeah, yeah.
Some textshelo helov elove loves ovesy vesyo esyousyouy youye ouyea uyeah yeahy eahye ahyeahyeah yeahy eahye ahyea hyeah
5-grams77 72 42 17 98 50 23 55 6 66 34 24 39 1184 24 39 11 84
Hypothetical hash72 24 84 24 84
Fingerprint
![Page 12: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/12.jpg)
MOSS: Problems
Structural information is lost (e.g., whitespace, punctuation, uppercase characters, non-alphanumeric symbols).Larger k-grams decrease execution time, but decrease sensitivity.Most k-grams are also thrown out for faster processing, reducing accuracy.
![Page 13: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/13.jpg)
CodeMatch: Algorithms
Statement MatchingComment MatchingIdentifier MatchingPartial Identifier MatchingInstruction Sequence Matching
![Page 14: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/14.jpg)
CodeMatch: Statement Matching
File 1 File 2
1 /* begin routine */ 1 /* find the file extension */
2 void fdiv( 2 void file_divide(
3 char *fname, // file name 3 char *fname,
4 char *path) /* path */ 4 char *path)
5 { 5 {
6 int Index1, j; 6 int i, j; // begin routine
7 7 while (1) // loop here
8 while (1) 8 j = strlen(fname);
9 j = strlen(fname); 9
10 // find the file extension 10
![Page 15: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/15.jpg)
CodeMatch: Comment Matching
File 1 File 2
1 /* begin routine */ 1 /* find the file extension */
2 void fdiv( 2 void file_divide(
3 char *fname, // file name 3 char *fname,
4 char *path) /* path */ 4 char *path)
5 { 5 {
6 int Index1, j; 6 int i, j; // begin routine
7 7 while (1) // loop here
8 while (1) 8 j = strlen(fname);
9 j = strlen(fname); 9
10 // find the file extension 10
![Page 16: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/16.jpg)
CodeMatch: Identifier Matching
Counts the number of matching words that are not programming language keywords.Requires a list of keywords to exclude.Matching numerals given less weight that matching alphabeticals.Finds matching identifiers – routines, variables, constants, etc.
![Page 17: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/17.jpg)
CodeMatch: Partial Identifier Matching
Counts the number of partially matching words that are not programming language keywords.Requires a list of keywords to exclude.Matching numerals given less weight that matching alphabeticals.Finds disguised identifiers – routines, variables, constants, etc.For example, abc partially matches abc1 and xxxabc.
![Page 18: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/18.jpg)
CodeMatch: Instruction Sequence Matching
File 1 File 2
1 /* begin routine */ 1 /* find the file extension */
2 void fdiv( 2 void file_divide(
3 char *fname, // file name 3 char *fname,
4 char *path) /* path */ 4 char *path)
5 { 5 {
6 int Index1, j; 6 int i, j; // begin routine
7 7 while (1) // loop here
8 while (1) 8 j = strlen(fname);
9 j = strlen(fname); 9
10 // find the file extension 10
![Page 19: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/19.jpg)
CodeMatch: Total Match Score
t = kww +kp p +ks s +kc c +kq q
![Page 20: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/20.jpg)
CodeMatch Basic Report
Comparing files in folder D:\CodeMatch\Code Development\test\C test 2\files 1 To files in folder D:\CodeMatch\Code Development\test\C test 2\files 2
D:\CodeMatch\Code Development\test\C test 2\files 1\bpf_dump.c Match Score Compared To File
2910 D:\CodeMatch\Code Development\test\C test 2\files 2\bpf_dump.c 374 D:\CodeMatch\Code Development\test\C test 2\files 2\W32NReg.c 374 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (variable names changed).c374 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (no comments).c
D:\CodeMatch\Code Development\test\C test 2\files 1\bpf_filter.c Match Score Compared To File
606 D:\CodeMatch\Code Development\test\C test 2\files 2\W32NReg.c 606 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (no comments).c 572 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (variable names changed).c398 D:\CodeMatch\Code Development\test\C test 2\files 2\bpf_dump.c
![Page 21: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/21.jpg)
CodeMatch Detailed Report
C o m p a r i n g f i l e 1 : D : \ C o d e M a t c h \ \ t e s t \ C t e s t 2 \ f i l e s 1 \ b p f _ d u m p .c T o f i l e 2 : D : \ C o d e M a t c h \ C t e s t 2 \ f i l e s 2 \ t e s t \ W 3 2 N r e g . c
M a t c h i n g s o u r c e l i n e s : F i l e 1 L i n e # F i le 2 L i n e # S o u r c e l i n e
2 1 1 # in c lu d e < w in d o w s . h >2 2 3 # in c lu d e < s t d io . h > 2 4 7 # in c lu d e " W iN D I S . h "
M a t c h i n g c o m m e n t l i n e s : F i l e 1 L i n e # F i le 2 L i n e # C o m m e n t l i n e
3 3 * T h e R e g e n t s o f t h e U n iv e r s it y o f C a l i fo r n ia . A l l r ig h t s r e s e r v e d . 1 0 5 * R e d is t r ib u t io n a n d u s e in s o u r c e a n d b in a r y fo r m s , w it h o r w it h o u t
L o n g e s t m a t c h i n g s e m a n t i c s e q u e n c e :
F i l e 1 L i n e # F i l e 2 L i n e # N u m b e r o f m a t c h i n g l i n e s
2 1 1 3
M a t c h i n g w o r d s :
s t d io W iN D I S w in d o w s
M a t c h i n g p a r t i a l w o r d s :
0 x w in d o w s
![Page 22: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/22.jpg)
Competitive Evaluation: Test
GNU C compiler GCC version 3.3.2– Less than 100 lines– Between 100 and 1000 lines– Greater than 1000 lines
Modify files– Remove all comments– Rename all identifiers– Rearrange routines within the file– Rearrange lines of code within routines in the file– Do all of the above– Remove all the code but leave the comments– Create one file that has exactly one routine from each of the other
files in the same category
![Page 23: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/23.jpg)
Competitive Evaluation: Accuracy
70%(146 of 210)
MOSS
80%(169 of 210)
JPlag
95%(200 of 210)
CodeMatch
Copied files foundProgram
![Page 24: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/24.jpg)
Conclusion
Previous tools miss important matchesCodeMatch– Statement Matching– Comment Matching– Identifier Matching– Partial Identifier Matching– Instruction Sequence Matching
More accurate than other tools
![Page 25: Detecting Source Code Plagiarism With CodeMatch](https://reader034.vdocuments.net/reader034/viewer/2022051314/553d31ac4a7959d4238b456f/html5/thumbnails/25.jpg)
Download CodeMatch For Free
www.ZeidmanConsulting.com