cmcd: count matrix based code clone detection yang yuan and yao guo key laboratory of...

24
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking University

Upload: baldwin-montgomery

Post on 01-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

CMCD: Count Matrix based Code Clone Detection

Yang Yuan and Yao GuoKey Laboratory of High-Confidence Software

Technologies (Ministry of Education)Peking University

Page 2: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Code Clones

• In software development, it is common to reuse some code fragments by copying with or without minor modifications.

• This kind of code fragments are called code clones. [Jurgens et al., ICSE 2009]

Page 3: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Scenario-based Evaluation

Original Copy Example of Scenario #1

Page 4: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Scenario-based Evaluation

Original Copy Example of Scenario #2

Page 5: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Scenario-based Evaluation

Original Copy Example of Scenario #3

Page 6: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Scenario-based Evaluation

Original Copy Example of Scenario #4

Page 7: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Importance of Code Clones

• Code clone brings troubles:– Increase the complexity of source code– Increase the maintenance cost of software system– Increase the possibility of getting bugs

• 7%-23% of the code in large software system is cloned. [Roy et al., SCP 2009]

• Detecting code clones may help:– Analyze the programming habits of the programmers– Find the design patterns of the source code

Page 8: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Previous Work in Clone Detection

• lower level:– Textual approach• SDD [Lee and Jeong, OOPSLA 2005]• NICAD [Roy and Cordy, ICPC 2008]• ...

– Lexical approach• DUP [Baker, WCRE 1995]• CCFinder [Kamiya et al., TSE 2002]• CP-Miner [Li et al., OSDI 2004, TSE 2006]• ….

Page 9: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Previous Work in Clone Detection

• Higher level:– Syntactic approach• CloneDr [Baxter et al., ICSM 1998]• Deckard [Jiang et al., ICSE 2007]• CloneDigger [Bulychev, SyRCoSE 2008]• …

– Semantic approach• Duplix [Krinke, WCRE 2001]• GPLAG [Liu et al., KDD 06]• …

Page 10: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Challenges

Low level approaches• Faster

• Usually focusing on local characters

• No Idea about global meanings

High level approaches• Slower

• Better understanding of the programs

• Difficult to scale

GAP

Page 11: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Our idea

• A novel count matrix based clone detection approach.

• Benefits of counting– By ignoring the order of variables, it can identify

clones with statement swapping cases, which is difficult for both lexical and syntactic approaches.

– Easy to calculate and implement• Reduces space and time complexity

Page 12: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Count Matrix Construction

Token Sequence Count Vector Count Matrix

tot,=,n,+,Find,(,n,),for,i,=,1,to,n,-,1, if,a,[,i,],>,a,[,j,],,k,=,a,[,i,]….

tot 1 0 0 … 0

i 3 0 0 … 2

j 1 0 0 … 1

a 3 0 0 … 3

n 2 1 0 … 0

tot 1 0 0 … 0

i 3 0 0 … 2

j 1 0 0 … 1

A 3 0 0 … 3

n 2 1 0 … 0

Page 13: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Comparison Algorithms

• Goal:– Find more scenario #4 clones with more

transformations such as sentence swapping – Run fast

• General principles:– Compare individual variables, instead of variable

sequences– Ignore variable orders in the count matrix

Page 14: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

bipartite graph matching

• Use bipartite graph matching to find code clone in different granularity:– Bottom-up approach• Can be used for compute the similarity between two

projects, two classes, or two methods

– Use two kinds of bipartite graph• KM algorithm (low-level, slow, accurate)• Hungarian algorithm (high-level, fast, inaccurate)

Page 15: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Optimization

• Use Euclidean metrics to compute the similarity of CVs

• Use quick rejection algorithm to improve speed

• Eliminate false positives:– Cut and check– Slice and match

Page 16: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Implementation

• Use Soot to convert Java->Jimple • [Vallee-Rai et al., CASCON 1999]

– 3-address intermediate representation– Smaller language set– Break complex statements into basic ones– Does not change the meaning of the program

• A new version of CMCD without using Soot

Page 17: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Overview

Page 18: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Performance Comparison to Deckard

1.0(1.0) 0.95(0.9999) 0.9(0.999) 0.85(0.99) 0.8(0.95)

0.1

1

10

100

1000

10000

833565 571 636

2274

Stage1Stage2Stage3Stage2+keyStage3+keyDeckard

Similarity

Com

pare

Tim

e(se

c)

Page 19: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Scenario-based Evaluation

Based on scenario classification from Roy et al., paper “Comparison and Evaluation of Code Clone Detection Techniques ”

Page 20: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Detecting Plagiarisms

• Student-submitted compiler lab projects– 29 submissions– 106 - 251 Java classes – 7,825 – 38,086 Lines of code

• Experimental Results– Running time: 123 minutes– 2 clusters of code clones, each has 3 copies– Confirmed– Now used by two courses in Peking University for

detecting students’ homework

Page 21: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Analyzing JDK 1.6 Source Code

• JDK 1.6.0_18– 7,197 files– 2,079,166 LoC

• Experimental Results– Running time: 163 minutes– Found: 786 methods in 174 clusters (Small

methods are omitted)

Page 22: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Code Comparison: Two ClonesMethod 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory)public static SyncFactory getSyncFactory(){

if(syncFactory == null){synchronized(SyncFactory.class) {

if(syncFactory == null){syncFactory = new SyncFactory();

} //end if} //end synchronized block

} //end ifreturn syncFactory;

}

Method 2: (in javax.swing.JComponent)static Set<KeyStroke> getManagingFocusBackwardTraversalKeys() {

synchronized(JComponent.class) {if (managingFocusBackwardTraversalKeys == null) {

managingFocusBackwardTraversalKeys = new HashSet<KeyStroke>(1);managingFocusBackwardTraversalKeys.add(KeyStroke.getKeyStroke(KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK));

}}return managingFocusBackwardTraversalKeys;

}

Page 23: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Detected a bugMethod 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory)public static SyncFactory getSyncFactory(){

if(syncFactory == null){synchronized(SyncFactory.class) {

if(syncFactory == null){syncFactory = new SyncFactory();} //end if

} //end synchronized block} //end ifreturn syncFactory;

}

Method 3: (in com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent)public static JavaSerializationComponent singleton() {

if (singleton == null) {synchronized (JavaSerializationComponent.class) {

singleton =new JavaSerializationComponent(Message.JAVA_ENC_VERSION);}

}return singleton;

} http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6999537

Page 24: CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking

Conclusion

• We propose a code clone detection approach CMCD:– Extracting count-based information– Language independent– Scales to large programs (> 1M LoC)

• Capabilities– Performs well in scenario-based evaluation– Detects code plagiarism in students’ homework– Identifies a potential bug in JDK source code