cross language clone analysis team 2 october 27, 2010

Presentation 5Cross Language Clone Analysis

Team 2October 27, 2010

• Current Tasks• GOLD Parsing System• Grammar Update• Clone Analysis• Demonstration• Team Collaboration• Path Forward

Agenda

Allen Tucker Patricia Bradford Greg Rodgers Brian Bentley Ashley Chafin

Our Team

Current TasksWhat we are tackling…

Current tasks created for the first user story “Source Code Load & Translate”:◦ Load & parse C# source code.◦ Load & parse JAVA source code.◦ Load & parse C++ source code.◦ Translate the parsed C# source code to

CodeDOM.◦ Translate the parsed JAVA source code to

CodeDOM.◦ Translate the parsed C++ source code to

CodeDOM.◦ Associate the CodeDOM to the original source

Current Tasks (Review)

UML Model – Load & Parse

UML Model – Translate

UML Model – Associate

GOLD Parsing SystemGOLD Parsing Populating CodeDOM

Topics To Discuss What we are doing? Compiled Grammar Table Bookkeeping Testing

How It Works (Block Structure)

Grammar Builder

Compiled Grammar

Table (*.cgt)

Engine

Source Code

Parsed

How It Works (Process)

Grammar Builder

Compiled Grammar

Table (*.cgt)

Engine

Source Code

Parsed

Typical output from engine: a long nested tree

Usage within CloneDigger

Compiled Grammar

Table (*.cgt)

Engine

Source Code

Parsed

CodeDOM Conversion• Need to write routine to move

data from Parsed Tree to CodeDOM• Parsed data trees from parser

are stored in consistent data structure, but are based on rules defined within grammars

CodeDOM Conversi

For Java, there is…◦ 359 production rules◦ 249 distinctive symbols (terminal & non-terminal)

For C#, there is…◦ 415 production rules◦ 279 distinctive symbols (terminal & non-terminal)

Compiled Grammar Table

Production Rule Dependancies

Since there are so many production rules, we came up with the following bookkeeping:

A spreadsheet of the compiled grammar table (for each language) with each production rule indexed.◦ This spreadsheet covers:

various aspects of language what we have/have not handled from the parser what we have/have not implemented into CodeDOM percentage complete

Our Grammar Bookkeeping

White Box Testing: ◦ Unit Testing

Black Box Testing:◦ Production Rule Testing

Allows us to test the robustness of our engine because we can force rule production errors.

Regression Testing Automated

Testing

Unit Testing

Production Rule Test Input File Example

Task Understanding Three Step Process• Step 1 Code Translation

• Step 2 Clone Detection

• Step 3 Visualization

Source Files

TranslatorCommon

Common Model

InspectorDetected Clones

Detected Clones

UIClone

Visualization

Grammar UpdatesJava & C#

Grammar Updates Currently the grammars we have for the

Gold parser are out dated.

Current Gold Grammars◦ C# version 2.0◦ Java version 1.4

Current available software versions◦ C# version 4.0◦ Java version 6

Grammar Updates Available updated grammars

◦ Antlr has grammars updated to more recent versions of both C# and Java.

◦ C# version 4.0 (latest version)◦ Java version 1.5 (second to latest version)

Currently we are attempting to transform the Antlr grammars into Gold Parser grammars.

Grammar Update Issues Grammars for C# and Java are very

complex and require a lot of work to build.

Antler and Gold Parser grammars use completely different syntax.

Positive note: Other development not halted by use of older grammars.

Clone AnalysisOverview and Dr. Kraft’s Student’s Tool

Software Clones: (Definitions from Wikipedia)

◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity.

◦ Clones: sequences of duplicate code.

“Clones are segments of code that are similar according to some definition of similarity.”

—Ira Baxter, 2002

Software Clones

How clones are created:◦ copy and paste programming

◦ similar functionality, similar code

◦ plagiarism

Software Clones (cont.)

3 Types of Clones:◦ Type 1: an exact copy without modifications

(except for whitespace and comments).

◦ Type 2: a syntactically identical copy only variable, type, or function identifiers have

been changed.

◦ Type 3: a copy with further modifications statements have been changed, added, or

removed.

Software Clones (cont.)

Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model.

Some Language Independent Object Models:◦ Dagstuhl Middle Metamodel (DMM)◦ Microsoft CodeDOM

Both of these models provide a language independent object model for representing the structure of source code.

Introduction (cont.)

Detecting clones across multiple programming languages is on the cutting edge of research.

A preliminary version of this was done by Dr. Kraft and his students for C# and VB.◦ They compared the Mono C# parser (written in C#) to the

Mono VB parser (written in VB).◦ Publication:

Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59

Related Research

Token sequence of CodeDOM graphs with Levenshtein distance◦ The Levenshtein distance between two sequences is

defined as the minimum number of edits needed to transform one sequence into the other

Performs Comparisons of code files CodeDOM tree is tokenized Based on Distances

◦ Percentage of matching tokens in a sequence

Dr. Kraft Approach

Dr. Kraft Approach (cont)

Only does file-to-file comparisons◦ Does not detect clones in same source file

Can only detect Type 1 and some Type 2 clones

Not very efficient (brute force)

Limitations

Split into parameter (identifiers and literals) and non-parameter tokens

Non-parameter tokens summarized using a hash function

Parameter tokens are encoded using a position index for their occurrence in the sequence◦ Abstracts concrete names and values while

maintaining order

Enhancements

Represent all prefixes of the sequence in a suffix tree

Suffixes that share the same set of edges have a common prefix◦ Prefix occurs more than once (clone)

Enhancements (cont)

What’s been done

Demonstration

Team CollaborationTeam 2 & Team 3

Team Collaboration Team 2 & Team 3 Team 2

◦ We plan to start giving Team 3 periodic drops of our source code for Java and C# parsing.

◦ We are researching and working to update the Java and C# grammars.

Team 3◦ Team 3 is working on C++ parsing.

Looking into other parser, ELSA.

Path ForwardNext Iteration & Schedule

Finalize Iteration 1 (C++ to CodeDom) Iteration 2 (Code Analysis) Iteration 3 (Begin GUI)

Path Forward

Schedule

cross language clone analysis team 2 october 27, 2010

parsed java source code

parsed c source code

load parse java source

parsed c source code

original source code

load parse c source

load parse c source

code translationc

Documents

cross-cultural multimedia language learning

language learning - holy cross high

cross language clone analysis team 2 february 3, 2011

teaching for cross-language transfer in dual language

cross cultural body language

cross-language retrieval

teaching for cross-language transfer in dual language...

multilinguality and cross-language searching

continuous vector spaces for cross-language nlp...

flipkart clone, snapdeal clone, shopclues clone, tradus...

clcdsa: cross language code clone detection using

to clone or not to clone

cross language information retrieval (clir)slide

40 cross-language speech dependent lip-synchronization ·...

towards cross-language application

cross-language information...

cross-language qualitative research

cross-language text matching

structural and nominal cross-language clone...

cross language clone analysis team 2 november 22, 2010