csmr10c.ppt

Recognizing Words from Source Code Recognizing Words from Source Code Identifiers using Speech Recognition Identifiers using Speech Recognition Identifiers using Speech Recognition Identifiers using Speech Recognition

Techniques Techniques

CSMR 2010, MadridCSMR 2010, Madrid

Nioosha Madani, Latifa Guerrouj, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol

ContentContent

�� Problem StatementProblem Statement

�� Aligning Strings and WordsAligning Strings and Words

�� MetaMeta--heuristic Inspired Approachheuristic Inspired Approach

2/24

CSMR 2010, Madrid

�� TechnologiesTechnologies

�� Case Study Case Study –– Research QuestionsResearch Questions

�� Case Study Case Study –– ResultsResults

�� Conclusion and Future WorkConclusion and Future Work

The ChallengeThe Challenge

�� A few years after deployment, documentation may A few years after deployment, documentation may

no longer exist.no longer exist.

�� If it exists, it will be almost surely outdated.If it exists, it will be almost surely outdated.

Problem Statement

3/24

CSMR 2010, Madrid

�� My customers desire to change the system, add My customers desire to change the system, add

new functionalities or fix a defect.new functionalities or fix a defect.

�� The only available source of information is the The only available source of information is the

code:code:

� Identifiers;

� Comments.

Identifiers SemanticIdentifiers Semantic

�� Researchers agree that the identifier semantics are Researchers agree that the identifier semantics are

important:important:

� Help program comprehension;

� Suggest clues.

Problem Statement

4/24

CSMR 2010, Madrid

�� Composed identifiers:Composed identifiers:

� Camel Case: MyLocalAccount , User_Address

� Contraction based: pntrctr , usrAdrss , imagEdge

� Good and possibly known to the developers:

hmmm, ixoth , pqrstuvwxyz

Words, Terms, Soft, and Hard Words

�� Term: Term: any substring in a compound identifier.any substring in a compound identifier.

�� Word: Word: an entry in a dictionary (e.g., the English an entry in a dictionary (e.g., the English

dictionary).dictionary).

Problem Statement

5/24

CSMR 2010, Madrid

�� Hard words: Hard words: terms composing an identifier reflecting terms composing an identifier reflecting

domain concepts, clearly demarked: domain concepts, clearly demarked:

�� baseAddressbaseAddress, , user_fileuser_file

�� Soft words: Soft words: terms different from the identifier and not terms different from the identifier and not

clearly demarked (e.g., abbreviation, contraction, clearly demarked (e.g., abbreviation, contraction,

etc.):etc.):

�� userareauserarea, , ptrcntrptrcntr, , userGiduserGid

Current Practices

�� Camel CaseCamel Case--based approaches plus greedy based approaches plus greedy

algorithms, e.g., algorithms, e.g., LawrieLawrie et al. 2006, 2007.et al. 2006, 2007.

�� Samurai by Samurai by EnslenEnslen et al, 2009:et al, 2009:

� Lexicon plus a greedy algorithm;

Problem Statement

6/24

CSMR 2010, Madrid

� If a contraction is used somewhere in the code then it is

likely used in the same context than the original term;

� Frequency tables of contractions and terms to split

composed identifiers.

�� LimitationsLimitations : Abbreviations not treated, no : Abbreviations not treated, no

quantification of how close the match is to the quantification of how close the match is to the

unknown string.unknown string.

Our Approach in EssenceOur Approach in Essence

�� Developers compose identifiers:Developers compose identifiers:

� Using terms and words reflecting domain concepts,

developer’s experience, knowledge.

�� Developers generate contraction via a finite set of Developers generate contraction via a finite set of transformation rules:transformation rules:

Problem Statement

7/24

CSMR 2010, Madrid

transformation rules:transformation rules:

� Drop all vowels, drop prefix, drop suffix, etc.

�� Mimics developer’s identifiers generation process:Mimics developer’s identifiers generation process:

�� Dictionaries capturing terms and words;

� A search-based technique to split exactly any unknown

string;

� A distance using Dynamic Time Warping (DTW) for

continuous speech recognition [H. Ney, 1984].

Modified H. Ney DTW

C t

r

U s

r

3

2

3

2

1

4

3

4

3

2

5

4

5

4

3

4

3

4

3

2

2

1

1

0

0

1

2

1

0

1

0

1

0

1

2

3

2

4

3

5

4

5

4

3

4

3

2

3

3

2

Dic

tio

nary

of

3 w

ord

s

Aligning Strings and Words

8/24

CSMR 2010, Madridp n t r c t r u s r

P n

t

r

C t

r

3

2

1

0

3

2

0

1

2

0

1

2

0

1

2

3

3

2

2

1

2

1

3

2

4

3

3

2

1

0

0

1

1

22

1

3

2

4

3

3

2

3

2

2

3

3

2

4

3

2

1

5

4

3

2

4

4

3

2

Identifier to split : pntrctrusr

Dic

tio

nary

of

3 w

ord

s

Word Transformation Rules

� Drop all vowels

� Drop a random vowel

Constraint: String must remain longer or equal to 3 chars

pointer → pntr

user → usr

Meta-heuristic InspiredApproach

9/24

CSMR 2010, Madrid

� Drop a random character

� Drop suffix (ing, tion, ed,

ment, able)

� Drop the last m characters

pntr → ptr

rectangle → rect

user → usr

available → avail

Overall Splitting (Hill Climbing) Procedure

DTW Match

No

Success!

Select randomly a

word with a minimal

distance <> 0

Best Matching

Zero Dist?

Identifier

- Meta-heuristic InspiredApproach

-Technologies

10/24

CSMR 2010, Madrid

distance <> 0

Apply a random

transformation to the

chosen wordAdd transf word to

temporary

dictionary

DTW

Matchred Dist ?

yes

Best Matching

If other transf to applyNo

Discard word

from temporary

dictionary

Current dictionary

Case Study - Research Questions

�� RQ1RQ1: : What is the percentage of identifiers What is the percentage of identifiers

correctly split by the proposed approach?correctly split by the proposed approach?

�� RQ2RQ2: : How does the proposed approach perform How does the proposed approach perform

Case Study – ResearchQuestions

11/24

CSMR 2010, Madrid

�� RQ2RQ2: : How does the proposed approach perform How does the proposed approach perform

compared with the Camel Case splitter?compared with the Camel Case splitter?

�� RQ3RQ3: : What percentage of identifiers containing What percentage of identifiers containing

word abbreviations is the approach able to word abbreviations is the approach able to

map to dictionary words?map to dictionary words?

Case Study - Results

� JHotDraw – Java

� 16 KLOC

� 155 files

� 2,348 identifiers (longer than 2 chars)

� 957 manually segmented identifiers

Case Study – Results

12/24

CSMR 2010, Madrid

� 957 manually segmented identifiers

� Lynx – C

� 174 KLOC

� 247 files

� 12,194 identifiers (longer than 2 chars)

� 3,085 manually segmented identifiers

RQ1 - Percentage of Correct Classifications

Splits Ids Single iteration

Multiple iterations

Errors

JHotDraw 957 891 (93%) 920 (95%) 37

Systems


13/24

CSMR 2010, Madrid

Lynx 3,085 2,169 (70%) 2,901 (94%) 271

Typical cases where the approach failed:

afaik, ihmo, foobar, fsize …

RQ2 - Camel Case Split

Ids Correct Split Errors

JHotDraw 957 874 (91%) 83

Lynx 3,085 561 (18%) 2,524

Splits

Systems


14/24

CSMR 2010, Madrid

Statistical comparison (Fisher’s exact test) with our approach:

Null Hypothesis (H0) : The propotions of correct splittings

obtained by the approaches are not significantly <>.

• JHotDraw: Odds Ratio = 1.3, p-value = 0.1

• Lynx: Odds Ratio = 60, p-value < 0.001

RQ3 - Percentage of Correctly Split Id (s)

Ids Correct Split Errors

JHotDraw 957 920 (95%) 37

Splits

Systems


15/24

CSMR 2010, Madrid

Lynx 3,085 2,901 (94%) 271

The novel identifier splitting approach perfoms

better than the Camel Case splitter.

Multiple Possible Splits - Successes

borddec

anchorlen

drawrect

drawroundrect

fillrect

javadrawapp

bord decimal

anchor length

draw rectangle

draw round rectangle

fill rectangle

java draw apply

bord decision

anchor lender

java draw append


16/24

CSMR 2010, Madrid

javadrawapp

netapp

newlen

nothingapp

addcolumninfo

addlbl

casecomp

java draw apply

net apply

new length

nothing apply

add column information

add label

case compare

java draw append

net append

new lender

nothing application

add column inform

case complete

Max of 10000 iterations

Multiple Possible Splits - Failures

serialversionuid

selectionzordered

removefrfigurerequestremove

jhotdraw

getvadjustable

fimagewidth

serial version did

selection ordered

remove figure request remove

hot draw

get bad just able

him age width


17/24

CSMR 2010, Madrid

fimagewidth

fimageheight

writeref

him age width

him age height

write red

DTW does not account for context, syntax or semantic

Max of 10000 iterations

Discussion - Challenges

�� How can we expand How can we expand fwritefwrite or or pdrawpdraw??

�� How can we avoid expanding How can we avoid expanding FileLen FileLen into into File File

Lender Lender rather than rather than File LengthFile Length??


18/24

CSMR 2010, Madrid

Lender Lender rather than rather than File LengthFile Length??

�� How can we recognize that How can we recognize that ImagEdit ImagEdit has a correct has a correct

split at distance 1 and not 0?split at distance 1 and not 0?

�� How can we expand/split How can we expand/split pqrstuvwxyzpqrstuvwxyz??

Threats to Validity

�� External validity: External validity:

� We analyzed only two systems;We analyzed only two systems;

�� However: different domains, different programming languages.However: different domains, different programming languages.

�� Construct validity: Construct validity: errors may be present in the oracle!errors may be present in the oracle!

�� We detected 1% error in the first oracle release;We detected 1% error in the first oracle release;

�� We did the best to guess programmer intention but we cannot We did the best to guess programmer intention but we cannot


19/24

CSMR 2010, Madrid

�� We did the best to guess programmer intention but we cannot We did the best to guess programmer intention but we cannot

exclude errors.exclude errors.

�� Reliability validity:Reliability validity: replication package available.replication package available.

�� Internal validity: Internal validity: subjectivity and bias in building the oracle:subjectivity and bias in building the oracle:

�� The same researcher built both oracles;The same researcher built both oracles;

�� Oracles were validated by other two researchers;Oracles were validated by other two researchers;

�� Size of oracle large enough to avoid a few percent errors change Size of oracle large enough to avoid a few percent errors change conclusions.conclusions.

Conclusion

�� We presented a searchWe presented a search--based approach to based approach to

automatically segment source code identifiers.automatically segment source code identifiers.

�� The novel approach is inspired by the developerThe novel approach is inspired by the developer

behavior when composing identifiers.behavior when composing identifiers.

Conclusion and Future Work

20/24

CSMR 2010, Madrid

behavior when composing identifiers.behavior when composing identifiers.

�� The approach uses a dictionary, a distance computed The approach uses a dictionary, a distance computed

via DTW, and a set of word transformations.via DTW, and a set of word transformations.

�� Results on Results on JHotDrawJHotDraw and Lynx show the superiority and Lynx show the superiority

of the approach over a simple Camel Case splitter.of the approach over a simple Camel Case splitter.

Future Work

We plan toWe plan to::

�� Expand the evaluation to other systems.Expand the evaluation to other systems.

�� Introduce enhanced heuristics for term selection Introduce enhanced heuristics for term selection

Conclusion and Future Work

21/24

CSMR 2010, Madrid

�� Introduce enhanced heuristics for term selection Introduce enhanced heuristics for term selection

and word transformations.and word transformations.

�� Contextualize our search by coupling our Contextualize our search by coupling our

algorithm with the approach of algorithm with the approach of EnslenEnslen et al.et al.

[ELK, 2009][ELK, 2009](restrict the search to the words used (restrict the search to the words used

in the same method, class, or package).in the same method, class, or package).

Finally… Questions

22/24

CSMR 2010, Madrid

Thank you for your attention

References

[ELK, 2009] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker,

“Mining source code to automatically split identifiers for software

analysis,” Mining Software Repositories, International Workshop on,

vol. 0, pp. 71 - 80, 2009.

[H. Ney, 1984] H. Ney, “The use of a one-stage dynamic programming

algorithm for connected word recognition,” Acoustics, Speech and

23/24

CSMR 2010, Madrid

algorithm for connected word recognition,” Acoustics, Speech and

Signal Processing, IEEE Transactions on, vol. 32, no. 2, pp. 263 - 271,

Apr 1984.

D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effective identifier

names for comprehension and memory,” Innovations in Systems and

Software Engineering, vol. 3, no. 4, pp. 303 - 318, 2007.

D. Lawrie, C. Morrel, H. Feild, and D. Binkley, “What’s in a name? a

study of identifiers,” in Proc. of the International Conference on

Program Comprehension (ICPC), 2006, pp. 3 - 12.

Overall Splitting (Hill Climbing) Procedure

DTW

Match

Ranked

Word List

Identifier

Best MatchingZero Dist?

Success!

Yes

No

No

24/24

CSMR 2010, Madrid

Temporary

Dictionary

Word ListImproved?

Discard word

and create new

dictionary

Save word and

create new

dictionary

Yes

Dictionary

No

csmr10c.ppt

Education

dictionary words

words r

composed identifiers

identifiers comments

problem statement words

camel case splitter

madrid userarea

madrid lynx