improving automatic abbreviation expansion within source code to aid in program search tools
DESCRIPTION
Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools. Zak Fry. Outline. Problem and Motivation Automatically Identifying Abbreviation Expansions A Scoped Approach Analysis and Refinement: iScope Evaluations Conclusions. Maintenance Tasks. - PowerPoint PPT PresentationTRANSCRIPT
Improving Automatic Abbreviation Expansion within Source Code to
Aid in Program Search Tools
Zak Fry
Outline
Problem and Motivation Automatically Identifying Abbreviation
Expansions A Scoped Approach Analysis and Refinement: iScope Evaluations Conclusions
Maintenance Tasks
60-90% of software lifecycle
Problem: id where relevant code is – where changes need to be made
Code to perform a certain task can be very scattered
Causes difficulty for current maintenance search tools
Challenges - Coding Practices
Identifier names important for code documentation and understanding
Problem: Programmers’ use of abbreviations in code– Frequency of occurrence
character, integer, string
– Complex inheritance – long class names SecureMessageServiceClientMessageImpl
Negates usefulness of identifier names and complicates program understanding
Abbreviations and Maintenance Tools
Problem: Search based maintenance tools rely on natural language
– Abbreviations change the natural language
Search Term: “distributed hash” dht = (DHTPlugin)dht_pi.getPlugin();
Thread t = new AEThread( "DHTTrackerPlugin:init" ) {
public void runSupport() { try{ if ( dht.isEnabled()){ log.log( "DDB
Available" ); } }
catch( Throwable e ){ log.log( "DDB Failed", e ); } ... }
}
Automatically Identifying Abbreviation Expansions
First, how do we identify candidates for expansion?– Non-dictionary words
Abbreviation– Short form
Expansion– Long form
Types of Non-Dictionary Words
Abbreviation Category
Type Short Form Long Form
Single WordPrefix int integer
Dropped Letter
evt event
Multiple Word
Acronym FBIFederal Bureau of Investigation
Combination Multiword
recblk receive block
Domain Keywords and Special Cases
---parsetree
serialize---
State of the Art
Lawrie, Feild, and Binkley– Abbreviation Expansion– Problem:
Lack of precision No support for choosing between multiple matches
Scoped Approach
How to choose between multiple possible long forms:– By manual inspection we found correct
long forms are more likely to be found in certain locations
– Also, correctly identifying the long forms for certain types of abbreviations is easier than for others
Order of Types
Abbreviation Type
1: Acronym
2: Prefix
3: Dropped Letter
4: Combo Multiword
5: Most Common
Order of Program Context
Context1: Javadoc2: Type3: Method Name4: Statement5: Method6: Method Comments7: Class Comments
General Algorithm
Javadoc
Type
Method Name
…
Acronym
PrefixJavadoc
Type
Method Name
…
Multiple matches
We assume one best candidate though multiple might be present at the same level of scope
If multiple matches:1. Examine frequencies
2. Stem long forms and reexamine frequencies
3. Broaden Scope and reexamine frequencies
4. Most frequent expansion
Most Frequent Expansion (MFE)
If still no ideal candidate is found:– We mined long forms from 1.5 million LOC of
Java 5 code base– Return most frequent long form as last resort
Evaluation of Scoped Approach
250 abbreviations from 5 subject programs Gold standard developed by human developer
inspecting the code manually Implemented LFB according to description
– Except combination words – due to missing database
(Accuracy)
Analysis and Refinement - iScope
Analyzed results and found 3 major sources of problems
Developed iScope by addressing these 3 major problem areas
Order of Scoping
•Insight: Context is more sensitive than type
•Solution: Check each type at each context level, then go to next context level (switch order)
•Problem:
•Scoped approach ordering: examine every context for an abbreviation type then go to next type
•Investigating broader contexts for one type before even the narrowest context for another type is likely to yield incorrect matches
Single Letter Abbreviations
•Insight: Based on manual inspection, we found that meaningful single letter short forms were identifiers whose long forms were also their type name
•Solution: Limit contextual scope to type only
•Problem:
•Developers use single letter abbreviations differently than multiple letter abbreviations
•A large subset are actually semantically meaningless
•Single letter very easily matched especially because prefix matching is greedy
•Reader r = new BufferedReader()
Hyper-Common Abbreviation
•Problem: Some abbreviations used so often in code that long form rarely ever co-occurs leading to incorrect expansion based on coincidence
•Solution: Mine a small set of extremely common abbreviations and use as a preprocessing step
Mined list of hyper-common abbreviations
Evaluations
Is our method accurate enough to be useful?– Reevaluation of previous experiment
Does abbreviation expansion help maintenance tasks?– Simple Search– Concern Location Task
1. Reevaluation of Previous Test
Based on our previous experimental methodology and metrics, how much improvement was made from Scope to iScope?
Modified goldset based on new assumptions – single letter abbreviations
1. Reevaluation of Previous Test - Results
•Compare LFB with Scope and iScope using non combinational word (NCW) accuracy values
•Compare JavaMFE, ProgMFE, Scope, and iScope using the total accuracy values
2. Simple Search Evaluation
When abbreviations are expanded in software, how many more search results are returned than without expansion?
Focus: Recall– Not missing important results – want as many
potentially relevant results as possible Metric: Percent increase in results
– P.I. = Raw returned results with expansion - 100%
Raw returned results without expansion
2. Simple Search Evaluation (cont)
Subjects: 215 concerns(Eaddy et al.) annotated by 3 people each for total of 645 queries– Developed independent of the idea of
abbreviation expansion – many queries might not be affected by abbreviation expansion at all
“Match”: if any word in the query matches any word in the method considered a match and returned as a result
2. Simple Search Evaluation - Results
Approach Total Returned Results
Percent Increase
No Expansion 240,752 ---
Scope 284,160 18.03
iScope 282,489 17.34
•Less increase with iScope – single letter abbreviation false positive decrease•Ideally, this means quality is better
•experiment 3
3. Evaluation with Concern Location
Concern location task: identification of methods that are deemed to be relevant for the given search term
How much increase in effectiveness can be gained from expanding abbreviations in source code when performing concern location tasks?
3. Evaluation Methodology
Tools: Latent Semantic Indexing(LSI) and Log Entropy-based concern location– Goals: Attempt to calculate similarity values
based on location and frequency of potential query matches
Subjects: same as previous experiment
3. Methodology (cont)
Metric: Mean Average Precision (MAP)– Precision: # True positives / Total # of positives– MAP:
Collect precision values for every new true positive, going down the ranked returned results
Then take average of all results
– Attempts to reward highly ranked true positives
3. Concern Location Tasks - Results
3. Concern Location Tasks - Results
Conclusions
Abbreviation expansion is proven to be helpful in maintenance tools and processes
iScope approach improves upon Scope and greatly upon state-of-the-art
Future Work
Further refinement of expansion process to achieve highest possible accuracy
Full integration into maintenance tool Extension into other programming languages
Acknowledgments
Emily Hill and Haley Boyd Dr. Vijay K. Shanker and Dr. Lori Pollock
Questions?
Inherent Inaccuracy
•Problem: Additional errors in code not generalizable into solvable problems
•Insight: There will always be inherent error when developing automatic systems for non-standard input