introducing natural language program analysis
DESCRIPTION
Introducing Natural Language Program Analysis. Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, Kishen Maloor. NLPA Research Team Leaders. K. Vijay-Shanker “The Umpire”. Lori Pollock “Team Captain”. University of Delaware. Problem. - PowerPoint PPT PresentationTRANSCRIPT
Introducing Natural Language Program Analysis
Lori Pollock, K. Vijay-Shanker, David Shepherd,
Emily Hill, Zachary P. Fry, Kishen Maloor
NLPA Research Team Leaders
Lori Pollock“Team Captain”
K. Vijay-Shanker“The Umpire”
ProblemModern software is large and complex
object oriented class hierarchy
Software development tools are needed
Successes in Software Development Tools
object oriented class hierarchy
Good with local tasks
Good with traditional structure
object oriented class hierarchy
Scattered tasks are difficult
Programmers use more than traditional program structure
Issues in Software Development Tools
public interface Storable{...
activate tool
save drawing
update drawing
undo action
public void Circle.save()
//Store the fields in a file....
object oriented system
Key Insight: Programmers leave natural language clues that
can benefit software development tools
Observations in Software Development Tools
Studies on choosing identifiers
Impact of human cognition on names [Liblit et al. PPIG 06] Metaphors, morphology, scope, part of speech hints Hints for understanding code
Analysis of Function identifiers [Caprile and Tonella WCRE 99] Lexical, syntactic, semantic Use for software tools: metrics, traceability, program understanding
Carla, the compiler writer Pete, the programmer
I don’t care about names.
So, I could use x, y, z. But, no one
will understandmy code.
Our Research Path
[MACS 05, LATE 05]
[AOSD 06]
[ASE 05, AOSD 07, PASTE 07]
Motivated usefulness of exploiting natural language (NL) clues in toolsDeveloped extraction process and an NL-
based program representationCreated and evaluated a concern
location tool and an aspect miner with NL-based analysis
pic
Name: David C ShepherdNickname: Leadoff HitterCurrent Position: PhD May 30, 2007Future Position: Postdoc, Gail Murphy
StatsYear coffees/day redmarks/paper draft2002 0.1 5002007 2.2 100
Aspect Mining
Aspect-Oriented Programming
Aspect Mining TaskLocate refactoring
candidates
Applying NL Clues for
Molly, the Maintainer
How can I fix Paul’s
atrocious code?
Timna: An Aspect Mining Framework [ASE 05]
Uses program analysis clues for mining Combines clues using machine learning Evaluated vs. Fan-in Precision (quality) and Recall (completeness)
P R 37 2 62 60
Fan-InTimna
iTimna (Timna with NL) Integrates natural language cluesExample: Opposite verbs (open and close)
P R 37 2 62 60 81 73
Fan-InTimna iTimna
Integrating NL Clues into Timna
Natural language information increases the effectiveness of Timna[Come back Thurs 10:05am]
Concern Location
60-90% software costs spent on reading and navigating code for maintenance*
(fixing bugs, adding features, etc.)
*[Erlikh] Leveraging Legacy System Dollars for E-Business
Applying NL Clues for
Motivation
Key Challenge: Concern Location
Find, collect, and understand all source code related to a particular concept
Concerns are often crosscutting
State of the Art for Concern Location
Mining Dynamic Information [Wilde ICSM 00]
Program Structure Navigation [Robillard FSE 05, FEAT, Schaefer ICSM 05]
Search-Based Approaches RegEx [grep, Aspect Mining Tool 00]
LSA-Based [Marcus 04]
Word-Frequency Based [GES 06]
Reduced to similar problem
Slow
Fast
Fragile
Sensitive
No Semantics
Limitations of Search Techniques
1. Return large result sets
2. Return irrelevant results
3. Return hard-to-interpret result sets
The Find-Concept Approach
concept
Find-ConceptConcrete query
Recommendations
Source Code
Method a
Method bMethod c
Method d Method e
NL-basedCode Rep
Result GraphNatural
Language Information
1. More effective search
2. Improved search terms
3. Understandable results
Underlying Program Analysis
Action-Oriented Identifier Graph (AOIG) [AOSD 06] Provides access to NL information Provides interface between NL and traditional
Word Recommendation Algorithm NL-based
Stemmed/Rooted: complete, completing Synonym: finish, complete
Combining NL and Traditional Co-location: completeWord()
Experimental Evaluation
Research Questions Which search tool is most effective at forming and
executing a query for concern location? Which search tool requires the least human effort to form
an effective query?
Methodology: 18 developers complete nine concern location tasks on medium-sized (>20KLOC) programs
Measures:Precision (quality), Recall (completeness), F-Measure (combination of both P & R)
Find Concept, GES, ELex
Overall Results
Effectiveness FC > Elex with statistical
significance FC >= GES on 7/9 tasks FC is more consistent than GES
Effort FC = Elex = GES
FC is more consistent and more effective in experimental study without requiring more effort
Across all tasks
Natural Language Extraction from Source Code
Key Challenges:Decode name usageDevelop automatic extraction
processCreate NL-based program
representation
Molly, the Maintainer
What was Pete thinking
when he wrote this code?
Natural Language: Which Clues to Use?
Software MaintenanceTypically focused on actionsObjects are well-modularized
Maintenance Requests
Natural Language: Which Clues to Use?
Software MaintenanceTypically focused on actionsObjects are well-modularized
Focus on actions Correspond to verbsVerbs need Direct Object
(DO)
Extract verb-DO pairs
Extracting Verb-DO Pairs
Two types of extractionclass Player{ /** * Play a specified file with specified time interval */ public static boolean play(final File file,final float fPosition,final long length) { fCurrent = file; try { playerImpl = null; //make sure to stop non-fading players stop(false); //Choose the player Class cPlayer = file.getTrack().getType().getPlayerImpl(); …}
Extraction from comments
Extraction from method signatures
public UserList getUserListFromFile( String path ) throws IOException {
try {
File tmpFile = new File( path );
return parseFile(tmpFile);
} catch( java.io.IOException e ) {
throw new IOrException( ”UserList format issue" + path + " file " + e );
}
}
Extracting Clues from Signatures
1. POS Tag Method Name
2. Chunk Method Name
3. Identify Verb and Direct-Object (DO)
get<verb> User<adj> List<noun> From <prep> File <noun>
get<verb phrase> User List<noun phrase> From File <prep phrase>
POS Tag
Chunk
pic
Name: Zak FryNickname: The RookieCurrent Position: Upcoming seniorFuture Position: Graduate School
StatsYear diet cokes/day lab days/week2006 1 22007 6 8
Developing rules for extraction
For many methods: Identify relevant verb (V)
and direct object (DO) in method signature
Classify pattern of V and DO locations
If new pattern, create new extraction rule
verbDO
verb DO
verbDO
Our Current Extraction Rules
4 general rules with subcategories:
URL parseURL()
void mouseDragged()
void Host.onSaved()
Left Verb
Right Verb
Generic Verb
Unidentified Verb
void message() message-
hostsaved
mousedragged
URLparse
DOVerb
Example: Sub-Categories for Left-Verb General Rule
Look beyond the method name:
Parameters, Return type, Declaring class name, Type hierarchy
Subcategory1) Standard left verb 2) No DO in method name; has parameters; non object return type3) No DO in method name; no parameters; no return type4) Creational left verb; has return type5) No DO in method name; has parameters; return type is more specific than parameters in type hierarchy6) No DO in method name; parameters are more specific than parameters in type hierarchy
2) No DO in method name; has parameters; non object return type
Verb-DO pair:
<remove, UserID>Left
Verb
Representing Verb-DO Pairs
Action-Oriented Identifier Graph (AOIG)
verb1 verb2 verb3 DO1 DO2 DO3
verb1, DO1 verb1, DO2 verb3, DO2 verb2, DO3
source code files
use
use
use
use
use
use
useuse
Action-Oriented Identifier Graph (AOIG)
play add remove file playlist listener
play, file play, playlist remove, playlist add, listener
source code files
use
use
use
use
use
use
useuse
Representing Verb-DO Pairs
Evaluation of Extraction Process
Compare automatic vs ideal (human) extraction 300 methods from 6 medium open source programs Annotated by 3 Java developers
Promising Results Precision: 57% Recall: 64%
Context of Results Did not analyze trivial methods On average, at least verb OR direct object obtained
pic
Name: Emily Gibson HillNickname: Batter on DeckCurrent Position: 2nd year PhD StudentFuture Position: PhD Candidate
StatsYear cokes/day meetings/week2003 0.2 12007 2 5
Program Exploration
Purpose: Expedite software maintenance and program comprehension
Key Insight: Automated tools can use program structure and identifier names to save the developer time and effort
Ongoing work:
Dora the Program Explorer*
* Dora comes from exploradora, the Spanish word for a female explorer.
DoraDora
Natural Language Query• Maintenance request• Expert knowledge• Query expansion
Natural Language Query• Maintenance request• Expert knowledge• Query expansion
Relevant Neighborhood
Program Structure• Representation
• Current: call graph• Seed starting point
Relevant Neighborhood• Subgraph relevant to query
Query
State of the Art in Exploration
Structural (dependence, inheritance) Slicing Suade [Robillard 2005]
Lexical (identifier names, comments) Regular expressions: grep, Eclipse search Information Retrieval: FindConcept,
Google Eclipse Search [Poshyvanyk 2006]
Motivating need for structural and lexical information
Program: JBidWatcher, an eBay auction sniping program
Bug: User-triggered add auction event has no effect
Task: Locate code related to ‘add auction’ trigger
Seed: DoAction() method, from prior knowledge
ExampleScenario
DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()
DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()
DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()
Using only structural information
DoAction() has 38 callees, only 2/38 are relevant Relevant
Methods
Irrelevant Methods
Looking for: ‘add auction’ trigger
DoAction()
DoAdd()
DoPasteFromClipboard()
And what if you wanted to explore more than one edge away?
Locates locally relevant items, but many irrelevant
Using only lexical information
50/1812 methods contain matches to ‘add*auction’ regular expression query
Only 2/50 are relevant
Locates globally relevant items, but many irrelevant
Looking for: ‘add auction’ trigger
DoNada() DoNada() DoNada() DoNada() DoNada()DoNada() DoNada()DoNada()DoNada() DoNada() DoNada()
DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()
DoNada() DoNada()DoNada() DoNada() DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()DoNada()
Combining Structural & Lexical Information Structural: guides exploration
from seed
Looking for: ‘add auction’ trigger
RelevantNeighborhood
DoAction()
DoPasteFromClipboard()
DoAdd()
Lexical: prunes irrelevant edges
The Dora Approach
Determine method relevance to queryCalculate lexical-based relevance score
Low-scored methods pruned from neighborhood
Recursively explore
Prune irrelevant structural edges from seed
Calculating Relevance Score:Term Frequency Score based on query term frequency of the method
6 query term 6 query term occurrencesoccurrences6 query term 6 query term occurrencesoccurrences
Only 2 Only 2 occurrencesoccurrences
Only 2 Only 2 occurrencesoccurrences
Query: ‘add auction’
Weigh term frequency based on location: Method name more important than body Method body statements normalized by length
Calculating Relevance Score:Location Weights Query: ‘add auction’
?
Dora explores ‘add auction’ trigger
From DoAction() seed:Correctly identified at 0.5 threshold
DoAdd() (0.93)DoPasteFromClipboard() (0.60)
With only one false positiveDoSave() (0.52)
Summary
NL technology usedSynonyms, collocations, morphology, word frequencies, part-of-speech tagging, AOIG
Evaluation indicatesNatural language information shows promise for improving software development tools
Key to successAccurate extraction of NL clues
Our Current and Future Work
Basic NL-based tools for softwareAbbreviation expanderProgram synonymsDetermining relative importance of words
Integrating information retrieval techniques
Posed Questions for Discussion
What open problems faced by software tool developers can be mitigated by NLPA?
Under what circumstances is NLPA not useful?