research heaven, west virginia see more, learn more, tell more tim menzies west virginia university...
TRANSCRIPT
Research Heaven,West Virginia
See More, Learn More, Tell More
Tim MenziesWest Virginia University
Galaxy Global: Robert (Mike) Chapman
Justin Di Stefano
NASA IV&V: Kenneth McGill
Pat Callis
WVU: Kareem Ammar
JPL: Allen Nikora
DN American: John Davis
NASA OSMA SAS’03 [2] of 23
Research Heaven,West Virginia
What’s unique about OSMA research?
See more!
Learn more!
Tell more!Important: transition to
the broader NASA software community
NASA OSMA SAS’03 [3] of 23
Research Heaven,West Virginia
• Old dialogue: “v(G)>10 is a good thing”• New dialogue
– A core method in my analysis was M1– I have compared M1 to M2
• On NASA-related data• Using criteria C1
– I argue for the merits of C1 as follows • possible via a discussion on C2, C3,…
– I endorse/reject M1 because of that comparison
Show me the money!Show me the data!
NASA OSMA SAS’03 [4] of 23
Research Heaven,West Virginia
The IV&V Holy Grail
• Learn “stuff” in early lifecycle– That would lead to late
lifecycle errors
• Actions: – change that stuff now OR – plan more IV&V on that
stuff
Hey, that’sfunny
Betterwatch out!
bubbles (e.g. UML, block diagrams)
English code traces issues
NASA OSMA SAS’03 [5] of 23
Research Heaven,West Virginia
The IV&V metrics repository
• Galaxy Global (P.I.= Robert (Mike) Chapman)– NASA P.O.C.= Pat Callis
• Cost: $0• Currently:
– Mostly code metrics on a small number of projects– Defect fields and static code measures– Also, for C++ code, some class-based metrics
• Real soon:– Requirements mapped to code functions, plus defect logs– For only 1 project
• Some time in the near future– As above, for more projects
http://mdp.ivv.nasa.gov/
NASA OSMA SAS’03 [6] of 23
Research Heaven,West Virginia
Repositories or Sarcophagus?(use it or lose it!)
data sarcophagus active data repository
NASA OSMA SAS’03 [7] of 23
Research Heaven,West VirginiaWho’s using MDP data?
• Ammar, Kareem• Callis, Pat• Chapman, Mike• Cukic , Bojan• Davis, John• Di Stefano, Justin• Goa, Lan • McGill, Kenneth• Menzies,Tim
• Dekhtyar, Alex• Hayes, Jane• Merritt, Phillip• Nikora, Allen• Orrego, Andres• Wallace, Dolores • Wilson, Aaron
NASA OSMA SAS’03 [8] of 23
Research Heaven,West Virginia
Project2: Learn defect detectors from code
• Not perfect predictors– Hints that let us focus our effort
• Example:– V(g)= Mccabe’s cyclomatic complexity
= pathways through a function
– if V(g)>10 then predict detects
• Traffic light browser:– Green:
• No faults known/predicted
– Red: faults detected• I.e. link to a fault database
– Yellow: faults predicted• Based on past experience• Learn via data mining
Used MDP
data
NASA OSMA SAS’03 [9] of 23
Research Heaven,West Virginia
Static code metrics for defect detection= a very bad idea?
• Better idea:– model-based methods to study deep semantics of this code– E.g. Heimdahl, Menzies, Owen, et al– E.g. Owen, Menzies
• High cost of model-based methods• How about cheaper alternatives?
NASA OSMA SAS’03 [10] of 23
Research Heaven,West Virginia
Static code metrics for defect detection= a very bad idea?
• Shephard & Ince:– “… (cyclomatic complexity) based
upon poor theoretical foundations and an inadequate model of software development
– “… for a large class of software it is no more than a proxy for, and in many cases outperformed by, lines of code.”
• High utility of “mere” LOC also seen by:
– Chapman and Solomon (2003)
• Folks wasting their time:– Porter and Selby (1990)– Tian and Zelkowitz (1995)– Khoshgoftaar and Allen (2001)– Lan Goa & Cukic (2003)– Menzies et.al.(2003)– Etc etc
all attributes
someattributes
learningdecisions
learning
[Ammar, Menzies, Nikora, 2003]
MagicC4.5 Naïve Bayes
Used MDP
data
NASA OSMA SAS’03 [11] of 23
Research Heaven,West Virginia
Shephard & Ince:Simple LOC out-performs others
(according to “correlation”)
• “Correlation” is not “decision”– Correlation
• defects= 0.0164+0.0114*LOC
– Classification:• How often does theory correctly classify correct/incorrect examples?• E.g. (0.0164+0.0114*LOC) >= 1
• Learning classifiers is different to learning correlations– Best classifiers found by [Ammar, Menzies, Nikora 2003]
did NOT use LOC:• KC2: ev(g) >= 4.99• JM1: unique operands >= 60.48
Used MDP
data
NASA OSMA SAS’03 [12] of 23
Research Heaven,West Virginia
• Accuracy, like correlation, can miss vital features
– Same accuracy/correlations– Different detection, false alarm rates
[Ammar, Menzies, Nikora 2003]Astonishingly few metrics required to generate
accurate defect detectors
Truth
no yes
Detected no A, locA B, locB
yes C, locC D, locD
Accuracy = (A+D)/(A+B+C+D)
False alarm = PF = C/(A+C)
Got it right = PD = D/(B+D)
%effort = (locC+locD) / (locA + locB + locC + locD)
0.0 0.5 1.0 1.5 2.0
accco
rrelat
ion
effo
rt
pd
pf
values relative to detector1
detector1 detector2 detector3
Detector1: 0.0164+0.0114*LOC
correlation(0.0164+0.0114*LOC) = 0.66%
classification((0.0164+0.0114*LOC)>1) = 80%
Detector1: 0.0164+0.0114*LOC
correlation(0.0164+0.0114*LOC) = 0.66%
classification((0.0164+0.0114*LOC)>1) = 80%
LSR: loc
Detector2: 0.0216 +0.0954*v(g) - 0.109*ev(g) + 0.0598*iv(g)Detector2: 0.0216 +0.0954*v(g) - 0.109*ev(g) + 0.0598*iv(g) LSR: Mccabes
Detector3: 0.00892 -0.00432*uniqOp + 0.0147*uniqOpnd -0.01* totalOp + 0.0225* totalOpndDetector3: 0.00892 -0.00432*uniqOp + 0.0147*uniqOpnd -0.01* totalOp + 0.0225* totalOpndLSR: Halstead
Used MDP
data
NASA OSMA SAS’03 [13] of 23
Research Heaven,West Virginia
Don’t assess defect defectors on just accuracy
• [Menzies, Di Stefano, Ammar, McGill, Callis, Chapman, Davis , 2003]:
–A study of 300+ detectors generated from MDP logs
• Sailing on the smooth sea of accuracy
–With many unseen rocks below–“Rocks”=
• huge variations in false alarm/ detection/effort probabilities
Truth
no yes
Detected no A, locA B, locB
yes C, locC D, locD
Accuracy = (A+D)/(A+B+C+D)
False alarm = PF = C/(A+C)
Got it right = PD = D/(B+D)
%effort = (locC+locD) / (locA + locB + locC + locD)
accuracy
%effort PD
PF
Used MDP
data
NASA OSMA SAS’03 [14] of 23
Research Heaven,West Virginia
Not “one ring torule them all”
Risk-adverse projects: high PD, not very high PFs Cost-adverse projects (low PF): don’t waste time of chasing false alarms New relationship projects (low PF): don’t tell client anything dumb Time-constrained projects (low effort): limited time for inspections Writing coding standards (categorical statements)
accuracy
%effort PD
PF
Truth
no yes
Detected no A, locA B, locB
yes C, locC D, locDAccuracy = (A+D)/(A+B+C+D)
False alarm = PF = C/(A+C)
Got it right = PD = D/(B+D)
%effort = (locC+locD) /
(locA + locB + locC + locD)
• Detectors tuned to the needs of your domain
v(g) ≥ 10
NASA OSMA SAS’03 [15] of 23
Research Heaven,West Virginia
IQ = automatically exploring all detector trade-offs
• Setup– decide your evaluation criteria
• 0 <= criteria <= 1• “good” = 1, “bad”= 0• E.g. pf, pd, effort, cost, support,…• Interesting detector: optimal on some pair of criteria
– quickly generate a lot of detectors
• Repeat:– add in three “artificial detectors”– draw a lasso around all the points; i.e. compute the convex hull– forget the un-interesting detectors (inside the hull)– generate combinations (conjunctions) of detectors on hull
• Until hull stops growing towards the “sweet spot”• IQ= iterative quick hull• For N>2 criteria:
– Draw hull for N!/2*(N-2)!= (N*(N-1))/2 combinations– A detector is interesting if it appears on the hull in any of these combinations
• Result: a few, very interesting, detectors
1 - Effort
PD
“sweet spot”
= <1,1>
NASA OSMA SAS’03 [16] of 23
Research Heaven,West Virginia
IQ: generates a wide-range of interesting detectors
PD
PF
acc
ura
cy
Branch_Count >= 3.30 andN >= 6.22 and L >= 0.044
L >= 0.044 andUniq_Opnd >= 25.80 andTotal_Op >= 22.03 andTotal_Opnd >= 37.16 and
20%effort
40%effort
30%
45%
65%
5%
24%
80%
pf
pd
acc
pf
pd
acc
%effortpf
pd
effort
Used MDP
data
pd
pf
effort
52%effort
90%
62%
20%
pf
pd
acc L >= 0.044 andN >= 6.2
pd
pf
effort
7%
21%
pf
pd Learnt by decision trees and Naïve Bayes
NASA OSMA SAS’03 [17] of 23
Research Heaven,West VirginiaFuture work
• Is it always domain-specific tuning?– Lan Goa & Cukic (2003)– Applying detectors learnt
from project1 to project2• The “DEFECT 1000”
– Menzies, Massey, et.al. (2004+)– For 1000 projects– Learn defect detectors– Look for repeated patterns
• But what if no repeated detectors?– Is tuning always domain-specific? – No general rules?– What advice do we give projects?
1. Collect your defect logs2. Define your local assessment criteria3. Learn, until conclusions stabilize4. Check old conclusions against new
Uses MDP
data
NASA OSMA SAS’03 [18] of 23
Research Heaven,West Virginia
Conclusions (1)• Thanks to NASA’s
Metrics Data Program, we are
– seeing more, – learning more.– telling more (this talk)
• Not “one ring to rule them all”; e.g. “v(g) >= 10”
• Prior pessimism about defect detectors premature– correlation and classification accuracy : not enough– The flat sea of accuracy and the savage rocks below
See more!
Learn more!
Tell more!
v(g) ≥ 10
accuracy
%effort PD
PF
NASA OSMA SAS’03 [19] of 23
Research Heaven,West Virginia
Conclusions (2)
• Defect detection subtler that we’d thought – Effort,, pf, pd, …
• Standard data miners– insensitive to these
subtleties
• IQ: a new data miner• Level-playing field to
compare different techniques– entropy vs IR vs DS vs
mccabes vs IQ vs …
Show me the data!
PD
PF
acc
ura
cy
%effortpf
pd
effort
pd
pf
effort
pd
pf
effort
NASA OSMA SAS’03 [20] of 23
Research Heaven,West Virginia
Use MDP
data
http://mdp.ivv.nasa.gov/
Conclusions (3)
NASA OSMA SAS’03 [22] of 23
Research Heaven,West Virginia
Yet more1. Accuracy, 2. PD3. PF4. Effort5. Distance to some user-goal
• E.g. effort=25%6. Cost:
– Mccabes license= $50K/year7. Precision:
– D/(C+D)8. Support:
– (C+D)/(A+B+C+D)9. External validity
• Stability:• N-way cross-val• (note: support can predict for stability)
• Generality• Same result in multiple projects
10. Lift:– Change in weighted sum of classes
11. Etc
• Problem: can’t do it using standard data miners
Truth
no yes
Detected no A, locA B, locB
yes C, locC D, locD
Accuracy = (A+D)/(A+B+C+D)
False alarm = PF = C/(A+C)
Got it right = PD = D/(B+D)
%effort = (locC+locD) / (locA + locB + locC + locD)
NASA OSMA SAS’03 [23] of 23
Research Heaven,West Virginia
PD
PF
accuracy
Standard data miners optimize for accuracy, miss interesting options
truth
detected no yes
no a b acc (A+B)/(A+B+C+D)
yes c d pd D/(B+D)
pf C/(C+A) truth
detected no yes
C4.5minobs=128
no 3064 901 acc 76%
yes 136 192 pd 21%
pf 4%
C4.5minobs=2
no 2895 794 acc 74%
yes 305 299 pd 21%
pf 10%
Naïve bayes
no 3041 873 acc 76%
yes 159 220 pd 21%
pf 5%
acc
pd
pf
acc
pd
pf
acc
pd
pf
[10-way cross validation]
Used MDP
data
NASA OSMA SAS’03 [24] of 23
Research Heaven,West Virginia
Project1: Learning what is a “bad” requirement
• What features of requirements (in English) that predict for defects (in “C”)?
• What features do we extract from English text?
– ARM: phrases:• Weak; e.g. “maybe”• Strong; e.g. “must”• Continuations; e.g. “and”, “but”,…
– Porter’s algorithm• Stemming
– LEG, total/unique:• words• verbs• nouns• dull words; e.g. “a”, “be”,…• interesting words; i.e. data dictionary• etc
– WORDNET: how to find:• nouns, • verbs, • synonyms
Uses MDP
data
NASA OSMA SAS’03 [25] of 23
Research Heaven,West Virginia
PACE: browser for detector
trade-offsPhillip Aaron
Merritt Wilson
NASA OSMA SAS’03 [26] of 23
Research Heaven,West Virginia
Project2: Learning “holes” and “poles” in models
data miner
Requires: assessment criteria of outputsRequires: assessment criteria of outputsRequires: detailed knowledge of internalsRequires: detailed knowledge of internals