research heaven, west virginia see more, learn more, tell more tim menzies west virginia university...

Research Heaven,West Virginia

See More, Learn More, Tell More

Tim MenziesWest Virginia University

[email protected]

Galaxy Global: Robert (Mike) Chapman

Justin Di Stefano

NASA IV&V: Kenneth McGill

Pat Callis

WVU: Kareem Ammar

JPL: Allen Nikora

DN American: John Davis

NASA OSMA SAS’03 [2] of 23


What’s unique about OSMA research?

See more!

Learn more!

Tell more!Important: transition to

the broader NASA software community



• Old dialogue: “v(G)>10 is a good thing”• New dialogue

– A core method in my analysis was M1– I have compared M1 to M2

• On NASA-related data• Using criteria C1

– I argue for the merits of C1 as follows • possible via a discussion on C2, C3,…

– I endorse/reject M1 because of that comparison

Show me the money!Show me the data!



The IV&V Holy Grail

• Learn “stuff” in early lifecycle– That would lead to late

lifecycle errors

• Actions: – change that stuff now OR – plan more IV&V on that

stuff

Hey, that’sfunny

Betterwatch out!

bubbles (e.g. UML, block diagrams)

English code traces issues



The IV&V metrics repository

• Galaxy Global (P.I.= Robert (Mike) Chapman)– NASA P.O.C.= Pat Callis

• Cost: $0• Currently:

– Mostly code metrics on a small number of projects– Defect fields and static code measures– Also, for C++ code, some class-based metrics

• Real soon:– Requirements mapped to code functions, plus defect logs– For only 1 project

• Some time in the near future– As above, for more projects

http://mdp.ivv.nasa.gov/



Repositories or Sarcophagus?(use it or lose it!)

data sarcophagus active data repository


Research Heaven,West VirginiaWho’s using MDP data?

• Ammar, Kareem• Callis, Pat• Chapman, Mike• Cukic , Bojan• Davis, John• Di Stefano, Justin• Goa, Lan • McGill, Kenneth• Menzies,Tim

• Dekhtyar, Alex• Hayes, Jane• Merritt, Phillip• Nikora, Allen• Orrego, Andres• Wallace, Dolores • Wilson, Aaron



Project2: Learn defect detectors from code

• Not perfect predictors– Hints that let us focus our effort

• Example:– V(g)= Mccabe’s cyclomatic complexity

= pathways through a function

– if V(g)>10 then predict detects

• Traffic light browser:– Green:

• No faults known/predicted

– Red: faults detected• I.e. link to a fault database

– Yellow: faults predicted• Based on past experience• Learn via data mining

Used MDP

data



Static code metrics for defect detection= a very bad idea?

• Better idea:– model-based methods to study deep semantics of this code– E.g. Heimdahl, Menzies, Owen, et al– E.g. Owen, Menzies

• High cost of model-based methods• How about cheaper alternatives?



Static code metrics for defect detection= a very bad idea?

• Shephard & Ince:– “… (cyclomatic complexity) based

upon poor theoretical foundations and an inadequate model of software development

– “… for a large class of software it is no more than a proxy for, and in many cases outperformed by, lines of code.”

• High utility of “mere” LOC also seen by:

– Chapman and Solomon (2003)

• Folks wasting their time:– Porter and Selby (1990)– Tian and Zelkowitz (1995)– Khoshgoftaar and Allen (2001)– Lan Goa & Cukic (2003)– Menzies et.al.(2003)– Etc etc

all attributes

someattributes

learningdecisions

learning

[Ammar, Menzies, Nikora, 2003]

MagicC4.5 Naïve Bayes

Used MDP

data



Shephard & Ince:Simple LOC out-performs others

(according to “correlation”)

• “Correlation” is not “decision”– Correlation

• defects= 0.0164+0.0114*LOC

– Classification:• How often does theory correctly classify correct/incorrect examples?• E.g. (0.0164+0.0114*LOC) >= 1

• Learning classifiers is different to learning correlations– Best classifiers found by [Ammar, Menzies, Nikora 2003]

did NOT use LOC:• KC2: ev(g) >= 4.99• JM1: unique operands >= 60.48

Used MDP

data



• Accuracy, like correlation, can miss vital features

– Same accuracy/correlations– Different detection, false alarm rates

[Ammar, Menzies, Nikora 2003]Astonishingly few metrics required to generate

accurate defect detectors

Truth

no yes

Detected no A, locA B, locB

yes C, locC D, locD

Accuracy = (A+D)/(A+B+C+D)

False alarm = PF = C/(A+C)

Got it right = PD = D/(B+D)

%effort = (locC+locD) / (locA + locB + locC + locD)

0.0 0.5 1.0 1.5 2.0

accco

rrelat

ion

effo

rt

pd

pf

values relative to detector1

detector1 detector2 detector3

Detector1: 0.0164+0.0114*LOC

correlation(0.0164+0.0114*LOC) = 0.66%

classification((0.0164+0.0114*LOC)>1) = 80%

Detector1: 0.0164+0.0114*LOC

correlation(0.0164+0.0114*LOC) = 0.66%

classification((0.0164+0.0114*LOC)>1) = 80%

LSR: loc

Detector2: 0.0216 +0.0954*v(g) - 0.109*ev(g) + 0.0598*iv(g)Detector2: 0.0216 +0.0954*v(g) - 0.109*ev(g) + 0.0598*iv(g) LSR: Mccabes

Detector3: 0.00892 -0.00432*uniqOp + 0.0147*uniqOpnd -0.01* totalOp + 0.0225* totalOpndDetector3: 0.00892 -0.00432*uniqOp + 0.0147*uniqOpnd -0.01* totalOp + 0.0225* totalOpndLSR: Halstead

Used MDP

data



Don’t assess defect defectors on just accuracy

• [Menzies, Di Stefano, Ammar, McGill, Callis, Chapman, Davis , 2003]:

–A study of 300+ detectors generated from MDP logs

• Sailing on the smooth sea of accuracy

–With many unseen rocks below–“Rocks”=

• huge variations in false alarm/ detection/effort probabilities

Truth

no yes


yes C, locC D, locD





accuracy

%effort PD

PF

Used MDP

data



Not “one ring torule them all”

Risk-adverse projects: high PD, not very high PFs Cost-adverse projects (low PF): don’t waste time of chasing false alarms New relationship projects (low PF): don’t tell client anything dumb Time-constrained projects (low effort): limited time for inspections Writing coding standards (categorical statements)

accuracy

%effort PD

PF

Truth

no yes


yes C, locC D, locDAccuracy = (A+D)/(A+B+C+D)



%effort = (locC+locD) /

(locA + locB + locC + locD)

• Detectors tuned to the needs of your domain

v(g) ≥ 10



IQ = automatically exploring all detector trade-offs

• Setup– decide your evaluation criteria

• 0 <= criteria <= 1• “good” = 1, “bad”= 0• E.g. pf, pd, effort, cost, support,…• Interesting detector: optimal on some pair of criteria

– quickly generate a lot of detectors

• Repeat:– add in three “artificial detectors”– draw a lasso around all the points; i.e. compute the convex hull– forget the un-interesting detectors (inside the hull)– generate combinations (conjunctions) of detectors on hull

• Until hull stops growing towards the “sweet spot”• IQ= iterative quick hull• For N>2 criteria:

– Draw hull for N!/2*(N-2)!= (N*(N-1))/2 combinations– A detector is interesting if it appears on the hull in any of these combinations

• Result: a few, very interesting, detectors

1 - Effort

PD

“sweet spot”

= <1,1>



IQ: generates a wide-range of interesting detectors

PD

PF

acc

ura

cy

Branch_Count >= 3.30 andN >= 6.22 and L >= 0.044

L >= 0.044 andUniq_Opnd >= 25.80 andTotal_Op >= 22.03 andTotal_Opnd >= 37.16 and

20%effort

40%effort

30%

45%

65%

5%

24%

80%

pf

pd

acc

pf

pd

acc

%effortpf

pd

effort

Used MDP

data

pd

pf

effort

52%effort

90%

62%

20%

pf

pd

acc L >= 0.044 andN >= 6.2

pd

pf

effort

7%

21%

pf

pd Learnt by decision trees and Naïve Bayes


Research Heaven,West VirginiaFuture work

• Is it always domain-specific tuning?– Lan Goa & Cukic (2003)– Applying detectors learnt

from project1 to project2• The “DEFECT 1000”

– Menzies, Massey, et.al. (2004+)– For 1000 projects– Learn defect detectors– Look for repeated patterns

• But what if no repeated detectors?– Is tuning always domain-specific? – No general rules?– What advice do we give projects?

1. Collect your defect logs2. Define your local assessment criteria3. Learn, until conclusions stabilize4. Check old conclusions against new

Uses MDP

data



Conclusions (1)• Thanks to NASA’s

Metrics Data Program, we are

– seeing more, – learning more.– telling more (this talk)

• Not “one ring to rule them all”; e.g. “v(g) >= 10”

• Prior pessimism about defect detectors premature– correlation and classification accuracy : not enough– The flat sea of accuracy and the savage rocks below

See more!

Learn more!

Tell more!

v(g) ≥ 10

accuracy

%effort PD

PF



Conclusions (2)

• Defect detection subtler that we’d thought – Effort,, pf, pd, …

• Standard data miners– insensitive to these

subtleties

• IQ: a new data miner• Level-playing field to

compare different techniques– entropy vs IR vs DS vs

mccabes vs IQ vs …

Show me the data!

PD

PF

acc

ura

cy

%effortpf

pd

effort

pd

pf

effort

pd

pf

effort



Use MDP

data

http://mdp.ivv.nasa.gov/

Conclusions (3)


Supplemental material



Yet more1. Accuracy, 2. PD3. PF4. Effort5. Distance to some user-goal

• E.g. effort=25%6. Cost:

– Mccabes license= $50K/year7. Precision:

– D/(C+D)8. Support:

– (C+D)/(A+B+C+D)9. External validity

• Stability:• N-way cross-val• (note: support can predict for stability)

• Generality• Same result in multiple projects

10. Lift:– Change in weighted sum of classes

11. Etc

• Problem: can’t do it using standard data miners

Truth

no yes


yes C, locC D, locD







PD

PF

accuracy

Standard data miners optimize for accuracy, miss interesting options

truth

detected no yes

no a b acc (A+B)/(A+B+C+D)

yes c d pd D/(B+D)

pf C/(C+A) truth

detected no yes

C4.5minobs=128

no 3064 901 acc 76%

yes 136 192 pd 21%

pf 4%

C4.5minobs=2

no 2895 794 acc 74%

yes 305 299 pd 21%

pf 10%

Naïve bayes

no 3041 873 acc 76%

yes 159 220 pd 21%

pf 5%

acc

pd

pf

acc

pd

pf

acc

pd

pf

[10-way cross validation]

Used MDP

data



Project1: Learning what is a “bad” requirement

• What features of requirements (in English) that predict for defects (in “C”)?

• What features do we extract from English text?

– ARM: phrases:• Weak; e.g. “maybe”• Strong; e.g. “must”• Continuations; e.g. “and”, “but”,…

– Porter’s algorithm• Stemming

– LEG, total/unique:• words• verbs• nouns• dull words; e.g. “a”, “be”,…• interesting words; i.e. data dictionary• etc

– WORDNET: how to find:• nouns, • verbs, • synonyms

Uses MDP

data



PACE: browser for detector

trade-offsPhillip Aaron

Merritt Wilson



Project2: Learning “holes” and “poles” in models

data miner

Requires: assessment criteria of outputsRequires: assessment criteria of outputsRequires: detailed knowledge of internalsRequires: detailed knowledge of internals



0

10

20

30

40

50

60

1 2

Run Number

Alt

itu

de

De

tec

ted

Mean (Normal Run)

Mean (Reverse Run)

STEREO= “solar terrestrial relations observatory”

research heaven, west virginia see more, learn more, tell more tim menzies west virginia university...

Documents