digging for diamonds: identifying valuable web automation programs in repositories jarrod jackson 1,...

20
Digging for diamonds: Digging for diamonds: Identifying valuable web Identifying valuable web automation programs automation programs in repositories in repositories Jarrod Jackson 1 , Chris Scaffidi 2 , Katie Stolee 2 1 Oregon State University 2 University of Nebraska - Lincoln

Upload: dortha-welch

Post on 11-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

Digging for diamonds: Identifying Digging for diamonds: Identifying valuable web automation programs valuable web automation programs

in repositoriesin repositories

Jarrod Jackson1, Chris Scaffidi2, Katie Stolee2

1 Oregon State University2 University of Nebraska - Lincoln

Page 2: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

22

Web scripts:Web scripts:Enabling users to enhance the browserEnabling users to enhance the browser

IBM CoScripter Web Macro

Problem Approach Evaluation Conclusion

Page 3: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

33

Web scripts:Web scripts:Enabling users to enhance the browserEnabling users to enhance the browser

Yahoo Pipe

Problem Approach Evaluation Conclusion

Page 4: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

44

Web scripts:Web scripts:Enabling users to enhance the browserEnabling users to enhance the browser

GreaseMonkey UserScript

Problem Approach Evaluation Conclusion

Page 5: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

55

Repositories of end-user code:Repositories of end-user code:The good, the great, and the “other”The good, the great, and the “other”

C. Bogart, et al. End-User Programming in the Wild: A Field Study of CoScripter Scripts. VL/HCC 2008.

Previous study:

Of 1445 web macros…~ 10% had many runs~ 10% had many users~ 80% were “other”

This is the largest web macro repository> 6000 users, > 3000 “public” scripts

Problem Approach Evaluation Conclusion

Page 6: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

66

What if our repositories could…What if our repositories could…

• … omit pieces of code from search results if they are unlikely to be reused, anyway?

• ... provide a UI for administrators to review (and remove?) old code that’s unlikely to be used?

• … advise programmers, when they upload code, about how to improve the reusability of their code?

Problem Approach Evaluation Conclusion

Page 7: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

77

Needed: a model for predicting reuseNeeded: a model for predicting reuse

• Key questions for discovering such a model…– What information about the code indicates reusability?– How do we combine this information to predict reuse?

• Similar models have been successful on OO code– Predicting reuse based on coupling & cohesion– Predicting bugginess based on code complexity metrics,

information about code authors, code churn, …

Web scripts are much simpler (don’t call each other, don’t have inheritance, etc)… we need different information here.

Problem Approach Evaluation Conclusion

Page 8: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

88

Prior work found 35 traits (in 8 Prior work found 35 traits (in 8 categories) statistically related to reusecategories) statistically related to reuse• Mass appeal – eg popular keywords• Language – eg data values are in English• Annotations – eg comments• Flexibility – eg parameterization (variables)• Length – eg small # distinct lines of code• Author information – eg early adopter?• Advanced syntax – eg “control-click” keyword• No Preconditions – eg no cookies needed

All traits are computed automatically from one of four sources: executable code statements, URLs referenced, annotations, code history.

Problem Approach Evaluation Conclusion

Page 9: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

99

• Given a binary measure of reuse, for each trait– Find the threshold that optimally divides the reused

scripts from the un-reused scripts

Model that we developed Model that we developed (in words & pictures)(in words & pictures)

Tra

it le

vel

Threshold

Problem Approach Evaluation Conclusion

Page 10: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1010

Predicting if a macro will be reusedPredicting if a macro will be reused

• Count how many predictors are satisfied

• Predict that the macro will be reused if this count exceeds some minimum– A tunable parameter– A higher minimum implies a higher bar that a script

must overcome to be predicted as to be reused• Fewer false positives, higher false negatives

Problem Approach Evaluation Conclusion

Page 11: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1111

ExampleExample

• E.g.: Suppose that our measure of reuse is “script is reused more than 75% of other scripts”

• Suppose that based on this measure of reuse, the best thresholds for four predictors are…

comments ≥ 3 lines_of_code ≥ 40 prev_created ≥ 10 literals ≤ 4

• The model would predict that some other script would satisfy the reuse measure criterion if the script satisfies at least n of these predictors

Problem Approach Evaluation Conclusion

Page 12: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1212

How well does this approach work…How well does this approach work…

• … for different kinds of web scripts?

• … for different reuse measures?

• … when predicting future reuse based on past data?

• … when only a subset of traits are available?

Problem Approach Evaluation Conclusion

Page 13: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1313

Scripts and measures Scripts and measures for our evaluationfor our evaluation

Problem Approach Evaluation Conclusion

Measure cutoff

Page 14: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1414

Accuracy varied little by measure or Accuracy varied little by measure or script type (e.g., TP ≥ 0.7 at FP = 0.4)script type (e.g., TP ≥ 0.7 at FP = 0.4)

Problem Approach Evaluation Conclusion

Page 15: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1515

Yahoo Pipe accuracy slipped a bit Yahoo Pipe accuracy slipped a bit when using past to predict futurewhen using past to predict future

Problem Approach Evaluation Conclusion

Page 16: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1616

Code-based traits gave nearly the full Code-based traits gave nearly the full accuracyaccuracy

(History, URL, Annotations, Code)

Problem Approach Evaluation Conclusion

Page 17: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1717

ConclusionsConclusions

• Model is equally accurate for a range of uses– And might only require code-based traits

Problem Approach Evaluation Conclusion

Page 18: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1818

ConclusionsConclusionsand future workand future work

• Model is equally accurate for a range of uses– And might only require code-based traits– But can we improve accuracy by using information

available after reuse is attempted?– Can we also predict how happy people will be when

reusing different pieces of code?

• And now to put the model to work…– Improving search engines– Providing UI for administrators to review macros– Giving programmers advice automatically

Problem Approach Evaluation Conclusion

Page 19: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

1919

Thank YouThank You

To ICISA for this opportunity to present this paper

Problem Approach Evaluation Conclusion

Page 20: Digging for diamonds: Identifying valuable web automation programs in repositories Jarrod Jackson 1, Chris Scaffidi 2, Katie Stolee 2 1 Oregon State University

2020

So how do we separate the So how do we separate the wheat from the chaff?wheat from the chaff?

• Providing such features requires predicting whether code will ever be reused

– Without relying on information that’s available after code is reused (“chicken and egg”)

• Ratings, reviews, etc…• (For some features, of course, we can always add this

information in later.)

– With a fairly simple model for making predictions• So that predictions can be explained to users• Especially when we’re advising users about how to improve

reusability of their code!!!!!

Problem Approach Evaluation Conclusion