code you can use: searching for web automation scripts based on reusability james admire, abbas al...

19
Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher Scaffidi Oregon State University

Upload: valerie-hines

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Code You Can Use: Searching for

web automation scripts based on reusability

James Admire, Abbas Al Zawwad, Abdulwahab Almorebah,

Sanchit Karve, Christopher Scaffidi

Oregon State University

Page 2: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Online repositories of reusable EUP code offer many ways to find relevant code

• Keyword-based search• Type keywords, receive a search result list of existing code

available to reuse

• Browsing by category• E.g., based on thematic categories or tags

• “Related” code• E.g., by listing other code derived from a given piece of code

Page 3: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Finding high-quality EUP code to reuse is hard

• Download counters and similar auto-generated popularity counts• But hardly any code is ever downloaded more than a trivial number

of times

• Explicit user-generated ratings of quality• But most code is never rated, certainly not by more than a few

people

• Curated collections of “featured” code• But scalability and sustainability are perennial challenges for

curators

Page 4: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

CoScripter web macro repository as a microcosm

• Was one of the biggest repositories of web macros• Web macro = EUP script for automating browser interactions with

web sites• > 6000 web macros when I last saw this repository

• Prior studies showed hardly any macros were reused much• 9% run by 3 or more people• 7% run at least 6 times per user

• Ultimately: Discontinued by IBM• Sustainability is a challenge!!!

• 5% customized by any other user• 4% copied by any other user

Page 5: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Prior work has shown it possible to predict which web macros would be reused

• But suppose a repository could predict from the moment of a macro’s creation whether it would be reused, so the search engine could emphasize or downplay the macro accordingly

• Prior work• Collected 35 features of macros that seemed plausibly related to the

understandability and modifiability of the macros, plus measures of reuse

• Trained machine learning models to predict which macros would be reused (train a unique model for each measure of reuse)

• Result: True positives of up to 90% at false positive rates in 10%-40% range

• Similar results when replicated with two other repositories of EUPs’ code

Page 6: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Key limitations of that prior work

• Predicted reuse, not reusability: Users might reuse enticing but low-quality code and then regret it. Sometimes, reuse != reusability.

• Predicted binary measures: We would need to estimate level of reusability for sorting, not merely whether it will or will not be reused.

• Relied only on data available at macro creation: Data such as user-generated ratings might help inform reusability estimates.

• Provided no search engine: A proof of concept implementation would help to clarify any remaining technical hurdles.

Page 7: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Goal: An approach for modeling reusability of EUPs’ code, for use in sorting search results1. Start with an existing repository that EUPs have used for a while

2. Define and compute features for EUPs’ macros in the repository

3. Reduce the feature set with factor analysis

4. Construct a model of reusability by linear regression of an expert user’s estimate of macro reusability versus the computed features

5. Sort macros by estimated reusability (at least in part) in search engine

6. Evaluate reusability estimates with another panel of experts as they use the search engine, and iterate the model in the search engine

Page 8: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Step 1: Getting a repository of EUPs’ code

• CoScripter• Already had been in operation for approximately 5 years (since

early 2008)

• Already well-familiar with the repository due to our prior work

• Already had a well-developed list of candidate features due to prior work

• Already had permission to scrape macros and other data from the repository

Page 9: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Step 2: Defining features for macros

• Selected 8 features from the 35 investigated in prior work• Statistically associated with reuse in both prior studies• Could be computed directly and automatically from available data• E.g., # comments, # parameters, 1 or 0 indicating if macro has a title

• Created 21 features as refinements of the 35 from prior work• Macro age, and 20 different counts of code length

• Created 8 new features based on new data suggesting user interest• Not previously considered, as these data accrue after macro creation• E.g., # times run, # users who ran it, # revisions, # comments about it

Page 10: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Step 3: Reducing features data with factor analysis• Factor = linear combination of features that are mutually

correlated

• Procedure• Randomly selected 100 macros• Computed our 37 features for each macro• Performed factor analysis• Discard all but the most salient factors (optimal coordinates method)

• Result: 8 factors containing 17 features• Most of these retained features were related to code size, comments,

and numbers of runs (e.g., total count or normalized by number of users)

Page 11: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Step 4: Constructing a model of reusability• Linear regression of reusability estimates versus factors

• From Step 3, we could compute 8 factor scores as linear combinations of features

• But just because factors exist doesn’t mean they are actually related to reusability!• So: Linear regression w/ dep var = reusability estimate, 8 indep vars = factor scores

• Procedure• One team member (who did not help with defining or computing features)

gave reusability estimate (range 1-4) to each of the 100 web macros

• Result: Linear model that estimates reusability based on the features• Linear regression was highly significant (P=0.003)• 7 out of 8 factors had non-zero coefficients

Page 12: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Step 5: Searching for code based on reusability

• Code You Can Use (CYCU) (pronounced “cuckoo”)• Compute reusability estimates offline

• When user enters query, forward query to CoScripter repository, get back a list of macros, look up reusability estimates, and sort by estimated reusability

CYCU Web Spider

CYCU Database

CoScripter repository

Request macros

Macros

Reusability estimates

Reusability estimates

Keyword query

Macros CYCU Search Engine

KeywordsSearch results

(offline)

Page 13: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Step 6: Evaluating reusability estimates with another panel of expert users• Using a different set of users than the one who gave initial estimates

• Needed users who were pretty good at programming but who could approach CoScripter as an EUP tool rather than as a professional programming tool

• 4 CS students, only one of whom had any experience as a professional programmer (<2 years), but all of whom were seniors or master’s

• Using a different set of macros than those used to create the model• Manually reviewed CoScripter repository to see what was popular lately

• Identified two themes: searching for houses and checking for flight information

• Each of the 4 participants rated 20 of the 40 test macros

• i.e., 2 participants rated each test macro

Page 14: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

We collected 2 user-assessed reusability measures and 1 user-assessed relevance measure• Randomly ordered the macros and asked participants to rate

(on a 4-point Likert scale)…• How helpful is this code in learning CoScripter?• How easy is it to understand the code?• How relevant is the code to the search term ‘search for houses’

[or ‘check airlines’]?

• We expected that our reusability estimates…• Would significantly correlate with learnability and understandability

ratings• Would not significantly correlate with relevance ratings

Page 15: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Result: Significant correlations appeared on all three measures

Measure P F score Adj. R2

Learnability <0.001 54.2 0.59

Understandability <0.001 41.0 0.52

Relevance 0.01 6.88 0.14

Regression of each measure for each macro (averaged over participants) against reusability estimate.Note: Analysis utilized data for only 39 macros… one participant chose to skip a macro.

Page 16: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Further work could address threats to validity and limitations of this study• Different kinds of macros require different models of reusability

• Indeed, our prior work showed different kinds of scripts require somewhat different features.

• But the overall approach (compute features, combine features, validate) should methodologically generalize at least across textual scripting languages.

• More sophisticated methods might be better for sorting search results based on integrating relevance with reusability estimates

• Tool-builders might find this approach more onerous than we did• We built on hundreds of hours of our own prior work• Crucial work remains on overcoming barriers to tech transfer of EUP

research

Page 17: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Exciting opportunities now exist for moving quality-based code search toward practice• Key contributions

• New approach for modeling reusability of EUPs’ scripts• Demonstration of how such a model can be used in a search engine

• Next steps• Elucidating and countering risks of users “gaming” the system by

artificially boosting the apparent reusability of their code• Begin integrating reusability models into other, more sophisticated

browsing and search methods (e.g., collaborative filtering or other search tools)

• Investigating the impacts of applying this approach on day-to-day practice with a larger repository (e.g., impacts on learning by Scratch users)

• Working with industry partners to apply this approach in their own repositories

Page 18: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

Thank you

• To you for your attention, interest, and ideas

• To the VL/HCC reviewers for your compliments and suggestions

• To IBM for permission to scrape the CoScripter repository

• To the National Science Foundation for funding

Page 19: Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher

CYCU Screenshots