exploring the use of crowdsourcing to support empirical ... · empirical studies in software...
TRANSCRIPT
IntroductionMechanical Turk Study
Summary
Exploring the Use of Crowdsourcing to SupportEmpirical Studies in Software Engineering
Kathryn T. Stolee & Sebastian Elbaum
September 16, 2010
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 1 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Introduction
Known Issue
Recruiting the right type and number of users for empirical studies insoftware engineering is hard.
Possible Solutions:
Use fewer participants of the right type
Limits generalizability to larger groups
Relax requirements for participation
Limits generalizability to target population
Crowdsource the study
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Introduction
Known Issue
Recruiting the right type and number of users for empirical studies insoftware engineering is hard.
Possible Solutions:
Use fewer participants of the right type
Limits generalizability to larger groups
Relax requirements for participation
Limits generalizability to target population
Crowdsource the study
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Introduction
Known Issue
Recruiting the right type and number of users for empirical studies insoftware engineering is hard.
Possible Solutions:
Use fewer participants of the right type
Limits generalizability to larger groups
Relax requirements for participation
Limits generalizability to target population
Crowdsource the study
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Introduction
Known Issue
Recruiting the right type and number of users for empirical studies insoftware engineering is hard.
Possible Solutions:
Use fewer participants of the right type
Limits generalizability to larger groups
Relax requirements for participation
Limits generalizability to target population
Crowdsource the study
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Background
Crowdsourcing
Leveraging a global community of users with different talents andbackgrounds to help perform a task that would not be feasible without amass of people behind it.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 3 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Crowdsourcing Services (examples)
Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers
Photographers collect with people whoneed stock photography. 3,000,000+members
Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists
People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Workflow in Mechanical Turk
Requestors:
CreateTasks
DescribeTasks
UploadTasks
Acceptor Reject
Types of tasks:
Short duration (60s. or less)
Require human intelligence (handwirting analysis, image tagging)
Specialized (requires certain knowledge) or generic
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 5 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Workflow in Mechanical Turk
Requestors:
CreateTasks
DescribeTasks
UploadTasks
Acceptor Reject
Types of tasks:
Short duration (60s. or less)
Require human intelligence (handwirting analysis, image tagging)
Specialized (requires certain knowledge) or generic
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 5 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Workflow in Mechanical Turk
Workers:
Searchfor Tasks
SelectTask
CompleteTask
SubmitTask
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 6 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Workflow in Mechanical Turk
Workers:
Searchfor Tasks
SelectTask
CompleteTask
SubmitTask
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 6 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Workflow in Mechanical Turk
CreateTasks
DescribeTasks
UploadTasks
Acceptor Reject
Tasks
Searchfor Tasks
SelectTask
CompleteTask
SubmitTask
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 7 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Goal of This Work
Conjecture
Crowdsourcing can be a good solution for recruiting the right type andquantity of participants for an empirical study in software engineering.
In this work, we crowdsource a software engineering experiment usingAmazon’s Mechanical Turk service, and reflect on our experiences.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 8 / 18
IntroductionMechanical Turk Study
Summary
MotivationBackgroundBackgroundObjective
Goal of This Work
Conjecture
Crowdsourcing can be a good solution for recruiting the right type andquantity of participants for an empirical study in software engineering.
In this work, we crowdsource a software engineering experiment usingAmazon’s Mechanical Turk service, and reflect on our experiences.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 8 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Study Definition
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Purpose: Evaluate the impact of coding practices (e.g.,code smells) on end user’s preferences and understanding ofweb mashups built in Yahoo! Pipes.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 9 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experimental Task Example
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Task Description: Given two pipes with the same behavior,one with a smell and one without, select the preferable one.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 10 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experimental Task Example
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Task Description: Given two pipes with the same behavior,one with a smell and one without, select the preferable one.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 11 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experimental Design
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Task Subjects Pretest Object Treatment Posttest
1 R O1, O2 Pipe1 Smell5 O3, O4
2 R O1, O2 Pipe2 Smell4 O3, O4
3 R O1, O2 Pipe3 Smell5 O3, O4
4 R O1, O2 Pipe4 Smell8 O3, O4
5 R O1, O2 Pipe5 Smell7 O3, O4
6 R O1, O2 Pipe6 Smell1 O3, O4
7 R O1, O2 Pipe7 Smell5,10 O3, O4
8 R O1, O2 Pipe8 Smell2,9 O3, O4
O1 = EducationO2 = Pipes test scoreO3 = PreferenceO4 = Time to completion
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 12 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experimental Design
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Lessons Learned:
Experimental tasks must be modular and independent,but can be longer (ours took 3-4 minutes, on average)
Qualification tests can be used to capture pretestmeasures
Cannot control which tasks are completed by whichparticipants
Self-selection of tasks may introduce bias that needs tobe accounted for in the analysis
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 13 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Selection and Recruitment
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Desired Participant Characteristics:
Limited computer science education (end users)
Familiar with Yahoo! Pipes
Mechanical Turk:
Facilitates recruitment by hosting tasks
Allows for qualification tests to be administered prior toparticipation (pretest measures)
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 14 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Selection and Recruitment
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Lessons Learned
50 qualification tests submitted in two weeks, 38 passed
22 participants in total, 14 were considered “end users”
More variation and unknowns in participants (e.g., age,gender, education, experimental context)
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 15 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experimental Task in Mechanical Turk
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 16 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Instrumentation
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Lessons Learned
Need to learn how to use a new tool and/or API
Need to adjust presentation of tasks to fit theMechanical Turk interface
All tasks are in competition with other tasks forparticipants, so the task description must be enticing.
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 17 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experiment Operation
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Mechanical Turk:
Hosts tasks for a custom time period (2 weeks)
Administers qualification tests (50 requests)
Maintains user anonymity
Collects results and metrics (188 tasks submitted)
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 18 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Experiment Operation
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Lessons Learned:
Hand-grading qualification tests introduce delay, andmay discourage further participation
Time to completion is reported, but is suspicious
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 19 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Analysis
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Response Quality:
Qualitative responses were detailed and demonstratedunderstanding (Average length was 31 words, only 10were required)
Did not need to reject any responses
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 20 / 18
IntroductionMechanical Turk Study
Summary
DefinitionPlanningOperationAnalysis
Analysis
ExperimentDefinition
Design
Selection
Instrumentation
Operation
Analysis
Lessons Learned:
We were able to validate our hypotheses (for only $42)
May need to throw away some data due to learning (wethrew away 28 responses)
Too many responses from a small group of participantscould skew results
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 21 / 18
IntroductionMechanical Turk Study
Summary
Summary
Crowdsourcing allowed us to:
Obtain a sufficient number of participants with the desiredcharacteristics
Evaluate our research questions using an empirical study for low cost
However...
Requires careful experimental design to work within the MechanicalTurk infrastructure
Due to the “unknowns” about the subjects and environment,crowdsourcing may not be appropriate for all studies
ESQuaReDESQuaReDee22e22
Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 22 / 18