exploring the use of crowdsourcing to support empirical ... · empirical studies in software...

31
Introduction Mechanical Turk Study Summary Exploring the Use of Crowdsourcing to Support Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ESQuaReD ESQuaReD e e 2 2 e 2 2 Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 1 / 18

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

Exploring the Use of Crowdsourcing to SupportEmpirical Studies in Software Engineering

Kathryn T. Stolee & Sebastian Elbaum

September 16, 2010

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 1 / 18

Page 2: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Introduction

Known Issue

Recruiting the right type and number of users for empirical studies insoftware engineering is hard.

Possible Solutions:

Use fewer participants of the right type

Limits generalizability to larger groups

Relax requirements for participation

Limits generalizability to target population

Crowdsource the study

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18

Page 3: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Introduction

Known Issue

Recruiting the right type and number of users for empirical studies insoftware engineering is hard.

Possible Solutions:

Use fewer participants of the right type

Limits generalizability to larger groups

Relax requirements for participation

Limits generalizability to target population

Crowdsource the study

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18

Page 4: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Introduction

Known Issue

Recruiting the right type and number of users for empirical studies insoftware engineering is hard.

Possible Solutions:

Use fewer participants of the right type

Limits generalizability to larger groups

Relax requirements for participation

Limits generalizability to target population

Crowdsource the study

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18

Page 5: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Introduction

Known Issue

Recruiting the right type and number of users for empirical studies insoftware engineering is hard.

Possible Solutions:

Use fewer participants of the right type

Limits generalizability to larger groups

Relax requirements for participation

Limits generalizability to target population

Crowdsource the study

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 2 / 18

Page 6: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Background

Crowdsourcing

Leveraging a global community of users with different talents andbackgrounds to help perform a task that would not be feasible without amass of people behind it.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 3 / 18

Page 7: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Crowdsourcing Services (examples)

Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers

Photographers collect with people whoneed stock photography. 3,000,000+members

Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists

People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18

Page 8: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Crowdsourcing Services (examples)

Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers

Photographers collect with people whoneed stock photography. 3,000,000+members

Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists

People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18

Page 9: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Crowdsourcing Services (examples)

Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers

Photographers collect with people whoneed stock photography. 3,000,000+members

Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists

People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18

Page 10: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Crowdsourcing Services (examples)

Companies with hard problems connectwith people interested in solving. 1,000+problems, 200,000+ solvers

Photographers collect with people whoneed stock photography. 3,000,000+members

Companies with scientific problemsconnect with retired scientists. 1,000+companies, 5,000+ scientists

People with many small tasks connect withscalable workforce. 100,000+ tasks,100,000+ workers

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 4 / 18

Page 11: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Workflow in Mechanical Turk

Requestors:

CreateTasks

DescribeTasks

UploadTasks

Acceptor Reject

Types of tasks:

Short duration (60s. or less)

Require human intelligence (handwirting analysis, image tagging)

Specialized (requires certain knowledge) or generic

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 5 / 18

Page 12: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Workflow in Mechanical Turk

Requestors:

CreateTasks

DescribeTasks

UploadTasks

Acceptor Reject

Types of tasks:

Short duration (60s. or less)

Require human intelligence (handwirting analysis, image tagging)

Specialized (requires certain knowledge) or generic

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 5 / 18

Page 13: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Workflow in Mechanical Turk

Workers:

Searchfor Tasks

SelectTask

CompleteTask

SubmitTask

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 6 / 18

Page 14: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Workflow in Mechanical Turk

Workers:

Searchfor Tasks

SelectTask

CompleteTask

SubmitTask

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 6 / 18

Page 15: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Workflow in Mechanical Turk

CreateTasks

DescribeTasks

UploadTasks

Acceptor Reject

Tasks

Searchfor Tasks

SelectTask

CompleteTask

SubmitTask

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 7 / 18

Page 16: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Goal of This Work

Conjecture

Crowdsourcing can be a good solution for recruiting the right type andquantity of participants for an empirical study in software engineering.

In this work, we crowdsource a software engineering experiment usingAmazon’s Mechanical Turk service, and reflect on our experiences.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 8 / 18

Page 17: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

MotivationBackgroundBackgroundObjective

Goal of This Work

Conjecture

Crowdsourcing can be a good solution for recruiting the right type andquantity of participants for an empirical study in software engineering.

In this work, we crowdsource a software engineering experiment usingAmazon’s Mechanical Turk service, and reflect on our experiences.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 8 / 18

Page 18: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Study Definition

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Purpose: Evaluate the impact of coding practices (e.g.,code smells) on end user’s preferences and understanding ofweb mashups built in Yahoo! Pipes.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 9 / 18

Page 19: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experimental Task Example

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Task Description: Given two pipes with the same behavior,one with a smell and one without, select the preferable one.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 10 / 18

Page 20: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experimental Task Example

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Task Description: Given two pipes with the same behavior,one with a smell and one without, select the preferable one.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 11 / 18

Page 21: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experimental Design

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Task Subjects Pretest Object Treatment Posttest

1 R O1, O2 Pipe1 Smell5 O3, O4

2 R O1, O2 Pipe2 Smell4 O3, O4

3 R O1, O2 Pipe3 Smell5 O3, O4

4 R O1, O2 Pipe4 Smell8 O3, O4

5 R O1, O2 Pipe5 Smell7 O3, O4

6 R O1, O2 Pipe6 Smell1 O3, O4

7 R O1, O2 Pipe7 Smell5,10 O3, O4

8 R O1, O2 Pipe8 Smell2,9 O3, O4

O1 = EducationO2 = Pipes test scoreO3 = PreferenceO4 = Time to completion

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 12 / 18

Page 22: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experimental Design

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Lessons Learned:

Experimental tasks must be modular and independent,but can be longer (ours took 3-4 minutes, on average)

Qualification tests can be used to capture pretestmeasures

Cannot control which tasks are completed by whichparticipants

Self-selection of tasks may introduce bias that needs tobe accounted for in the analysis

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 13 / 18

Page 23: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Selection and Recruitment

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Desired Participant Characteristics:

Limited computer science education (end users)

Familiar with Yahoo! Pipes

Mechanical Turk:

Facilitates recruitment by hosting tasks

Allows for qualification tests to be administered prior toparticipation (pretest measures)

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 14 / 18

Page 24: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Selection and Recruitment

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Lessons Learned

50 qualification tests submitted in two weeks, 38 passed

22 participants in total, 14 were considered “end users”

More variation and unknowns in participants (e.g., age,gender, education, experimental context)

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 15 / 18

Page 25: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experimental Task in Mechanical Turk

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 16 / 18

Page 26: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Instrumentation

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Lessons Learned

Need to learn how to use a new tool and/or API

Need to adjust presentation of tasks to fit theMechanical Turk interface

All tasks are in competition with other tasks forparticipants, so the task description must be enticing.

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 17 / 18

Page 27: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experiment Operation

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Mechanical Turk:

Hosts tasks for a custom time period (2 weeks)

Administers qualification tests (50 requests)

Maintains user anonymity

Collects results and metrics (188 tasks submitted)

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 18 / 18

Page 28: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Experiment Operation

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Lessons Learned:

Hand-grading qualification tests introduce delay, andmay discourage further participation

Time to completion is reported, but is suspicious

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 19 / 18

Page 29: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Analysis

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Response Quality:

Qualitative responses were detailed and demonstratedunderstanding (Average length was 31 words, only 10were required)

Did not need to reject any responses

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 20 / 18

Page 30: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

DefinitionPlanningOperationAnalysis

Analysis

ExperimentDefinition

Design

Selection

Instrumentation

Operation

Analysis

Lessons Learned:

We were able to validate our hypotheses (for only $42)

May need to throw away some data due to learning (wethrew away 28 responses)

Too many responses from a small group of participantscould skew results

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 21 / 18

Page 31: Exploring the Use of Crowdsourcing to Support Empirical ... · Empirical Studies in Software Engineering Kathryn T. Stolee & Sebastian Elbaum September 16, 2010 ... code smells) on

IntroductionMechanical Turk Study

Summary

Summary

Crowdsourcing allowed us to:

Obtain a sufficient number of participants with the desiredcharacteristics

Evaluate our research questions using an empirical study for low cost

However...

Requires careful experimental design to work within the MechanicalTurk infrastructure

Due to the “unknowns” about the subjects and environment,crowdsourcing may not be appropriate for all studies

ESQuaReDESQuaReDee22e22

Kathryn T. Stolee & Sebastian Elbaum Crowdsourcing Empirical Studies in Software Engineering 22 / 18