review article creation of reliable relevance judgments in … · 2019. 7. 31. · review article...

Review ArticleCreation of Reliable Relevance Judgments inInformation Retrieval Systems Evaluation Experimentationthrough Crowdsourcing A Review

Parnia Samimi and Sri Devi Ravana

Department of Information Systems Faculty of Computer Science and Information Technology University of Malaya50603 Kuala Lumpur Malaysia

Correspondence should be addressed to Sri Devi Ravana sdeviumedumy

Received 30 December 2013 Accepted 8 April 2014 Published 19 May 2014

Academic Editors L Li L Sanchez and F Yu

Copyright copy 2014 P Samimi and S D Ravana This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation In a classicsetting generating relevance judgments involves human assessors and is a costly and time consuming task Researchers andpractitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems Crowdsourcing as anovel method of data acquisition is broadly used in many research fields It has been proven that crowdsourcing is an inexpensiveand quick solution as well as a reliable alternative for creating relevance judgments One of the crowdsourcing applications in IR isto judge relevancy of query document pair In order to have a successful crowdsourcing experiment the relevance judgment tasksshould be designed precisely to emphasize quality control This paper is intended to explore different factors that have an influenceon the accuracy of relevance judgments accomplished byworkers and how to intensify the reliability of judgments in crowdsourcingexperiment

1 Introduction

In order to have an effective Information Retrieval (IR) sys-tem and user satisfaction evaluation of system performanceis crucial IR evaluation is to measure whether the systemaddresses the information needs of the users There are twoapproaches for evaluating the effectiveness of IR systemsas shown in Figure 1 (i) user-based evaluation and (ii)system-based evaluation [1] In the system-based evaluationmethod a number of assessors prepare a set of data whichcan be reused in later experiments A common system-basedevaluation approach is test collections which is also referredto as the Cranfield experiments which is the beginning oftodayrsquos laboratory retrieval evaluation experiments [2] TheText REtrieval Conference (TREC) was established in 1992 inorder to support IR researches through providing the infras-tructure for large scale evaluation of retrieval methodologies

Test collection consists of three components (i) docu-ment corpus which is a set of large size documents (ii) topics

which are a collection of search queries and (iii) relevancejudgments which involve human assessors appointed byTREC The relevance judgment is partial since not all docu-ments from the corpus have been judged by assessors becausethe judging task is not only costly but also time consumingFor instance a complete judgment collection for TREC-2010Web needs expert assessors to assess 1 billion documents Byassuming that an expert assessor can assess two documentsin a minute 347000 days are needed for judging 1 billiondocuments Therefore to assess more documents a largenumber of human experts need to be appointed which iscostly [3]

In TREC pooling method has been used to recognize asubset of documents for judging Figure 2 shows the processof typical IR evaluation through test collection The par-ticipating systems run their retrieval algorithms against thedocument corpus and topics in the test collection to generatea set of documents called runsThe systems 1 2 119898 are thecontributing systems for the pool creation and a collection of

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 135641 13 pageshttpdxdoiorg1011552014135641

2 The Scientific World Journal

IR evaluation

System-basedevaluation Test collection

User-basedevaluation

Human in the lab

Side-by-sidepanels

AB testing

Clickthrough data

Crowdsourcing

Figure 1 Classification of IR evaluation methods

top ranked documents retrieved by contributing systems foreach topic has been selected for judgment The documentsin the pool are judged by human assessors to create relevancejudgment set and all of the other documents outside the poolare considered as nonrelevant documentsOnce the relevancejudgments are ready the whole set of runs retrieved byboth contributing and noncontributing systems (1 2 119899)would be evaluated against relevance judgments or qrelsto measure the accuracy and effectiveness of the retrievalsystems through evaluation metrics Each system gets a scorefor each topic and then scores are aggregated to achieve theoverall performance score for a system Finally as a resultof each IR experiment the system ranking is generated forall of the systems The major drawback of test collections isthe huge cost of relevance assessment which is conducted byhuman experts It needs different resources including timeinfrastructure and money while it does not scale up simply

The user-based evaluationmethod quantifies the satisfac-tion of users by monitoring the userrsquos interactions with thesystem Table 1 illustrates the user-based evaluation methodswhich are divided into five groups [4]

The drawback of human involvement in the user-basedevaluation experiments in labs is the high cost of experimen-tation set up and the difficulty in repeating the experimentIn addition this method limited to a small set of informationneeds accomplished by a small number of human subjectsThe side-by-side panels just allow the comparison of twosystems but it is not applicable to multiple systems Usingclickthrough data seems attractive because of the low costof collecting data but it needs a precise setup since it canbe noisy The term crowdsourcing was conceived by Howebased onWeb 20 technology in aWiredMagazine article [5]Recently the use of the term ldquocrowdsourcingrdquo for relevancejudgment is increasing to conquer the high cost that currentevaluation methods have through expert judges Runningexperiments within low cost and fast turnaround makes thisapproach very outstanding [4]

This paper discusses the issues related to using crowd-sourcing for creating relevance judgment while giving adviceon the implementation of successful crowdsourcing exper-iments and presents a state-of-the-art review on avail-able methods to enhance the quality of the judgments

in crowdsourcing experimentsTheorganization of this paperis as follows the introductory section explains IR evalua-tion process and its different methods Section 2 elaboratescrowdsourcing and its application in different fields espe-cially IR Section 3 explains how accuracy of crowdsourcingresults is measured Section 4 highlights importance of taskdesign in crowdsourcing as well as human features andmonetary factors are explained in Section 5 The recentlyproposed methods on quality control of crowdsourcing willbe reviewed in Section 6 The section on conclusion presentsa look into the future on the use of crowdsourcing for IRevaluation and some suggestions for further study in relatedareas

2 Introduction to Crowdsourcing

Crowdsourcing is an upcoming research field for InformationSystems scholars to create noteworthy contributions It canbe applied widely in various fields of computer science andother disciplines to test and evaluate studies as shown andexplained in Table 2 [6]

There are various platforms for crowdsourcing suchas Amazon Mechanical Turk (AMT) (httpswwwm-turkcommturkwelcome) Crowdflower [16] ODesk (httpswwwodeskcom) and Elance (httpswwwelancecom)Currently AMT is the most popular and largest marketplaceand crowdsourcing platform It is a profit-oriented market-place which allows requesters to submit tasks and workersto complete them Human Intelligence Tasks (HIT) ormicrotasks is the unit of work to be accomplished Figure 3presents the crowdsourcing scheme which includes multiplerequesters who publish their tasks and workers whoaccomplish the tasks on a crowdsourcing platform

Generally the crowdsourcing process starts with pub-lishing a task by requesters to the crowdsourcing platformWorkers select and work on a specific task of their interestand complete it The requesters assess the results of the tasksdone by workers If the results are acceptable to requestersthey would pay the workers Otherwise the workers might berejected because of performing the task carelesslyThe flow ofsubmitting and completing tasks via crowdsourcing is shownin Figure 4

Simplicity is the main feature of crowdsourcing Crowd-sourcing platforms enable the requesters to have fast accessto an on-demand global scalable workforce and the work-ers to choose thousands of tasks to accomplish online InAMT the requesters are able to create HITs via applicationprogramming interface (API) which enable requesters todistribute tasks pragrammatically or use template via dash-board Crowdsourcing platforms were also suggested for datacollection as a viable choice [17] Mason and Suri [18] statedthree advantages of crowdsourcing platforms (i) allowing alarge number of workers to take part in experiments withlow payment (ii) workers are from diverse language culturebackground age and country and (iii) low cost at whichthe researches can be carried on Precise design and qualitycontrol are required to optimize work quality and conducta successful crowdsourcing experiment In general there are

The Scientific World Journal 3

Table 1 User-based evaluation methods

User-based methods DescriptionHuman in the lab This method involves human experimentation in the lab to evaluate the user-system interaction

Side-by-side panelsThis method is defined as collecting the top ranked answers generated by two IR systems for the samesearch query and representing them side by side to the users To evaluate this method in the eyes ofhuman assessor a simple judgment is needed to see which side retrieves better results

AB testing AB testing involves numbers of preselected users of a website to analyse their reactions to the specificmodification to see whether the change is positive or negative

Using clickthrough data Clickthrough data is used to observe how frequently users click on retrieved documents for a given query

CrowdsourcingCrowdsourcing is defined as outsourcing tasks which was formerly accomplished inside a company orinstitution by employees assigned externally to huge heterogeneous mass of potential workers in the formof an open call through Internet

7 8 96C

x_

=

10

2 35 4

MR

System2

System n

System 1

System 2

System 1

Documents corpus

Runs

Topics

Human assessors

Pooling

Subset of documents

Test collection

System m

middot middot middot

Run of system 1 Run of system 2 Run of system n

Evaluation

Con

trib

utin

g sy

stem

ssy

stem

s

Relevance

judgments

Set (qrels)

Calculating system

scores

System rankings

Con

trib

utin

g an

d no

ncon

trib

utin

g

Figure 2 Typical IR evaluation process


Table 2 Different applications of crowdsourcing

Application DescriptionNatural languageprocessing Crowdsourcing technology was used to investigate linguistic theory and language processing [7]

Machine learning Automatic translation by using active learning and crowdsourcing was suggested to reduce the cost oflanguage experts [8 9]

Software engineering The use of crowdsourcing was investigated to solve the problem of recruiting the right type and number ofsubjects to evaluate a software engineering technique [10]

Network eventmonitoring

Using crowdsourcing to detect isolate and report service-level network events was explored which wascalled CEM (crowdsourcing event monitoring) [11]

Sentiment classification The issues in training a sentiment analysis system using data collected through crowdsourcing wereanalysed [12]

Cataloguing The application of crowdsourcing for libraries and archives was assessed [13]Transportation plan Use of crowdsourcing was argued to enable the citizen participation process in public planning projects [14]Information retrieval To create relevance judgments crowdsourcing was suggested as a feasible alternative [15]

Worker 1 Worker 2 Worker m

Requester 1 Requester 2 Requester n

Crowdsourcing platform



Submittask

Submittask

Submittask

Completetask

Completetask

Completetask

Figure 3 Crowdsourcing scheme

three objectives that most researchers draw attention to inthis field of research especially in IR evaluation which are(i) to determine the agreement percentage between expertsand crowdsourcing workers (ii) to find out factors thathave an effect on the accuracy of the workersrsquo results and(iii) to find the approaches to maximize the quality ofworkersrsquo results A survey of crowdsourcing was done interms of social academic and enterprise [19] while assessingdifferent factors which cause motivation and participation ofvolunteers including the following

(i) role-oriented crowdsourcing (leadership and owner-ship)

(ii) behaviour-oriented crowdsourcing (attribution co-ordination and conflict)

(iii) media-oriented crowdsourcing (mobile computingand ubiquitous computing)


Requester

Worker

1

2 3

4

5

Submittask

Acceptrejectresult

Picktask

Completetask

Payment

Figure 4 Flow of submitting and completing tasks via crowdsourc-ing

However this study looks from the academic aspect ofcrowdsourcing specifically focusing on factors that contributeto a successful and efficient crowdsourcing experiment in thearea of information retrieval evaluation

3 Accuracy and Reliability of RelevanceJudgments in Crowdsourcing

The use of crowdsourcing in creating relevance judgmentfor an IR evaluation is generally validated through mea-suring the agreement between crowdsourcing workers andhuman assessors This is to see whether crowdsourcing is areliable replacement for human assessors In crowdsourcingexperiments the interrater or interannotator agreement isused to measure the performance and to analyse the agree-ment between crowdsourcing workers and human assessors


The score of homogeneity in the rating list given by judges isthe interrater agreement Various statistical methods are usedto calculate the interrater agreement Table 3 summarizes fourcommonmethods suggested to calculate the interrater agree-ment between crowdsourcing workers and human assessorsfor relevance judgment in IR evaluation [20]

Alonso et al [15] pioneers of crowdsourcing in IR eval-uation applied crowdsourcing through Amazon MechanicalTurk on TREC dataTheir results showed that crowdsourcingis a low cost reliable and quick solution and could beconsidered an alternative for creating relevance judgment byexperts but it is not a replacement for current methods Thisis due to the fact that there are still several gaps and questionsthat should be addressed by future researches The scalabilityof crowdsourcing approach in IR evaluation has not beeninvestigated yet [20]

4 Experimental Design in Crowdsourcing

Beyond the workersrsquo attention culture preferences andskills the presentation and properties of HITs have an influ-ence on the quality of the crowdsourcing experiment Thecareful and precise task design rather than filtering the poorquality work after accomplishing the task is an importanttask since the complicated tasks distract cheaters [24] It ispossible that qualified workers accomplish erroneous tasks ifthe user interface instructions and design have a poor qual-ity Different phases in the implementation of crowdsourcingexperiment are similar to software development The detailsof each phase are identified for creating relevance judgmenttask as follows

Planning

(i) Choosing a crowdsourcing platform(ii) Defining a task(iii) Preparing a dataset

Design and Development

(i) Selecting topics(ii) Choosing a few documents for each topic (relevant

and nonrelevant)(iii) Task definition description and preparation

Verifying

(i) Debugging by using internal expert team(ii) Take note of shortcoming of the experiment(iii) Using external inexpert to test(iv) Reviewing and clearing the instruction(v) Comparing the results of experts with inexpert teams

Publishing

(i) Monitoringwork quality

The impact of the twomethods ofHIT design the full andsimple design was assessed on the accuracy of labels [25]

The full design method prequalifies and restricts workerswhile the simple design method includes less quality controlsetting and restriction on the workers However the outputof the full design method has a higher quality In generalcrowdsourcing is a useful solution for relevance assessmentif the tasks are designed carefully and the methods for aggre-gating labels are selected appropriately Experiment designis discussed based on three categories (i) task definitiondescription and preparation (ii) interface design and (iii)granularity

41 Task Definition Description and Preparation The infor-mation that is presented to the workers about crowdsourcingtask is task definition An important issue of implementinga crowdsourcing experiment is task description that is a partof task preparation Clear instruction which is a part of taskdescription is crucial to have a quick result usually in less than24 hours All workers should have a common understandingabout the task and the words used must be understandableby different workers [26] Task description should considerthe characteristics of workers such as language or expertise[27] For example to avoid jargon plain English is suggestedsince the population is diverse [28] As creating relevancejudgments requires reading text using plain English andsimple words do help in a successful experiment havingthe phrase ldquoI do not knowrdquo was suggested in response asit allows workers to convey that they do not have sufficientknowledge to answer the question [26] Getting user feedbackat the end of the task by asking an open-ended question isrecommended to improve the quality and avoid spammersThis kind of question can be optional If the answer isuseful the requester can pay bonus [28] In creating relevancejudgments one has to read text hence instructions andquestions should be readable clear and presented withexamples An example of relevance judgment task is shownbelow

Relevance Judgment

Instruction Evaluate the relevance of a document to the givenquery

Task Please evaluate the relevance of the following documentabout Alzheimer

ldquoDementia is a loss of brain function that occurs withcertain diseases Alzheimerrsquos disease (AD) is one form ofdementia that gradually gets worse over time It affectsmemorythinking and behaviorrdquo

Please rate the above document according to its relevanceto Alzheimer as follows

(i) Highly relevant

(ii) Relevant

(iii) Not relevant

Please justify your answer mdashmdashmdashmdashmdashmdash


Table 3 Statistics for calculating the interrater agreement

Methods DescriptionJoint-probability ofagreement (percentageagreement) [20]

The simplest and easiest measure based on dividing number of times for each rating (eg1 2 5)assigned by each assessor by the total number of the ratings

Cohenrsquos kappa [21]A statistical measure to calculate interrater agreement among raters This measurement is more robustthan percentage agreement since this method considers the effects of random agreement between twoassessors

Fleissrsquo kappa [22] An extended version of Cohenrsquos kappa This measurement considers the agreement among any number ofraters (not only two)

Krippendorff rsquos alpha [23] The measurement is based on the overall distribution of assessors regardless of which assessors producedthe judgments

42 Interface Design Users or workers access and contributeto the tasks via user interface Three design recommenda-tions were suggested to have a doing well task completionalong with paying less they are designing a user interfaceand instructions understandable by novice users translatingthe instruction to the local language as well as Englishand preparing a video tutorial [29] It also suggested theuse of colours highlights bold italics and typefaces toenhance comprehensibility [26] Verifiable questions suchas common-knowledge questions within the task are alsouseful to validate the workers It leads to a reduction inthe number of worthless results since the users are awarethat their answers are being analysed [30] Another researchwork showed that more workers may be attracted to the userfriendly and simple interface and the quality of outcomemayincrease while reliable workers may be unenthusiastic aboutthe unreasonably complex user interface which leads to delay[27]

43 Granularity Tasks can be divided into three broadgroups routine complex and creative tasks The tasks whichdo not need specific expertise are called routine such ascreating relevance judgment Complex tasks need somegeneral skills like rewriting a given text On the other handcreative tasks need specific skill and expertise such as researchand development [31] Submitting small tasks by splitting thelong and complex tasks was recommended since small taskstend to attract more workers [26] However some studiesshowed that novel tasks that need creativity tend to distractthe cheaters and attract more qualified and reliable workers[24 32] In a separate study Difallah et al [33] stated thatdesigning complicated tasks is a solution to filter out thecheaters On the other hand money is the main reasonwhy most of the workers accomplish the tasks Howeverthe workers who do the task because of the prospect ofbecoming a celebrity and fun are the least truthful while theworkers who are provoked by fulfillment and fortune arethe most truthful workers Eickhoff and de Vries [24] foundthat the workers who do the job for entertainment not formoney rarely cheat So designing and representing HITs inan exciting and entertaining way lead to better results alongwith cost efficiency

5 Human Features and MonetaryFactors in Crowdsourcing

Effect of human features like knowledge andmonetary factorssuch as payment on accuracy and success of crowdsourcingexperiments is discussed in this section The discussionis based on three factors namely (i) worker profile incrowdsourcing (ii) payment in crowdsourcing and (iii)compensation policy in crowdsourcing

51 Worker Profile in Crowdsourcing One of the factorsthat have an influence on the quality of results is workerprofile Worker profile consists of reputation and expertiseRequestersrsquo feedback about workersrsquo accomplishments cre-ates reputation scores in the systems [34] Requesters alsoshould maintain a positive reputation since workers mayrefuse accepting HIT from poor requesters [17] Expertisecan be recognized by credentials and experience Informationsuch as language location and academic degree is credentialswhile experience refers to knowledge that worker achievesthrough crowdsourcing system [27] Location is anotherimportant factor that has a strong effect on the accuracy ofthe results A study conducted by Kazai et al [35] indicatedthat Asian workers performed poor quality work comparedwith American or European workers The Asians mostlypreferred simple design of HIT while American workerspreferred full design In crowdsourcing platforms such asAMT and Crowdflower it is possible to limit the workers tothe specific country while designing tasks However creatingrelevance judgment is a routine task and does not needspecific expertise or experience

52 Payment in Crowdsourcing Payment impacts on accu-racy of the results in crowdsourcing since the workers whoare satisfied with the payment accomplish the tasks moreaccurately than the workers who are not contented [32]Monetary or nonmonetary reasons can be the motivationfor the workers of crowdsourcing platforms [36] A studyconducted by Ross et al [37] to find the motivation of theworkers in crowdsourcing showed that money is the mainincentive of 13 of Indian and 5 of US workers Anotherstudy carried out by Ipeirotis [38] reported that AMTwas themain income of 27 of Indians and 12 of US workers Kazai[39] stated that higher payment leads to better quality of work


while Potthast et al [40] reported that higher payment onlyhas an effect on completion time rather than on the quality ofresults

Kazai et al [32] analysed the effect of higher paymentto the workers on accuracy of the results Although higherpayment encourages more qualified workers in contrast withlower payment which upsurges the sloppy workers some-times unethical workers are attracted by higher payment Inthis case strong quality control method can prevent poorresult as well as filtering Other studies reported that byincreasing payment it leads to increase in quantity ratherthan quality while some studies showed that consideringgreater payment may only influence getting the task donefaster but not better as increasing payment incentive speedsup work [41 42] However a reasonable pay is a bettercautious solution as high pay tasks attract spammers aswell aslegitimate workers [43] Indeed the payment amount shouldbe fair and based on the complexity of the tasks

Payment method is another factor which impacts on thequality of the outputs of the workers [41 44] for instancea task which requires finding 10 words in a puzzle morepuzzles would be solved with method of payment per puzzlerather than payment per word [45] It is better to pay workersafter the work quality has been confirmed by measuringthe accuracy and clarification that payment is based on thequality of the completed work This can help in achievingbetter results fromworkers [26] In an experiment conductedby Alonso and Mizzaro [20] the workers were paid $002 forrelevancy evaluation that took about oneminute consisting ofjudging the relevancy of one topic and document In anotherexperiment the payment was $004 for judging one topicagainst ten documents [46]

53 Compensation in Crowdsourcing A proper compensa-tion policy and inducement has an impact on the qualityof the results [47 48] There are two types of incentivesextrinsic incentives like monetary bonus such as extra pay-ment and intrinsic incentives such as personal enthusiasm[27] Rewards were categorized into psychological monetaryand material [47] Mason and Watts [41] found that usingnonfinancial compensation to motivate workers such asenjoyable tasks or social rewards is a better choice thanfinancial rewards and have more effect on quality It wasalso reported that using social rewards is like harnessingintrinsic motivation to improve the quality of work [49] Theuse of intrinsic along with extrinsic incentives was recom-mended to motivate more reliable workers to perform a taskFurthermore considering the task requirements propertiesand social factors such as income and interests of workersaccording to the profiles of requesters and workers is vitalin selecting proper compensation policy [27] Offering bonusfor writing comments and justification or paying bonus to thequalified and good workers who perform the tasks accuratelyand precisely was recommended for relevance evaluation task[26]

In addition to task design human features and monetaryfactors are important to have a successful crowdsourcing

experiment quality control is another vital part of crowd-sourcing experiment which will be elaborated on in thefollowing section

6 Quality Control in Crowdsourcing

Managing the workersrsquo act is a crucial part of crowdsourcingbecause of the sloppy workers who accomplish the taskscarelessly or spammers who trick and complete the tasksinaccurately Different worker types by using different factorsincluding label accuracy HIT completion time and fractionof useful labels were determined through behavioral obser-vation [50] as explained below (see Table 4) This can helpin developing methods for HIT design to attract the properworkers for the job Table 5 shows different types of workersclassified based on their average precision [51]

In 2012 to detect random spammers for each worker 119908Vuurens and de Vries [52] calculated the RandomSpam scoreas the average squared ordinal distance (it shows how thejudgment of one worker is different from the other worker)1199001199031198892

119894119895between their judgments 119895 isin 119869

119908 and judgments by other

workers on the same query-document pair 119894 isin 119869119895119908

as shownbelow

RandomSpam119908=

sum119895isin119869119908

sum119894isin119869119895119908

1199001199031198892

119894119895

sum119895isin119869119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(1)

In addition to detect the uniform spammers the Uni-formSpam score is calculated as shown below

UniformSpam119908

=

sum119904isin119878|119878| sdot (119891119904119869

119908

minus 1) sdot (sum119895isin119869119904119908

sum119894isin119869119895119908

disagree119894119895)

2

sum119904isin119878sum119895isin119869119904119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(2)

where disagree119894119895is the number of disagreements between

judgments 119869119904119908

which happen within label sequence 119904 andjudgments 119869

119895119908and 119878 is a set of all possible label sequences

119904 with length |119904| = 2 or 3 while 119891119904119869119908

is the frequencyat which label sequence 119904 occurs within worker 119908rsquos time-ordered judgments 119869

119908

Quality control in crowdsourcing area is defined as theextent to which the provided outcome fulfills the require-ments of the requester Quality control approaches can becategorized into (i) design-time approaches which wereemployed while the requester prepared the task before sub-mission and (ii) run-time approacheswhichwere used duringor after doing the tasks Another classification for qualitycontrol methods is (i) filtering workers and (ii) aggregatinglabels that are discussed in this section [53] In other wordsfiltering workers are design-time approaches and aggregatinglabels methods can be considered as run-time approaches[27]

61 Design-Time Methods There are three approaches toselect workers (i) credential-based (ii) reputation-based


Table 4 Worker types based on their behavioral observation

Workers DescriptionDiligent Completed tasks precisely with a high accuracy and longer time spent on tasksCompetent Skilled workers with high accuracy and fast workSloppy Completed tasks quickly without considering qualityIncompetent Completed tasks in a time with low accuracySpammer Did not deliver useful works

Table 5 Worker types based on their average precision

Workers DescriptionProper Completed tasks preciselyRandom spammer Gave a worthless answer

Semirandom spammer Answered incorrectly on most questions while answering correctly on few questions hoping to avoiddetection as a spammer

Uniform spammer Repeated answersSloppy Not precise enough in their judgments

and (iii) open-to-all It is also possible to use the combi-nation of the three approaches Credential-based approachis used in systems where users are well profiled Howeverthis approach is not practical in crowdsourcing systemssince users are usually unknown Reputation-based approachselects workers based on their reputation like AMT by usingapproval rate parameter Open-to-all approach such as inWikipedia allows any worker to contribute and it is easy touse and implement but increases the number of unreliableworkers [27] Requesters are also able to implement their ownqualification methods [18] or combine different methods Intotal there are various methods of filtering workers whichidentify the sloppy workers at first and then exclude themWe have categorized the methods into five categories Table 6presents a list of these methods

When using qualification test it is still possible that work-ers perform tasks carelessly Moreover there are two issuesthat researchers encounter with qualification test (i) taskswith qualification test may take longer to complete as mostworkers prefer taskswithout qualification test and (ii) the costof developing and maintaining the test continuously Honeypots or gold standard is used when the results are clear andpredefined [54] The honey pots are faster than qualificationtest as it is easy and quick to identify workers who answer thequestions at random Combining qualification test and honeypots is also possible for quality control [26]

Qualification settings such as filtering workers especiallybased on origin have an important impact on the cheaterrates For instance some assume thatworkers fromdevelopedcountries have lesser cheater rates [24] Setting approval ratehas been used by AMT AMT provides a metric called theapproval rate to prefilter the workers It represents percentageof assignments the worker has performed and confirmed bythe requester over all tasks the worker has done Generallyit is the total rating of each worker in the system It isalso possible to limit the tasks to the master workers InAMT the group of people who accomplish HITs with ahigh degree of accuracy across a variety of requesters arecalled master workers Master workers expect to receive

higher payment Limiting the HITs to the master workersgets better quality results [32] If the high approval rate isconsidered for quality control it would take a longer time tocomplete the experiment since the worker population maybe decreasing [28] In general the setting should be doneaccording to the task type If the task is complicated theapproval rate and other settings should be set precisely or thetask is assigned to master workers only If the task is routineordinary workers with simple setting can be effective Trapquestions are also used to detect careless workers who donot read the instructions of the task carefully Kazai et al[25] designed two trap questions to avoid this situationldquoPlease tick here if you did NOT read the instructionsrdquo andat each page ldquoI did not pay attentionrdquo All of the unreliableworkers may not be detected by trap questions but it showedstrange behavior and it can be effective both in discouragingand identifying spammers In crowdsourcing there are someprograms or bots designed to accomplish HITs automatically[55] and these kinds of data definitely are of poor quality [18]CAPTCHAs and reCAPTCHA are used in crowdsourcingexperiment to detect superficial workers and random clickingon answers [29 32] These methods are easy to use andinexpensive

It is also possible to use a combination of the abovemethods to filter workers For example a real time strategyapplying different methods to filter workers was proposedin recruiting workers and monitoring the quality of theirworks At first a qualification test was used to filter the sloppyworkers and the completion time of HITs was calculated toreflect on the truthfulness of the workers Then a set of goldquestions were used to evaluate the skills of workers [56]However by using these methods to filter workers there isthe possibility of lesser quality being recruited to do a workThe quality control methods used after filtering workers inrun time are explained in the subsequent section

62 Run-Time Methods Although design-time techniquescan intensify quality there is the possibility of low quality


Table 6 Design-time methods

Method Description Platform Example

Qualification test Qualification test is a set of questions which the workers must answerto qualify for doing the tasks AMT IQ test [57]

Honey pots or goldstandard data

Creating predefined questions with known answers is honey pots[54] If the workers answer these questions correctly they are markedas appropriate for this task

Crowdflower [16] mdash

Qualification settings Some qualification settings are set when creating HITs AMT Crowdflower Using approval rate

Trap questions This method is about designing HITs along with a set of questionswith known answers to find unreliable workers [58] mdash mdash

CAPTCHAs andreCAPTCHA

CAPTCHAs are an antispamming technique to separate computersand humans to filter automatic answers A text of the scanned page isused to identify whether human inputs data or spam is trying to trickthe web application as only a human can go through the test [59]reCAPTCHA is the development of CAPTCHAs [60]

mdash mdash

because of misunderstanding while doing the tasks There-fore run-time techniques are vital to high quality resultsIndeed quality of the results would be increased by applyingboth approaches [27] In crowdsourcing if we assume thatone judgment per each example or task is called single label-ing method the time and cost may be saved However thequality of work is dependent on an individualrsquos knowledgeand in order to solve this issue of single labeling methodsintegrating the labels from multiple workers was introduced[61 62] If labels are noisy multiple labels can be desirableto single labeling even in the former setting when labelsare not particularly low cost [63] It leads to having moreaccuracy for relevance judgments [64] An important issuerelated to multiple labels per example is how to aggregatelabels accurately and effieciently from various workers intoa single consensus label Run-time quality control methodsare explained in Table 7

TheMV is a better choice for routine tasks which are paida lower payment since it is easy to implement and achievereasonable results depending on truthfulness of the workers[53] The drawback of this method is that the consensuslabel is measured for a specific task without considering theaccuracy of the workers in other tasks Another drawbackof this method is that MV considers all workers are equallygood For example if there is only one expert and theothers are novices and the novices give the same inaccurateresponses the MV considers the novicesrsquo answer as thecorrect answer because they are the majority

In EM algorithm a set of estimated accurate answersfor each task and a set of matrixes that include the list ofworkers errors (in that the replies might not be the correctedanswers) are the outputs The error rate of each worker canbe accessed by this confusion matrix However to measurethe quality of a worker the error rate is not adequate sinceworkers may have completed the task carefully but withbias For example in labeling websites parents with youngchildren are more conservative in classifying the websites Toprevent this situation a single scalar score can be assignedto each worker corresponding to the completed labels Thescores lead to separation of error rate from worker bias andsatisfactory treatment of the workers Approximately five

labels are needed for each example for this algorithm for itto become accurate [65] Recently Carpenter [66] and Raykaret al [67] proposed a Bayesian version of the EM algorithmby using confusion matrix A probabilistic framework wasproposed by Raykar et al [67] in the case of no gold standardwith multiple labels A specific gold standard is createdrepeatedly by the proposed algorithm and the performancesof workers aremeasured and then the gold standard is refinedaccording to the performance measurements Hosseini et al[64] evaluated the performance of EM and MV for aggre-gating labels in relevance assessment His experiment resultsshowed that the EM method is a better solution to problemsrelated to the reliability of IR system ranking and relevancejudgment especially in the case of small number of labelsand noisy judgments Another study compared MV and EMmethods which stated that EM outperformedMV when 60of workers were spammers [51] However a combination ofthese two methods could produce more accurate relevancejudgments

Considering the time taken to complete the task is anexample of observation of the pattern of responses whichwas used to determine random answers of unreliable workers[30] As tasks that were completed fast deemed to havepoor quality completion time is a robust method of detect-ing the sloppy workers Soleymani and Larson [75] usedcrowdsourcing for affective annotation of video in order toimprove the performance of multimedia retrieval systemsThe completion time of each HIT was compared to videolength in order to evaluate quality In another study thetime that each worker spent on judgment was assessed inorder to control quality [58] Three types of patterns werefound (i) normal pattern whereby the workers begin slowlyand get faster when they learn about the task (ii) periodicpattern a peculiar behavior since some of the judgmentsare done fast and some slow and (iii) interrupted patternwhich refers to interruption in the middle of doing tasksThis method in combination with other methods of qualitycontrol can be effective in crowdsourcing experiments sincethis method alone may not be able to detect all sloppyworkers Another method of quality control expert reviewis commonly applied in practice but the drawbacks of this


Table 7 Run-time methods

Method Description

Majority voting (MV) MV is a straightforward and common method which eliminates the wrong results by using the majoritydecision [31 61 68]

Expectation maximization(EM) algorithm

EM algorithm measures the worker quality by estimating the accurate answer for each task through labelscompleted by different workers using maximum likelihood The algorithm has two phases (i) the correctanswer is estimated for each task through multiple labels submitted by different workers accounting forthe quality of each worker and (ii) comparing the assigned responses to the concluded accurate answer inorder to estimate quality of each worker [69]

Naive Bayes (NB)

Following EM NB is a method to model the biases and reliability of single workers and correct them inorder to intensify the quality of the workersrsquo results According to gold standard data a small amount oftraining data labeled by expert was used to correct the individual biases of workers The idea is torecalibrate answers of workers to be more matched with experts An average of four inexpert labels foreach example is needed to emulate expert level label quality This idea helps to improve annotation quality[68]

Observation of thepattern of responses

Looking at the pattern of answers is another effective way of filtering unreliable responses as someuntrustworthy workers have a regular pattern for example selecting the first choice of every question

Probabilistic matrixfactorization (PMF)

Using probabilistic matrix factorization (PMF) that induces a latent feature vector for each worker andexample to infer unobserved worker assessments for all examples [70] PMF is a standard method incollaborative filtering through converting crowdsourcing data to collaborative filtering data to predictunlabeled labels from workers [71 72]

Expert review Expert review uses experts to evaluate workers [73]

Contributor evaluation

The workers are evaluated according to quality factors such as their reputation experience or credentialsIf the workers have enough quality factors the requester would accept their tasks For instance Wikipediawould accept the article written by administrators without evaluation [27] For instance the tasks that aresubmitted by the workers who have a higher approval rate or master workers would be accepted withoutdoubt

Real-time support

The requesters give workers feedback about the quality of their work in real time while workers areaccomplishing the task This helps workers to amend their works and the results showed thatself-assessment and external feedback improve the quality of the task [48] Another real-time approachwas proposed by Kulkarni et al [74] where requesters can follow the workers workflows while solving thetasks A tool called Turkomatic was presented which employees workers to do tasks for requesters Whileworkers are doing the task requesters are able to monitor the process and view the status of the task in realtime

method are the high cost of employing experts and being timeconsuming This approach is not applicable in IR evaluationsince crowdsourcing is used to lower the cost of hiring expertsfor creating relevance judgment Therefore if the expertreviewmethod is used to aggregate judgments the cost wouldbe increased

The drawbacks of using crowdsourcing to create rele-vance judgment are (i) each worker judges a small numberof examples and (ii) to decrease cost few judgments arecollected per example Therefore the judgments are imbal-anced and sparse as each worker assesses a small number ofexamples The MV is vulnerable to this problem while EMindirectly addresses this problem and PMF tackles this issuedirectly [70]

Different quality control methods applied in crowd-sourcing experiment in different areas of the study werelisted and discussed both in design-time approaches andrun-time approaches Although successful experiments arecrucial to both approaches the important point is that qualitycontrol method should be well-suited to crowdsourcingplatform

7 Conclusion and Future Work

Crowdsourcing is a novel phenomenon being applied tovarious fields of studies One of the applications of thecrowdsourcing is in IR evaluation experiments Several recentstudies show that crowdsourcing is a viable alternative tothe creation of relevance judgment however it needs precisedesign and appropriate quality control methods to be certainabout the outcomes This is particularly imperative sincethe workers of crowdsourcing platforms come from diversecultural and linguistic backgrounds

One of the important concerns in crowdsourcing exper-iments is the possibility of untrustworthiness of workerstrying to earn money without paying due attention to theirtasks Confusion or misunderstanding in crowdsourcingexperiments is another concern that may happen due tolow quality in the experimental designs Conclusively thequality control in crowdsourcing experiments is not onlya vital element in the designing phase but also in run-time phase The quality control and monitoring strategiesshould be considered and examined during all stages of


the experimentsMoreover the user interface and instructionshould be comprehensible motivating and exciting to theusers The payment should be reasonable but monetary ornonmonetary rewards can be also considerable Althoughthere are efforts in developing crowdsourcing experiments inIR area in recent years there are several limitations (some arelisted below) pending to refine the various approaches

(i) currently a random topic-document is shown to theworkers to judgeworkers should be providedwith thefreedom to choose their topic of interest for relevancejudgment

(ii) to scale up creating relevance judgments for a largerset of topic-document to find out whether crowd-sourcing is a replacement for hired assessors in termsof creating relevance judgments

(iii) to design a more precise grading system for workersrather than approval rate where requesters can findout each workersrsquo grade in each task type For exam-ple the requesters are able to know which workershave a good grade in creating relevance judgments

(iv) to access reliable personal information of work-ers such as expertise language and backgroundby requesters to decide which type of workers areallowed to accomplish the tasks

(v) trying different aggregating methods in creating rel-evance judgments and evaluating the influence oncorrelation of judgments with TREC experts

Crowdsourcing is an exciting research area with severalpuzzles and challenges that require further researches andinvestigations This review shed a light on some aspectsof using crowdsourcing in IR evaluation with insight intoissues related to crowdsourcing experiments in some of themain stages such as design implementation andmonitoringHence further research and development should be con-ducted to enhance the reliability and accuracy of relevancejudgments in IR experimentation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This research was supported by High Impact ResearchGrant UMC6251HIRMOHEFCSIT14 from Universityof Malaya and Ministry of Education Malaysia

References

[1] E M Voorhees ldquoThe philosophy of information retrieval eval-uationrdquo in Evaluation of Cross-Language Information RetrievalSystems pp 355ndash370 Springer 2002

[2] C Cleverdon ldquoThe Cranfield tests on index language devicesrdquoAslib Proceedings vol 19 no 6 pp 173ndash194 1967

[3] S I Moghadasi S D Ravana and S N Raman ldquoLow-cost evaluation techniques for information retrieval systems areviewrdquo Journal of Informetrics vol 7 no 2 pp 301ndash312 2013

[4] R Baeza-Yates and B Ribeiro-Neto Modern InformationRetrieval the Concepts and Technology Behind Search Addison-Wesley 2011

[5] J Howe ldquoThe rise of crowdsourcingrdquo Wired Magazine vol 14no 6 pp 1ndash4 2006

[6] Y Zhao and Q Zhu ldquoEvaluation on crowdsourcing researchcurrent status and future directionrdquo Information Systems Fron-tiers 2012

[7] R Munro S Bethard V Kuperman et al ldquoCrowdsourcing andlanguage studies the new generation of linguistic datardquo inProceedings of the Workshop on Creating Speech and LanguageData with Amazonrsquos Mechanical Turk pp 122ndash130 Associationfor Computational Linguistics 2010

[8] V Ambati S Vogel and J Carbonell ldquoActive learning andcrowd-sourcing formachine translationrdquo inLanguage Resourcesand Evaluation (LREC) vol 7 pp 2169ndash2174 2010

[9] C Callison-Burch ldquoFast cheap and creative evaluating trans-lation quality using amazonrsquos mechanical turkrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing vol 1 pp 286ndash295 Association for ComputationalLinguistics August 2009

[10] K T Stolee and S Elbaum ldquoExploring the use of crowdsourc-ing to support empirical studies in software engineeringrdquo inProceedings of the 4th International Symposium on EmpiricalSoftware Engineering and Measurement (ESEM rsquo10) no 35ACM September 2010

[11] D R Choffnes F E Bustamante and Z Ge ldquoCrowdsourcingservice-level network event monitoringrdquo in Proceedings of the7th International Conference on Autonomic Computing (SIG-COMM rsquo10) pp 387ndash398 ACM September 2010

[12] A Brew D Greene and P Cunningham ldquoUsing crowdsourcingand active learning to track sentiment in online mediardquo in Pro-ceedings of the Conference on European Conference on ArtificialIntelligence (ECAI rsquo10) pp 145ndash150 2010

[13] R Holley ldquoCrowdsourcing and social engagement potentialpower and freedom for libraries and usersrdquo 2009

[14] D C Brabham ldquoCrowdsourcing the public participation pro-cess for planning projectsrdquo Planning Theory vol 8 no 3 pp242ndash262 2009

[15] O Alonso D E Rose and B Stewart ldquoCrowdsourcing forrelevance evaluationrdquo ACM SIGIR Forum vol 42 no 2 pp 9ndash15 2008

[16] Crowdflower httpshttpwwwcrowdflowercom[17] G Paolacci J Chandler and P G Ipeirotis ldquoRunning exper-

iments on amazon mechanical turkrdquo Judgment and DecisionMaking vol 5 no 5 pp 411ndash419 2010

[18] W Mason and S Suri ldquoConducting behavioral research onamazonrsquos mechanical turkrdquo Behavior Research Methods vol 44no 1 pp 1ndash23 2012

[19] Y Pan and E Blevis ldquoA survey of crowdsourcing as a meansof collaboration and the implications of crowdsourcing forinteraction designrdquo in Proceedings of the 12th InternationalConference on Collaboration Technologies and Systems (CTS rsquo11)pp 397ndash403 May 2011

[20] O Alonso and S Mizzaro ldquoUsing crowdsourcing for TRECrelevance assessmentrdquo Information Processing andManagementvol 48 no 6 pp 1053ndash1066 2012


[21] J Cohen ldquoA coefficient of agreement for nominal scalesrdquoEducational and Psychological Measurement vol 20 no 1 pp37ndash46 1960

[22] J L Fleiss ldquoMeasuring nominal scale agreement among manyratersrdquo Psychological Bulletin vol 76 no 5 pp 378ndash382 1971

[23] K Krippendorff ldquoEstimating the reliability systematic errorand random error of interval datardquo Educational and Psycholog-ical Measurement vol 30 no 1 pp 61ndash70 1970

[24] C Eickhoff and A P de Vries ldquoIncreasing cheat robustness ofcrowdsourcing tasksrdquo Information Retrieval vol 16 no 2 pp121ndash137 2013

[25] G Kazai J Kamps M Koolen and N Milic-Frayling ldquoCrowd-sourcing for book search evaluation Impact of HIT design oncomparative system rankingrdquo in Proceedings of the 34th Interna-tional ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR rsquo11) pp 205ndash214 ACM July 2011

[26] O Alonso ldquoImplementing crowdsourcing-based relevanceexperimentation an industrial perspectiverdquo InformationRetrieval vol 16 no 2 pp 101ndash120 2013

[27] M Allahbakhsh B Benatallah A Ignjatovic H R Motahari-Nezhad E Bertino and S Dustdar ldquoQuality control in crowd-sourcing systems issues and directionsrdquo IEEE Internet Comput-ing vol 17 no 2 pp 76ndash81 2013

[28] O Alonso and R Baeza-Yates ldquoDesign and implementationof relevance assessments using crowdsourcingrdquo in Advances inInformation Retrieval pp 153ndash164 Springer 2011

[29] S Khanna A Ratan J Davis and W Thies ldquoEvaluating andimproving the usability of mechanical turk for low-incomeworkers in Indiardquo in Proceedings of the 1st ACM Symposium onComputing for Development (DEV rsquo10) no 12 ACM December2010

[30] A Kittur E H Chi and B Suh ldquoCrowdsourcing user studieswith mechanical turkrdquo in Proceedings of the 26th Annual CHIConference on Human Factors in Computing Systems (CHI rsquo08)pp 453ndash456 ACM April 2008

[31] M Hirth T Hoszligfeld and P Tran-Gia ldquoAnalyzing costs andaccuracy of validation mechanisms for crowdsourcing plat-formsrdquoMathematical and ComputerModelling vol 57 no 11-12pp 2918ndash2932 2013

[32] G Kazai J Kamps and N Milic-Frayling ldquoAn analysis ofhuman factors and label accuracy in crowdsourcing relevancejudgmentsrdquo Information Retrieval vol 16 no 2 pp 138ndash1782013

[33] D E Difallah G Demartini and P Cudre-Mauroux ldquoMechan-ical cheat spamming schemes and adversarial techniques oncrowdsourcing platformsrdquo in Proceedings of the CrowdSearchWorkshop Lyon France 2012

[34] L De Alfaro A Kulshreshtha I Pye and B T Adler ldquoRepu-tation systems for open collaborationrdquo Communications of theACM vol 54 no 8 pp 81ndash87 2011

[35] G Kazai J Kamps andNMilic-Frayling ldquoThe face of quality incrowdsourcing relevance labels demographics personality andlabeling accuracyrdquo in Proceedings of the 21st ACM InternationalConference on Information and Knowledge Management pp2583ndash2586 ACM 2012

[36] L Hammon and H Hippner ldquoCrowdsourcingrdquo Wirtschaftsinfvol 54 no 3 pp 165ndash168 2012

[37] J Ross L Irani M S Silberman A Zaldivar and B Tom-linson ldquoWho are the crowdworkers Shifting demographicsin mechanical turkrdquo in Proceedings of the 28th Annual CHIConference on Human Factors in Computing Systems (CHI rsquo10)pp 2863ndash2872 ACM April 2010

[38] P Ipeirotis ldquoDemographics of mechanical turkrdquo 2010[39] G Kazai ldquoIn search of quality in crowdsourcing for search

engine evaluationrdquo in Advances in Information Retrieval pp165ndash176 Springer 2011

[40] M Potthast B Stein A Barron-Cedeno and P Rosso ldquoAnevaluation framework for plagiarism detectionrdquo in Proceedingsof the 23rd International Conference on Computational Linguis-tics pp 997ndash1005 Association for Computational LinguisticsAugust 2010

[41] W Mason and D J Watts ldquoFinancial incentives and the perfor-mance of crowdsrdquo ACM SIGKDD Explorations Newsletter vol11 no 2 pp 100ndash108 2010

[42] J Heer and M Bostock ldquoCrowdsourcing graphical perceptionusing mechanical turk to assess visualization designrdquo in Pro-ceedings of the 28th Annual CHI Conference on Human Factorsin Computing Systems (CHI rsquo10) pp 203ndash212 ACM April 2010

[43] C Grady and M Lease ldquoCrowdsourcing document relevanceassessment with mechanical turkrdquo in Proceedings of the Work-shop on Creating Speech and Language Data with AmazonrsquosMechanical Turk pp 172ndash179 Association for ComputationalLinguistics 2010

[44] J Chen N Menezes J Bradley and T A North ldquoOpportuni-ties for crowdsourcing research on amazon mechanical turkrdquoHuman Factors vol 5 no 3 2011

[45] A Kittur B Smus S Khamkar and R E Kraut ldquoCrowd-Forge crowdsourcing complex workrdquo in Proceedings of the24th Annual ACM Symposium on User Interface Software andTechnology (UIST rsquo11) pp 43ndash52 ACM October 2011

[46] P Clough M Sanderson J Tang T Gollins and A WarnerldquoExamining the limits of crowdsourcing for relevance assess-mentrdquo IEEE Internet Computing vol 17 no 4 pp 32ndash338 2012

[47] O Scekic H L Truong and S Dustdar ldquoModeling rewardsand incentive mechanisms for social BPMrdquo in Business ProcessManagement pp 150ndash155 Springer 2012

[48] S P Dow B Bunge T Nguyen S R Klemmer A Kulkarniand B Hartmann ldquoShepherding the crowd managing andproviding feedback to crowd workersrdquo in Proceedings of the29th Annual CHI Conference on Human Factors in ComputingSystems (CHI rsquo11) pp 1669ndash1674 ACM May 2011

[49] T W Malone R Laubacher and C Dellarocas ldquoHarnessingcrowds mapping the genome of collective intelligencerdquo MITSloan School Working Paper 4732-09 2009

[50] G Kazai J Kamps and N Milic-Frayling ldquoWorker types andpersonality traits in crowdsourcing relevance labelsrdquo inProceed-ings of the 20th ACMConference on Information and KnowledgeManagement (CIKM rsquo11) pp 1941ndash1944 ACM October 2011

[51] J Vuurens A P de Vries and C Eickhoff ldquoHow much spamcan you take an analysis of crowdsourcing results to increaseaccuracyrdquo in Proceedings of the ACM SIGIR Workshop onCrowdsourcing for Information Retrieval (CIR rsquo11) pp 21ndash262011

[52] J B P Vuurens and A P de Vries ldquoObtaining high-qualityrelevance judgments using crowdsourcingrdquo IEEE Internet Com-puting vol 16 no 5 pp 20ndash27 2012

[53] W Tang and M Lease ldquoSemi-supervised consensus labelingfor crowdsourcingrdquo in Proceedings of the SIGIR Workshop onCrowdsourcing for Information Retrieval (CIR rsquo11) 2011

[54] J Le A Edmonds V Hester and L Biewald ldquoEnsuring qualityin crowdsourced search relevance evaluation the effects oftraining question distributionrdquo in Proceedings of the SIGIRWorkshop on Crowdsourcing for Search Evaluation pp 21ndash262010


[55] RMMcCreadie CMacdonald and I Ounis ldquoCrowdsourcinga news query classification datasetrdquo in Proceedings of the SIGIRWorkshop on Crowdsourcing for Search Evaluation (CSE rsquo10) pp31ndash38 2010

[56] T Xia C Zhang J Xie and T Li ldquoReal-time quality control forcrowdsourcing relevance evaluationrdquo in Proceedings of the 3rdIEEE International Conference on Network Infrastructure andDigital Content (IC-NIDC rsquo12) pp 535ndash539 2012

[57] G Zuccon T Leelanupab S Whiting E Yilmaz J M Jose andL Azzopardi ldquoCrowdsourcing interactions using crowdsourc-ing for evaluating interactive information retrieval systemsrdquoInformation Retrieval vol 16 no 2 pp 267ndash305 2013

[58] D Zhu and B Carterette ldquoAn analysis of assessor behaviorin crowdsourced preference judgmentsrdquo in Proceedings of theSIGIRWorkshop onCrowdsourcing for Search Evaluation pp 17ndash20 2010

[59] L Von Ahn M Blum N J Hopper et al ldquoCAPTCHA usinghard AI problems for securityrdquo in Advances in CryptologymdashEUROCRYPT 2003 pp 294ndash311 Springer 2003

[60] L Von Ahn B Maurer C McMillen D Abraham and MBlum ldquoreCAPTCHA human-based character recognition viaweb security measuresrdquo Science vol 321 no 5895 pp 1465ndash1468 2008

[61] V S Sheng F Provost and P G Ipeirotis ldquoGet another labelImproving data quality and data mining using multiple noisylabelersrdquo in Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDDrsquo08) pp 614ndash622 ACM August 2008

[62] P Welinder and P Perona ldquoOnline crowdsourcing ratingannotators and obtaining cost-effective labelsrdquo in Proceedings ofthe IEEE Computer Society Conference on Computer Vision andPattern RecognitionmdashWorkshops (CVPRW rsquo10) pp 25ndash32 June2010

[63] P G Ipeirotis F Provost V S Sheng and J Wang ldquoRepeatedlabeling usingmultiple noisy labelersrdquoDataMining and Knowl-edge Discovery vol 28 no 2 pp 402ndash441 2014

[64] M Hosseini I J Cox NMilic-Frayling G Kazai and V VinayldquoOn aggregating labels from multiple crowd workers to inferrelevance of documentsrdquo in Advances in Information Retrievalpp 182ndash194 Springer 2012

[65] P G Ipeirotis F Provost and J Wang ldquoQuality managementon amazon mechanical turkrdquo in Proceedings of the HumanComputation Workshop (HCOMP rsquo10) pp 64ndash67 July 2010

[66] B Carpenter ldquoMultilevel bayesian models of categorical dataannotationrdquo 2008

[67] V C Raykar S Yu L H Zhao et al ldquoLearning from crowdsrdquoThe Journal of Machine Learning Research vol 11 pp 1297ndash13222010

[68] R Snow B OrsquoConnor D Jurafsky and A Y Ng ldquoCheap andfastmdashbut is it good Evaluating non-expert annotations fornatural language tasksrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing pp 254ndash263Association for Computational Linguistics October 2008

[69] A P Dawid and A M Skene ldquoMaximum likelihood estimationof observer error-rates using the EM algorithmrdquo Applied Statis-tics vol 28 pp 20ndash28 1979

[70] R Salakhutdinov and A Mnih ldquoProbabilistic matrix factoriza-tionrdquo Advances in Neural Information Processing Systems vol20 pp 1257ndash1264 2008

[71] H J Jung andM Lease ldquoInferringmissing relevance judgmentsfrom crowd workers via probabilistic matrix factorizationrdquo in

Proceedings of the 35th International ACM SIGIR Conference onResearch and Development in Information Retrieval pp 1095ndash1096 ACM 2012

[72] H J Jung and M Lease ldquoImproving quality of crowdsourcedlabels via probabilistic matrix factorizationrdquo in Proceedings ofthe 26th AAAI Conference on Artificial Intelligence 2012

[73] A J Quinn and B B Bederson ldquoHuman computation asurvey and taxonomy of a growing fieldrdquo in Proceedings of the29th Annual CHI Conference on Human Factors in ComputingSystems (CHI rsquo11) pp 1403ndash1412 ACM May 2011

[74] A KulkarniM Can and BHartmann ldquoCollaboratively crowd-sourcingworkflowswith turkomaticrdquo inProceedings of the ACMConference on Computer Supported Cooperative Work (CSCWrsquo12) pp 1003ndash1012 February 2012

[75] M Soleymani and M Larson ldquoCrowdsourcing for affectiveannotation of video development of a viewer-reported bore-dom corpusrdquo in Proceedings of the ACM SIGIR Workshop onCrowdsourcing for Search Evaluation (CSE rsquo10) pp 4ndash8 2010

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



IR evaluation

System-basedevaluation Test collection

User-basedevaluation

Human in the lab

Side-by-sidepanels

AB testing

Clickthrough data

Crowdsourcing

Figure 1 Classification of IR evaluation methods

top ranked documents retrieved by contributing systems foreach topic has been selected for judgment The documentsin the pool are judged by human assessors to create relevancejudgment set and all of the other documents outside the poolare considered as nonrelevant documentsOnce the relevancejudgments are ready the whole set of runs retrieved byboth contributing and noncontributing systems (1 2 119899)would be evaluated against relevance judgments or qrelsto measure the accuracy and effectiveness of the retrievalsystems through evaluation metrics Each system gets a scorefor each topic and then scores are aggregated to achieve theoverall performance score for a system Finally as a resultof each IR experiment the system ranking is generated forall of the systems The major drawback of test collections isthe huge cost of relevance assessment which is conducted byhuman experts It needs different resources including timeinfrastructure and money while it does not scale up simply

The user-based evaluationmethod quantifies the satisfac-tion of users by monitoring the userrsquos interactions with thesystem Table 1 illustrates the user-based evaluation methodswhich are divided into five groups [4]

The drawback of human involvement in the user-basedevaluation experiments in labs is the high cost of experimen-tation set up and the difficulty in repeating the experimentIn addition this method limited to a small set of informationneeds accomplished by a small number of human subjectsThe side-by-side panels just allow the comparison of twosystems but it is not applicable to multiple systems Usingclickthrough data seems attractive because of the low costof collecting data but it needs a precise setup since it canbe noisy The term crowdsourcing was conceived by Howebased onWeb 20 technology in aWiredMagazine article [5]Recently the use of the term ldquocrowdsourcingrdquo for relevancejudgment is increasing to conquer the high cost that currentevaluation methods have through expert judges Runningexperiments within low cost and fast turnaround makes thisapproach very outstanding [4]

This paper discusses the issues related to using crowd-sourcing for creating relevance judgment while giving adviceon the implementation of successful crowdsourcing exper-iments and presents a state-of-the-art review on avail-able methods to enhance the quality of the judgments

in crowdsourcing experimentsTheorganization of this paperis as follows the introductory section explains IR evalua-tion process and its different methods Section 2 elaboratescrowdsourcing and its application in different fields espe-cially IR Section 3 explains how accuracy of crowdsourcingresults is measured Section 4 highlights importance of taskdesign in crowdsourcing as well as human features andmonetary factors are explained in Section 5 The recentlyproposed methods on quality control of crowdsourcing willbe reviewed in Section 6 The section on conclusion presentsa look into the future on the use of crowdsourcing for IRevaluation and some suggestions for further study in relatedareas

2 Introduction to Crowdsourcing

Crowdsourcing is an upcoming research field for InformationSystems scholars to create noteworthy contributions It canbe applied widely in various fields of computer science andother disciplines to test and evaluate studies as shown andexplained in Table 2 [6]

There are various platforms for crowdsourcing suchas Amazon Mechanical Turk (AMT) (httpswwwm-turkcommturkwelcome) Crowdflower [16] ODesk (httpswwwodeskcom) and Elance (httpswwwelancecom)Currently AMT is the most popular and largest marketplaceand crowdsourcing platform It is a profit-oriented market-place which allows requesters to submit tasks and workersto complete them Human Intelligence Tasks (HIT) ormicrotasks is the unit of work to be accomplished Figure 3presents the crowdsourcing scheme which includes multiplerequesters who publish their tasks and workers whoaccomplish the tasks on a crowdsourcing platform

Generally the crowdsourcing process starts with pub-lishing a task by requesters to the crowdsourcing platformWorkers select and work on a specific task of their interestand complete it The requesters assess the results of the tasksdone by workers If the results are acceptable to requestersthey would pay the workers Otherwise the workers might berejected because of performing the task carelesslyThe flow ofsubmitting and completing tasks via crowdsourcing is shownin Figure 4

Simplicity is the main feature of crowdsourcing Crowd-sourcing platforms enable the requesters to have fast accessto an on-demand global scalable workforce and the work-ers to choose thousands of tasks to accomplish online InAMT the requesters are able to create HITs via applicationprogramming interface (API) which enable requesters todistribute tasks pragrammatically or use template via dash-board Crowdsourcing platforms were also suggested for datacollection as a viable choice [17] Mason and Suri [18] statedthree advantages of crowdsourcing platforms (i) allowing alarge number of workers to take part in experiments withlow payment (ii) workers are from diverse language culturebackground age and country and (iii) low cost at whichthe researches can be carried on Precise design and qualitycontrol are required to optimize work quality and conducta successful crowdsourcing experiment In general there are








7 8 96C

x_

=

10

2 35 4

MR

System2

System n

System 1

System 2

System 1

Documents corpus

Runs

Topics

Human assessors

Pooling

Subset of documents

Test collection

System m



Evaluation

Con

trib

utin

g sy

stem

ssy

stem

s

Relevance

judgments

Set (qrels)

Calculating system

scores

System rankings

Con

trib

utin

g an

d no

ncon

trib

utin

g
















Submittask

Submittask

Submittask

Completetask

Completetask

Completetask







Requester

Worker

1

2 3

4

5

Submittask

Acceptrejectresult

Picktask

Completetask

Payment










Planning





Verifying


Publishing





Relevance Judgment





(i) Highly relevant

(ii) Relevant

(iii) Not relevant




























as shownbelow

RandomSpam119908=

sum119895isin119869119908

sum119894isin119869119895119908

1199001199031198892

119894119895

sum119895isin119869119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(1)


UniformSpam119908

=

sum119904isin119878|119878| sdot (119891119904119869

119908


sum119894isin119869119895119908

disagree119894119895)

2

sum119904isin119878sum119895isin119869119904119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(2)


judgments 119869119904119908





119908



























mdash mdash








Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










7 8 96C

x_

=

10

2 35 4

MR

System2

System n

System 1

System 2

System 1

Documents corpus

Runs

Topics

Human assessors

Pooling

Subset of documents

Test collection

System m



Evaluation

Con

trib

utin

g sy

stem

ssy

stem

s

Relevance

judgments

Set (qrels)

Calculating system

scores

System rankings

Con

trib

utin

g an

d no

ncon

trib

utin

g
















Submittask

Submittask

Submittask

Completetask

Completetask

Completetask







Requester

Worker

1

2 3

4

5

Submittask

Acceptrejectresult

Picktask

Completetask

Payment










Planning





Verifying


Publishing





Relevance Judgment





(i) Highly relevant

(ii) Relevant

(iii) Not relevant




























as shownbelow

RandomSpam119908=

sum119895isin119869119908

sum119894isin119869119895119908

1199001199031198892

119894119895

sum119895isin119869119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(1)


UniformSpam119908

=

sum119904isin119878|119878| sdot (119891119904119869

119908


sum119894isin119869119895119908

disagree119894119895)

2

sum119904isin119878sum119895isin119869119904119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(2)


judgments 119869119904119908





119908



























mdash mdash








Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

















Submittask

Submittask

Submittask

Completetask

Completetask

Completetask







Requester

Worker

1

2 3

4

5

Submittask

Acceptrejectresult

Picktask

Completetask

Payment










Planning





Verifying


Publishing





Relevance Judgment





(i) Highly relevant

(ii) Relevant

(iii) Not relevant




























as shownbelow

RandomSpam119908=

sum119895isin119869119908

sum119894isin119869119895119908

1199001199031198892

119894119895

sum119895isin119869119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(1)


UniformSpam119908

=

sum119904isin119878|119878| sdot (119891119904119869

119908


sum119894isin119869119895119908

disagree119894119895)

2

sum119904isin119878sum119895isin119869119904119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(2)


judgments 119869119904119908





119908



























mdash mdash








Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in








Planning





Verifying


Publishing





Relevance Judgment





(i) Highly relevant

(ii) Relevant

(iii) Not relevant




























as shownbelow

RandomSpam119908=

sum119895isin119869119908

sum119894isin119869119895119908

1199001199031198892

119894119895

sum119895isin119869119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(1)


UniformSpam119908

=

sum119904isin119878|119878| sdot (119891119904119869

119908


sum119894isin119869119895119908

disagree119894119895)

2

sum119904isin119878sum119895isin119869119904119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(2)


judgments 119869119904119908





119908



























mdash mdash








Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





























as shownbelow

RandomSpam119908=

sum119895isin119869119908

sum119894isin119869119895119908

1199001199031198892

119894119895

sum119895isin119869119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(1)


UniformSpam119908

=

sum119904isin119878|119878| sdot (119891119904119869

119908


sum119894isin119869119895119908

disagree119894119895)

2

sum119904isin119878sum119895isin119869119904119908

10038161003816100381610038161003816119869119895119908

10038161003816100381610038161003816

(2)


judgments 119869119904119908





119908



























mdash mdash








Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



























mdash mdash








Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





Method Description




Naive Bayes (NB)









Real-time support


















Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in













Acknowledgment


References






















































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




































































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in