ios press using microtasks to crowdsource dbpedia entity ... · design qiong bu, elena simperl,...

16
Undefined 0 (0) 1 1 IOS Press Using microtasks to crowdsource DBpedia entity classification: A study in workflow design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of Southampton E-mail: {qb1g13, e.simperl,s.zerr, yl2}@soton.ac.uk Abstract. DBpedia is at the core of the Linked Open Data Cloud and widely used in research and applications. However, it is far from being perfect. Its content suffers from many flaws, as a result of factual errors inherited from Wikipedia or incomplete mappings from Wikipedia infobox to DBpedia ontology. In this work we focus on one class of such problems, un-typed entities. We propose a hierarchical tree-based approach to categorize DBpedia entities according to the DBpedia ontology using human computation and paid microtasks. We analyse the main dimensions of the crowdsourcing exercise in depth in order to come up with suggestions for workflow design and study three different workflows with automatic and hybrid prediction mechanisms to select possible candidates for the most specific category from the DBpedia ontology. To test our approach, we run experiments on CrowdFlower using a gold standard dataset of 120 previously unclassified entities. In our studies human-computation driven approaches generally achieved higher precision at lower cost when compared to workflows with automatic predictors. However, each of the tested workflows has its merit and none of them seems to perform exceptionally well on the entities that the DBpedia Extraction Framework fails to classify. We discuss these findings and their potential implications for the design of effective crowdsourced entity classification in DBpedia and beyond. Keywords: task design, workflow design, entity classification, DBpedia, microtask crowdsourcing, CrowdFlower 1. Introduction DBpedia is a community project, in which struc- tured information from Wikipedia is published as Linked Open Data (LOD) [1]. The resulting dataset consists of an ontological schema and a large number of entities covering, by virtue of its origins, a wide range of topics and domains. With links to many other sources in the LOD Cloud, it acts as a central hub for the development of many algorithms and applications in academia and industry. Still, no matter how popu- lar, DBpedia is far from being perfect. Its many flaws have been subject to extensive studies and inspired re- searchers to design tools and methodologies to system- atically assess and improve its quality [2,3,4]. In this paper, we focus on a particular class of errors, the un-typed entities. According to the statistics 1 , the English DBpedia (version 3.9) contains 4.58 million entities, but only 4.22 million among them are classi- fied according to the DBpedia ontology. There are still a large amount of entities unclassified in the latest re- lease of DBpedia (4.3M out of 5.9M are classified in version 2015-04, 5M out of 6.2M are consistently clas- sified in 2015-10 version) 2 . This gap is present due to several factors including incomplete or incorrect map- pings from Wikipedia infoboxes to the DBpedia ontol- ogy 3 or entity ambiguity. Any attempt to fix the prob- lem will most likely require human intervention [5]. 1 http://wiki.dbpedia.org/news/dbpedia-version-2014- released 2 http://wiki.dbpedia.org/services-resources/datasets/ dbpedia-datasets 3 http://mappings.dbpedia.org/server/statistics/ * / 0000-0000/0-1900/$00.00 c 0 – IOS Press and the authors. All rights reserved

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

Undefined 0 (0) 1 1IOS Press

Using microtasks to crowdsource DBpediaentity classification: A study in workflowdesignQiong Bu, Elena Simperl, Sergej Zerr and Yunjia LiSchool of Electronics and Computer Science, University of SouthamptonE-mail: {qb1g13, e.simperl,s.zerr, yl2}@soton.ac.uk

Abstract. DBpedia is at the core of the Linked Open Data Cloud and widely used in research and applications. However, it isfar from being perfect. Its content suffers from many flaws, as a result of factual errors inherited from Wikipedia or incompletemappings from Wikipedia infobox to DBpedia ontology. In this work we focus on one class of such problems, un-typed entities.We propose a hierarchical tree-based approach to categorize DBpedia entities according to the DBpedia ontology using humancomputation and paid microtasks. We analyse the main dimensions of the crowdsourcing exercise in depth in order to come upwith suggestions for workflow design and study three different workflows with automatic and hybrid prediction mechanisms toselect possible candidates for the most specific category from the DBpedia ontology. To test our approach, we run experimentson CrowdFlower using a gold standard dataset of 120 previously unclassified entities. In our studies human-computation drivenapproaches generally achieved higher precision at lower cost when compared to workflows with automatic predictors. However,each of the tested workflows has its merit and none of them seems to perform exceptionally well on the entities that the DBpediaExtraction Framework fails to classify. We discuss these findings and their potential implications for the design of effectivecrowdsourced entity classification in DBpedia and beyond.

Keywords: task design, workflow design, entity classification, DBpedia, microtask crowdsourcing, CrowdFlower

1. Introduction

DBpedia is a community project, in which struc-tured information from Wikipedia is published asLinked Open Data (LOD) [1]. The resulting datasetconsists of an ontological schema and a large numberof entities covering, by virtue of its origins, a widerange of topics and domains. With links to many othersources in the LOD Cloud, it acts as a central hub forthe development of many algorithms and applicationsin academia and industry. Still, no matter how popu-lar, DBpedia is far from being perfect. Its many flawshave been subject to extensive studies and inspired re-searchers to design tools and methodologies to system-atically assess and improve its quality [2,3,4].

In this paper, we focus on a particular class of errors,the un-typed entities. According to the statistics1, theEnglish DBpedia (version 3.9) contains 4.58 millionentities, but only 4.22 million among them are classi-fied according to the DBpedia ontology. There are stilla large amount of entities unclassified in the latest re-lease of DBpedia (4.3M out of 5.9M are classified inversion 2015-04, 5M out of 6.2M are consistently clas-sified in 2015-10 version)2. This gap is present due toseveral factors including incomplete or incorrect map-pings from Wikipedia infoboxes to the DBpedia ontol-ogy3 or entity ambiguity. Any attempt to fix the prob-lem will most likely require human intervention [5].

1http://wiki.dbpedia.org/news/dbpedia-version-2014-released

2http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets

3http://mappings.dbpedia.org/server/statistics/*/

0000-0000/0-1900/$00.00 c© 0 – IOS Press and the authors. All rights reserved

Page 2: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

2 /

DBpedia relies on a community of volunteers and awiki to define and maintain a collection of mappingsin different languages4. However, this is a process thatis primarily designed for domain experts; it requiresa relatively high degree of familiarity with knowledgeengineering and the intricacies of both Wikipedia andDBpedia. The participation of the volunteers is primar-ily intrinsically motivated. While this self-organizingcommunity approach works well in terms of classifi-cation precision, it is also acknowledged to be ratherslow and challenging to manage and as a consequence,a substantial share of mappings is still missing [6].Other forms of human input collection could improveon these aspects and the most important in this contextare paid microtasks and crowdsourcing.

Paid microtask crowdsourcing employs servicessuch as Amazon’s Mechanical Turk5 or CrowdFlower6

to undertake work as a series of small tasks that com-bined together comprise a large unified project towhich many people are contributing. It is typically ap-plied to repetitive activities that lend themselves to par-allelization and do not require specific expertise be-yond the most common human knowledge and cogni-tive skills. A paid microtasks project is broken downinto many units that are self-contained and executedindependently by different people for a fee in the rangeof several cents to a dollar. This approach has alreadybeen successfully employed in different areas relatedto Linked Data and semantic technologies [7,8,9,10].In particular, [2,4,8] have shown that it can achievereasonable accuracy in quality repair and entity typingtasks compared to expert crowds at a faster turnaround,however to our best knowledge, none of the worksmade use of the structural dependencies between theclasses as it is the case in Ontologies.

In this work, we consider the problem of system-atically selecting the most specific category from atree of hierarchically organized labels through employ-ing microtasks. We contribute with the in-depth analy-sis of the main crowdsourcing exercise dimensions inorder to come up with suggestions for workflow de-sign, which are grounded in existing literature in cog-nitive psychology, and elaborate on their implicationsin terms of precision and cost. To this end, we pro-pose a workflow model consisting of a predictor, ableto suggest category candidates for a given entity, aswell as a crowd based error detection and correction al-

4http://mappings.dbpedia.org/index.php5https://www.mturk.com/mturk/welcome6http://www.crowdflower.com/

gorithms. Using three alternative workflows, we com-pare the performance of an automatic predictor andtwo crowd based approaches: a naive predictor, wherethe whole ontology tree is traversed top down by thecrowd and a microtask based free text predictor, that isable to make a decision based on human text input. Totest our model, we run experiments on CrowdFlowerusing a dataset of 120 entities previously unclassifiedby the DBpedia community and compare the answersfrom the crowd with gold standard created by expertsfrom our research group. Our experiments show thatthe crowd based approach achieved higher precision atlower cost compared to the workflow with automaticpredictor. However, each workflow has its merit andwe provide in-depth evaluation of their properties tosupport experts in selection decisions.

The rest of this paper is structured as follows: westart with an analysis of the state of the art in the areasof entity classification and crowdsourcing task designin Section 2. In Section 3 we present our entity clas-sification model based on machine and human input.Section 4 is dedicated to the design of our experimentswhere we measure the precision and costs of our modelin three alternative workflows. Section 5 discusses theevaluation results and their implications for applyinghuman computation and microtasks to similar scenar-ios. Finally, we conclude with a summary of our find-ings and plans for future work in Section 6.

2. Related work

Our work on using the crowd to effectively classifyDBpedia entities is closely related to entity classifi-cation techniques in general and crowdsourcing basedtasks similar to the problem discussed in this paper.We take a closer look at research in these areas and re-view how existing methods differ from our approachor inspire it in some way.

2.1. Entity annotation and classification

Generally, entity classification can be carried out indifferent ways, ranging from manual annotation by ex-perts from the corresponding domain, over hybrid ap-proaches where human input and machine algorithmsare combined, to the tools for purely automatic clas-sification. As the expert classification is costly andtime-consuming task, automatic entity classificationtraditionally has been part of the NER (Named En-tity Recognition) research where the entity is first au-

Page 3: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 3

tomatically identified and then classified [11,12,13,14] according to a number of categories. Unfortu-nately, such automatic algorithms, require predefinedconcept-based seeds for training and manually definedrules, which complexity depends on the number of in-volved classes. As a consequence, the classificationability is typically limited to a relatively small numberof classes such as "Person", "Location", "Time" and"Organisation". Those approaches are often combinedwith machine learning algorithms or ensembles trainedon large pre-annotated datasets. However, all discussedtechniques are working with a certain error margin andhence their output requires to be manually re-examinedand further classified. There is a number of existingtools and APIs such as DBpedia spotlight,7 Dande-lion,8 Alchemy API,9 Open Calais,10 GATE,11 andNERD12 that can be used to classify entities based on apredefined ontology, but they very often fail to producea type for some of the entities. On the other hand, entityclassification by combining human and machine is apromising alternative as discussed in [15,16,17,18,19].Some studies show that it is possible to combine auto-matic prediction methods (Bayesian/Generative prob-abilistic models) with additional input from the crowdto improve output accuracy [16,20,21,22]. Wang et al.[19] have also shown the performance advantage ofthe hybrid approach when comparing with fully auto-matic methods. There exist basically two types of hy-brid approaches: The first is based on collecting a largeamount of annotations from the crowd and use the col-lected labels to train machine learning algorithms inorder to achieve better classification quality. An exam-ple for such approach is the ESP game [18] for gami-fication based images labelling. The other approach isto use a machine to narrow down the possible optionsand then employ the crowd to validate or chose thebest matching one. As an example, the work presentedin [23] employs a machine-based algorithm to clas-sify entities along with calculating a confidence score.The authors suggest that the label crowdsourcing is re-quired only for entities with low confidence scores pro-duced by the classifier.

In contrast to the works mentioned above, our ap-proach can efficiently deal with a large number of

7https://dbpedia-spotlight.github.io/demo/8https://dandelion.eu/semantic-text/entity-extraction-

demo/9http://www.alchemyapi.com/products/demo/alchemylanguage10http://www.opencalais.com/opencalais-demo/11https://gate.ac.uk/sale/tao/splitch6.html#x9-1360006.712http://nerd.eurecom.fr/

classes and keep the error margin narrow at the sametime. Based on the discussed literature, in our work-flows, we utilise automatic tools and algorithms as pre-dictors for producing a small set of candidate sugges-tions from a large set (over 700) of DBPedia classes,hence significantly reducing the search space.

2.2. Crowdsourcing task design

In recent years, researchers have successfully ap-plied human computation and crowdsourcing to a va-riety of scenarios that use Linked Data and semantictechnologies where classification is one of the mostpopular tasks. The idea is to decompose tasks withhigher complexity into smaller sub-tasks [24] such thateach of those tasks can be solved by non-expert crowd-workers. Due to the decentralized and diverse natureof possible participants in crowdsourcing work, datavalidation and quality control have been an essentialtopic explored by many previous researchers, resultingin challenges in workflow and task design.

One challenge is to design an effective workflowand a mechanism to aggregate the sub-task outputsinto a high quality final result. For example, to solvethe complex authoring task using crowdsourcing plat-form, Bernstein et al. [7] introduce the fix-verify pat-tern where a task is split into multiple generation andreview stages. As a popular ontology, DBpedia is asubject to a series of microtask experiments. The au-thors in [2] introduce a methodology to assess LinkedData quality through the combination of a contest,targeting Linked Data experts [4] and CrowdFlowermicrotasks. Similar to the works above, in this pa-per we employ the concept of combining a predict-ing stage (human, auto, and hybrid) and a verificationstage through the CrowdFlower workers subsequently.To improve the crowdsourcing workflow and to ensurehigh quality answers, various machine learning mech-anisms have been recently introduced [25,26,27,28].Closely to our task, ZenCrowd [8] explore the com-bination of probabilistic reasoning and crowdsourcingto improve the quality of entity linking. Large scalecrowdsourcing can be as well applied in citizen scienceprojects and classification tasks as shown in [29,30].Machine learning algorithms are often used to supportthe decision making of adaptive task allocation. In thiswork, we implement automatic algorithms, similar tothat in [19], for a freetext predictor and a naive predic-tor to facilitate the crowd in locating promising areaswithin the ontology where the most specific class labelfor a given entity is likely to be located.

Page 4: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

4 /

Another challenge appears in mitigating noisy an-swers to achieve high quality annotations. Various as-pects in microtask design have been studied in thepast years. Kittur et al. [31] investigate crowdsourcingmicro-task design on Mechanical Turk and suggeststhat it is crucial to have verifiable questions as part ofthe task and to design the task in a way that answer-ing the question both properly and randomly would re-quire comparable efforts. The authors in [17] elaborateon the quality of the crowd-generated labels in naturallanguage processing tasks, compared to expert-driventraining and gold standard data. Their work suggeststhat the quality of the results produced by four non-experts can be comparable to the output of a singleexpert. Some successful citizen science applicationssuch as Galaxy Zoo [32], do not require a high exper-tise level of the volunteers and the fine granularity ofthe workflow makes it possible for a task to be car-ried out through a person without a relevant scientificbackground. In our work, we leveraged testing ques-tions to eliminate the potential spam user or unquali-fied user, and designed the task in a way such that onlya little user expertise is required. FreeAssociation andCategodzilla/Categorilla [33] tools show a clear em-pirical evidence that variations, such as less restrictivecriteria for user input, can have a positive impact ondata quality. In this work we employ a hybrid free-textbased predictor where user input for the most specificcategory is completely unrestricted. The InPhO (Indi-ana Philosophy Ontology) project [9] use a hybrid ap-proach to dynamically build a concept hierarchy bycollecting user feedback on concept relationships fromMechanical Turk and automatically incorporating thefeedback into the ontology. The authors in [10] intro-duce and evaluated the Crowdmap model to convertontology alignment task into microtasks. This researchinspires us to reduce the amount of work for the crowdby using a "predictor" step, providing a shortlist ofcandidates instead of requiring the worker to explorethe full ontology.

This literature is used as a foundational reference forthe design of the microtask workflows introduced inthis paper, including aspects such as instructions, userinterfaces, task configuration, and validation. In con-trast to solutions mentioned in the literature, the mainnovelty of our work lies in the systematic study of aset of alternative workflows for ontology centric entityclassification. We believe that such an analysis wouldbe beneficial for a much wider array of Semantic Weband Linked Data scenarios in order to truly understandthe complexity of the overall design space and define

more differentiated best practices for the use of crowd-sourcing in other real-world projects under differentbudgetary constraints.

3. Approach

The entity classification problem considered in thispaper is a problem of selecting the most specific typefrom a given class hierarchy for a particular entity. Asa class hierarchy can contain thousands of classes, thistask is not easy to be solved manually by a few ex-perts maintaining the dataset, especially for large en-tity batches such as around 1.2M un-typed entities inDBpedia13. Automatic and semi-automatic classifica-tion is a well-studied area with a variety of probabilis-tic and discriminative models that can assist in thiscontext. However, whilst there exist a number of pos-sible machine-based classification approaches, they allaccept certain error margins of a few dozens of percentand as a result manual correction of their output by anexpert remains a heavy overhead. In this section wepropose a human computation based model for reduc-ing the amount of the corrections required to be doneby an expert.

The problem we are tackling can be formalised asfollows: Let O be the given DBpedia ontology and ebe a particular entity which has not been classified inDBpedia. Our target is to find the most specific typeNE for entity e within O. For an un-typed entity, weconsider this as a tree traversal problem starting froma predicted candidate node in the ontology and contin-ued until the most specific type is found. To this end,as depicted in Figure 1, we break our workflow into(a) prediction step, where a list of candidate nodes isidentified first, (b) error detection step where the out-put is manually checked and (c) error correction stepwhere the error (if detected) is manually corrected. Wevary the implementation of each step and measure theperformance of our model by two of the most impor-tant factors that play into the success of a microtask ap-proach to DBpedia entity typing: precision and costs.Precision refers to the ability of crowd contributors tosubmit precise answers and our objective is to min-imise the costs denoted by the amount of manual workrequired in a workflow. To be more specific, cost inour framework is the sum of prediction, error detectionand error correction effort, measured in terms of the

13http://wiki.dbpedia.org/dbpedia-dataset-version-2015-10

Page 5: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 5

number of steps to finish the corresponding tasks bycrowdworkers.

3.1. The Model

In this section, we formally define our human com-putation driven model for semi-automatic entity anno-tation using the hierarchically structured set of labels.This starts with the definition of the terms used in ourmodel, then covers the main tasks where human par-ticipation is involved – predictor, the subsequent er-ror detection and error correction process, and our costmodel.

Definition 1 (DBpedia Ontology) The DBpedia on-tology O is a tree structure, where each node corre-sponds to an entity category and can contain a listof references to the child nodes, each of those corre-sponds to a more specific sub-category of the parententity type. The root node ROOT ("Thing") is thetopmost node in the ontology.

Definition 2 (Candidate Nodes) Given an entity eand ontology O, particular node N is considered asthe exact solution node NE iff N and not any of thechildren nodes of N is corresponding to the categoryof e, i.e N represents the most specific category for ein O. Any ancestor of NE thereby is considered as acandidate node NC .

Definition 3 (Predictor and Predicted Nodes) Let apredictor P be a (semi-) automatic approach, ableto select for given e a ranked set Spredicted ={Np1 , . . . Npn} of nodes from O as predicted candi-dates for NE .

Our assumption is that the location of an NP withinO is close to NE with high probability. In case no pre-diction can be made by P for a particular e, we con-sider Spredicted = {ROOT}, as the ROOT is theclosest to the optimal solution given the hierarchicalstructure definition in 1.

3.1.1. Human computation driven error detection andcorrection model

The error detection and correction process in thiscontext is to traverse O starting from a set S consistingof top-ranked predicted nodes NP , until NE is foundand to break the process if no correct solution can befound. We model each traversal step as a microtask fora human assessor, where we ask, whether NC existsin a list of options. The answer can be true (with in-

dication of the corresponding node) or false, in caseno nodes from the list can be selected. Generally, weassume that if a node is proven to be false, also all ofits descendants are proven to be false.

Error detection algorithm: We employ a traversalalgorithm 1 with a set S of the top-scored node as astart. In case any of NP from this set can be identi-fied as NC , a check is done for each of its child nodesaccording to the definition 2.

Algorithm 1 Error detection1: procedure CHECKSPECIFICTYPE( e,

S = {Np1 . . . Npn})2: if Nc ∈ S and humanChoice() = Nc then3: if ∀Ni ∈ children(Nc),

humanChoice() 6= Ni then4: return TRUE5: return FAIL;

Error correction algorithm: After the error is de-tected, it is possible to correct it using human asses-sors. Similarly to error detection, the microtask basedcorrection algorithm 2 starts from the set S of candi-date nodes. In case a NP from S is identified as NC ,its child nodes are traversed in a breadth-first manner,until the node with the most specific type correspond-ing to e is found. Otherwise the algorithm continueswith the parent node of NP . Every node on the way istouched only once. The algorithm stops when NC isfound without children corresponding to the type of e.Note, in case no specific node can be found in O, thealgorithm will return ROOT .

The overall cost for entity annotation in our frame-work is constituted as the sum of prediction, error de-tection and error correction costs, more formally:

Cost(annotation) = Cost(prediction)+

Cost(detection) + Cost(correction)

In which, the costs of a particular algorithm are definedas the number of steps necessary to complete the al-gorithm run. Cost(detection) and Cost(correction) willbe the cost to detect and correct an error, which corre-sponds to the steps it takes to run the above Algorithm1 and 2 respectively. The details of the prediction stepwill be described in the next section.

3.1.2. PredictorsA predictor, in our case, is a module, able to produce

a list of candidate classes for a given entity. Currently,

Page 6: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

6 /

Fig. 1. Representation of three workflows with human participation steps highlighted in blue.

Algorithm 2 Error correction1: procedure FINDSPECIFICTYPE( e,

S = {Np1 . . . Npn})2: if Nc ∈ S and humanChoice() = Nc then3: C = anyChild(e,Nc)4: if C = FAIL then5: return Nc

6: else7: return C8: else9: Sparents ← {}

10: for Ni ∈ S do11: Sparents ← S ∪ parent(Ni)

12: return findSpecificType(e, Sparents)13: procedure ANYCHILD(e,N )14: for Ni ∈ children(N) do15: if Ni = humanChoice() then16: Nchild = anyChild(e,Ni)17: if Nchild = FAIL then18: return Ni

19: else20: return Nchild

21: return FAIL;

all the predictors described in literature are automaticpredictors, where given an entity a list of candidatesalong with the confidence value of the predictions isproduced. In this work we employ three approaches,namely an automatic predictor Pauto, and two manualpredictors Pnaive and Pfree as follows:

As an example for Pauto, we used DandelionAPI 14

which has an add-on for Google Spreadsheet allowingentities uploaded in a spreadsheet to be analysed andtyped with DBpedia types. The entity to be classifiedis the text to be analysed, and the output we got is thetypes along with the confidence levels. The types givenfrom Dandelion API always include the parent types ifa specific type is identified. For instance, if an entity’sspecific category is "Building", Dandelion would out-put Building, ArchitecturalStructure and Place. In thiscase, we use the open source tool which is free and theprediction cost can be considered as 0.

Additionally, we propose two human computationbased prediction approaches. The first Pnaive is anaive predictor approach that starts from the root ofthe ontology and traverses the tree by sequentially ex-panding the children. This approach is very costly, butcan be applied also for entities, where any of the other(semi-) automatic approaches fail. Our second humancomputation based prediction approach Pfree allowsfor unconstrained input from the crowd, balancing mi-crotask efficiency and annotation accuracy. The mainadvantages of this approach are its simplicity and free-dom of choice. Classification is one of the most com-mon human cognitive skills and the crowd is not con-strained by any unnecessary biases. In addition, restric-tions on category input can sometimes increase the dif-ficulty of the task [33] and hence impact the overall ac-curacy. The outcome depends on how many times thequestions are asked and how the answers are aggre-gated. The more answers are collected, the more reli-able is also the classification [34]. Top aggregated an-

14https://dandelion.eu/docs/api/

Page 7: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 7

swers can be voted by the crowd in a second step toidentify the most suitable candidate [35], or detect er-rors and correct the answers, as it is done in our case.

Considering the diversity of vocabulary of the crowdusers, direct aggregation of their answers is not ef-fective. To solve this issue, we leverage the freetextinput to automatically calculate the closest DBpediatypes based on the textual similarity between entity ti-tles and ontology class names. We used the difflib15

SequenceMatcher to compute a string match similarityscores between every input collected from the crowdand names off all of existing DBpedia classes. EachDBpedia class is assigned an aggregated score as theindication of how close it is to the user proposed type.We finally retrieve the top scored DBpedia types toform the list of output candidates to be used in errordetection and error correction steps.

3.2. Workflows

For the purpose of this study, we identify three typesof workflows, following the preliminary considera-tions presented so far. Each workflow incorporates apredictor, as well as the error detection and correctionsteps.

– (Wnaive) Naive: ask the crowd to choose theDBpedia category by traversing from the root top-down until a specific category is selected. This isthe workflow that employs Pnaive predictor.

– (Wfree) Freetext: accepts unconstrained inputfrom the crowd to label an entity. This work-flow incorporates the Pfree predictor where col-lected freetext annotations are processed to iden-tify the top candidates in DBpedia types whichthen can be explored and corrected by the crowdin a follow-up verification task.

– (Wauto) Automatic: uses an entity typing toolto generate candidate list, then ask the crowd todetect and correct errors. This workflow is basedon the Pauto predictor.

Figure 1 not only shows an overview of these threeworkflows, but also highlights the steps requiring hu-man participation in blue background. In the remain-der of this section we explain the workflows and theirtranslation into microtasks in more detail.

For the (Wnaive) Naive workflow, the idea is totraverse the DBpedia class hierarchy from top-down.The particular worker choice from a list depends on

15https://docs.python.org/2/library/difflib.html

her level of expertise and on the given situation, asthe theory teaches. Rosch et al. proved in experimentsthat experts and newbies make very different classifi-cation decisions - people with little insight into a do-main tend to feel comfortable with categories that areneither too abstract, not too specific, whereas expertsare much more nuanced [36]. The same effect was ob-served by [37] or in games with a purpose [38]. As weare working with microtask platforms, we have to as-sume that the behaviour of the crowd contributors willresemble newbies in the Rosch et al. experiments andcasual gamers who interacted with GWAPs. Thus, wecannot always expect from Pnaive to identify the mostspecific class in the DBpedia ontology, which matchesthe input entity. This is also why in each list of optionswe presented to the crowd, we need to include a None-OfAbove option. In this Naive workflow, the crowdis firstly presented with a list of candidate types (firstlevel of DBpedia classes under the ROOT, along withNoneOfAbove) . As there are 49 first level classes inDBpedia, using the naive approach, the classes at eachlevel are randomly put into groups and one such groupis presented at a time. The details on decision on thenumber of options to be put in each group are availablein task design section 3.4). If NoneOfAbove is chosen,another group of options is presented to the worker. Ifany given level 1 class is chosen and the class has chil-dren, a group candidate types from level 2 class willbe displayed for the worker instead. The process keepsgoing until a class which has no children is chosen, ora chosen class does not contain children that are suit-able specific types for the given entity. The result ofthis process is considered as final.

The (Wfree) Freetext workflow starts with a taskwhere an entity and its description are presented tothe worker and free text annotation from this workerare collected. Once we collected annotations for allentities, for each entity, we consider all the annota-tions and calculate their textual similarity (many avail-able APIs16,17,18) with titles of all existing DBpediaclasses(vocabularies) and add-up a similarity score foreach DBpedia classes. The higher the similarity scoreof a DBpedia class has, the higher is the chance thatthis class is the candidate type of the corresponding en-tity. Then we pick the top classes for each entity andadd a "NoneOfAbove" option to form a shortlist as thepredicted result from Pfree. Then we start the error de-

16https://pypi.python.org/pypi/Distance/17https://docs.python.org/2.7/library/difflib.html18http://www.nltk.org/howto/wordnet.html

Page 8: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

8 /

tection and correction process. Similar to (Wnaive),we traverse the DBpedia class hierarchy until a classwhich has no children is chosen, or a class has beenchosen, but none of its children are suitable specifictypes for the given entity. In the case when none of pre-dictor predicted classes is chosen as a suitable type forthe given entity, we will traverse up to the parent level.Depending on whether a suitable class is chosen fromthe parent level or not, further child or parent classeswill be presented respectively. The process keeps go-ing until it reaches ROOT or a class is chosen and thatclass either has no children or none of its children aresuitable types for the given entity.

For the (Wauto) Automatic workflow, the idea isto use one of existing APIs19,20,21 to produce a listof candidate types for the given entities, and only thetop types with higher confidence level are picked tobe presented along with the NoneOfAbove option tothe crowd. Based on worker’s selection, the similar tra-verse process is followed until it reaches ROOT or aclass is chosen and that class either has no children ornone of its children classes is chosen.

3.3. Microtask Design

In general, we distinguish between two types of mi-crotasks based on their output: T1 where the workersproduce free text output; and T2 where the worker canchoose from a list of classes. In both cases, we gener-ate descriptions of the input entities in the form of la-bels or text summaries. Either way, the effort to gener-ate the first type of tasks (T1) is comparatively lower,as there is no need to compute a list of potential can-didates. However, this is compensated by overhead insense making of the output and means to aggregate freetext inputs into meaningful class suggestions, while inT2 the answer domain is well-defined. Crowdsourc-ing literature recommends to use iterative tasks to dealwith T1 scenarios: in a first step the crowd is generat-ing suggestions, while in the second it is asked to voteon the most promising ones [39,40,35]. Accordingly,T2 can be used as the voting step for T1 output to getmore useful suggestions in the condition T1 output hasbeen post-processed to produce a ranked list, such asusing the aggregation we proposed in 3.1.2 for Pfree.

The T2 variant requires the strategy to generate thelist of candidates. It can be achieved through auto-

19https://dandelion.eu/docs/api/20http://www.alchemyapi.com/api/entity/types21http://www.opencalais.com/opencalais-api/

matic tools addressing a similar task (as listed in sec-tion 2.1), however, they impose clear bounds on the ac-curacy of the crowd experiments, as not all input canbe processed by the automatic tools (recall) and theiroutput can only be as precise as the task input allows(precision). Additionally, the choice of an appropriatethreshold to accept or reject produced results is also asubject of debate in related literature [41,42]. Anotheroption is to provide the crowd all possible choices ina finite domain - in our case all classes of the DBpe-dia ontology. The challenge then is to find a meaning-ful way for the crowd to explore these choices. Whileclassification is indeed one of our most common hu-man skills, cognitive psychology established that peo-ple have trouble when too many choices are possible.This means both, too many classes to consider [43,42]and too many relevant criteria to decide whether anitem belongs to a class or not [44,45]. Miller’s researchsuggests that in terms of the capacity limit for peo-ple to process information, 7 (plus or minus 2) op-tions are a suitable benchmark [46]. We hence wouldideally use between 5 and 10 classes to choose from.This situation is given by the predictor Pauto, as theconfidence of automatic entity typing algorithms de-creases rapidly and only the top 10 are likely to berepresentative. However, in the Pnaive condition, westart from the DBpedia ontology, which has over 700classes to choose from22. The problem persists evenwhen browsing the ontology level by level, as someclasses have tens of subclasses (e.g., there are 21 dif-ferent forms of organization, 50 child categories ofperson and 43 specific types of athlete). In our experi-ments, therefore, we split the list into subsets of 7 itemsor less to display per microtask.

3.4. Implementation

As shown in section 3.1, all three workflows takethe same input. As a result, for each workflow we firstqueried the DBpedia endpoint via SPARQL to obtainthe name, description, and a link to Wikipedia of eachentity. Wnaive and Wauto workflows only involveT1 task type, Wfree include both T1 and T2 in se-quence. The Figure 2 and 3 depict the user interfacesfor T1 and T2 types of tasks. For T2 we have al-ways limited the number of options shown to a userin the list to maximum of 7 (6 class candidates and anNoneOfAbove option). In case there are more candi-

22http://mappings.dbpedia.org/server/ontology/classes/(accessedonJan14,2016)

Page 9: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 9

dates, we split the list as discussed earlier. We createdthe gold standard data (see Section 4), set the requiredCrowdFlower parameters, launched the jobs, and mon-itored their progress.

3.4.1. CrowdFlower ParametersTask: Following the advice from the literature [2], weused 5 units/rows for each task/page for each of thethree workflows. The worker completes a task by cat-egorising five entities.Judgment: Snow et al. [17] claim that answers froman average of four non-experts could achieve a level ofaccuracy parallel to NLP experts on Mechanical Turk.We hence asked for 5 judgments per experimental datathroughout our experiments. In Wfree, we asked for 11free text suggestions. This choice is in line with experi-ments from the literature [2,10]. We also did an empir-ical simulation with our previous experiments on en-tities that have been classified to find out the effectivenumber of judgement to use. Figure 4 shows the num-ber of times the entity has been annotated and the cor-responding matched answers (based on 120 entities).

Fig. 4. Freetext: Judgement Number vs. Matched Answers

Payment: We paid 6 cents for each task consisting of5 units. This setting took into account existing surveys[47], as well as related experiments which have simi-lar complexity level. Tasks like reading an article andthen asking the crowd to rate the article based on givencriteria as well as providing a suggestion for the areasto be improved were paid 5 cents [31]. Tasks that havesmaller granularity such as validating the given linkedpage display relevant image to the subject were paid4 cents per 5 units [2]. In a similar vein, the complex-ity of our classification task is somewhere in betweenconsidering the time and knowledge it requires to com-plete the task.

Quality control: We created test questions for all jobsand required contributors to pass a minimum accuracyrate (we use 50%) before working on the units.Contributors: CrowdFlower distinguishes betweenthree levels of contributors based on their previous per-formance. The higher level of contributors required,the longer it takes to finish the task, but might be withhigher quality. In our experiment, we choose the de-fault Level1 which allow all levels of contributors toparticipate in the classification task. We used this levelfor all three workflows.Aggregation: For the T2 tasks we used the defaultoption (aggregation=’agg’)23, as the task is to choosefrom a set of pre-defined options. For T1, we looked atthe first three answers (aggregation=’agg_3’) based on11 judgments. We also used aggregation=’all’ to col-lect the additional categories when the crowd felt thebest match was not listed among the suggestions.

3.5. Using the Crowdsourced Data

The validated and aggregated results may be lever-aged in several ways in the context of DBpedia. Free-text suggestions signal potential extensions of the DB-pedia ontology (concepts and labels) and of the DBpe-dia Extraction Framework (mappings). Applying thefreetext workflow gives insights into the limitations ofentity typing technology, while the way people interactwith the naive workflow is interesting not only becauseit provides hints about the quality imbalances withinthe DBpedia ontology, but also for research on Seman-tic Web user interfaces and possible definition of newmapping rules for entity typing.

4. Evaluation

In this section, we will evaluate the performanceof proposed workflows as depicted in Figure 1. Over-all, we obtained three different workflows, namely:Wnaive, Wauto and Wfree, respectively based on pre-dictors Pnaive, Pauto and Pfree. We evaluate the ac-curacy of the predictors and compare the overall preci-sion and costs of different workflows.

23https://success.crowdflower.com/hc/en-us/articles/203527635-CML-Attribute-Aggregation

Page 10: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

10 /

Fig. 2. The user interface for T1 tasks based on free text input

Fig. 3. The user interface for T2 tasks based on a list of candidate options

4.1. Data

For our experiments we uniformly at random choose120 entities which were not classified so far by the DB-Pedia. The authors of the paper annotated these enti-ties manually to obtain a gold standard. The annota-tors worked independently and achieved an agreementof 0.72 measured using Cohen’s kappa. According toone of the most commonly used interpretation by Lan-dis and Koch(1977), kappa value in the range of 0.6-0.8 corresponds to a substantial agreement. Noted weare able to get 106 out of 120 entity typings agreed be-tween two annotators. To achieve consensus, the an-notators then collaboratively defined a set of rules to

categorize the entities whose classes did not match andinvolved a third annotator for majority voting calcu-lation. For example, an entity such as "List of Buffythe Vampire Slayer novels" was eventually classifiedas List instead of Novel, while the "Haute Aboujagane,New Brunswick", which describes a specific commu-nity, was defined as Community instead of Govern-mentAdministrativeRegion.

Table 1 provides the overview on the compositionof the resulting gold standard corpus with respect togeneral categories. "Place" and "Organisation" are themost present categories with respectively 20 and 16 en-tities in the dataset, corresponding to one of their childcategories. For 18 entities that no appropriate category

Page 11: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 11

#Categories 20 18 16 10 9 7 5 4 2 1

Categories Plac

e

owl:T

hing

Org

anis

atio

n

Wor

kA

gent

Topi

calC

once

pt

Gro

up

Lis

t

Eth

nicG

roup Company

PersonFunctionDeviceUnitOfWorkFoodSpeciesChemicalSubstance

NameActivityMeanOfTransportationMedicineEducationalInstitutionEventLanguage

Table 1Overview of the E1 Corpus

in the ontology could be found by the experts, thosewere labelled as "owl:Thing".

4.2. Experiments

For each workflow we assess their properties in ourexperiments with respect to quality, effectiveness andcosts. The accuracy of the predictor matching our goldstandard will provide information about the predictoreffectiveness for real applications. The accuracy of thehuman based error correction with respect to our goldstandard will provide insights into what quality can beexpected from the crowd in general.

4.2.1. Prediction QualityIn our quality concerned experiments, we employ

precision which measures the portion of correctly clas-sified entities among all entities classified by the pre-dictor. Our quality measures are the precision-recallcurves as well as the precision-recall break-even pointsfor these curves. The precision-recall graph providesinsights in how much precision it is possible to gain incase only a certain amount of recall is required. Typi-cally, the head of the distribution achieves better pre-cision due to the fact that the values correspond tothe higher confidence of the predictor. The break-evenpoint (BEP) is the precision/recall value at the pointwhere precision equals recall, which is equal to the F1measure, the harmonic mean of precision and recall, inthat case. The results of the experiments for predictorsdescribed in Section 3.1.2 are shown in Figure 5. Themain observations are:

The precision of the human computation basedmethods was generally higher when compared to theautomatic method. Also the automatic method did notproduce any results for around 80% of the entities,whilst human computation based predictor providedsuggestions for all entities with certain quality. Pnaive

was showing the best precision over the test set of BEP

Wauto Wnaive Wfree

Prediction costs 0 4.1 1

Error detection costs 2.9 0 1.6

Error correction costs 3.2 0 1.83

Sum 6.1 4.1 4.43

Table 2Cost overview for different workflows

0.49, followed by Pfree with BEP 0.47. Surprisingly,the precision of Pnaive was lower than expected fora completely human computation based solution andmay indicate that the classification task requires highexpertise and cannot rely on crowd alone. For about10% of recall all methods provide very good results,especially the precision for Pauto and Pfree was veryhigh, making further steps in error detection and cor-rection possibly unnecessary.

We standardized the confidence level of each predic-tor to the range [0.0-1.0] and plotted the output qual-ity at different confidence levels in terms of precisionin Figure 6. As expected from previous results, highconfidence levels above 0.9 for Pauto and Pfree indi-cate high quality results. For confidence levels below,the output cannot be trusted and need error correction.To improve the prediction, quality research needs to bedone to develop better predictors, or improve the qual-ity existing ones.

4.2.2. Prediction costsIn comparison to Pauto, the predictors Pfree and

Pnaive required human input and therefore added ad-ditional costs. Whilst in Pfree the user had to executeexactly one free text task per entity, to obtain the re-sults using Pnaive, the crowd worker had to complete4.1 tasks on average.

4.2.3. Error Detection Quality and CostsHaving received the output of a predictor, we tested

whether the crowd was able to identify prediction er-

Page 12: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

12 /

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

BEP 0.37

(a) Pauto

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

BEP 0.49

(b) Pnaive

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

BEP 0.47

(c) Pfree

Fig. 5. Precision-recall curves for the output of the different predictors with respect to our gold standard.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Confidence

(a) Pauto

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Confidence

(b) Pnaive

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Confidence

(c) Pfree

Fig. 6. Precision at confidence level curves for the output of the different predictors with respect to our gold standard.

rors and empirically determined the costs for this de-tection in terms of the number of questions needed tobe answered on average. The graphs provide insightsinto the quality of the predictor outputs with respect tocrowd decisions and can provide decision support toinclude or exclude the error detection step in real ap-plications. We do not test error detection for Pnaive asits output is already based on crowd and cannot be ex-pected to improve with the error detection or correc-tion step. As defined in our crowdsourcing model, ateach task we show 7 candidate options to the user. Onaverage, error detection took 2.9 detection steps withWauto and 1.6 steps with Wfree as depicted in Table 2.

4.2.4. Error Correction Quality and CostsIn the last step, we apply our algorithm to correct the

errors produced by the predictors. Similar to the previ-ous section, we measure the costs for the correction asthe number of questions needed to be answered on av-erage. The Table 2 provides an overview of the costs.Wauto required on average 3.2 steps to correct the pre-diction due to the fact that the predictor did not pro-duce any results for the most entities and the whole treehad to be traversed to find the answer in such cases.1.83 steps were required on average for Wfree, indi-cating that in general the predictor pointed the userto the right area within the ontology. In summary, asdepicted in Table 2, Wauto appeared the most costly

with 6.1 steps on average and the other two workflowsshowed comparable results.

4.2.5. The Quality of Human OutputFinally, we measure the quality of the workflows as

a whole to estimate the effort to be invested by expertsin post-processing. The Figure 7 shows the precision-recall curves of the human based result correction forWauto and Wfree. We observe that in both workflowsthe result improved when compared to the predictionstep alone. Our proposed workflow Wfree reached aBEP of 0.53 - the highest result among all experiments.

4.2.5.1 Crowdsourcing TasksWe decided to limit the number of top-options shownto the user to 7 as recommended in the literature.Longer lists may contain the correct result with higherprobability, however would also require more interac-tion and effort in complexity for a crowd worker. Toshow the possible influence of the option number onthe prediction quality, we plot the correspondence inFigure 8 where we vary the list size on the logarithmicscale from top 1 to the maximum of 740 possible cat-egories and measure the predictor BEPs’ on the basisof the gold standards. As we can observe, the precisionimproves only slightly with the growing list size, in-dicating no meaningful advantage for lists with morethan 7-10 options.

Page 13: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 13

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

BEP 0.42

(a) Wauto

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

BEP 0.53

(b) Wfree

Fig. 7. The overall output quality of the workflows.

0

0.2

0.4

0.6

0.8

1

1 10 100

Pre

cis

ion

List Size

Fig. 8. Precision at different Options List Size with the Wfree Pre-dictor

5. Discussion and Lessons Learnt

5.1. Are unclassified entities unclassifiable?

As noted earlier, there were significant differencesin the performance scores achieved in the experimentsusing different workflows. Especially notable is thatexisting NLP tool could only classify 45% of the ran-domly selected entities (54 out of 120) with a preci-sion of only 0.37. Human-based approach can achieverelatively higher 0.49 to 0.47 precision, but still a largeportion of the entities are not classified to the correctspecific type. To some extent, this indicates that un-typed entities have certain characteristics which makethem difficult to be classified.

– Firstly, we observed that their types are quite di-verse and not within the most popular classes24.For instance, "Standard", "SystemOfLaw", or"TopicalConcept" are not categories non-expertcould easily distinguish.

– Secondly, the imbalanced structure of DBpediaontology also makes the classification of untypedentities whose boundary between subtle cate-gories are not well defined. For example, "Ho-chosterwitz Castle" is a "Castle" which would bethe most specific category for this entity, however,Castle is a child category of Building, which is achild type of ArchitecturalStructure that has manychild types such as Arena, Venue and Pyramid,leading the user to choose none of the children of"ArchitecturalStructure" as they did not see anyfits. Similarly, among all 18 entities that are in-stances of "Place", only 5 of them are correctlyclassified to the most specific category becauseof the unclear and over-defined sub-categories."Place" and "Area" are both immediate first leveltypes under "owl:Thing", which create a lot con-fusion in the first place as we observed from thecrowd contributed classification. Also categoriessuch as "Region", "Locality" and "Settlement" aredifficult to be differentiated.

– Lastly, ambiguous entities unsurprisingly causeddisagreement [48]. This was the case with "List"and specific types such as the "1993 in Film",25

which is an List (not a film), and the "1976 NBAFinals",26 which is rather a Tournament (child of"Event", "SocietalEvent" and "SportsEvent", butnot a "List"). In general, entities like these containa context which sometimes makes the entity it-self ambiguous. In the similar way, "Provinces ofthe Dominican Republic"27 is a list (not a place)while "Luxembourg at the Olympics" is a sportsteam. In another case, an entity with context isjust difficult to fit in any existing DBpedia types.For instance,"Higher education in Hong Kong"and "Petroleum industry in Nigeria".

5.2. The outputs are only as good as the inputs

Taking naive workflow where we present maximumof 7 types (including a "NoneOfAbove" option) in one

24http://wiki.dbpedia.org/services-resources/datasets/data-set-39/data-set-statistics

25https://en.wikipedia.org/wiki/1993_in_film26https://en.wikipedia.org/wiki/1976_NBA_Finals27https://en.wikipedia.org/wiki/Provinces_of_the_

Dominican_Republic

Page 14: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

14 /

step as an example, the aggregated outcome shows that33 entities are categorised as "other" after traversingthe DBpedia class tree top-down from "owl:Thing",with none of the DBpedia categories being chosen.This also contributes to the ongoing debate in thecrowdsourcing community regarding the use of mis-cellaneous categories [48,49]. In our case, using thisoption elicited a fair amount of information, even if itwas used just to identify a problematic case. [49] dis-cuss the use of instructions as a means to help peo-ple complete an entity typing task for microblogs. Inour case, however, we believe that performance en-hancements would be best achieved by studying thenature of unclassified entities in more depth and look-ing for alternative workflows that do not involve au-tomatic tools in cases which we assume they will notbe able to solve. A low hanging fruit is the case ofcontainers such as lists, which can be identified easily.For the other cases, one possible way to move forwardwould be to compile a list of entity types in the DBpe-dia ontology, which are notoriously difficult, and askthe crowd to comment upon that shortlist instead ofthe one more or less ’guessed’ by a computer program.Another option would be to look at workflows that in-volve different types of crowds. However, it is worthmentioning that for the 120 randomly chosen untypedentities from DBpedia, 18 of them don’t fit in any DB-pedia types based on our gold standard which indicatethere is a need to enhance the ontology itself.

5.3. Popular classes are not enough

As noted earlier, entities which do not lend them-selves easily to any form of automatic classificationseem to be difficult to handle by humans as well. Thisis worrying, especially if we recall that this is preciselywhat people would expect crowd computing to excelat, enhancing the results of technology. However, weshould also consider that a substantial share of micro-task crowdsourcing applications addresses slightly dif-ferent scenarios: (i) the crowd is either asked to per-form the task on their own, in the absence of any al-gorithmically generated suggestions; or (ii) it is askedto create training data or to validate the results of analgorithm, under the assumption that those results willbe meaningful to a large extent. The situation we aredealing with here is fundamentally new because themachine part of the process does not work very welland distorts the wisdom of the crowds. These effectsdid not occur when we used free annotations. An en-

tity such as "Brunswick County North Carolina"28 isan obviously a "County" and a child type of "Place".Freetext approach actually proposes this category, al-though that category does not exist in DBpedia yet.This result is consistent with [33].

It became evident that in case the predicted cate-gories are labelled in domain-specific or expert ter-minology, people tend not to select them. While un-der the unbound condition they are comfortable us-ing differentiated categories, the vocabulary has to re-main accessible. For example, Animal (sub-categoryof Eukaryote) is used more than Eukaryote. In all threeworkflows, if the more general category is not listed,participants were inclined towards the more special-ized option rather than higher-level themes such asPerson, Place, and Location. This could be observedbest in the freetext workflow. Such aspects could in-form recent discussions in the DBpedia community to-wards a revision of the core ontology.

5.4. Spam prevention

It has been observed that crowdsourcing microtaskssometimes generate noisy data which either is submit-ted deliberately from lazy workers or from the crowdwhose knowledge of the task area is not sufficientenough to meet certain accuracy criteria. Test questionis a good way to help minimize the problems caused byboth cases such that only the honest worker with basicunderstanding are involved in the tasks. In our experi-ment, we did not specially use control question to pre-vent spam, instead we use test question to recruit qual-ified workers. Although the test question approach re-quires about 10% additional judgments to be collected,it does give good inputs in which the definite spam israre and negligible.

6. Conclusion and Future Work

In this paper, we are the first to propose a crowd-sourcing based error detection and correction work-flows for (semi-) automatic entity typing in DBpediawith selection of the most specific category. Thoughour framework employs DBpedia ontology, in princi-ple it can also be applied to other classification prob-lems where the labels are structured as a tree. In oursecond contribution we propose a new microtask based

28https://en.wikipedia.org/wiki/Brunswick_County,_North_Carolina

Page 15: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

/ 15

free text approach for the prediction of entity type.This predictor provides good results even for enti-ties where automatic machine learning based approachfails and naive full ontology scans are necessary. Weempirically evaluated the quality of the proposed pre-dictors as well as the crowd performance and costs foreach of the workflows using 120 DBpedia entities thatare chosen uniformly at random from the set of enti-ties, not yet annotated by the DBPedia community.

While upper levels of the DBpedia ontology seemto match well the basic level of abstraction coined byRosch and colleagues more than 30 years ago [36],contributors used more specific categories as well, es-pecially when not being constrained in their choicesto the (largely unbalanced) DBpedia ontology. The ex-periments also call for more research into what is dif-ferent about those entities which make the hard casesand in our discussion we gave some suggestions forimprovement.

In our future work we plan to investigate how thequality of semi-automatic predictors can be further im-proved to reduce the costs for correction to a mini-mum. More work can be done in the design of the mi-crotasks especially in terms of the option choice dis-played to the user. Finally, there is a lot to be done interms of support for microtask crowdsourcing projects.The effort invested in our experiments could be greatlyreduced if existing platforms would offer more flexibleand richer services for quality assurance and aggrega-tion (e.g., using different similarity functions).

References

[1] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dim-itris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mo-hamed Morsey, Patrick van Kleef, Sören Auer, et al. Dbpedia–a large-scale, multilingual knowledge base extracted fromwikipedia. Semantic Web, 2014.

[2] Maribel Acosta, Amrapali Zaveri, Elena Simperl, DimitrisKontokostas, Sören Auer, and Jens Lehmann. Crowdsourcinglinked data quality assessment. The Semantic Web–ISWC 2013,pages 260–276, 2013.

[3] P Kreis. Design of a quality assessment framework for thedbpedia knowledge base. Master’s thesis, Freie UniversitätBerlin, 2011.

[4] Amrapali Zaveri, Dimitris Kontokostas, Mohamed A Sherif,Lorenz Bühmann, Mohamed Morsey, Sören Auer, and JensLehmann. User-driven quality evaluation of dbpedia. In Pro-ceedings of the 9th International Conference on Semantic Sys-tems, pages 97–104. ACM, 2013.

[5] Katharina Siorpaes and Elena Simperl. Human intelligence inthe process of semantic content creation. World Wide Web, 13(1-2):33–59, 2010.

[6] Dimitris Kontokostas. Making sense out of the wikipediacategories (gsoc2013). wiki.dbpedia.org, 2013. URLhttp://wiki.dbpedia.org/blog/making-sense-out-wikipedia-categories-gsoc2013.[Accessed: 15 May 2016].

[7] Michael S Bernstein, Greg Little, Robert C Miller, Björn Hart-mann, Mark S Ackerman, David R Karger, David Crowell, andKatrina Panovich. Soylent: a word processor with a crowd in-side. In Proceedings of the 23nd annual ACM symposium onUser interface software and technology, pages 313–322. ACM,2010.

[8] Gianluca Demartini, Djellel Eddine Difallah, and PhilippeCudré-Mauroux. Zencrowd: leveraging probabilistic reason-ing and crowdsourcing techniques for large-scale entity link-ing. In Proceedings of the 21st international conference onWorld Wide Web, pages 469–478. ACM, 2012.

[9] Kai Eckert, Mathias Niepert, Christof Niemann, CameronBuckner, Colin Allen, and Heiner Stuckenschmidt. Crowd-sourcing the assembly of concept hierarchies. In Proceedingsof the 10th annual joint conference on Digital libraries, pages139–148. ACM, 2010.

[10] Cristina Sarasua, Elena Simperl, and Natalya F Noy.Crowdmap: Crowdsourcing ontology alignment with micro-tasks. In The Semantic Web–ISWC 2012, pages 525–541.Springer, 2012.

[11] Cheng Niu, Wei Li, Jihong Ding, and Rohini K Srihari. A boot-strapping approach to named entity classification using suc-cessive learners. In Proceedings of the 41st Annual Meetingon Association for Computational Linguistics-Volume 1, pages335–342. Association for Computational Linguistics, 2003.

[12] Yu-Chieh Wu, Teng-Kai Fan, Yue-Shi Lee, and Show-JaneYen. Extracting named entities using support vector machines.In Knowledge Discovery in Life Science Literature, pages 91–103. Springer, 2006. ISBN 3540328092.

[13] Michael Collins and Yoram Singer. Unsupervised models fornamed entity classification. In Proceedings of the joint SIG-DAT conference on empirical methods in natural languageprocessing and very large corpora, pages 100–110. Citeseer,1999.

[14] Jae-Ho Kim, In-Ho Kang, and Key-Sun Choi. Unsupervisednamed entity classification models and their ensembles. InProceedings of the 19th international conference on Computa-tional linguistics-Volume 1, pages 1–7. Association for Com-putational Linguistics, 2002.

[15] Joana Costa, Catarina Silva, Mário Antunes, and BernardeteRibeiro. On using crowdsourcing and active learning to im-prove classification performance. International Conference onIntelligent Systems Design and Applications, ISDA, pages 469–474, 2011. ISSN 21647143. .

[16] Edwin Simpson, Stephen Roberts, Ioannis Psorakis, and ArfonSmith. Dynamic bayesian combination of multiple imperfectclassifiers. Studies in Computational Intelligence, 474:1–35,2013. ISSN 1860949X. .

[17] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and An-drew Y Ng. Cheap and fast–but is it good? eevaluating non-expert annotations for natural language tasks. In Proceed-ings of the conference on empirical methods in natural lan-guage processing, pages 254–263. Association for Computa-tional Linguistics, 2008.

[18] Luis Von Ahn. Games with a purpose. Computer, 39(6):92–94,2006. ISSN 0018-9162.

Page 16: IOS Press Using microtasks to crowdsource DBpedia entity ... · design Qiong Bu, Elena Simperl, Sergej Zerr and Yunjia Li School of Electronics and Computer Science, University of

16 /

[19] Jiannan Wang, Tim Kraska, Michael J Franklin, and JianhuaFeng. Crowder: Crowdsourcing entity resolution. Proceedingsof the VLDB Endowment, 5(11):1483–1494, 2012. ISSN 2150-8097.

[20] Babak Loni, Jonathon Hare, Mihai Georgescu, MichaelRiegler, Xiaofei Zhu, Mohamed Morchid, Richard Dufour, andMartha Larson. Getting by with a little help from the crowd:Practical approaches to social image labeling. Proceedings ofthe 2014 International ACM Workshop on Crowdsourcing forMultimedia, pages 69–74, 2014. .

[21] Jonathon S Hare, Maribel Acosta, Anna Weston, Elena Sim-perl, Sina Samangooei, David Dupplaw, and Paul H Lewis. AnInvestigation of Techniques that Aim to Improve the Quality ofLabels provided by the Crowd. In Proceedings of the MediaE-val 2013 Multimedia Benchmark Workshop, Barcelona, Spain,October 18-19, 2013., volume 1043 of CEUR Workshop Pro-ceedings. CEUR-WS.org, 2013.

[22] Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis.Get another label? improving data quality and data mining us-ing multiple, noisy labelers. In Proceeding KDD ’08 Proceed-ings of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 614–622, LasVegas, Nevada, USA, 2008.

[23] Gianluca Demartini, Djellel Eddine Difallah, and PhilippeCudré-Mauroux. Large-scale linked data integration usingprobabilistic reasoning and crowdsourcing. The VLDB Jour-nal, 22(5):665–687, 2013. ISSN 1066-8888. .

[24] Dafna Shahaf and Eric Horvitz. Generalized Task Markets forHuman and Machine Computation. In AAAI, 2010.

[25] Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Qualitymanagement on Amazon Mechanical Turk. Proceedings of theACM SIGKDD Workshop on Human Computation - HCOMP’10, page 64, 2010. ISSN 145030222X. .

[26] Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini,Panagiotis G. Ipeirotis, and Philippe Cudré-Mauroux. The Dy-namics of Micro-Task Crowdsourcing: The Case of AmazonMTurk. pages 238–247, may 2015.

[27] Stephanie L. Rosenthal and Anind K. Dey. Towards maximiz-ing the accuracy of human-labeled sensor data. Proceedings ofthe 15th international conference on Intelligent user interfaces- IUI ’10, page 259, 2010. .

[28] Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma,and Javier Movellan. Whose Vote Should Count More: Opti-mal Integration of Labels from Labelers of Unknown Exper-tise. Advances in Neural Information Processing Systems, 22(1):1–9, 2009.

[29] Ece Kamar, Severin Hacker, and Eric Horvitz. Combininghuman and machine intelligence in large-scale crowdsourc-ing. In Proceedings of the 11th International Conference onAutonomous Agents and Multiagent Systems-Volume 1, pages467–474. International Foundation for Autonomous Agentsand Multiagent Systems, 2012.

[30] Chong Sun, Narasimhan Rampalli, Frank Yang, and AnHaiDoan. Chimera: Large-scale classification using machinelearning, rules, and crowdsourcing. Proceedings of the VLDBEndowment, 7(13):1529–1540, 2014. ISSN 2150-8097.

[31] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourc-ing user studies with mechanical turk. In Proceedings of theSIGCHI conference on human factors in computing systems,pages 453–456. ACM, 2008.

[32] Galaxy Zoo. Galaxy zoo. http://zoo1.galaxyzoo.org/, 2007.

[33] David Vickrey, Aaron Bronzan, William Choi, Aman Kumar,Jason Turner-Maier, Arthur Wang, and Daphne Koller. Onlineword games for semantic data collection. In Proceedings of theConference on Empirical Methods in Natural Language Pro-cessing, pages 533–542. Association for Computational Lin-guistics, 2008.

[34] Chris J Lintott, Kevin Schawinski, Anže Slosar, Kate Land,Steven Bamford, Daniel Thomas, M Jordan Raddick, Robert CNichol, Alex Szalay, Dan Andreescu, et al. Galaxy zoo: mor-phologies derived from visual inspection of galaxies from thesloan digital sky survey. Monthly Notices of the Royal Astro-nomical Society, 389(3):1179–1189, 2008.

[35] Greg Little, Lydia B Chilton, Max Goldman, and Robert CMiller. Turkit: human computation algorithms on mechanicalturk. In Proceedings of the 23nd annual ACM symposium onUser interface software and technology, pages 57–66. ACM,2010.

[36] Eleanor Rosch, Carolyn B Mervis, Wayne D Gray, David MJohnson, and Penny Boyes-Braem. Basic objects in naturalcategories. Cognitive psychology, 8(3):382–439, 1976.

[37] James W Tanaka and Marjorie Taylor. Object categories andexpertise: Is the basic level in the eye of the beholder? Cogni-tive psychology, 23(3):457–482, 1991.

[38] Luis Von Ahn and Laura Dabbish. Labeling images with acomputer game. In Proceedings of the SIGCHI conference onHuman factors in computing systems, pages 319–326. ACM,2004.

[39] Andrew W Brown and David B Allison. Using crowdsourcingto evaluate published scientific literature: methods and exam-ple. PloS one, 9(7):e100647, 2014.

[40] Lydia B Chilton, Greg Little, Darren Edge, Daniel S Weld, andJames A Landay. Cascade: Crowdsourcing taxonomy creation.In Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems, pages 1999–2008. ACM, 2013.

[41] Benjamin Scheibehenne, Rainer Greifeneder, and Peter MTodd. What moderates the too-much-choice effect? Psychol-ogy & Marketing, 26(3):229–253, 2009.

[42] Barry Schwartz. The paradox of choice. Ecco, 2004.[43] Sheena S Iyengar and Mark R Lepper. When choice is demo-

tivating: Can one desire too much of a good thing? Journal ofpersonality and social psychology, 79(6):995, 2000.

[44] Leola A Alfonso-Reese, F Gregory Ashby, and David HBrainard. What makes a categorization task difficult? Percep-tion & Psychophysics, 64(4):570–583, 2002.

[45] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, andEdward R Dougherty. Optimal number of features as a functionof sample size for various classification rules. Bioinformatics,21(8):1509–1515, 2005.

[46] George A Miller. The magical number seven, plus or minustwo: some limits on our capacity for processing information.Psychological review, 63(2):81, 1956.

[47] Panagiotis G Ipeirotis. Analyzing the amazon mechanical turkmarketplace. XRDS: Crossroads, The ACM Magazine for Stu-dents, 17(2):16–21, 2010.

[48] Lora Aroyo and Chris Welty. Crowd truth: Harnessing dis-agreement in crowdsourcing a relation extraction gold stan-dard. Web Science’13, 2013.

[49] O Feyisetan, E Simperl, M Luczak-Roesch, R Tinati, andN.Shadbolt. Towards hybrid NER: a study of content andcrowdsourcing-related performance factors. In Proceedings of

the ESWC2015 (to appear), 2015.