comparing bayesian models of annotation draft · draft comparing bayesian models of annotation...

15
DRAFT Comparing Bayesian Models of Annotation Silviu Paun 1 Bob Carpenter 2 Jon Chamberlain 3 Dirk Hovy 4 Udo Kruschwitz 3 Massimo Poesio 1 1 School of Electronic Engineering and Computer Science, Queen Mary University of London 2 Department of Statistics, Columbia University 3 School of Computer Science and Electronic Engineering, University of Essex 4 Department of Marketing, Bocconi University Abstract The analysis of crowdsourced annotations in NLP is concerned with identifying 1) gold standard labels, 2) annotator accuracies and biases, and 3) item difficulties and error patterns. Traditionally, majority voting was used for 1), and coefficients of agreement for 2) and 3). Lately, model-based analy- sis of corpus annotations have proven better at all three tasks. But there has been rel- atively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pool- ing (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accu- racy for new annotations, annotator char- acterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selec- tion, application, and implementation. 1 Introduction The standard methodology for analyzing crowd- sourced data in NLP is based on majority vot- ing (selecting the label chosen by the majority of coders) and inter-annotator coefficients of agree- ment, such as Cohen’s κ (Artstein and Poesio, 2008). However, aggregation by majority vote im- plicitly assumes equal expertise among the anno- tators. This assumption, though, has been repeat- edly shown to be false in annotation practice (Poe- sio and Artstein, 2005; Passonneau and Carpen- ter, 2014; Plank et al., 2014b). Chance-adjusted coefficients of agreement also have many short- comings: e.g., agreements in mistake, overly large chance-agreement in datasets with skewed classes, or no annotator bias correction (Feinstein and Ci- cchetti, 1990; Passonneau and Carpenter, 2014). Research suggests that models of annotation can solve these problems of standard prac- tices when applied to crowdsourcing (Dawid and Skene, 1979; Smyth et al., 1995; Raykar et al., 2010; Hovy et al., 2013; Passonneau and Carpen- ter, 2014). Such probabilistic approaches allow us to characterize the accuracy of the annotators and correct for their bias, as well as accounting for item-level effects. They have been shown to perform better than non-probabilistic alternatives based on heuristic analysis or adjudication (Quoc Viet Hung et al., 2013). But even though a large number of such models has been proposed (Car- penter, 2008; Whitehill et al., 2009; Raykar et al., 2010; Hovy et al., 2013; Simpson et al., 2013; Pas- sonneau and Carpenter, 2014; Felt et al., 2015a; Kamar et al., 2015; Moreno et al., 2015, inter alia), it is not immediately obvious to potential users how these models differ, or in fact, how they should be applied at all. To our knowledge, the literature comparing models of annotation is lim- ited; focused exclusively on synthetic data (Quoc Viet Hung et al., 2013) or using publicly avail- able implementations which constrain the exper- iments almost exclusively to binary annotations (Sheshadri and Lease, 2013). Contributions Our selection of six widely used models (Dawid and Skene, 1979; Hovy et al., 2013; Carpenter, 2008) covers models with vary- ing degrees of complexity: pooled models, which assume all annotators share the same ability; unpooled models, which model in- dividual annotator parameters; and partially pooled models, which employ a hierarchical structure to let the level of pooling be dictated by the data.

Upload: others

Post on 15-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

Comparing Bayesian Models of Annotation

Silviu Paun1 Bob Carpenter2 Jon Chamberlain3

Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

1School of Electronic Engineering and Computer Science, Queen Mary University of London2Department of Statistics, Columbia University

3School of Computer Science and Electronic Engineering, University of Essex4Department of Marketing, Bocconi University

Abstract

The analysis of crowdsourced annotationsin NLP is concerned with identifying 1)gold standard labels, 2) annotator accuraciesand biases, and 3) item difficulties and errorpatterns. Traditionally, majority voting wasused for 1), and coefficients of agreementfor 2) and 3). Lately, model-based analy-sis of corpus annotations have proven betterat all three tasks. But there has been rel-atively little work comparing them on thesame datasets. This paper aims to fill thisgap by analyzing six models of annotation,covering different approaches to annotatorability, item difficulty, and parameter pool-ing (tying) across annotators and items. Weevaluate these models along four aspects:comparison to gold labels, predictive accu-racy for new annotations, annotator char-acterization, and item difficulty, using fourdatasets with varying degrees of noise in theform of random (spammy) annotators. Weconclude with guidelines for model selec-tion, application, and implementation.

1 Introduction

The standard methodology for analyzing crowd-sourced data in NLP is based on majority vot-ing (selecting the label chosen by the majority ofcoders) and inter-annotator coefficients of agree-ment, such as Cohen’s κ (Artstein and Poesio,2008). However, aggregation by majority vote im-plicitly assumes equal expertise among the anno-tators. This assumption, though, has been repeat-edly shown to be false in annotation practice (Poe-sio and Artstein, 2005; Passonneau and Carpen-ter, 2014; Plank et al., 2014b). Chance-adjustedcoefficients of agreement also have many short-comings: e.g., agreements in mistake, overly largechance-agreement in datasets with skewed classes,

or no annotator bias correction (Feinstein and Ci-cchetti, 1990; Passonneau and Carpenter, 2014).

Research suggests that models of annotationcan solve these problems of standard prac-tices when applied to crowdsourcing (Dawid andSkene, 1979; Smyth et al., 1995; Raykar et al.,2010; Hovy et al., 2013; Passonneau and Carpen-ter, 2014). Such probabilistic approaches allowus to characterize the accuracy of the annotatorsand correct for their bias, as well as accountingfor item-level effects. They have been shown toperform better than non-probabilistic alternativesbased on heuristic analysis or adjudication (QuocViet Hung et al., 2013). But even though a largenumber of such models has been proposed (Car-penter, 2008; Whitehill et al., 2009; Raykar et al.,2010; Hovy et al., 2013; Simpson et al., 2013; Pas-sonneau and Carpenter, 2014; Felt et al., 2015a;Kamar et al., 2015; Moreno et al., 2015, interalia), it is not immediately obvious to potentialusers how these models differ, or in fact, how theyshould be applied at all. To our knowledge, theliterature comparing models of annotation is lim-ited; focused exclusively on synthetic data (QuocViet Hung et al., 2013) or using publicly avail-able implementations which constrain the exper-iments almost exclusively to binary annotations(Sheshadri and Lease, 2013).

Contributions

• Our selection of six widely used models(Dawid and Skene, 1979; Hovy et al., 2013;Carpenter, 2008) covers models with vary-ing degrees of complexity: pooled models,which assume all annotators share the sameability; unpooled models, which model in-dividual annotator parameters; and partiallypooled models, which employ a hierarchicalstructure to let the level of pooling be dictatedby the data.

Page 2: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFTFigure 1: Plate diagram for multinomial model. Thehyperparameters are left out.

• We carry out the evaluation on four datasetswith varying degrees of sparsity and annota-tor accuracy in both gold-standard dependentand independent settings.

• We use fully Bayesian posterior inference toquantify the uncertainty in parameter esti-mates.

• We provide guidelines for both model selec-tion and implementation.

Our findings indicate that models which in-clude annotator structure generally outperformother models, though unpooled models can over-fit. Several open-source implementations of eachmodel type are available to users.

2 Bayesian Annotation Models

All Bayesian models of annotation we describe aregenerative: they provide a mechanism to generateparameters θ characterizing the process (annota-tor accuracies and biases, prevalence, etc.) fromthe prior p(θ), then generate the observed labels yfrom the parameters according to the sampling dis-tribution p(y|θ). Bayesian inference allows us tocondition on some observed data y to draw infer-ences about the parameters θ; this is done throughthe posterior, p(θ|y). The uncertainty in such in-ferences may then be used in applications suchas jointly training classifiers (Smyth et al., 1995;Raykar et al., 2010), comparing crowdsourcingsystems (Lease and Kazai, 2011), or characteriz-ing corpus accuracy (Passonneau and Carpenter,2014).

This section describes the six models we eval-uate. These models are drawn from the litera-ture, but some had to be generalized from binaryto multiclass annotations. The generalization nat-urally comes with parameterization changes, al-though these do not alter the fundamentals of themodels. (One aspect tied to the model parameter-ization is the choice of priors. The guideline we

Figure 2: Plate diagram of Dawid and Skene model.

followed was to avoid injecting any class prefer-ences a priori and let the data uncover this infor-mation; see more in Section 3.)

2.1 A Pooled ModelMultinomial (MULTINOM) The simplestBayesian model of annotation is the bino-mial model proposed in (Albert and Dodd, 2004)and discussed in (Carpenter, 2008). This modelpools all annotators (i.e., assumes they have thesame ability; see Figure 1).1 The generativeprocess is:

• For every class k ∈ 1, 2, ...,K:

– Draw class-level abilitiesζk ∼ Dirichlet(1K)2

• Draw class prevalence π ∼ Dirichlet(1K)

• For every item i ∈ 1, 2, ..., I:

– Draw true class ci ∼ Categorical(π)

– For every position n ∈ 1, 2, ..., Ni:∗ Draw annotationyi,n ∼ Categorical(ζci)

2.2 Unpooled ModelsDawid and Skene (D&S) The model proposedby Dawid and Skene (1979) is, to our knowledge,the first model-based approach to annotation pro-posed in the literature.3 It has found wide applica-tion (e.g., (Kim and Ghahramani, 2012; Simpsonet al., 2013; Passonneau and Carpenter, 2014)). Itis an unpooled model, i.e., each annotator has theirown response parameters (see Figure 2), which aregiven fixed priors. Its generative process is:

• For every annotator j ∈ 1, 2, ..., J:1Carpenter (2008) parameterizes ability in terms of speci-

ficity and sensitivity. For multiclass annotations, we gener-alize to a full response matrix (Passonneau and Carpenter,2014).

2Notation: 1K is a K-dimensional vector of 1 values3Dawid and Skene fit maximum likelihood estimates us-

ing expectation maximization (EM), but the model is easilyextended to include fixed prior information for regularization,or hierarchical priors for fitting the prior jointly with the abil-ity parameters and automatically performing partial pooling.

Page 3: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFTFigure 3: Plate diagram for the MACE model.

– For every class k ∈ 1, 2, ...,K:∗ Draw class annotator abilitiesβj,k ∼ Dirichlet(1K)

• Draw class prevalence π ∼ Dirichlet(1K)

• For every item i ∈ 1, 2, ..., I:

– Draw true class ci ∼ Categorical(π)

– For every position n ∈ 1, 2, ..., Ni:∗ Draw annotationyi,n ∼ Categorical(βjj[i,n],ci)

4

Multi-Annotator Competence Estimation(MACE) This model, introduced by Hovy et al.(2013), takes into account the credibility of theannotators and their spamming preference andstrategy5 (see Figure 3). This is another exampleof an unpooled model, and possibly the modelmost widely applied to linguistic data (e.g.,(Plank et al., 2014a; Sabou et al., 2014; Habernaland Gurevych, 2016, inter alia)). Its generativeprocess is:

• For every annotator j ∈ 1, 2, ..., J:

– Draw spamming behaviorεj ∼ Dirichlet(10K)

– Draw credibility θj ∼ Beta(0.5, 0.5)

• For every item i ∈ 1, 2, ..., I:

– Draw true class ci ∼ Uniform– For every position n ∈ 1, 2, ..., Ni:∗ Draw a spamming indicatorsi,n ∼ Bernoulli(1− θjj[i,n])∗ If si,n = 0 then:· yi,n = ci

∗ Else:· yi,n ∼ Categorical(εjj[i,n])

4Notation: jj[i,n] gives the index of the annotator whoproduced the n-th annotation on item i.

5I.e. propensity to produce labels with malicious intent.

Figure 4: Plate diagram for the hierarchical Dawid andSkene model.

2.3 Partially-Pooled Models

Hierarchical Dawid and Skene (HIERD&S) Inthis model, the fixed priors of Dawid and Skeneare replaced with hierarchical priors representingthe overall population of annotators (see Figure4). This structure provides partial pooling, usinginformation about the population to improve es-timates of individuals by regularizing toward thepopulation mean. This is particularly helpful withlow count data as found in many crowdsourcingtasks (Gelman et al., 2013). The full generativeprocess is as follows:6

• For every class k ∈ 1, 2, ...,K:

– Draw class ability meansζk,k′ ∼ Normal(0, 1),∀k′ ∈ 1, ...,K

– Draw class s.d.’sΩk,k′ ∼ HalfNormal(0, 1), ∀k′

• For every annotator j ∈ 1, 2, ..., J:

– For every class k ∈ 1, 2, ...,K:∗ Draw class annotator abilitiesβj,k,k′ ∼ Normal(ζk,k′ ,Ωk,k′),∀k′

• Draw class prevalence π ∼ Dirichlet(1K)

• For every item i ∈ 1, 2, ..., I:

– Draw true class ci ∼ Categorical(π)

– For every position n ∈ 1, 2, ..., Ni:∗ Draw annotation yi,n ∼Categorical(softmax(βjj[i,n],ci))

7

Item Difficulty (ITEMDIFF) We also test an exten-sion of the “Beta-Binomial by Item” model from(Carpenter, 2008), which does not assume any an-notator structure; instead, the annotations of anitem are made to depend on its intrinsic difficulty.The model further assumes that item difficulties

6A two-class version of this model can be found in (Car-penter, 2008) under the name “Beta-Binomial by Annotator”.

7The argument of the softmax is a K-dimensional vectorof annotator abilities given the true class, i.e., βjj[i,n],ci =(βjj[i,n],ci,1, ..., βjj[i,n],ci,K)

Page 4: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFTFigure 5: Plate diagram for item difficulty model.

are instances of class-level hierarchical difficul-ties (see Figure 5). This is another example of apartially-pooled model. Its generative process ispresented below:

• For every class k ∈ 1, 2, ...,K:

– Draw class difficulty means:ηk,k′ ∼ Normal(0, 1), ∀k′ ∈ 1, ...,K

– Draw class s.d.’sXk,k′ ∼ HalfNormal(0, 1),∀k′

• Draw class prevalence π ∼ Dirichlet(1K)

• For every item i ∈ 1, 2, ..., I:

– Draw true class ci ∼ Categorical(π)

– Draw item difficulty θi,k ∼Normal(ηci,k, Xci,k), ∀k

– For every position n ∈ 1, 2, ..., Ni:∗ Draw annotation:yi,n ∼ Categorical(softmax(θi))

Logistic Random Effects (LOGRNDEFF) Thelast model is the Logistic Random Effects model(Carpenter, 2008), which assumes the annotationsdepend on both annotator abilities and item dif-ficulties (see Figure 6). Both annotator and itemparameters are drawn from hierarchical priors forpartial pooling. Its generative process is given be-low:

• For every class k ∈ 1, 2, ...,K:

– Draw class ability meansζk,k′ ∼ Normal(0, 1),∀k′ ∈ 1, ...,K

– Draw class ability s.d.’sΩk,k′ ∼ HalfNormal(0, 1),∀k′

– Draw class difficulty s.d.’sXk,k′ ∼ HalfNormal(0, 1),∀k′

• For every annotator j ∈ 1, 2, ..., J:

– For every class k ∈ 1, 2, ...,K:∗ Draw class annotator abilitiesβj,k,k′ ∼ Normal(ζk,k′ ,Ωk,k′),∀k′

Figure 6: Plate diagram for logistic random effectsmodel.

• Draw class prevalence π ∼ Dirichlet(1K)

• For every item i ∈ 1, 2, ..., I:

– Draw true class ci ∼ Categorical(π)

– Draw item difficulty:θi,k ∼ Normal(0, Xci,k),∀k

– For every position n ∈ 1, 2, ..., Ni:∗ Draw annotation yi,n ∼Categorical(softmax(βjj[i,n],ci −θi))

3 Implementation of the Models

We implemented all models in this paper in Stan(Carpenter et al., 2017), a tool for Bayesian In-ference based on Hamiltonian Monte Carlo. Al-though the non-hierarchical models we presentcan be fit with (penalized) maximum likelihood(Dawid and Skene, 1979; Passonneau and Car-penter, 2014),8 there are several advantages to aBayesian approach. First and foremost, it pro-vides a mean for measuring predictive calibra-tion for forecasting future results. For a well-specified model that matches the generative pro-cess, Bayesian inference provides optimally cali-brated inferences (Bernardo and Smith, 2001); foronly roughly accurate models, calibration may bemeasured for model comparison (Gneiting et al.,2007). Calibrated inference is critical for mak-ing optimal decisions, as well as for forecast-ing (Berger, 2013). A second major benefit ofBayesian inference is its flexibility in combiningsubmodels in a computationally tractable manner.

8Hierarchical models are challenging to fit with classicalmethods; the standard approach, maximum marginal likeli-hood, requires marginalizing the hierarchical parameters, fit-ting those with an optimizer, then plugging the hierarchicalparameter estimates in and repeating the process on the co-efficients (Efron, 2012). This marginalization requires eithera custom approximation per model in terms of either quadra-ture or MCMC to compute the nested integral required forthe marginal distribution that must be optimized first (Mar-tins et al., 2013).

Page 5: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

For example, predictors or features might be avail-able to allow the simple categorical prevalencemodel to be replaced with a multi-logistic regres-sion (Raykar et al., 2010), features of the anno-tators may be used to convert that to a regres-sion model, or semi-supervised training might becarried out by adding known gold-standard labels(Van Pelt and Sorokin, 2012). Each model canbe implemented straightforwardly and fit exactly(up to some degree of arithmetic precision) usingMarkov chain Monte Carlo (MCMC) methods, al-lowing a wide range of models to be evaluated.This is largely because posteriors are much bet-ter behaved than point estimates for hierarchicalmodels, which require custom solutions on a per-model basis for fitting with classical approaches(Rabe-Hesketh and Skrondal, 2008). Both of thesebenefits make Bayesian inference much simplerand more useful than classical point estimates andstandard errors.

Convergence is assessed in a standard fashionusing the approach proposed by Gelman and Ru-bin (1992): for each model we run four chainswith diffuse initializations and verify that theyconverge to the same mean and variances (usingthe criterion R < 1.1).

Hierarchical priors, when jointly fit with the restof the parameters, will be as strong and thus sup-port as much pooling as evidenced by the data. Forfixed priors on simplexes (probability parametersthat must be non-negative and sum to 1.0), we useuniform distributions (i.e., Dirichlet(1K)). For lo-cation and scale parameters, we use weakly infor-mative normal and half-normal priors that informthe scale of the results, but are not otherwise sen-sitive. As with all priors, they trade some bias forvariance and stabilize inferences when there is notmuch data. The exception is MACE, for which weused the originally recommended priors, to con-form with the authors’ motivation.

All model implementations are availableto readers online at http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip.

4 Evaluation

The models of annotation discussed in this paperfind their application in multiple tasks: to labelitems, characterize the annotators, or flag espe-cially difficult items. This section lays out the met-rics used in the evaluation of each of these tasks.

Dataset I N J K J/I I/J

WSD 177 1770 34 310 10 1010 10 10

17 20 2052 77 177

RTE 800 8000 164 210 10 1010 10 10

20 20 2049 20 800

TEMP 462 4620 76 210 10 1010 10 10

10 10 1661 50 462

PD 5892 43161 294 41 5 77 9 57

1 4 13 14751 3395

Table 1: General statistics (I items, N observations, Jannotators, K classes) together with summary statis-tics for the number of annotators per item (J/I) andthe number of items per annotator (I/J) (i.e., Min, 1stQuartile, Median, Mean, 3rd Quartile, and Max)

4.1 Datasets

We evaluate on a collection of datasets reflect-ing a variety of use-cases and conditions: binaryvs. multi-class classification; small vs. largenumber of annotators; sparse vs. abundant num-ber of items per annotator / annotators per item;and varying degrees of annotator quality (statis-tics presented in Table 1). Three of the datasets– WSD, RTE and TEMP, created by Snow et al.(2008) – are widely used in the literature on an-notation models (Hovy et al., 2013; Carpenter,2008). In addition, we include the Phrase Detec-tives 1.0 (PD) corpus (Chamberlain et al., 2016)which differs in a number of key ways from theSnow et al. (2008) datasets: it has a much largernumber of items and annotations, greater sparsity,and a much greater likelihood of spamming due toits collection via a Game-With-A-Purpose. Thisdataset is also less artificial than the datasets inSnow et al. (2008), which were created with theexpress purpose of testing crowd-sourcing. Thedata consists of anaphoric annotations, which wereduce to four general classes (DN/DO - discoursenew/old, PR - property, and NR - non-referring).To ensure similarity with the Snow et al. (2008)datasets, we also limit the coders to one annotationper item (discarded data was mostly redundant an-notations). Furthermore, this corpus allows us toevaluate on meta-data not usually available in tra-ditional crowdsourcing platforms, namely infor-mation about confessed spammers and good, es-tablished players.

4.2 Comparison Against a Gold Standard

The first model aspect we assess is how accu-rately they identify the correct (“true”) label ofthe items. The simplest way to do this is by

Page 6: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

comparing the inferred labels against a gold stan-dard, using standard metrics such as Precision /Recall / F-measure, as done, e.g., for the evalua-tion of MACE in (Hovy et al., 2013). We checkwhether the reported differences are statisticallysignificant, using bootstrapping (the shift method),a non-parametric two-sided test (Smucker et al.,2007; Wilbur, 1994). We use a significance thresh-old of 0.05 and further report whether the signifi-cance still holds after applying the Bonferroni cor-rection for type-1 errors.

This type of evaluation, however, presupposesthat a gold standard can be obtained. This as-sumption has been questioned by studies show-ing the extent of disagreement on annotation evenamong experts (Poesio and Artstein, 2005; Pas-sonneau and Carpenter, 2014; Plank et al., 2014b).This motivates exploring complementary evalua-tion methods.

4.3 Predictive AccuracyIn the statistical analysis literature, posterior pre-dictions are a standard assessment method forBayesian models (Gelman et al., 2013). We mea-sure the predictive performance of each model us-ing the log predictive density (lpd), i.e., log p(y|y),in a Bayesian K-fold cross-validation setting (Pi-ironen and Vehtari, 2017; Vehtari et al., 2017). Theset-up is straightforward: we partition the data intoK subsets, each subset formed by splitting the an-notations of each annotator into K random folds(we chooseK = 5). The splitting strategy ensuresthat models that cannot handle predictions for newannotators (i.e., unpooled models like D&S andMACE) are nevertheless included in the compari-son. Concretely, we compute

lpd =

K∑k=1

log p(yk|y(−k))

=K∑k=1

log

∫p(yk, θ|y(−k))dθ

≈K∑k=1

log1

M

M∑m=1

p(yk|θ(k,m))

(1)

In (1), y(−k) and yk represent the items from thetrain and test data, for iteration k of the cross vali-dation, while θ(k,m) is one draw from the posterior.

4.4 Annotators’ CharacterizationA key property of most of these models is thatthey provide a characterization of coder ability.

In the D&S model, for instance, each annota-tor is modeled with a confusion matrix; Passon-neau and Carpenter (2014) showed how differenttypes of annotators (biased, spamming, adversar-ial) can be identified by examining this matrix.The same information is available in HIERD&Sand LOGRNDEFF, whereas MACE characterizescoders by their level of credibility and spammingpreference. We discuss these parameters with thehelp of the meta-data provided by the PD corpus.

Some of the models (e.g., MULTINOM orITEMDIFF) do not explicitly model annotators.However, an estimate of annotator accuracy canbe derived post-inference for all the models. Con-cretely, we define the accuracy of an annotator asthe proportion of their annotations that match theinferred item-classes. This follows the calculationof gold-annotator accuracy (Hovy et al., 2013),computed with respect to the gold standard. Simi-lar to Hovy et al. (2013), we report the correlationbetween estimated and gold annotators’ accuracy.

4.5 Item Difficulty

Finally, the LOGRNDEFF model also provides anestimate which can be used to assess item diffi-culty. This parameter has an effect on the correct-ness of the annotators, i.e., there is a subtractiverelationship between the ability of an annotatorand the item-difficulty parameter. The ‘difficulty’name is thus appropriate, although an examinationof this parameter alone does not explicitly markan item as difficult or easy. The ITEMDIFF modeldoes not model annotators and only uses the diffi-culty parameter, but the name is slightly mislead-ing, since its probabilistic role changes in the ab-sence of the other parameter (i.e., it now shows themost likely annotation classes for an item). Theseobservations motivate an independent measure ofitem difficulty, but there is no agreement on whatsuch a measure could be.

One approach is to relate the difficulty of anitem to the confidence a model has in assigningit a label. This way, the difficulty of the items isjudged under the subjectivity of the models, whichin turn, is influenced by their set of assumptionsand data fitness. As in (Hovy et al., 2013), wemeasure the model’s confidence via entropy, to fil-ter out the items the models are least confident in(i.e. the more difficult ones) and report accuracytrends.

Page 7: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

5 Results

This Section assesses the six models along dif-ferent dimensions. The results are comparedwith those obtained with a simple majority vote(MAJVOTE) baseline. We do not compare theresults with non-probabilistic baselines as it hasalready been shown–see, e.g., Quoc Viet Hunget al. (2013)–that they underperform compared toa model of annotation.

We follow the evaluation tasks and metrics dis-cussed back in Section 4 and briefly summarizednext. A core task for which models of annota-tion are employed is to infer the correct interpreta-tions from a crowdsourced dataset of annotations.This evaluation is conducted first and consists ofa comparison against a gold standard. A problemwith this assessment is caused by ambiguity, pre-vious studies indicating disagreement even amongexperts. Considering obtaining a true gold stan-dard is questionable, we further explore a comple-mentary evaluation, assessing the predictive per-formance of the models, a standard evaluation ap-proach from the literature on Bayesian models.Another core task models of annotation are usedfor is to characterize the accuracy of the annotatorsand their error patterns. This is the third objectiveof this evaluation. Finally, we conclude this Sec-tion assessing the ability of the models to correctlydiagnose the items for which potentially incorrectlabels have been inferred.

The PD data are too sparse to fit the modelswith item-level difficulties (i.e., ITEMDIFF andLOGRNDEFF). These models are therefore notpresent in the evaluations conducted on the PDcorpus.

5.1 Comparison Against a Gold Standard

A core task models of annotation are used for isto infer the correct interpretations from crowd-annotated datasets. This Section compares the in-ferred interpretations with a gold standard.

Tables 2, 3 and 4 present the results.9 OnWSD and TEMP datasets (see Table 4), charac-terized by a small number of items and annotators(statistics in Table 1), the different model com-plexities result in no gains, all the models per-forming equivalently. Statistically significant dif-

9The results for MAJVOTE, HIERD&S and LOGRNDEFF

we report match or slightly outperform those reported by(Carpenter, 2008) on the RTE dataset. Similar for MACE,across WSD, RTE and TEMP datasets (Hovy et al., 2013).

Model Result Statistical Significance

MULTINOM 0.89D&S* HIERD&S*LOGRNDEFF* MACE*

D&S 0.92ITEMDIFF* MAJVOTE

MULTINOM*

HIERD&S 0.93ITEMDIFF* MAJVOTE*MULTINOM*

ITEMDIFF 0.89LOGRNDEFF* MACE*D&S* HIERD&S*

LOGRNDEFF 0.93MAJVOTE* MULTINOM*ITEMDIFF*

MACE 0.93MAJVOTE* MULTINOM*ITEMDIFF*

MAJVOTE 0.90D&S HIERD&S*LOGRNDEFF* MACE*

Table 2: RTE dataset: results against the gold standard.Both micro (accuracy) and macro (P, R, F) scores arethe same. * indicates that significance (0.05 threshold)holds after applying the Bonferroni correction.

ferences (0.05 threshold, plus Bonferroni correc-tion for Type-1 errors; see Section 4.2 for details)are, however, very much present in Tables 2 (RTEdataset) and 3 (PD dataset). Here the results aredominated by the unpooled (D&S and MACE)and partially-pooled models (LOGRNDEFF, andHIERD&S, except for PD, as discussed later inSection 6.1) which assume some form of annota-tor structure. Furthermore, modeling the full an-notator response matrix leads in general to bet-ter results (e.g., D&S vs. MACE on the PDdataset). Ignoring completely any annotator struc-ture is rarely appropriate, such models failing tocapture the different levels of expertise the codershave – see the poor performance of the unpooledMULTINOM model and of the partially-pooledITEMDIFF model. Similarly, the MAJVOTE base-line, implicitly assumes equal expertise amongcoders, leading to poor performance results.

5.2 Predictive Accuracy

Ambiguity causes disagreement even among ex-perts, affecting the reliability of existing gold stan-dards. This Section presents a complementaryevaluation, i.e., predictive accuracy. In a simi-lar spirit to the results obtained in the compari-son against the gold standard, modeling the abil-ity of the annotators was also found essential fora good predictive performance (results presentedin Table 5). However, in this type of evaluation,

Page 8: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

Accuracy (micro) F-measure (macro)

Model Result Statistical Significance Result Statistical Significance

MULTINOM 0.87 D&S* HIERD&S* MACE* MAJVOTE 0.79 D&S* HIERD&S* MACE* MAJVOTE*D&S 0.94 HIERD&S* MACE* MAJVOTE* MULTINOM* 0.87 HIERD&S* MACE* MAJVOTE* MULTINOM*HIERD&S 0.89 MACE* MAJVOTE* MULTINOM* D&S* 0.82 MAJVOTE* MULTINOM* D&S*MACE 0.93 MAJVOTE* MULTINOM* D&S* HIERD&S* 0.83 MAJVOTE* MULTINOM* D&S*MAJVOTE 0.88 MULTINOM D&S* HIERD&S* MACE* 0.73 MULTINOM* D&S* HIERD&S* MACE*

Precision (macro) Recall (macro)

Model Result Statistical Significance Result Statistical Significance

MULTINOM 0.73 D&S* HIERD&S* MACE* MAJVOTE* 0.85 HIERD&S* MAJVOTE*D&S 0.88 HIERD&S* MACE* MULTINOM* 0.87 HIERD&S MACE MAJVOTE*HIERD&S 0.76 MACE* MAJVOTE* MULTINOM* D&S* 0.89 MACE* MAJVOTE* MULTINOM* D&SMACE 0.83 MAJVOTE MULTINOM* D&S* HIERD&S* 0.84 MAJVOTE* D&S HIERD&S*MAJVOTE 0.87 MULTINOM* HIERD&S* MACE 0.63 MULTINOM* D&S* HIERD&S* MACE*

Table 3: PD dataset: results against the gold standard. * indicates that significance holds after Bonferroni correc-tion.

Dataset Model Accµ PM RM FM

WSDITEMDIFF

0.99 0.83 0.99 0.91LOGRNDEFF

Others 0.99 0.89 1.00 0.94

TEMPMAJVOTE 0.94 0.93 0.94 0.94

Others 0.94 0.94 0.94 0.94

Table 4: Results against the gold (µ micro; M macro)

the unpooled models can overfit, affecting theirperformance, e.g., a model of higher complex-ity like D&S, on a small dataset like WSD. Thepartially pooled models avoid overfitting throughthe hierarchical structure obtaining the best pre-dictive accuracy. Ignoring the annotator structure(ITEMDIFF and MULTINOM) leads to poor per-formance on all datasets except for WSD wherethis assumption is roughly apppropriate since allthe annotators have a very high proficiency (above95%).

5.3 Annotators’ CharacterizationAnother core task models of annotation are em-ployed for is to characterize the accuracy and biasof the annotators.

We first assess the correlation between the es-timated and gold accuracy of the annotators. Theresults, presented in Table 6, follow the same pat-tern to those obtained in Section 5.1: a better per-formance of the unpooled (D&S and MACE10)and partially-pooled models (LOGRNDEFF andHIERD&S, except for PD, as discussed later inSection 6.1). The results are intuitive: a model

10The results of our reimplementation match the publishedones (Hovy et al., 2013)

Model WSD RTE TEMP PD*

MULTINOM -0.75 -5.93 -5.84 -4.67D&S -1.19 -4.98 -2.61 -2.99HIERD&S -0.63 -4.71 -2.62 -3.02ITEMDIFF -0.75 -5.97 -5.84 -LOGRNDEFF -0.59 -4.79 -2.63 -MACE -0.70 -4.86 -2.65 -3.52

Table 5: The log predictive density results, normalizedto a per-item rate (i.e., lpd/I). Larger values indicatea better predictive performance. PD* is a subset of PDsuch that each annotator has a number of annotations atleast as big as the number of folds.

that is accurate w.r.t. the gold standard should alsoobtain high correlation at annotator level.

The PD corpus comes also with a list of self-confessed spammers and one of good, establishedplayers (see Table 7 for a few details). Contin-uing with the correlation analysis, an inspectionof the second-last column from Table 6 showslargely accurate results for the list of spammers.However, on the second category, i.e., the non-spammers (the last column), we see large differ-ences between models, following the same patternwith the previous correlation results. An inspec-tion of the spammers’ annotations show an almostexclusive use of the DN (discourse new) class,which is highly prevalent in PD and easy for themodels to infer; the non-spammers, on the otherhand, make use of all the classes, making it moredifficult to capture their behavior.11

11In a typical coreference corpus over 60% of mentionsare DN; thus always choosing DN results in a good accuracylevel. The one-class preference is a common spamming be-havior (Hovy et al., 2013; Passonneau and Carpenter, 2014).

Page 9: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

Model WSD RTE TEMP PD S NS

MAJVOTE 0.90 0.78 0.91 0.77 0.98 0.65line MULTINOM 0.90 0.84 0.93 0.75 0.97 0.84D&S 0.90 0.89 0.92 0.88 1.00 0.99HIERD&S 0.90 0.90 0.92 0.76 1.00 0.91ITEMDIFF 0.80 0.84 0.93 - - -LOGRNDEFF 0.80 0.89 0.92 - - -MACE 0.90 0.90 0.92 0.86 1.00 0.98

Table 6: Correlation between gold and estimated accu-racy of annotators. The last two columns refer to thelist of known spammers and non-spammers in PD

Type Size Gold accuracy quantiles

Spammers 7 0.42 0.55 0.74Non-spammers 19 0.59 0.89 0.94

Table 7: Statistics on player types. Reported quantilesare 2.5%, 50% and 97.5%.

We further examine some useful parameter es-timates for each player type. We chose one spam-mer and one non-spammer and discuss the con-fusion matrix inferred by D&S, together withthe credibility and spamming preference given byMACE. The two annotators were chosen to berepresentative for their type. The selection ofthe models was guided by their two different ap-proaches to capturing the behavior of the annota-tors.

Table 8 presents the estimates for the annotatorselected from the list of spammers. Again, inspec-tion of the confusion matrix shows that, irrespec-tive of the true class, the spammer almost alwaysproduces the DN label. The MACE estimates aresimilar, allocating 0 credibility to this annotator,and full spamming preference for the DN class.

In Table 9 we show the estimates for the anno-tator chosen from the non-spammers list. Theirresponse matrix indicates an overall good perfor-mance (see diagonal matrix), albeit with a con-fusion of PR (property) for DN (discourse new),which is not surprising given that indefinite NPs(e.g., a policeman) are the most common type ofmention in both classes. MACE allocates largecredibility to this annotator and shows a similarspamming preference for the DN class.

The discussion above, as well as the quantilespresented in Table 7, show that poor accuracy isnot by itself a good indicator of spamming. Aspammer like the one discussed in this sectioncan get good performance by always choosing aclass with high frequency in the gold standard. At

D&S

βj NR DN PR DO

NR 0.03 0.92 0.03 0.03

DN 0.00 1.00 0.00 0.00

PR 0.01 0.98 0.01 0.01

DO 0.00 1.00 0.00 0.00

MACE εjNR DN PR DO

0.00 0.99 0.00 0.00

θj 0.00

Table 8: Spammer analysis example: D&S provides aconfusion matrix; MACE shows the spamming prefer-ence and the credibility.

D&S

βj NR DN PR DO

NR 0.79 0.07 0.07 0.07

DN 0.00 0.96 0.01 0.02

PR 0.03 0.21 0.72 0.04

DO 0.00 0.06 0.00 0.94

MACE εjNR DN PR DO

0.09 0.52 0.17 0.22

θj 0.92

Table 9: A non-spammer analysis example: D&S pro-vides a confusion matrix; MACE shows the spammingpreference and the credibility.

the same time, a non-spammer may fail to recog-nize some true classes correctly, but be very goodon others. Bayesian models of annotation allowcapturing and exploiting these observations. Fora model like D&S, such a spammer presents noharm, as their contribution towards any potentialtrue class of the item is the same and therefore can-cels out.12

5.4 Filtering using Model Confidence

This Section assesses the ability of the models tocorrectly diagnose the items for which potentiallyincorrect labels have been inferred. Concretely,we identify the items the models are least confi-dent in (measured using the entropy of the poste-rior of the true class distribution) and present theaccuracy trends as we vary the proportion of fil-tered out items.

Overall, the trends – Figures 7, 8 and 9 – in-dicate that filtering out the items with low confi-

12This point was also made by Passonneau and Carpenter(2014).

Page 10: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFTFigure 7: Effect of filtering on RTE - accuracy (y-axis)vs. proportion of data with lowest entropy (x-axis)

Figure 8: TEMP dataset - accuracy (y-axis) vs. propor-tion of data with lowest entropy (x-axis)

dence improves the accuracy of all the models andacross all datasets.13

6 Discussion

We found significant differences across a numberof dimensions between both the annotation modelsand between the models and MAJVOTE.

6.1 Observations and Guidelines

The completely pooled model (MULTINOM) un-derperforms in almost all types of evaluation andall datasets. Its weakness derives from its core as-sumption: it is rarely appropriate in crowdsourc-ing to assume that all annotators have the sameability.

The unpooled models (D&S and MACE) as-sume each annotator has their own response pa-rameter. These models can capture the accuracyand bias of annotators, and perform well in all

13The trends for MACE match the published ones. Also,we left out the analysis on the WSD dataset, as the models al-ready obtain 99% accuracy without any filtering (see Section5.1).

Figure 9: PD dataset - accuracy (y-axis) vs. proportionof data with lowest entropy (x-axis)

evaluations against the gold standard. Lower per-formance is however obtained on posterior predic-tions: the higher complexity of unpooled modelsresults in overfitting, which affects their predictiveperformance.

The partially pooled models (ITEMDIFF,HIERD&S and LOGRNDEFF) assume bothindividual and hierarchical structure (capturingpopulation behaviour). These models achieve thebest of both worlds, letting the data determine thelevel of pooling that is required: they asymptoteto the unpooled models if there is a lot of varianceamong the individuals in the population, or to thefully pooled models when the variance is verylow. This flexibility ensures good performanceboth in the evaluations against the gold standardand in terms of their predictive performance.

Across the different types of pooling, the mod-els which assume some form of annotator structure(D&S, MACE, LOGRNDEFF and HIERD&S)came out on top in all evaluations. The un-pooled models (D&S and MACE) register onpar performance with the partially-pooled ones(LOGRNDEFF and HIERD&S, except for the PDdataset, as discussed later in this Section) in theevaluations against the gold, but as previouslymentioned, can overfit, affecting their predic-tive performance. Ignoring any annotator struc-ture (the pooled MULTINOM model, the partially-pooled ITEMDIFF model, or the MAJVOTE base-line) leads generally to poor performance results.

The approach we took in this paper is domainindependent, i.e., we did not assess and comparemodels that use features extracted from the data,even though it is known that when such featuresare available, they are likely to help (Raykar et al.,2010; Felt et al., 2015a; Kamar et al., 2015). This

Page 11: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

is because a proper assessment of such modelswould also require a careful selection of the fea-tures and how to include them into a model of an-notation. A bad (i.e., misspecified in the statisticalsense) domain model is going to hurt more thanhelp as it will bias the other estimates. Providingguidelines for this feature-based analysis wouldhave excessively expanded the scope of this pa-per. But feature-based models of annotation areextensions of the standard annotation-only mod-els; thus, this paper can serve as a foundation forthe development of such models. A few examplesof feature-based extensions of standard models ofannotation are given in the Related Work sectionto guide readers who may want to try them out fortheir specific task/domain.

The domain-independent approach we took inthis paper further implies there are no differencesbetween applying these models to corpus anno-tation or other crowdsourcing tasks. This paperis focused on resource creation and does not pro-pose to investigate the performance of the mod-els in downstream tasks. However, previous workalready employed such models of annotation forNLP (Plank et al., 2014a; Sabou et al., 2014;Habernal and Gurevych, 2016), image labeling(Smyth et al., 1995; Kamar et al., 2015) or med-ical (Albert and Dodd, 2004; Raykar et al., 2010)tasks.

While HIERD&S normally achieves the bestperformance in all evaluations on the Snow et al.(2008) datasets, on the PD data it is outperformedby the unpooled models (MACE and D&S). Tounderstand this discrepancy, it should be notedthat the datasets from Snow et al. (2008) were pro-duced using Amazon Mechanical Turk, by mainlyhighly skilled annotators; whereas the PD datasetwas produced in a game-with-a-purpose setting,where most of the annotations were made by onlya handful of coders of high quality, the rest be-ing produced by a large number of annotators withmuch lower abilities. These observations pointto a single population of annotators in the for-mer datasets, and to two groups in the latter case.The reason why the unpooled models (MACE andD&S) outperform the partially-pooled HIERD&Smodel on the PD data is that this class of modelsassumes no population structure – hence there isno hierarchical influence; a multi-modal hierarchi-cal prior in HIERD&S might be better suited forthe PD data. This further suggests that results de-

pend to some extent on the dataset specifics. Thisdoes not alter the general guidelines made in thisstudy.

6.2 Technical Notes

Posterior curvature. In hierarchical models, acomplicated posterior curvature increases the dif-ficulty of the sampling process affecting conver-gence. This may happen when the data is sparseor when there are large inter-group variances. Oneway to overcome this problem is to use a non-centered parameterization (Betancourt and Giro-lami, 2015). This approach separates the localparameters from their parents, easing the sam-pling process. This often improves the effectivesample size and, ultimately, the convergence (i.e.,lower R). The non-centered parameterization of-fers an alternative but equivalent implementationof a model. We found this essential to ensure a ro-bust implementation of the partially-pooled mod-els.Label Switching. The label switching problem thatoccurs in mixture models is due to the likelihood’sinvariance under the permutation of the labels.This makes the models nonidentifiable. Conver-gence cannot be directly assessed, since the chainswill not overlap anymore. We use a general solu-tion to this problem from Gelman et al. (2013):re-label the parameters, post-inference, based ona permutation that minimizes some loss function.For this survey, we used a small random sample ofthe gold data (e.g., five items per class) to find thepermutation which maximizes model accuracy forevery chain-fit. We then relabeled the parametersof each chain according to the chain-specific per-mutation before combining them for convergenceassessment. This ensures model identifiability andgold alignment.

7 Related Work

Bayesian models of annotation share many char-acteristics with so called item-response and ideal-point models. A popular application of these mod-els is to analyze data associated with individualsand test items. A classic example is the Raschmodel (Rasch, 1993) which assumes that the prob-ability of a person being correct on a test item isbased on a subtractive relationship between theirability and the difficulty of the item. The modeltakes a supervised approach to jointly estimatingthe ability of the individuals and the difficulty of

Page 12: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

the test items based on the correctness of their re-sponses. The models of annotation we discussedin this paper are completely unsupervised and in-fer, in addition to annotator ability and/or item dif-ficulty, the correct labels. More details on item-response models are given in (Skrondal and Rabe-Hesketh, 2004; Gelman and Hill, 2007). Item-response theory has also been recently applied toNLP applications (Lalor et al., 2016; Martınez-Plumed et al., 2016; Lalor et al., 2017).

The models considered so far take into accountonly the annotations. There is work, however,which further exploits the features that can accom-pany items. A popular example is the model intro-duced by Raykar et al. (2010), where the true classof an item is made to depend both on the annota-tions and on a logistic regression model which arejointly fit; essentially, the logistic regression re-places the simple categorical model of prevalence.Felt et al. (2014, 2015b) introduced similar mod-els which also modeled the predictors (features)and compared it to other approaches (Felt et al.,2015a). Kamar et al. (2015) account for task-specific feature effects on the annotations.

In Section 6.2, we discussed the label switch-ing problem (Stephens, 2000) that many mod-els of annotation suffer from. Other solutionsproposed in the literature include utilizing class-informative priors, imposing ordering constraints(obvious for univariate parameters; less so in mul-tivariate cases) (Gelman et al., 2013), or apply-ing different post-inference relabeling techniques(Felt et al., 2014).

8 Conclusions

This study aims to promote the use of Bayesianmodels of annotation by the NLP community.These models offer substantial advantages overboth agreement statistics (used to judge codingstandards), and over majority-voting aggregationto generate gold standards (even when used withheuristic censoring or adjudication). To provideassistance in this direction, we compare sixexisting models of annotation with distinct priorand likelihood structures (e.g., pooled, unpooled,and partially pooled) and a diverse set of effects(annotator ability, item difficulty, or a subtractiverelationship between the two). We use variousevaluation settings on four datasets, with differentlevels of sparsity and annotator accuracy, andreport significant differences both among the

models, and between models and majority voting.As importantly, we provide guidelines to bothaid users in the selection of the models and toraise awareness of the technical aspects essentialto their implementation. We release all modelsevaluated here as Stan implementations at http://dali.eecs.qmul.ac.uk/papers/supplementary_material.zip.

Acknowledgments

Paun, Chamberlain and Poesio are supported bythe DALI project, funded by ERC. Carpenter ispartly supported by the U.S. National ScienceFoundation and the U.S. Office of Naval Research.

References

Paul S. Albert and Lori E. Dodd. 2004. A caution-ary note on the robustness of latent class modelsfor estimating diagnostic error without a goldstandard. Biometrics, 60(2):427–435.

Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics.Computational Linguistics, 34(4):555–596.

James O. Berger. 2013. Statistical Decision The-ory and Bayesian Analysis. Springer.

José M. Bernardo and Adrian F. M. Smith. 2001.Bayesian Theory. IOP Publishing.

Michael Betancourt and Mark Girolami. 2015.Hamiltonian Monte Carlo for hierarchical mod-els. Current trends in Bayesian methodologywith applications, 79:30.

Bob Carpenter. 2008. Multilevel Bayesian mod-els of categorical data annotation. Unpublishedmanuscript.

Bob Carpenter, Andrew Gelman, Matt Hoffman,Daniel Lee, Ben Goodrich, Michael Betancourt,Michael A. Brubaker, Jiqiang Guo, Peter Li,and Allen Riddell. 2017. Stan: A probabilisticprogramming language. Journal of StatisticalSoftware, 76(1):1–32.

Jon Chamberlain, Massimo Poesio, and Udo Kr-uschwitz. 2016. Phrase Detectives corpus 1.0:Crowdsourced anaphoric coreference. In Pro-ceedings of the International Conference onLanguage Resources and Evaluation (LREC2016), Portoroz, Slovenia.

Page 13: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

Alexander Philip Dawid and Allan M. Skene.1979. Maximum likelihood estimation of ob-server error-rates using the EM algorithm. Ap-plied Statistics, 28(1):20–28.

Bradley Efron. 2012. Large-Scale Inference: Em-pirical Bayes Methods for Estimation, Testing,and Prediction, volume 1. Cambridge Univer-sity Press.

Alvan R. Feinstein and Domenic V. Cicchetti.1990. High agreement but low kappa: I. Theproblems of two paradoxes. Journal of clinicalepidemiology, 43(6):543–549.

Paul Felt, Kevin Black, Eric Ringger, Kevin Seppi,and Robbie Haertel. 2015a. Early gains matter:A case for preferring generative over discrimi-native crowdsourcing models. In Proceedingsof the 2015 Conference of the North AmericanChapter of the Association for ComputationalLinguistics: Human Language Technologies.

Paul Felt, Robbie Haertel, Eric K. Ringger,and Kevin D. Seppi. 2014. MOMRESP: ABayesian model for multi-annotator documentlabeling. In Proceedings of the InternationalConference on Language Resources and Eval-uation (LREC 2014), Reykjavik.

Paul Felt, Eric K. Ringger, Jordan Boyd-Graber,and Kevin Seppi. 2015b. Making the most ofcrowdsourced document annotations: Confusedsupervised LDA. In Proceedings of the Nine-teenth Conference on Computational NaturalLanguage Learning, pages 194–203.

Andrew Gelman, John B. Carlin, Hal S. Stern,David B. Dunson, Aki Vehtari, and Donald B.Rubin. 2013. Bayesian Data Analysis, ThirdEdition. Chapman & Hall/CRC Texts in Sta-tistical Science. Taylor & Francis.

Andrew Gelman and Jennifer Hill. 2007.Data Analysis Using Regression and Multi-level/Hierarchical Models. Analytical Methodsfor Social Research. Cambridge UniversityPress.

Andrew Gelman and Donald B. Rubin. 1992. In-ference from iterative simulation using multiplesequences. Statistical science, 7:457–472.

Tilmann Gneiting, Fadoua Balabdaoui, andAdrian E. Raftery. 2007. Probabilistic fore-casts, calibration and sharpness. Journal of the

Royal Statistical Society: Series B (StatisticalMethodology), 69(2):243–268.

Ivan Habernal and Iryna Gurevych. 2016. Whatmakes a convincing argument? Empirical anal-ysis and detecting attributes of convincingnessin Web argumentation. In Proceedings of the2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 1214–1223.

Dirk Hovy, Taylor Berg-Kirkpatrick, AshishVaswani, and Eduard Hovy. 2013. Learningwhom to trust with MACE. In Proceedingsof the 2013 Conference of the North Ameri-can Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, pages 1120–1130.

Ece Kamar, Ashish Kapoor, and Eric Horvitz.2015. Identifying and accounting for task-dependent bias in crowdsourcing. In ThirdAAAI Conference on Human Computation andCrowdsourcing.

Hyun-Chul Kim and Zoubin Ghahramani. 2012.Bayesian classifier combination. In Proceed-ings of the Fifteenth International Conferenceon Artificial Intelligence and Statistics, pages619–627, La Palma, Canary Islands.

John Lalor, Hao Wu, and hong yu. 2016. Buildingan evaluation scale using item response theory.In Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Process-ing, pages 648–657. Association for Computa-tional Linguistics.

John P. Lalor, Hao Wu, and Hong Yu. 2017.Improving machine learning ability with fine-tuning. CoRR, abs/1702.08563. Version 1.

Matthew Lease and Gabriella Kazai. 2011.Overview of the TREC 2011 crowdsourcingtrack. In Proceedings of the text retrieval con-ference (TREC).

Fernando Martınez-Plumed, Ricardo B. C.Prudêncio, Adolfo Martınez-Usó, and JoséHernández-Orallo. 2016. Making sense ofitem response theory in machine learning. InProceedings of 22nd European Conferenceon Artificial Intelligence (ECAI), Frontiers inArtificial Intelligence and Applications, volume285, pages 1140–1148.

Page 14: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

Thiago G. Martins, Daniel Simpson, Finn Lind-gren, and Håvard Rue. 2013. Bayesian comput-ing with INLA: New features. ComputationalStatistics & Data Analysis, 67:68–83.

Pablo G. Moreno, Antonio Artés-Rodríguez,Yee Whye Teh, and Fernando Perez-Cruz. 2015.Bayesian nonparametric crowdsourcing. Jour-nal of Machine Learning Research.

Rebecca J. Passonneau and Bob Carpenter. 2014.The benefits of a model of annotation. Transac-tions of the Association for Computational Lin-guistics, 2:311–326.

Juho Piironen and Aki Vehtari. 2017. Comparisonof Bayesian predictive methods for model selec-tion. Statistics and Computing, 27(3):711–735.

Barbara Plank, Dirk Hovy, Ryan McDonald, andAnders Søgaard. 2014a. Adapting taggers toTwitter with not-so-distant supervision. In Pro-ceedings of COLING 2014, the 25th Interna-tional Conference on Computational Linguis-tics: Technical Papers, pages 1783–1792.

Barbara Plank, Dirk Hovy, and Anders Sogaard.2014b. Linguistically debatable or just plainwrong? In Proceedings of the 52nd AnnualMeeting of the Association for ComputationalLinguistics (Volume 2: Short Papers).

Massimo Poesio and Ron Artstein. 2005. The re-liability of anaphoric annotation, reconsidered:Taking ambiguity into account. In Proceedingsof ACL Workshop on Frontiers in Corpus Anno-tation, pages 76–83.

Nguyen Quoc Viet Hung, Nguyen Thanh Tam,Lam Ngoc Tran, and Karl Aberer. 2013. Anevaluation of aggregation techniques in crowd-sourcing. In Web Information Systems Engi-neering – WISE 2013, pages 1–15, Berlin, Hei-delberg. Springer Berlin Heidelberg.

Sophia Rabe-Hesketh and Anders Skrondal. 2008.Generalized linear mixed-effects models. Lon-gitudinal data analysis, pages 79–106.

Georg Rasch. 1993. Probabilistic Models forSome Intelligence and Attainment Tests. ERIC.

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao,Gerardo Hermosillo Valadez, Charles Florin,Luca Bogoni, and Linda Moy. 2010. Learning

from crowds. Journal of Machine Learning Re-search, 11:1297–1322.

Marta Sabou, Kalina Bontcheva, Leon Derczyn-ski, and Arno Scharl. 2014. Corpus annotationthrough crowdsourcing: Towards best practiceguidelines. In Proceedings of the Ninth Interna-tional Conference on Language Resources andEvaluation (LREC-2014), pages 859–866.

Aashish Sheshadri and Matthew Lease. 2013.SQUARE: A benchmark for research on com-puting crowd consensus. In Proceedings of the1st AAAI Conference on Human Computation(HCOMP), pages 156–164.

Edwin Simpson, Stephen Roberts, Ioannis Pso-rakis, and Arfon Smith. 2013. DynamicBayesian Combination of Multiple ImperfectClassifiers. Springer Berlin Heidelberg, Berlin,Heidelberg.

Anders Skrondal and Sophia Rabe-Hesketh. 2004.Generalized Latent Variable Modeling: Mul-tilevel, Longitudinal, and Structural EquationModels. Chapman & Hall/CRC Interdisci-plinary Statistics. Taylor & Francis.

Mark D. Smucker, James Allan, and BenCarterette. 2007. A comparison of statisti-cal significance tests for information retrievalevaluation. In Proceedings of the SixteenthACM Conference on Conference on Informationand Knowledge Management, CIKM ’07, pages623–632, New York, NY, USA. ACM.

Padhraic Smyth, Usama M. Fayyad, Michael C.Burl, Pietro Perona, and Pierre Baldi. 1995. In-ferring ground truth from subjective labelling ofVenus images. In Advances in neural informa-tion processing systems, pages 1085–1092.

Rion Snow, Brendan O’Connor, Daniel Jurafsky,and Andrew Y. Ng. 2008. Cheap and fast - butis it good? Evaluating non-expert annotationsfor natural language tasks. In Proceedings ofthe Conference on Empirical Methods in Natu-ral Language Processing, pages 254–263.

Matthew Stephens. 2000. Dealing with labelswitching in mixture models. Journal of theRoyal Statistical Society: Series B (StatisticalMethodology), 62(4):795–809.

Page 15: Comparing Bayesian Models of Annotation DRAFT · DRAFT Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1

DRAFT

Chris Van Pelt and Alex Sorokin. 2012. De-signing a scalable crowdsourcing platform. InProceedings of the 2012 ACM SIGMOD Inter-national Conference on Management of Data,pages 765–766. ACM.

Aki Vehtari, Andrew Gelman, and Jonah Gabry.2017. Practical Bayesian model evaluation us-ing leave-one-out cross-validation and WAIC.Statistics and Computing, 27(5):1413–1432.

Jacob Whitehill, Ting-fan Wu, Jacob Bergsma,Javier R. Movellan, and Paul L. Ruvolo. 2009.Whose vote should count more: Optimal inte-gration of labels from labelers of unknown ex-pertise. In Advances in Neural Information Pro-cessing Systems 22, pages 2035–2043. CurranAssociates, Inc.

W. John Wilbur. 1994. Non-parametric sig-nificance tests of retrieval performance com-parisons. Journal of Information Science,20(4):270–284.