from rankings to ratings: rank scoring active learningceur-ws.org/vol-2169/paper-05.pdf · from...

13
From Rankings to Ratings: Rank Scoring via Active Learning ? Jack O’Neill 1 , Sarah Jane Delany 2 , and Brian Mac Namee 3 1 Dublin Institute of Technology, Ireland [email protected] 2 [email protected] 3 University College Dublin, Ireland [email protected] Abstract. In this paper we present RaScAL, an active learning ap- proach to predicting real-valued scores for items given access to an oracle and knowledge of the overall item-ranking. In an experiment on six dif- ferent datasets, we find that RaScAL consistently outperforms the state- of-the-art. The RaScAL algorithm represents one step within a proposed overall system of preference elicitations of scores via pairwise compar- isons. 1 Introduction Supervised machine learning for regression problems, in which models are trained to learn the relationship between descriptive features and some continuous- valued response, is an important sub-field of machine learning. Data rating is the process of asking human subjects (raters ) to provide real-valued labels (rat- ings or scores ) for input data (artifacts ); these labels are essential to training predictive models. Machine learning models for regression problems are typically trained on datasets using labels elicited via data rating. This scenario is par- ticularly common in the domain of recommender systems, where the artifacts being rated are, for example, items from an online store, or films; and the labels provided are scores on a continuous scale, often [1 .. 10] or [1 .. 5]. Data rating is not confined to the domain of recommender systems, however, and has also been used to train models to detect valence and activation of emotions in speech [10], and to make medical diagnoses [5], among other applications. When compiling a labelled dataset for training a machine learning model for a regression problem, researchers typically face two major difficulties: acquiring sufficient labels for the task at hand (data scarcity ), and ensuring the quality of labels supplied (avoidance of noisy data). The former is particularly problematic in the area of recommender systems, where models are usually employed to evaluate very large product sets, which in turn require a large number of labels to train an accurate model [17]. The latter is problematic in any scenario in ? This work will be published as part of the book “Emerging Topics in Semantic Technologies. ISWC 2018 Satellite Events. E. Demidova, A.J. Zaveri, E. Simperl (Eds.), ISBN: 978-3-89838-736-1, 2018, AKA Verlag Berlin”

Upload: others

Post on 13-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring viaActive Learning?

Jack O’Neill1, Sarah Jane Delany2, and Brian Mac Namee3

1 Dublin Institute of Technology, Ireland [email protected] [email protected]

3 University College Dublin, Ireland [email protected]

Abstract. In this paper we present RaScAL, an active learning ap-proach to predicting real-valued scores for items given access to an oracleand knowledge of the overall item-ranking. In an experiment on six dif-ferent datasets, we find that RaScAL consistently outperforms the state-of-the-art. The RaScAL algorithm represents one step within a proposedoverall system of preference elicitations of scores via pairwise compar-isons.

1 Introduction

Supervised machine learning for regression problems, in which models are trainedto learn the relationship between descriptive features and some continuous-valued response, is an important sub-field of machine learning. Data rating isthe process of asking human subjects (raters) to provide real-valued labels (rat-ings or scores) for input data (artifacts); these labels are essential to trainingpredictive models. Machine learning models for regression problems are typicallytrained on datasets using labels elicited via data rating. This scenario is par-ticularly common in the domain of recommender systems, where the artifactsbeing rated are, for example, items from an online store, or films; and the labelsprovided are scores on a continuous scale, often [1 . . 10] or [1 . . 5]. Data ratingis not confined to the domain of recommender systems, however, and has alsobeen used to train models to detect valence and activation of emotions in speech[10], and to make medical diagnoses [5], among other applications.

When compiling a labelled dataset for training a machine learning model fora regression problem, researchers typically face two major difficulties: acquiringsufficient labels for the task at hand (data scarcity), and ensuring the quality oflabels supplied (avoidance of noisy data). The former is particularly problematicin the area of recommender systems, where models are usually employed toevaluate very large product sets, which in turn require a large number of labelsto train an accurate model [17]. The latter is problematic in any scenario in

? This work will be published as part of the book “Emerging Topics in SemanticTechnologies. ISWC 2018 Satellite Events. E. Demidova, A.J. Zaveri, E. Simperl(Eds.), ISBN: 978-3-89838-736-1, 2018, AKA Verlag Berlin”

Page 2: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

2 J. O’Neill et al.

which the reliability of raters is not guaranteed, as incorrect labels have thecapacity to reduce the accuracy of any predictive model they are used to train.

Data rating is traditionally optimised by addressing the problem from eitherof two angles. Active learning for data rating overcomes issues of data scarcityby identifying items whose labels are likely to contribute most to improvingthe performance of the model. By issuing queries only for these most importantitems, it can make the most of a labelling budget, and train accurate models usingfewer labels. Elahi et al. [7] have compiled a comprehensive survey of techniquesfalling into this category. The problem of noisy data is typically addressed usingrater reliability estimation [19]. Rater reliability estimation, as its name suggests,seeks to identify reliable raters whose scores are more likely to be accurate. Bydirecting queries only to reliable raters, it minimises the error in its labels whichin turn improves the overall accuracy of any model trained on this data.

While the techniques described above improve data rating by optimising ei-ther who is asked for labels, or the choice of items whose labels are requested,we aim to deliver further performance gains by improving the way we ask thequestion. This study forms part of a wider investigation into the viability of adata rating system which, instead of asking labellers to provide scores for indi-vidual items in isolation, requests pair-wise comparisons between items. Thesecomparisons can then be used to build an overall ranking among items. By em-ploying active learning techniques, we can learn to map these rankings to itemscores. Figure 1 depicts a high-level overview of the process. This paper focuseson Step 3 in the diagram above; taking an overall ranking among items, andusing active learning techniques to efficiently query for these items’ scores.

In a previous study [13], we showed that items which are ranked compara-tively show a higher inter-rater reliability than items which are rated individu-ally. In order to complete the rating process, however, we need to infer concretescores from the overall item-ranking, which remains a non-trivial problem. Inthis paper we present Rank Scoring via Active Learning (RaScAL), a systemwhich combines isotonic regression modelling with active learning techniques toinfer a set of scores given access to an oracle and an overall item-ranking.

The rest of this paper is structured as follows. Section 2 discusses related workin the field of active learning for data rating. Section 3 describes the RaScALalgorithm in detail. Section 4 outlines the datasets and describes the methodsand evaluation metrics used in our experiment. Section 5 reports the resultsof the experiment, while Section 6 discusses conclusions and considers possibledirections of future research which will build on these findings.

2 Related Work

Active learning techniques can be divided broadly into two sub-fields based onthe type of labels sought: active learning for classification, in which labels takethe form of a class identifier, and active learning for regression, which deals withquestions having real-valued (numeric) labels. There has been a wide range ofstudies dealing with the former, (see Settles [15] for a comprehensive treatment

Page 3: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring via Active Learning 3

Fig. 1: Research Context. We begin by issuing queries for pairwise comparisons,and use this information to build up an overall item-ranking, which is thenconverted into a set of scores. This paper deals with Step 3 of the workflow;using active learning to predict scores given an overall ranking among items.

of the recent state-of-the art), but research on the latter is less common. Activelearning for regression has its roots in the statistical field of Optimal Experimen-tal Design [8], however, in recent years there has been increasing interest fromresearchers in the area of machine learning [12,18,4].

The idea that subjective judgements are prone to systematic rater biasesfirst gained widespread acceptance through the work of psychologists Tverskyand Kahnemann [20]. This idea has had consequences for any field of research inwhich judgement-based data is collected; and has been applied to sound qualityevaluation [22], crowdsourcing [6], and collaborative filtering [1] among others.

The study described in this paper explores the possibility of using activelearning techniques to efficiently infer a set of scores given an overall rankingamong items and access to an oracle which can provide a score for any item onrequest. To the best of our knowledge, this particular problem has not previ-ously been addressed in any great detail in the literature. However, the broaderscenario of preference elicitation via rankings, rather than scores, is not new.Raykar et al. [14], were motivated by (among other reasons) the realisation that“in many scenarios, it is more natural to obtain training data for pair-wise pref-erence relations rather than the actual labels for individual examples”, to discardraw scores in datasets originally used for regression, and instead train a modelto learn the ranking function over items.

The transformation of rankings to ratings has been successfully employed inan industry setting. Bockhorst et al. [3], reporting on their experience implement-ing a model to predict customer satisfaction scores, realised that self-reported

Page 4: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

4 J. O’Neill et al.

scores showed high variability. They recognised that a more accurate model couldbe trained by collecting rankings from customers, rather than scores, and thentransforming those rankings to real-valued scores using an isotonic regression.The work described in this paper extends the work of Bockhorst et al. by addingactive learning to the rank transformation process, which, we hypothesize, cansignificantly increase the learning rate.

3 RaScAL

The RaScAL algorithm is an active learning approach to predicting real-valuedscores for items given an overall item-ranking. Previously, we have shown thatdata collected via pairwise comparisons is more reliable than that collectedthrough queries for absolute item scores. Pairwise comparisons allow us to buildan overall ranking among items but do not allow us to infer scores. RaScALenables us to bridge this gap. By issuing a small number of queries for abso-lute scores for selected items, we can make predictions for the remaining items,assigning scores in such a way that the rank ordering is preserved.

For example, consider three films, F1, F2 and F3, with corresponding scoresSc1, Sc2 and Sc3 where the rank order of the scores is known i.e. Sc1 ≤ Sc2 ≤Sc3. After issuing queries for the scores of F1 and F3, imagine we get Sc1 =3 and Sc3 = 5 We then know that the score for F2 will be between 3 and5, inclusive. Technically, we achieve this by using the queried points to fit anisotonic regression [2], which we then use to predict the scores of the remainingitems.

RaScAL differs from previous research in how it selects items to be queried.When faced with the problem of transforming rankings to a set of scores, Bock-horst et al. fit an isotonic regression model using training examples (scores)sampled uniformly from the set of labels [3]. For example, given 101 items and alabelling budget of 11 queries, the uniform sampling approach would query thefirst item, and every tenth item thereafter. This approach works well when thedistances between scores are relatively uniform; however, if these scores are notuniformly distributed, this method is prone to underfit the data. RaScAL im-proves the robustness of this approach by selecting queries so as to minimise theexpected error of the predicted scores. This is best illustrated with an example.

Table 2 (a) describes an artificial dataset which we will use to illustrate theRaScAL query selection strategy. This data is visualised in Figure 2 (b). Weassume that the ranks of each item in the dataset are known before the labellingprocess begins, and the aim of the exercise is to accurately predict the scores ofeach item (which would be unknown) using as few queries as possible. We referto each item in the dataset as Ix where x is the rank position of the item inquestion.

The first 3 queries in the RaScAL algorithm are always the same. We beginby establishing the upper and lower limits of the scores by querying for thelowest and highest ranked items, and then query for the item in middle of the

Page 5: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring via Active Learning 5

Rank Score Rank Score

1 1.20 10 4.902 1.70 11 5.503 1.80 12 7.104 1.80 13 7.805 2.10 14 8.506 2.20 15 9.107 2.40 16 9.708 2.60 17 10.00

(a)(b)

(c) (d)

Fig. 2: RaScAL illustrative example. (a) Example data values. (b) Example datavisualised as scatter plot. (c) Calculating the next query for the RaScAL algo-rithm. The grey boxes represent the maximum possible error. (d) Step-by-stepillustration of RaScAL query selection strategy

ranking4. In this case we issue a query for {I1, I9, I17}. This query splits thedata into two sequences of items. The first sequence, S1, consists of the itemsI2 . . . I8. Assuming that the oracle returns perfect scores, the potential scoresfor items in this sequence are bounded by the values of I1 and I9, meaning allscores fall within the range {1.2 . . 2.8}. The second sequence, S2, consists of theitems I9 . . . I17. The potential scores for items in this sequence are bounded bythe values of I9 and I17, meaning all scores for this sequence fall within therange of {2.8 . . 10}. By performing a simple linear interpolation between thelabelled points, we fit an isotonic regression model which can then be used topredict the scores of the remaining items. The result of the first iteration of theRASCAL algorithm for this example are shown in Figure 2 (c). The red dotsrepresent queried items; while the red line joining these dots depicts the fittedisotonic regression function which allows us to make predictions for each of the

4 When there is an even number of items in the set, it is not possible to split it intotwo equally sized subsets, and the ‘middle’ rank must be rounded either up or down.Given that we have no prior knowledge of the distribution of scores there is notheoretical reason for preferring one over the other. In this study, however, we choseto round down when confronted with this problem

Page 6: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

6 J. O’Neill et al.

remaining items. The grey boxes show the bounds within which all remaininglabels must fall. These bounds can be used to select the next items for which toquery the oracle as they also bound the error of scores inferred using the isotonicregression.

If we predicted the value of 1.2 for each of I2 to I8 and each of the items I2to I8 had a score of 2.8, we could expect a maximum error of (2.8−1.2) for eachof the unrated items. This expected error is approximated by the area of thegrey box in Figure 2 (c). However, the isotonic regression diagonally bisects thisbox, reducing the maximum error by half. In the worst case scenario, the totalerror for S1 (assuming all items have a score of 1.2, or all items have a score of2.8) is 1

2 × (2.8 − 1.2) × (9 − 1 − 1) = 5.2. In the worst case scenario, the totalerror for S2 (assuming all items have a score of 2.8, or all items have a score of10) is 1

2 × (10 − 2.8) × (17 − 9 − 1) = 25.2. As S2 has a greater potential errorthan S1 we next query the oracle for the score for the item in the middle of S2.

The calculation of maximum expected error can be formalised as:

E =(Rankj −Ranki − 1) ∗ (Yi − Yj)

2(1)

where E is the maximum expected error for an unlabelled segment, Ranki andRankj are the rank positions of the items bounding the segment, and Yi and Yj

are the labelled scores for these items. The numerator represents the boundingbox between labelled items i and j, corresponding to the shaded grey areas inFigure 2 (c). The isotonic regression bisects this rectangle, effectively halving themaximum expected error for the segment.

After querying the oracle for the score for the middle item in S2 and bisectingthis segment, we are left with three segments. We repeat the process, finding thesegment with the maximum possible error and querying the item which bisectsthat segment, until no labelling budget remains. Algorithm 1 formalises thedescription of the RaScAL process. Figure 2 (d) shows the sequence of querieswhich would be made on the example data described above with a labellingbudget of 8 queries. Figure 3 compares the isotonic functions fitted by uniformsampling vs that fitted by RaScAL, with a labelling budget of 4 queries. It isevident from these graphs that the RaScAL approach fits the data more closely.

4 Experimental Framework

In order to investigate the performance of RaScAL in accurately mapping rank-ings to scores we performed an experiment using both synthetically generatedand real-world datasets. In Section 4.1 we describe the datasets used for thisexperiment, while Section 4.2 discusses the experimental evaluation process.

4.1 Datasets

The MovieLens 100k dataset [11] consists of 100,000 scores (on a scale of [1 . . 5])for over 1,500 films. We aggregated all scores on a per-item basis, using the

Page 7: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring via Active Learning 7

Fig. 3: Comparison of uniform sampling and RaScAL approaches to query selec-tion.

Dataset # Items Scale Mean IQR Min Max

MovieLens 1,682 [1 . . 5] 3.076 0.782 1 5Jester 100 [−10 . . 10] 0.824 2.239 -3.834 3.665Boredom Videos 125 [1 . . 10] 6.484 0.694 4.903 7.750Book Crossing 675 [1 . . 10] 7.401 3.000 1 10Bi-Modal 100 [0 . . 100] 49.670 48.821 0.077 96.850Multi-Modal 100 [0 . . 100] 42.149 68.974 1.927 99.164

Table 1: High-level distributional features of the datasets used in this evaluation

mean over all raters as the item’s final score5. Individual scores were providedin integer format; however, after averaging scores many items ended up with areal-valued label.

The Jester dataset, originally provided by Ken Goldberg [9], is a datasetof scores for jokes provided by users on a [−10, 10] scale. Unlike many collabo-rative filtering datasets, in which users provide scores on an integer scale, thejester dataset collected scores using a slider, allowing users to provide real-valuedscores. Overall item scores were calculated as the mean score across all users.

The Boredom Videos corpus was gathered by Soleymani et al. [16] using thecrowdsourcing platform Amazon Mechanical Turk6. Respondents were asked torate videos on a scale of [1 . . 10] based on how boring they found them to be. Aswith the Jester dataset, overall items scores were calculated as the mean scoreacross all users.

The Book Crossing dataset was collected by Ziegler et al. [21] from the BookCrossing online community. It contains a mixture of implicit and explicit scores.An implicit score indicates that a book has been read, while explicit scores areprovided as a value on an integer scale of [1 . . 10]. For this experiment, we only

5 A superior algorithm which takes a probabilistic approach to aggregating scores hasbeen proposed by Raykar et al. [14]. Although this approach has been shown to moreaccurately approximate the true rating; this aspect of rating elicitation is outside thescope of the current work as it would add unnecessary complexity to the experiment.

6 https://www.mturk.com

Page 8: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

8 J. O’Neill et al.

Algorithm 1 RaScAL algorithm

struct Segment:property firstproperty lastproperty max error

Segment(first, last, scores)max error ← max error(first, last, scores[first], scores[last]) . see

Equation 1

Input:Items . Rank ordered list of itemsN . The number of items in Items

Output:Fitted Isotonic Regression . A model which can predict scores for all items

Require:isotonic, a function which fits an isotonic regression given a set of items andcorresponding scoressort, a function which sorts the segment vector by maximum expected errorquery, a function which requests a score for a given item from an oracle

1: Scores← [ ]2: Scores[0]← query(Items[0])3: Scores[N − 1]← query(Items[N − 1])4: new seg ← segment(0, N − 1, Scores)5: segments← [new seg]6: repeat7: segments.sort() . Sort segments on s.max error descending8: s← segments.pop()9: mid point← s.first + (s.last− s.first)/2

10: score← query(Items[mid point])11: Scores[mid point]← score12: seg low ← segment(s.first, mid point, Scores)13: seg high← segment(mid point, s.last, Scores)14: segments.append([seg low, seg high])15: until Label budget exhausted OR all items ranked16: return isotonic(Items, Scores)

Page 9: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring via Active Learning 9

(a) MovieLens (b) Jester

(c) Boredom Videos (d) Book Crossing

Fig. 4: Visual representation of the label values from real-world datasets

used explicit scores. We selected only the first 675 books, giving us 684 explicitscores in total. This dataset was unique in our experiment in that most itemswere rated by only one user. As scores were provided on an integer scale, thisresulted in a large number of ties among items, as can be seen in Figure 4.

In addition to the real-world datasets described above, we created two arti-ficial datasets using random sampling from known distributions. The Bi-Modaldataset consists of 100 scores in total. 50 scores were drawn from a normal dis-tribution with mean 25 and standard deviation 10, with the remaining scoresdrawn from a normal distribution with mean 75 and standard deviation 10. TheMulti-Modal dataset also consists of 100 scores, though these scores are drawnfrom 3 uniform distributions; 50 scores from a uniform distribution with range[1 . . . 20], 15 scores from a uniform distribution with range [21 . . . 70] and theremaining 35 scores drawn from a uniform distribution with range [71 . . . 100].

In their original formats, the target variable of each dataset is a numericscore. For each dataset we convert these scores to ranks. These ranks are thenused as the input data for the RaScAL algorithm. Where ties were encountered,distinct ranks were assigned based on the order in which they occurred in thedataset. This means that an item with a rank value of 2 and an item with a rankvalue of 3 may have the same actual score.

Table 1 summarises the high-level distributional features of each of the datasetsused, after pre-processing, where described above, was carried out. The distribu-tion of scores for the Bi-Modal and Mult-iModal datasets are visualised in Figure5. The distribution of labels for each of the real-world datasets are visualised inFigure 4.

Page 10: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

10 J. O’Neill et al.

Fig. 5: Showing the distribution of scores for the Bi-Modal and Multi-Modalartificial datasets

4.2 Evaluation Metrics

We compare the RaScAL algorithm, described in Section 3, to the uniform-sampling approach employed by Bockhorst et al. [3]7 . We begin the evaluationby allowing three queries (the minimum required for the RaScAL algorithm). Theevaluation then proceeds in single-instance batches, with each batch affordingone additional query to the query budget of each algorithm. After each batch iscomplete, the returned scores are used to fit an isotonic regression which in turnis used to predict the scores of the remaining unlabelled items. We calculatethe Root Mean Squared Error (RMSE) after each stage is complete and usethe results to construct a learning curve plotting the RMSE of each algorithm’spredictions against the number of labels requested. We use the trapezoidal rule8

to approximate the Area Under the Learning Curve (AULC) which serves asan overall indicator of the learning rate, or accuracy of each approach. A lowerAULC indicates a faster learning rate, and hence overall algorithm efficiency.

5 Results

Table 2 shows the AULC for RaScAL and uniform sampling on each of thedatasets under examination. The RaScAL algorithm consistently outperformed

7 Source code available at https://github.com/joneill87/RaScAL8 pracma https://cran.r-project.org/web/packages/pracma/pracma.pdf

Page 11: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring via Active Learning 11

the uniform sampling baseline on all datasets. The improvement is particularlypronounced on the Book Crossing and Movie Lens datasets, where the scarcityof raters and the integer-valued scores led to a significant number of ties.

Dataset RaScAL Uniform Sampling

Movie Lens 5.00 13.63Jester 4.87 7.96

Boredom Videos 2.10 3.88Book Crossing 10.05 55.40

Bi-Modal 69.15 98.23Multi-Modal 56.76 91.58

Table 2: AULC for the RaScAL and Uniform Sampling algorithms on all datasets

Figure 6 depicts the learning curves for the RaScAL and baseline algorithmson each of the datasets under investigation. Figure 4 (d) shows that the BookCrossing dataset consists of densely packed scores, where there are far feweritems than there are distinct scores. It is clear from Figure 6 (d) that the RaScALalgorithm quickly identifies the key points in the ranking where the item scoresincrease, and reduces the error to 0 using only 10% of the labels (64 labels intotal).

The RaScAL algorithm also outperforms the uniform sampling baseline incases where there are few ties, but the distribution of scores is not uniform.Figure 6 (f) shows the learning curve decreasing rapidly and consistently for theRaScAL algorithm on the Multi-Modal dataset. This is due to its ability to issuemore queries in areas of greater uncertainty.

The less uniform the distribution, the greater the improvement in learningrate. In the case of a perfectly uniform distribution of scores, the behaviour ofRaScAL will mimic exactly the behaviour of the uniform sampling algorithm, sothis approach should be preferred to uniform sampling in all cases.

6 Conclusions and Future Work

In this paper we presented RaScAL, an active learning approach to transformingrankings into scores. By simulating a data rating exercise we demonstrated thatRaScAL performs at least well as, and often better than, a non active-learningbaseline.

The task of recovering latent scores given a total rank ordering is one stepwithin an overall system of preference elicitation of scores via pairwise compar-isons. In the current study we have assumed knowledge of the overall rankingamong items. In the final system, this ranking will need to be constructed byefficiently selecting of pairs of items for comparison and using rank aggregationto combine multiple 2-item rankings into an overall ranking. We anticipate thatthis will be achieved using a variant on the Bradley-Terry model.

Page 12: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

12 J. O’Neill et al.

(a) MovieLens (b) Jester

(c) Boredom Videos (d) Book Crossing

(e) Bi-Modal (f) Multi-Modal

Fig. 6: Learning curve for each dataset used in this experiment.

Once this system has undergone end-to-end validation we aim to verify ourfindings using actual labellers labelling real data in a crowd-sourced environment.We hypothesize that if we elicit labels using pairwise comparisons as opposed todirect scores, the increased reliability of the resulting data will allow us to trainmore effective models using fewer labels.

References

1. Adomavicius, G., Bockstedt, J.C., Curley, S.P., Zhang, J.: Do recommender sys-tems manipulate consumer preferences? a study of anchoring effects. InformationSystems Research 24(4), 956–975 (2013)

2. Barlow, R.E., Brunk, H.D.: The isotonic regression problem and its dual. Journalof the American Statistical Association 67(337), 140–147 (1972)

3. Bockhorst, J., Yu, S., Polania, L., Fung, G.: Predicting Self-reported Customer Sat-isfaction of Interactions with a Corporate Call Center. Lecture Notes in ComputerScience (including subseries Lecture Notes in Artificial Intelligence and LectureNotes in Bioinformatics) 10536 LNAI, 179–190 (2017)

Page 13: From Rankings to Ratings: Rank Scoring Active Learningceur-ws.org/Vol-2169/paper-05.pdf · From Rankings to Ratings: Rank Scoring via Active Learning? Jack O’Neill1, Sarah Jane

From Rankings to Ratings: Rank Scoring via Active Learning 13

4. Cai, W., Zhang, M., Zhang, Y.: Batch mode active learning for regression with ex-pected model change. IEEE transactions on neural networks and learning systems28(7), 1668–1681 (2017)

5. Cholleti, S.R., Don, S., Goldman, S.A., Politte, D.G.: Veritas : Combining ExpertOpinions without Labeled Data. International Journal on Artificial IntelligenceTools 18(05), 633–651 (2009)

6. Eickhoff, C.: Cognitive biases in crowdsourcing. In: Proceedings of the EleventhACM International Conference on Web Search and Data Mining. pp. 162–170.ACM (2018)

7. Elahi, M., Ricci, F., Rubens, N.: A survey of active learning in collaborative filteringrecommender systems. Computer Science Review 20, 29–50 (2016)

8. Fedorov, V.V.: Theory of optimal experiments. Elsevier (1972)9. Goldberg, K., Roeder, T., Gupta, D., Perkins, C.: Eigentaste: A constant time

collaborative filtering algorithm. information retrieval 4(2), 133–151 (2001)10. Grimm, M., Kroschel, K., Narayanan, S.: The Vera am Mittag German audio-visual

emotional speech database. 2008 IEEE International Conference on Multimediaand Expo, ICME 2008 - Proceedings pp. 865–868 (2008)

11. Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. ACMTransactions on Interactive Intelligent Systems (TiiS) 5(4), 19 (2016)

12. Nie, R., Wiens, D.P., Zhai, Z.: Minimax robust active learning for approximatelyspecified regression models. Canadian Journal of Statistics 46(1), 104–122 (2018)

13. ONeill, J., Delany, S.J., Mac Namee, B.: Rating by ranking: An improved scale forjudgement-based labels. In: 4 th Joint Workshop on Interfaces and Human DecisionMaking for Recommender Systems (IntRS) 2017. p. 24 (2017)

14. Raykar, V.C., Duraiswami, R., Krishnapuram, B.: A fast algorithm for learning aranking function from large scale data sets. IEEE Transactions on Pattern Analysisand Machine Intelligence 30(7), 1158–1170 (2008)

15. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Ma-chine Learning 6(1), 1–114 (2012)

16. Soleymani, M., Larson, M.: Crowdsourcing for Affective Annotation of Video : De-velopment of a Viewer-reported Boredom Corpus. Proceedings of the ACM SIGIR2010 workshop on crowdsourcing for search evaluation (CSE 2010) pp. 4–8 (2010)

17. Su, X., Khoshgoftaar, T.M.: A Survey of Collaborative Filtering Techniques. Ad-vances in Artificial Intelligence 2009(Section 3), 1–19 (2009)

18. Sugiyama, M., Nakajima, S.: Pool-based active learning in approximate linear re-gression. Machine Learning 75(3), 249–274 (2009)

19. Tarasov, A.: Dynamic Estimation of Rater Reliability using Multi-Armed Bandits.Doctoral thesis, Dublin Institute of Technology (2014)

20. Tversky, A., Kahneman, D.: Judgment under uncertainty: Heuristics and biases.science 185(4157), 1124–1131 (1974)

21. Ziegler, C.N.C., McNee, S.M.S., Konstan, J.a.J., Lausen, G.: Improving recommen-dation lists through topic diversification. In: Proceedings of the 14th internationalconference on World Wide Web WWW 05. pp. 22–32 (2005)

22. Zielinski, S., Rumsey, F., Bech, S.: On some biases encountered in modern audioquality listening tests - a review. Journal of the Audio Engineering Society 56(6),427–451 (2008)