by laurentcharlin - university of toronto t-space...the recommendation task. in this thesis we...

Supervised and Active Learning for Recommender Systems

by

Laurent Charlin

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Graduate Department of Computer Science

University of Toronto

© Copyright 2014 by Laurent Charlin

Abstract

Supervised and Active Learning for Recommender Systems

Laurent Charlin

Doctor of Philosophy

Graduate Department of Computer Science

University of Toronto

2014

Traditional approaches to recommender systems have often focused on the collaborative filtering problem:

using users’ past preferences in order to predict their future preferences. Although essential, rating

prediction is only one of the components of a successful recommender system. One important problem

is how to translate predicted ratings into actual recommendations. Furthermore, considering additional

information either about users or items may offer substantial gains in performance while allowing the

system to provide good recommendations to new users.

We develop machine learning methods in response to some of the limitations of current recommender-

systems’ research. Specifically, we propose a three-stage framework to model recommender systems.

We first propose an elicitation step which serves as a way to collect user information beneficial to

the recommendation task. In this thesis we framed the elicitation process as one of active learning. We

developed several active elicitation methods which, unlike previous approaches which exclusively focus

improving the learning model, directly aim at improving the recommendation objective.

The second stage of our framework uses the elicited user information to inform models that predict

user-item preferences. We focus user-preference prediction for a document recommendation problem for

which we introduce a novel graphical model over the space of user side-information, item (document)

contents, and user-item preferences. Our model is able to smoothly tradeoff its usage of side information

and of user-item preferences to make good document recommendations in both cold-start and non-cold-

start data regimes.

The final step of our framework consists of the recommendation procedure. In particular, we focus on

a matching instantiation and explore different natural matching objectives and constraints for the paper-

to-reviewer matching problem. Further, we explore and analyze the synergy between the recommendation

objective and the learning objective.

In all stages of our work we experimentally validate our models on a variety of datasets from differ-

ent domains. Of particular interest are several datasets containing reviewer preferences about papers

submitted to conferences. These datasets were collected using the Toronto Paper Matching System, a

system we built to help conference organizers in the task of matching reviewers to submitted papers.

ii

Acknowledgements

I am most in debt to my supervisors Richard Zemel and Craig Boutilier. Without their advice, their

continued encouragements and their help and support this work would not have been possible. I am

glad we made this co-supervision work.

I am especially grateful to Rich, whom as the NIPS’10 program chair provided the initial motivation

and momentum behind this thesis. Furthermore, Rich’s ideas, presence and experience were determinant

in our joint creation of the Toronto paper matching system. Throughout these projects I found a great

mentor and I have been privileged to work closely to Rich. Rich has thought me a lot about how to pick

and approach research problems as well as about how to model them.

I am also very grateful to have been able to work with Craig. I have learned a lot from Craig. His

vision, his curiosity and his scientific rigour are qualities that I strive for. Craig’s ideas were also the

ones that initially helped foster this research and his insights and ideas throughout have provided great

balance to my work. Our interactions through COGS have further widen my research interests.

I would also like to thank my first mentor, Pascal Poupart, who showed me how exciting research

could be and gave me some of the tools to succeed at it.

My thanks also go to the members of my thesis committee, Sheila Mcllraith and Geoffrey Hinton, for

their precise comments and questions throughout my PhD. Geoff’s enthusiasm and presence in the lab

were also very motivating to me. I would also like to thank my external advisor, Andrew McCallum for

the appropriateness of his comments regarding my work and also for pointing out important immediate

future steps of great benefit. Finally, I am thankful to Ruslan Salakhutdinov and Anna Goldenberg for

reading and commenting on the final copy of my thesis.

The constant support and love of Anne were also determinant in undertaking and successfully finishing

this PhD. Her reassuring words have helped me in many occasions. I am especially thankful for her

ideas and her outlook on life which she selflessly shares with me and which I have learned so much from.

Further, I want to dedicate this thesis to Viviane, the next big project in our lives.

Although their involvement was more indirect I learned a lot from postdocs that have tenured in

Toronto, specifically I want to thank Iain, Ryan, Marc’Aurelio and, of course, Hugo who has become a

good friend and collaborator.

Finally, the machine learning group at Toronto was an extremely stimulating and pleasant place to

work at thanks to collaborators, close colleagues and friends: Kevin R., Jasper, Danny, Ilya, Kevin S.,

Jen, Fernando, Eric, Maks, Darius, Bowen, Charlie, Tijmen, Graham, John, Vlad, Deep, Nitish, George,

Andriy, Tyler, Justin, Chris, Niail, Phil, and Genevieve. Special thanks to Kevin R., Jasper, Danny and

Ilya for many interesting discussions about everything throughout our graduate years.

iii

Contents

1 Introduction 1

1.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Constrained Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 6

2.1 Preliminaries and Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Preference Modelling and Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 CF for Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Active Preference Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Query by Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 Expected Model Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.4 Expected Error Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.5 Batch Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.6 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Paper-to-Reviewer Matching 26

3.1 Paper Matching System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Overview of the System Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.2 Active Expertise Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Learning and Testing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Initial Score Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.2 Supervised Score-Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Expertise Retrieval and Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Other Possible Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Conclusion and Future Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv

4 Collaborative Filtering with Textual Side-Information 39

4.1 Side Information in Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3.1 Variational Inference in Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Collaborative Score Topic Model (CSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4.1 The Relationship Between CSTM and Standard Models . . . . . . . . . . . . . . . 48

4.4.2 Learning and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6.2 Competing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


5 Learning and Matching in the Constrained Recommendation Framework 63

5.1 Learning and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Matching Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2.1 Matching Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Related Work on Matching Expert Users to Items . . . . . . . . . . . . . . . . . . . . . . 67

5.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4.2 Suitability Prediction Experimental Methodology . . . . . . . . . . . . . . . . . . . 69

5.4.3 Match Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.4.4 Transformed Matching and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 74


6 Task-Directed Active Learning 77

6.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2 Active learning for Match-Constrained Recommendation Problems . . . . . . . . . . . . . 78

6.2.1 Probabilistic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2.2 Matching as Inference in an Undirected Graphical Model . . . . . . . . . . . . . . 80

6.3 Active Querying for Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.4.2 Experimental Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


7 Conclusion 91

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Bibliography 94

v

List of Tables

3.1 Comparing the top ranks of word-LM and topic-LM . . . . . . . . . . . . . . . . . . . . . 35

4.1 Modelling capabilities of the different score prediction models . . . . . . . . . . . . . . . . 54

4.2 Comparisons between CSTM and competitors for cold-start users . . . . . . . . . . . . . . 56

4.3 Test performance of CSTM and competitors on the unmodified ICML-12 dataset . . . . . 59

4.4 Comparisons between CSTM and two variations . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Overview of the matching/evaluation process. . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Comparison of the matching objective versus within-reviewer variance . . . . . . . . . . . 73

vi

List of Figures

1.1 Constrained recommendation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Graphical model representation of PMF and BPMF . . . . . . . . . . . . . . . . . . . . . 11

2.2 Graphical model representation of a mixture model for collaborative filtering . . . . . . . 12

3.1 A conference’s typical workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 High-level software architecture of the system. . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Histograms of score values for NIPS-10 and ICML-12 . . . . . . . . . . . . . . . . . . . . . 34

3.4 Comparison of top word-LM scores with top topic-LM scores . . . . . . . . . . . . . . . . 36

4.1 Preference prediction in the constrained recommendation framework . . . . . . . . . . . . 40

4.2 Graphical model representation of collaborative filtering with side information models . . 41

4.3 Graphical model representations of LDA and CTM . . . . . . . . . . . . . . . . . . . . . . 43

4.4 Graphical model representation of CSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Graphical model representation of CTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.6 Score histograms for NIPS-10, ICML-12, and Kobo . . . . . . . . . . . . . . . . . . . . . . 54

4.7 Test performance comparing CSTM and competitors across datasets . . . . . . . . . . . . 57

4.8 Learned parameters for NIPS-10 and ICML-12 . . . . . . . . . . . . . . . . . . . . . . . . 58

4.9 Test performance on unmodified ICML-12 dataset . . . . . . . . . . . . . . . . . . . . . . 59

4.10 Test performance comparing CSTM and CTR on new users . . . . . . . . . . . . . . . . . 61

5.1 Constrained recommendation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 Match-constrained recommendation framework . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Score histograms for NIPS-10 and NIPS-09 . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.4 Performance on the matching task on the NIPS-10 dataset. . . . . . . . . . . . . . . . . . 71

5.5 Histogram of assignments by score value for the NIPS-10 dataset . . . . . . . . . . . . . . 72

5.6 Comparison of score assignment distributions . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.7 Histograms of number of papers matched per reviewer under soft constraints . . . . . . . 74

5.8 Performance on the transformed matching objective on NIPS-10. . . . . . . . . . . . . . 75

6.1 Elicitation in the match-constrained recommendation framework . . . . . . . . . . . . . . 78

6.2 Comparison of matching methods on a “toy” example . . . . . . . . . . . . . . . . . . . . 80

6.3 Histograms of score values for Jokes and Dating datasets. . . . . . . . . . . . . . . . . . . 84

6.4 Matching performance comparing different active learning strategies . . . . . . . . . . . . 86

6.5 Usage frequency of the fall-back query strategy of our active learning methods . . . . . . 87

vii

6.6 Additional matching experiments comparing different active learning strategies . . . . . . 88

viii

Chapter 1

Introduction

Easy access to large digital storage has enabled us to record data of interest, from documents of great

scientific importance to videos of cats and hippos. The ease with which digital content can now be

shared renders this content nearly-instantly accessible to interested (online) users worldwide. Without

some form of organization, for example through search capabilities, these data would rapidly lose the

interest of users overwhelmed under this deluge of data. Hence, the tasks of organizing and analyzing

immense quantities of data are of critical importance. On-line search engines, by enabling their users

to search on-line records for relevant items, have been the original tools of choice for finding relevant

data. Although a good search engine can distinguish relevant from irrelevant items given a specific query

(for example, what movies are playing this weekend? ), search engines lack the ability to determine the

level of user interest within the set of relevant documents. In the last few years, recommender systems

have become indispensable. Recommender systems enable their users to filter items based on individual

preferences (for example, which weekend movies would I enjoy? ). Recommender systems add so much

value to search engines that major search engines now include a recommendation engine to personalize

their results. For similar reasons, recommender systems have become an important topic of academic

and corporate research.

1.1 Recommender Systems

Recommender systems are software systems that can recommend items, for example, scientific papers

or movies, of interest to their users, for example, scientists or movie lovers. Although specifics will differ

across real-world applications, recommender systems aim to help users in decision-making tasks [88].

Hence, recommender systems are useful in domains where it is infeasible or impractical for users to

experience or provide input on every available item. For example, a user may require a recommender

system to help in deciding which movie to attend at a film festival.

To fulfil its duties, a recommender system must correctly model information about its users and

items. A good recommender system should represent user preferences, but also should use other user

information which may affect that users’ immediate preferences, such as a user’s current state of mind

or his or her decision mechanism. In this respect, a recommender system behaves similarly to a user’s

friend who suggests an item of interest. Furthermore, a good recommender system should also analyze

information about items which will be useful in determining user interests. A good recommender system

1

Chapter 1. Introduction 2

for a user should therefore combine the insights of that user’s friends with the knowledge of a domain

expert [87].

Goldberg et al. [40] introduced the first practical recommender system appeared at the end of the

1990s. Incidentally, the same authors also coined the term collaborative filtering to describe a system

which uses the collective knowledge of all its users to more accurately infer the preferences of individual

users. From that point, interest in both academia and the corporate world grew rapidly. Resnick

and Varian [87] formalized some of the design parameters in recommender systems. These parameters

included the type of user feedback (user evaluations of items), the various sources of preferences, and

the aggregation of user feedback, as well as how to communicate recommendations to the users. The

authors noted that “one of the richest areas for exploration is how to aggregate evaluations”. The future

proved them right because a plethora of work has since been carried out on this topic [88]. Furthermore,

this central question also constitutes the core of this thesis.

Creating a recommender system is a task which spans several research areas. As pointed out by Ricci

et al. [88] in their recommender systems handbook: “Development of recommender systems is a multidis-

ciplinary effort which involves experts from various fields such as artificial intelligence, human-computer

interaction, information technology, data mining, statistics, adaptive user interfaces, decision- support

systems, marketing, and consumer behaviour”. This thesis considers the determination of user-item pref-

erences, using information about users and items, as the central task faced by recommender systems.

Machine learning and specifically prediction models and techniques offer a natural way of representing

available information for predicting user-item preferences. Hence, machine learning models constitute

the central component of recommender systems. Other aspects of recommender systems, for example,

the interface between the system and its users, are auxiliary components which are built around the

prediction mechanisms.

1.1.1 Constrained Recommender Systems

User preferences represent the main thrust behind a recommender system’s suggestions. Therefore,

optimizing the prediction of preferences is the natural goal of a recommender system. However the

system’s designer may have additional objectives that he wishes to optimize, as well as other constraints

that he may be required to satisfy. For example, an on-line movie store may want to consider its

stock of each movie to ensure that only available movies are recommended. Depending on the exact

constraints, such recommendations may be similar to or very different from the initial, constraint-free

recommendations. The same on-line retailer may also wish to maximize some other function such as

user happiness, or perhaps more realistically, his or her own long-term profit. Such objectives lead to

a trade-off between an individual’s preferences, the preferences of other users, and the objectives of

the designers. Another common objective is that a recommender system may be inclined to take into

account the diversity of its suggestions [78] (for example, as a way to hedge its bets against non-optimal

recommendations). A recommender system may also use its recommendations as a means to further

refine its user model, for example, by selecting a subset of recommended items using an “exploration

strategy”. In this context, it may be useful for the system to recommend items that will be most useful

in refining its user model, somewhat independently of a system’s (ultimate) objective to provide the best

possible recommendations. The study of constrained recommender systems, and specifically the design

of learning methods which are tailored to the interaction between the preference prediction objective

and the final objective, is a primary focus of this thesis.


1.2 Contributions

The aim of this thesis is to develop machine learning methods tailored to recommender systems. Fur-

thermore, we are interested in how such machine learning methods may lead to better user and item

representations and, therefore, ultimately lead to improved recommender. We propose to model the

recommendation problem using a three-stage process which is elaborated in this thesis:

Preference Collection: When initially engaging with a recommender system, a new user must provide

information related to his preferences. This information will be used by the recommender system

to build a user model which paves the way to personalized recommendations. The preference

collection phase represents an opportunity for the system to elicit user information actively. We

frame the problem of selecting what to ask a user as an active learning problem. Such elicitation,

by asking informative queries which quickly lead to a good user model, has the potential to mitigate

the cold-start problem, a problem which refers to the difficulty for a system to learn good models of

users with little known preferences such as novel users. We propose several methods for performing

active learning of user preferences over items for (match-constrained) recommender systems. Our

main contribution that leads to the empirical success we demonstrate is our novel active learning

methods—they are sensitive to the matching objective, which is the ultimate objective of match-

constrained recommender systems.

Predict missing preferences: The central task of a recommender system is to use elicited user infor-

mation to predict user-item preferences. This means that the system must use available information

to learn a predictive model of user-item preferences. Here, we propose several models to predict

these missing user preferences. We focus on developing models which, in addition to user-item

preferences, also model and leverage side-information, that is features of users or items aside from

user preferences. We demonstrate empirically that this side-information can be beneficial in both

cold-start and non-cold-start data regimes.

Preferences to recommendations: Using predicted preferences to suggest items to users is the goal

of the recommender system. Going from preferences to suggestions can be as simple as selecting

a subset of the most preferred items or providing users with a preference-ordered list of items.

However, it can also involve solving a more complex optimization problem such as that imposed by a

constrained recommender system. This stage therefore involves using user-item preferences, either

user-elicited or predicted, as inputs to a recommendation procedure, for example a combinatorial

optimization problem, which considers the final objective and constraints and determines the

recommendations.

In this thesis, we introduce and instantiate match-constrained recommender systems which involve

matching users to items under constraints which globally restrict the set of possible matches. We

explore several matching objectives and constraints and show, for certain objectives, a synergy

between the final matching objective and the preference-prediction objective (the learning loss).

A flow chart depicting the three stages described above is shown in Figure 1.1. Although the first two

stages directly benefit from machine-learning techniques, the third, which arises from practical consid-

erations of recommender systems, provides an ultimate objective to be used to guide the development of

appropriate machine-learning techniques. In other words, there is a synergy between the first two stages

and the last stage. Specifically, the first stage’s objective is to collect useful information from users. The


Prediction of missing

preferencesElicitation

Final objectives

and constraints

S31

1 2

Users

Entities

Stated preferences

1 1 20 1 22 3 2

Stated & predicted

preferences

1 1 20 1 22 3 2

Recommendations

Side-Information

from Users and

Entities

F

Figure 1.1: Flow chart depicting the different components of our research. The first part representselicitation, followed by missing-preference prediction and finally the recommendation procedure, F ().The recommendation procedure represents the ultimate objective of the system (for example, ranking,matching, per-user diversity, social-welfare maximization).

usefulness of the information will be determined by the recommendation objective. The active learning

strategies can therefore be guided by the recommendation objective. Furthermore, when learning mod-

els for missing-preference prediction, a model sensitive to the recommendation objective may focus on

correctly predicting preferences that are more likely to be part of the recommender system’s suggestions.

The exact form of inter-stage interactions will be detailed in the appropriate chapters.

1.3 Outline

We will begin in Chapter 2 by defining the problem more formally, explaining some of the foundational

principles behind our work and reviewing relevant previous research in the areas of preference modelling

and prediction, active learning, and matching.

Chapter 3 introduces the paper-to-reviewer matching problem as a practical problem that will be

used to motivate and illustrate the various contributions throughout this thesis. This chapter also

introduces and describes in detail the on-line software system that we have implemented and released to

help conference organizers assign submitted papers to reviewers. Furthermore, this platform serves as a

testbed for the developments in other chapters.

Chapter 4 describes our work on the problem of preference prediction with textual side-information,

which is the first research contribution of this thesis. We introduce a novel graphical model for personal

recommendations of textual documents. The model’s chief novelty lies in its learned model of individual

libraries, or sets of documents, associated with each user. Overall, our model is a joint directed prob-

abilistic model of user-item scores (ratings), and the textual side information in the user libraries and

the items. Creating a generative description of scores and the text allows our model to perform well

in a wide variety of data regimes, smoothly combining the side information with observed ratings as

the number of ratings available for a given user ranges from none to many. We compare the model’s

performance on preference prediction with a variety of other models, including two methods used in our

paper-to-reviewer matching system. Overall, our method compares favourably to the competing mod-

els especially in cold-start data settings where user side-information is essential. We further show the

benefits of modelling user side-information in an application for personal recommendations of posters to

view at conference.


In Chapter 5, we formally introduce our framework for optimizing constrained recommender systems

and instantiate a match-constrained recommender system from it. We frame the matching or assignment

problem as an integer program and propose several variations tailored to the paper-to-reviewer matching

domain. Experiments on two data sets of recent conferences examine the performance of several learning

methods as well as the effectiveness of the matching formulations. We show that we can obtain high-

quality matches using our proposed framework. Finally, we explore how preference prediction and

matching interact. Experimentally we show that matching can benefit from interacting with the learning

objective when matching utility is a non-linear function of user preferences.

Active learning methods to optimize match-constrained recommender systems are proposed in Chap-

ter 6. Specifically, we develop several new active learning strategies that are sensitive to the specific

matching objective. Further, we introduce a novel method for determining probabilistic matchings

that accounts for the uncertainty of predicted preferences. This is of importance for as active learning

strategies are often guided by, or make use of, model uncertainty (for example, a common strategy is

to pick queries that will most reduce the model’s uncertainty). Experiments with real-world data sets

spanning diverse domains compare our proposed methods to standard techniques based on the qual-

ity of the resulting matches as a function of the number of elicited user preferences. We demonstrate

that match-sensitive active learning leads to higher-quality matches more quickly compared to standard

active learning techniques.

Finally, we provide concluding remarks and discuss opportunities for future work in Chapter 7. We

also provide a brief discussion of the main future issues to be addressed by the field of recommender

systems as a whole.

Chapter 2

Background

In this chapter, we review the literature pertaining to the three stages of recommendation systems

studied in this thesis. We begin by introducing the notation that will be used throughout this thesis and

reviewing some of the foundational concepts of machine learning. Then we review some of the previous

research in preference prediction, focusing especially on side-information based and collaborative filtering

methods. We also introduce the concept of active learning as a way to collect user preferences and review

some of the relevant literature. Finally, we introduce and discuss the matching literature as an interesting

example of a recommendation objective.

Note that throughout this chapter, we survey some of the work that relates to this thesis as a whole.

Work specifically related to individual chapters will be surveyed independently as part of the relevant

chapter.

2.1 Preliminaries and Conventions

In our study of recommendation systems, we take user-item preferences to be the atomic and quintessen-

tial pieces of information used to represent a user’s interest in a particular item. Our convention is to

use the term preferences to denote a user’s interest in an item, but we will also use it to denote a user’s

expertise with respect to an item. A user’s expertise denotes his competence, or qualifications, with re-

spect to a particular item and as such may differ from his preferences (for example, conference organizers

may wish to assign submitted papers to the most expert reviewers regardless of the reviewer’s actual

interest in the paper). This distinction will usually be clear from context, although we will typically

use the more specialized term score to denote expertise and rating to denote interest. Note that all

models and methods developed in this thesis can readily deal with ratings or scores. We only consider

cases where users explicitly express their preferences for items individually. For example, we do not

consider preferences that could be expressed by providing a ranking of a group of items. We assume that

user-item preferences are expressed using numeric (typically integer) values. Furthermore, we equate

higher ratings (or scores) with a stronger preference (that is, users would prefer an item rated 5 over

an item rated 1). Interestingly, the semantic meaning of the preferences is usually not specified in data

sets, and therefore the value of a preference is usually taken to represent its utility. As a result, most of

the work we survey considers the rating-prediction problem as a metric regression problem rather than

an ordinal regression problem, and we have followed their lead.

6

Chapter 2. Background 7

We denote an individual user as r, the set of all users as R, and the number of users as R (that is,

R ≡ R). Similarly, individual items are denoted as p, the set of all items as P, and the number of items

as P . In our work, we will use the term users to designate the set of entities to which recommendations

will be provided, regardless of whether they represent a person, a group of people, or other entities

of interest. Similarly, items are entities to be recommended, regardless of their actual representation.

Typical items in this thesis will be documents, jokes, and other humans. We denote the preference of

user r toward item p as srp (a score). It will also be useful to think of a preference as a (r, p, s)-triplet.

In general, we denote matrices using uppercase letters (for example, U , V ), vectors using bold-font

lower-case letters (for example, a,γ), and scalars using lower case letters (for example, a,b). Unless this

introduces ambiguity, we also denote the size of sets using uppercase letters (for example, N and R).

2.1.1 Learning

The central machine-learning task in this thesis is to model user interest in items to predict user-item

preferences. We will assume that a subset of user-item preferences, possibly with side information, is

always available to the system. Ways of obtaining such information will be discussed in Section 2.3. Our

goal is therefore to predict the preferences of users for items that they have not yet rated. In other words,

our goal is to use observed user-item preferences to learn a model that can predict missing user-item

preferences. This prediction problem is framed as a supervised machine-learning problem. Accordingly,

our aim is to minimize, for a suitable error or loss function, the expected loss of our preference prediction

model. The expectation is taken with respect to a fixed but unknown data-generating distribution,

where the generating distribution is a joint distribution over preferences and user-item pairs. Because

in practice the true expected loss cannot be evaluated, it is customary to evaluate and report instead

the empirical loss [124]. To evaluate the performance of a model, we will then use user-item preferences

collected from users otherwise known as ground-truth preferences. The empirical loss is the result of

evaluating the loss function using the ground-truth preferences with the predicted preferences.

For the purposes of empirical comparison, the available ground-truth preferences are divided into

two disjoint sets: the training set, which the machine learning model is allowed to leverage, and the

test set, which the method is not given access to and, therefore, which can be used to evaluate model

performance. In addition, it can be useful to reserve additional data from the training set to constitute

a validation set. The validation set can be used to evaluate model performance during the training

stage. In particular, validation sets are often used to determine the value of hyper-parameters, which are

parameters given as input to a learning model, and to prevent overfitting. A model is said to overfit if

its training loss is much smaller than its test loss.

2.2 Preference Modelling and Predictions

In the context of recommendation systems, two broad classes of preference prediction systems have been

explored in the machine-learning literature: content-based systems and collaborative filtering systems.

Both types of systems make use of user preferences for items. Content-based systems assume access to

the content of the items and model content features, features derived from the content. As an example,

in a book domain, the content of items include the words of the books, the corresponding word counts

would be content features. On the other hand, collaborative filtering leverages only the preference

similarities between users and/or items [40]. Instead of content-based we prefer the more general term


side information, which denotes all user, item, or user-item features except the user-item preferences

themselves.

2.2.1 Collaborative Filtering

A simple example of collaborative filtering (CF) can be described as follows: imagine two users that

have similar ratings for certain items. CF makes the assumption that these similarities extend across

items rated by only one of the two users. Therefore, to learn a user’s model CF methods try to glean

preference information from similar users (that is, users whose preferences for items rated in common

are similar). Using this natural idea, researchers have built a wealth of models that have performed very

well on several real-life data sets [72, 62]. These methods usually exploit CF by enabling the learning

models to share parameters either across users or across items, or both.

We view CF as having certain advantages over side-information based methods:

1. CF leverages the very natural and powerful idea that a user’s preferences can be predicted using

the preferences of similar users. Using other users’ preferences is especially attractive in today’s

connected world where it has become easier to collect preferences for large numbers of users.

2. Because CF uses user preferences only, it is largely domain-independent and can also be used to

model user preferences across different item domains. On the contrary, side-information-based

systems leverage the similarities contained in the features of users or items. Selecting features that

are discriminative of user preferences may not be straightforward in certain domains. For example,

in the movie-domain, meta-data, such as a movie’s genre, has not typically led to significant

performance gains (see for example [66]).

CF also has certain limitations:

1. Because it leverages only the information contained in user preferences, CF is problematic when

the preferences of a specific user or for a specific item are not available, a problem commonly

known as the cold-start problem because it generally applies when users or items first enter the

system. Generally, CF is effective in domains where collections of preferences from users for items

are accessible, in particular when each user can rate multiple items and each item is rated by

multiple users. For example, typical domains in which CF has been successful relate to everyday

decisions, such as movie or restaurant recommendations, which entail easy access to large numbers

of users and their preferences. By design, CF would typically perform worst in domains where

obtaining many preferences from each user or for each item is difficult. Example domains include

recommending houses, cars, or universities to which a high-school student should apply. In such

domains, CF may still be useful once side information is available [19].

2. The fact that CF does not use domain-specific information also means that it does not make use

of potentially useful domain information when it is available.

We find that the domain-independence of CF methods, as well as the apparent difficulty of leveraging

side information in systems where many preferences are available, and their performance in practice [7]

tips the balance in favour of using CF methods for recommendation systems.

Hybrid systems that combine the advantages of both CF and side-information techniques offer inter-

esting opportunities. Overall, the field of side-information-CF models has only begun to be explored by


researchers. Because Chapter 4 will present several novel hybrid approaches, we will defer our discussion

of previous research in this field until then.

Original Collaborative Filtering Models

The term collaborative filtering was first coined by researchers who used this technique to help users

filter their emails by leveraging their colleagues’ preferences [40]. This was soon followed by similar work

using a neighbourhood, or memory-based, method aimed at filtering articles from netnews [86].

In model-free, or neighbourhood, approaches to CF, a prediction for a user-item pair is the weighted

combination of the ratings given to that item by the user’s neighbours. Alternatively, one could use the

weighted combination of the ratings given by that user to neighbouring items. The weight of each neigh-

bour is typically taken to be some similarity measure between the two users. Breese et al. [20] compare

different similarity measures such as cosine similarity and the Pearson correlation coefficient, as well as

several variations of these. In their experiments, they show that modifications to the latter yield the

best results.

The other main class of CF models is called model-based CF [20]. In this approach, parameters of

a (probabilistic) model are learned using known ratings, and the model is then used to predict missing

ratings.

In terms of ratings prediction performance, the neighbourhood approaches are typically not as good

as the best model-based approaches on real-world CF data sets. However, the fact that they are very

computationally efficient renders them attractive for on-line recommendation applications (for example,

those that need to update their recommendations according to newly acquired ratings in real time).

Model-based approaches, in addition to providing stronger performance, are generally more principled.

Among model-based approaches, there has been a push toward using probabilistic models (see the next

two sections). Probabilistic models generally offer a wealth of advantages, such as uncertainty modelling,

higher robustness to noise in the data, and a more natural way to encode additional information such

as prior information available to the system or other features of items and users. By using probabilistic

models, it is also possible to benefit from recent developments in these models, including advances in

learning and inference techniques.

In this document, we will focus mostly on model-based approaches, which have been favoured in

most recent machine-learning work.

Matrix Factorization Models

User-item-score triplets of the form (r, p, s) can be seen as the entries of a matrix, where the pair (r, p)

represents an index of the matrix and s corresponds to the value at that index. We will denote this score

matrix by S and the triplet (r, p, s) by srp = s. The dimension of the matrix is equal to the number of

users R by the number of items P (S ∈ RR×P ). The unobserved triplets are encoded as missing entries

in this matrix. The resulting matrix containing observed and unobserved triplets is therefore a sparse

matrix. The goal of collaborative filtering is then to fill the matrix by estimating the missing entries of

the matrix (Sm) given its observed entries (So). One natural way of doing so is to find a factorization

of S, S ≈ UTV , with U ∈ Rk×R and V ∈ R

k×P , with k effectively determining the rank of UTV [111].

Then a missing value, (r, p), can be recovered by multiplying the r-th column of U with the p-th column

of V : uTr vp. When S encodes user-item preferences, then U could be seen as encoding user attitudes

toward the item features encoded by V . Unless this introduces ambiguity, we simplify the notation by


denoting user r’s factors, that is, r’s column of U , by ur, and similarly item p’s factors, p’s column of

V , by vp. For matrices without missing entries, a factorization can be recovered by finding the singular

value decomposition (SVD) of R, which is a convex optimization problem. However, SVD cannot be

used for matrices with missing entries [112]. Furthermore, CF applications typically involve predicting

preferences for a large number of users and items given a small number of observed ratings (i.e., S is

sparse), |So| << RP . For that reason, estimating (R2 +RP + P 2) parameters would not be wise.

A variety of methods have attempted to deal with these difficulties for CF. Srebro and Jaakkola

[112] proposed to regularize the factorization by finding a low-rank approximation: S ≈ S, where

rank(S) = k << min(R,P ). Specifically, Srebro and Jaakkola [112] searched for U and V that would

minimize the squared loss between the reconstruction and the observed ratings:

∑

i

∑

j

Iij(sij − (uTi vj))

2. (2.1)

where Iij = 1 if score sij is observed and 0 otherwise. They also considered more elaborate cases where

I could be a real number corresponding to a weight, “for example in response to some estimate of the

noise variance” [112]. They proposed both an iterative procedure, based on the fact that the objective

becomes convex when conditioning on either U or V , and an expectation-maximization (EM) procedure.

Salakhutdinov and Mnih [99] further regularized the objective using the Frobenius norm of U and V :

∑

i

∑

j

Iij(sij − (uTi vj)

2 + λ1||U ||Fro + λ2||V ||Fro. (2.2)

|| · || denotes the Frobenius norm of a matrix and corresponds to the square root of the sum of squares

of all elements of the matrix (it is the equivalent, for matrices, of the vector Euclidean). The minimum

of this objective corresponds to the MAP solution of a probabilistic model with a Gaussian likelihood:

Pr(S|U, V, σ2) =∏

i

∏

j

N (sij |uTi vj , σ

2) (2.3)

and isotropic Gaussian priors over U and V . This model is called probabilistic matrix factorization

(PMF) , and its graphical representation, by way of a Bayesian network, is shown in Figure 2.1(a). The

authors show that although the problem is non-convex, performing gradient descent jointly on U and

V yields very good performance on a large real-life data set such as the Netflix challenge data set (100

million ratings, over 480,000 users, and 17,000 items). Instead of approximating the posterior with the

independent mode of U and V , Lim and Teh [68] proposed to use a mean-field variational approximation

which assumes an independent posterior over U and V (Pr(U, V |So) = Pr(U |So) Pr(V |So)). Salakhutdi-

nov and Mnih [98] further proposed Bayesian PMF, a fuller Bayesian extension of PMF, which involves

setting hyper-priors over the parameters of the priors of U and V (see Figure 2.1(b)). The posterior can

be approximated using a sampling approach. Specifically, they developed a Gibbs sampling approach

that scales well enough to be applied to large data sets and which outperforms ordinary PMF.

Instead of performing regularization by constraining the rank of U and V , which would render the

problem non-convex, Srebro et al. [113] proposed regularizing (only) the norm of the factors, ||U ||Fro

and ||V ||Fro, or equivalently the trace norm of UTV (that is, the sum of its singular values, denoted

by ||UTV ||Σ). With certain loss functions (for example, the hinge loss), finding the optimal UTV with

the trace-norm constraints is a max-margin learning problem, and hence the name of this model is max


UVj i

Rij

j=1,...,Mi=1,...,N

V U

α

α α

(a)

j

Rij

j=1,...,Mi=1,...,N

Vµ µ Ui

ΛU

µU

0ν , W0

µ0V0

VΛ

, W00ν

α(b)

Figure 2.1: Graphical model representation of PMF (a) and BPMF (b). These figures show that BayesianPMF is a PMF model with additional priors of U and V . Both figures are borrowed from [98].

margin matrix factorization (MMMF). To illustrate this, suppose, for ease of exposition, that S is binary.

Then maximizing S(UTV ) with fixed ||(UTV )||Σ will find the solution, (UTV ), that yields the largest

distance between the hyperplane defined by the normal vector (uTi vj) and the ratings {sij}. Although

this approach leads to a convex optimization problem, Srebro et al. [113] framed the optimization as

a semi-definite program (SDP), and given current solvers, the optimization problem cannot be solved

for more than a few thousand variables [129]. In their formulation, the number of variables is equal

to the number of observed entries, meaning that this solution technique severely limits the scale of

the applicable data sets. To combat this problem, Rennie and Srebro [85] proposed a faster variant of

MMMF, in which they use a second-order gradient descent method, and bound the ranks of U and V

(effectively solving the non-convex problem). Another way to understand this work is as follows. Start

from Equation2.2; then a max margin approach reveals itself if the squared error term is replaced by a

loss function that maximizes the margin between the hyperplanes defined by uTi vj and {sij} ∀i, j. In

fact, with binary ratings, PMF can also be seen as a least-squares SVM [116].

As previously noted in matrix factorization techniques, the factors U and V can be thought of as user

features and item features respectively. Of note is the fact that factorization models are symmetric with

respect to users and items: transposing the observed score matrix does not alter the optimal solution.

Other Probabilistic Models

There has also been significant work on probabilistic models that are not based on factorizing the score

matrix. Such models typically learn different sets of parameters for users and for items. The principle

underlying most of these probabilistic models is to cluster users into a set of global user profiles. Profiles

can be seen as user attitudes toward items. Once a user’s profile has been determined, that profile alone

interacts, in some specific way, with an item’s representation to produce a rating. Some of this work

originates in applying mixture models to the problem of CF [49]. The generative model is as follows:

Pr(sij |ui, vj) =∑

z∈Z

Pr(zi = z|ui) Pr(Szj = s;µzj , σ2zj), (2.4)


s

i = 1, . . . , N

j = 1, . . . ,M

u

zv

Figure 2.2: Graphical representation of a mixture model proposed by Hofmann [49] to predict ratings.

where Z is a latent random variable representing the different user profiles. Pr(Szj = s;µzj , σ2zj) is a

Gaussian distribution with mean µzj and variance σ2zj . This mixture model’s Bayesian-network repre-

sentation is presented in Figure 2.2. Informally, the corresponding generative model first associates user

ui with a profile z. Then that profile and the item of interest vj combine to set the parameters of a

Gaussian, which in turn determines the value of the score for (ui, vj). The parameters to learn are the

mixture weights for each user (that is, the distribution over the profiles) in addition to the mean and

variance of a Gaussian distribution for every item and every profile. Several authors have proposed to

use priors over possible user profiles [71, 48]. In addition to representing users as a distribution over a

small set of profiles, Ross and Zemel [93] proposed to also further cluster items (MCVQ).

Marlin and Zemel [73] model users as a set of binary factors (binary vectors), and factors are viewed

as user attitudes. Each factor is associated with a multinomial distribution over ratings for each item.

The main distinguishing feature of this approach is that the influences of the different factors, expressed

through their multinomials over ratings, are combined multiplicatively to predict a rating for a particular

item. This combination implies that factors can express no opinion about certain items (by having a

uniform distribution over the item’s ratings). Furthermore, the multiplicative combination also means

that the distribution over a rating can be sharper that any of the factor distributions, something that is

impossible in mixture models where distributions are averaged.

Lawrence and Urtasun [66] proposed the use of a Gaussian process recovered by marginalizing out U

from the PMF formulation, Equation 2.3, assuming an isotropic Gaussian prior over U . The likelihood

over ratings is then a zero-mean Gaussian with a special covariance structure:

P (S|V, σ2, αw) =

N∏

j=1

N (sj |0, α−1w V TV + σ2I). (2.5)

One then optimizes for the parameters V and the scalar parameters αw and σ using (stochastic) gradient

descent. Here the parameters, V , form a covariance matrix that encodes similarities between items. To

obtain a prediction for a user given the above model, one simply has to condition on the observed ratings

of the user. Given the Gaussian nature of this model, a user’s unobserved ratings will then be predicted

by a weighted combination of the user’s observed ratings. The weights of the combination are then a

function of the variance between observed and unobserved items.


Salakhutdinov et al. [100] proposed to apply restricted Boltzmann machines (RBMs) to the collab-

orative filtering task. RBMs are a class of undirected graphical models, general enough to be applied,

with few modifications, to problems in many different domains. For CF, Salakhutdinov et al. [100]

showed that using RBMs enables efficient learning and inference that can scale to large problems (such

as the Netflix data set) and can yield near state-of-the-art performance. Furthermore, RBM distributed

representations combine active features multiplicatively in a similar way and with the same advantages

over mixture models, as Marlin and Zemel [73]. Recently, Georgiev and Nakov [38] have proposed a

novel RBM-based model for CF.

Without direct comparative experiments, it is difficult to differentiate among the methods described

above in terms of performance. Matrix factorization techniques, which are typically very fast to train

(and support effective inference) have been the most widely used in recent years. One lesson drawn from

the large problem run by the on-line movie rental company Netflix is that ensemble methods, which

are combinations of several different methods, are particularly well-suited for collaborative filtering

problems. With this in mind, combinations of linear models such as matrix factorization techniques

have been shown to be very effective [62, 7, 64].

Classical Evaluation Metrics

We now turn our attention to the performance metrics typically used to evaluate CF models. Two

metrics are commonly used to compare the performance of CF methods. Both are decomposable over

each single predicted rating. The first is the mean absolute error (MAE):

MAE :=

∑(r,p)∈So |srp − srp|

|So| , (2.6)

where srp is a learning method’s estimation of srp and So is the set of user-item index pairs of interest

(e.g., the pairs contained in a test set). Researchers have also used a normalized version of this method

(NMAE), where MAE is normalized so that random guessing would yield an error value of one. Nor-

malization enables comparison of errors across data sets that have different rating scales. The second

metric is the mean squared error (MSE):

MSE :=

∑(r,p)∈So(srp − srp)

2

|So| , (2.7)

and its square-root version RMSE:=√MSE. The probabilistic methods introduced in the previous

section do not directly optimize for these measures. Instead, they learn a maximum likelihood (ML)

estimate of the parameters, a maximum a posteriori (MAP) estimate, or a full distribution over their

parameters in the case of Bayesian methods such as Bayesian PMF. As outlined above, finding a MAP

estimate is, under certain assumptions, equivalent to minimizing the squared loss of individual predictions

(and hence the MSE). Once learning is done, probabilistic models can be used to infer a distribution over

ratings. Given that distribution, different evaluation metrics call for different prediction procedures: the

MSE is minimized by taking the expectation over possible rating values, while the MAE is optimized by

predicting the median of the ratings distribution. One exception is in the case of MMMF, which does

not have a direct probabilistic interpretation, but Srebro et al. [113] proposed to minimize a hinge loss

which generalizes to a multi-class objective like MAE.


The Missing At-Random Assumption

There is a general issue with the evaluation framework used in most CF research which may be problem-

atic when deploying these models in the real world. Because the empirical loss is taken as a surrogate

for the expected loss, one assumes that the data-generating distribution of both is the same; however,

in practice, it usually is not. In other words, users generally do not rate items at random, and in fact,

it has been shown empirically that users tend to bias their ratings toward items in which they are inter-

ested [75]. Positively biased ratings are likely due to users wanting to experience items that they believe

they will enjoy a priori. Hence, learning models will be affected unless this bias is explicitly taken into

account. Technically speaking, most models assume that data are missing at random, which is often

incorrect. Marlin et al. [75] and Marlin and Zemel [74] proposed a series of models which explicitly

account for data not missing at random. Using such models is preferred if data about items considered

but not explicitly rated are available. In practice, such data are seldom available (although on-line mer-

chants may have access to them). Other solutions try to reconcile the (possibly biased) data with the

need to minimize the (unbiased) expected loss, but they also have certain drawbacks [126]. Although

of great interest, such questions are somewhat tangential to our goals and therefore will not be directly

addressed here.

2.2.2 CF for Recommendations

The astute reader will have noticed that although we claim that CF techniques belong to the realm

of recommendation systems, neither the outlined CF methods nor their evaluation metrics offer a clear

way of making item recommendations to users. In fact, many authors (see, for example, [72]) see the

prediction of preferences as the first of two steps leading to personalized recommendations. The second

step consists of sorting the (predicted) ratings. The top recommendation for a user is then the item

with the highest predicted score. This two-step view of a recommendation system using collaborative

filtering has two important shortcomings:

1. When the goal is simply to recommend the top items to users, the solution to the ratings estimation

problem is also a solution to the ranking problem. Weimer et al. [128] pointed out that score-

estimation models must learn to calibrate scores, thus making the score-estimation problem actually

harder than estimating rankings.

2. As we have argued in Chapter 1, defining item recommendations by suggesting to users their

predicted favourite items is only one example of a possible recommendation task. In practice other

conditions and constraints may need to be satisfied:

(a) Certain domains may warrant the consideration of factors such as confidence in predictions.

For example, in a mobile application where there are costs associated with recommendations,

a system might prefer to use a more conservative strategy by, say, recommending an item for

which it has a higher certainty of success.

(b) The availability of side information may introduce additional constraints. For example, if

several items are to be recommended at once, criteria such as diversity might have to be

taken into account (see, e.g., [78]).

(c) In addition to constraints affecting single users or items, there are possibly global constraints

on users, items, or both. For example, certain items may be available only in limited quan-


tities, and therefore these items may not be recommended to more than a certain number of

users. More complex relationships may also exist between users and items, such as those that

arise in the domain of paper-to-reviewer matching for scientific conferences (see Chapter 3 for

a complete discussion).

In the cases described above, modifying the second step according to the recommendation task

may not be optimal because a learning method trained for ratings prediction may not capture the

constraints and aims of the recommendation system.

Instead of considering CF for recommendations as the product of two separate steps, where the first

is unaware of the objectives of the second, we argue that the tasks should be more integrated in that the

objective of collaborative filtering should be sensitive to the final objective. In fact, this line of reasoning

will be present throughout this thesis. We review some existing methods on the topic below and defer

a full discussion of our approaches to Chapter 5.

Collaborative Filtering in the Service of Another Task

In previous sections, we have reviewed various collaborative filtering techniques that all optimize rating

predictions. We have also established that to offer recommendations, a system should be aware of the

final recommendation task from the beginning. We mean by this that the learning procedure, at training

time, should consider a loss reflective of the ultimate objective. One may object to this by pointing out

that regardless of the final task, the learning method will be accurate if it is able to estimate perfectly

the unobserved ratings. However, this is almost never the case in practice. Instead, integrating the final

objective into the learning objective can be seen as indicating to the learning algorithm where to focus

its attention (capacity) to maximize performance on the final task.

One may wonder why this seemingly simple principle has not been more widely applied. First, as

will be made clear in this section, the ultimate task objective is often harder to optimize than the

ratings estimation task. Both RMSE and MAE conveniently decompose over single predictions and are

continuous. Furthermore, although possibly sub-optimal, it may be more appealing to design a general

CF method that can easily be applied to multiple different tasks rather than one with a specific task in

mind.

In this section, we outline some methods that have looked at integrating the preference estimation

step and the recommendation step into one. We also present a few tasks for which collaborative filtering

has been used and which may benefit from an integrated approach.

Most of the existing work on CF in the service of another task has proposed solutions for specific

recommendation tasks as opposed to providing frameworks that can easily be adapted to different tasks.

Jambor and Wang [56] stand out because they do not focus on a specific application, but rather propose

a general framework that can be tailored to the requirements of various applications. Although the

solution proposed in this paper does not constitute a definitive solution, it touches upon some of the

ideas that we have put forward in the introduction to this section and that have been developed by others

looking at specific applications. The main technical idea developed by Jambor and Wang [56] is that

contrary to what RMSE and MAE do, not all errors are “created equal”, and different types of errors

should be weighted differently based on the objectives of the recommendation system. Accordingly, the


authors use a loss function weighted by a function of the ground-truth rating and the prediction:

∑

(r,p)∈So

w(srp, Srp)(srp − srp)2 (2.8)

These weights can either be fixed a priori or learned to minimize the system’s loss function (for example,

a ranking function). Let us look at two simple examples to demonstrate how the w’s may be set for

different objectives. For simplicity, we will assume that the ratings are binary. In the first example,

imagine that the objective is to maximize the system’s precision performance, a reasonable objective

when the goal is to recommend top items to users. In that case, predicting a zero for a rating with

a ground-truth (GT) value of one will hurt the system less than predicting a one with a GT of zero.

Then it is sensible to set w(0, 1) > w(1, 0) in the above formula. In general, given that our objective is

precision, underestimating a high rating is less costly than overestimating a low rating. Now imagine

that instead of maximizing precision, we would like to maximize recall. Then it would be sensible for the

learning method to incur a greater cost when incorrectly classifying a one versus incorrectly classifying

a zero, and therefore w(1, 0) > w(0, 1).

Collaborative Ranking

The most natural recommendation task is to provide every user with a list of items ranked in order of

(estimated) preferences or ratings. This list may not contain all items, but is rather a truncated list with

only the top items for a given user. Naturally, then, optimizing a collaborative filtering system with a

ranking objective has received the most attention from the community.

Although many different ranking objectives exist, researchers in the learning-to- rank literature have

recently focused mostly on using the normalized cumulative discounted gain (NDCG) [57]:

NDCG@T (R, π) :=1

N

K∑

i=1

2Sπi − 1

log(i+ 2)(2.9)

where S are the ground-truth scores and π denotes a permutation, that is, Sπiis the value of the

rating at position πi. π reflects the ordering of the predictions of the underlying learning model. The

truncation parameter T corresponds to the number of items to be recommended to each user, and N is

a normalization factor such that the possible values of NDCG lie between zero and one. Note that the

denominator has the effect of weighting the higher-ranked items more strongly than the lower-ranked

ones. Having T and the denominator in the NDCG objective represents two major differences compared

to traditional collaborative filtering metrics and highlights the fact that, for top-T recommendations,

learning with NDCG@T might be beneficial compared to learning with a traditional loss (e.g., RMSE).

Moreover, NDCG as defined above considers the ranking of only one user, whereas in a multi-user setting,

it is customary to report the performance of a method using the average NDCG across all users.

A major challenge in this line of work is that NDCG is not a continuous function, but in fact is

piecewise constant [128] because all ratings with the same label can be ranked in a number of different

consistent orders without affecting the value of NDCG. For this reason, it is challenging to perform

gradient-based optimization of NDCG. Weimer et al. [128] derived a convex lower bound on NDCG

and maximized that instead. The underlying model predicting the ratings has the same form as PMF.

Consequently, their approach can be understood as training PMF using the NDCG loss and is called


CoFiRank. The resulting optimization is still non-trivial because the evaluation of the lower bound

is expensive, but using a bundle method [109] with an LP in its inner loop, Weimer et al. [128] were

able to report results on large data sets such as the Netflix data set. In experiments on standard CF

data sets, they compared their approach to MMMF as well as their approach using an ordinal loss and

a simple regression loss (RMSE). They showed that for most settings, training using a ranking loss

(NDCG or ordinal regression) results in a significant performance gain versus MMMF. In a subsequent

paper, Weimer et al. [129] showed less convincing results; on a CF ranking task, their method trained

using an ordinal loss is outperformed by the same method trained using a regression loss. The data

sets used in both papers differed, which may explain these differences. Balakrishnan and Chopra [4]

proposed a two-stage procedure resembling the optimization of NDCG. First, much as in regular CF, a

PMF model is learned. In the second stage, they use the latent variables inferred in PMF as features

in a regression model using a hand-crafted loss which is meant to capture the main characteristics of

NDCG. They show that jointly learning these two steps easily outperforms CoFiRankwhile providing

a simpler optimization approach. Volkovs and Zemel [125] used a similar two-stage approach. They

first extract user-item features using neighbour preferences. They then use these features inside a linear

model trained using LambdaRank, a standard learning-to-rank approach [24]. This model delivered

state-of-the-art performance and requires only 17 parameters, making learning and inference very fast.

Shi et al. [107] define a simple user-specific probability model over item rankings. The probability

distribution encodes the probability that a specific item, v, is ranked first for user u [25]:

P (srp) =exp(srp)∑p′ exp(srp′)

. (2.10)

A cross-entropy loss function is used to learn to match the ground-truth distribution, the distribution

yielded by the above equation using the ground- truth ratings, with the model (PMF) distribution:

Loss := −∑

p∈So(r)

exp(srp)∑r′∈So(r) exp(srp′)

logexp(g(uT

r vp))∑p′∈So(r) exp(g(u

Tr vp′))

(2.11)

where g() is a logistic function and So(r) are the items rated by user u. They optimize the above

function using gradient descent, alternately fixing U and optimizing V and vice versa. In experiments,

this model’s performance was generally significantly higher than CoFiRank’s performance.

Finally, Liu and Yang [69] bypassed the difficulty of training a ranking loss by using a model-free

CF approach. Here the similarity measure between users is the Kendall rank correlation coefficient,

which measures similarity between the ranks of items rated by both users. They model ratings using

pairwise preferences. A user’s preference function over two items, p and p′, Ψu(p, p′) is positive if the

user prefers p to p′ (srp > srp′), negative if the contrary, and zero if the user does not prefer one to the

other (srp = srp′). Formally,

Ψu(p, p′) =

∑r′∈N

p,p′

usim(r, r′)(srp − srp′)

∑u′∈N

p,p′

usim(r, r′)

(2.12)

where Nv,v′

u is the set of neighbours of u that have rated both v and v′ and simr,r′ is the similarity

between users u and u′. Then the optimal ranking for each user is that which maximizes the sum of the


properly ordered pairwise Ψ:

maxπ

.∑

p,p′:π(p)>π(p′)

Ψr(p, p′). (2.13)

Although the decision-variant of this optimization problem is NP-complete to solve [28], Liu and Yang

[69] proposed two heuristics that are less computationally costly and that perform well in practice.

Various authors have also looked at how side information about items may affect recommendations.

Jambor and Wang [55] studied ways to incorporate item-resource constraints. For example, a company

might prefer not to recommend aggressively products for which its stock is low, or alternatively, it might

want to recommend products with higher profit margins. They proposed to deal with such additional

constraints once the ratings had been estimated by allowing the system to re-rank the output of the

original CF method based on the additional constraints. They formulate this re-ranking as a convex

optimization problem, where the corresponding constraints or an additional term in the objective were

used to capture the requirement described above. They did not detail the exact technique used to

solve this optimization problem, but depending on requirements, their problem can be cast either as a

linear program or as a quadratic program and both can be solved with off-the-shelf solvers (for example,

CPLEX).

Overall, relatively little work has been done on using CF in more complex recommendation applica-

tions. The handful of papers that have considered CF with a true recommendation perspective will be

discussed in Chapter 5.

2.3 Active Preference Collection

In the previous section, we have assumed a typical supervised learning setting, where a set of labelled

instances (ratings) are provided and we are looking for the set of parameters within a class of models

that will minimize a certain loss function. Hence, we are assuming that the set of ratings is fixed, which

is unnatural in many recommendation system domains. For example: a) any new system will first have

to gather item preferences from users; b) typically, new users continually enter the system; and c) if the

system is on-line, it is probable that existing users will be providing new ratings for previously unrated

items. These events provide an opportunity for the system to gather extra preference data, or extra

information indicative of user preferences, and thus to improve its performance. Moreover, if the system

were able to target specifically valuable user preferences over items, these could help the system achieve

greater increases in performance more quickly. The process of selecting which labels or ratings to query

falls into the category of active learning. This is a learning paradigm which is often motivated in a

context where the cost of acquiring labelled data is high [11]. The aim of active learning is to select the

queries that will improve the performance of the learning model most quickly. Active learning methods

are often evaluated by comparing their performance to passive methods, methods which randomly select

queries to be elicited after each new query or after a set of queries.

Considering that relatively little work has been done on active learning directly aimed at recommen-

dation systems, this section will give a brief overview of active learning methods in general. Whenever

work related to recommendation systems does exist, we will highlight it, and we will also emphasize

the merits of the various techniques with respect to their applicability and possible effectiveness for

recommendation applications.

We will focus on four main query strategy frameworks: 1) uncertainty sampling, 2) query-by-


committee, 3) expected model change, and 4) expected error reduction [102]. Apart from papers related

to recommendation systems, most of the material presented in this section originates from the excellent

survey of Settles [102] and references therein. In what follows, we will denote the set of available ratings

as So and the set of unobserved, or missing, ratings as Su. The goal will then be to query ratings from

Su using a particular query strategy. Furthermore, because our focus is on the recommendation domain,

we will assume that the set of possible queries, which consist of unobserved user-item pairs, is known a

priori. At each time step, the goal is then to pick the best possible query, according to a specific query

strategy, among all possible queries. Once a user has responded to a query, his response can used to

re-train the learning model. After that we can proceed to query selection for the next time step.

2.3.1 Uncertainty Sampling

In uncertainty sampling, the goal is to pick the query that most reduces the uncertainty of the model’s

predictions [67]. The querying method can be seen as sampling examples, hence its name. It is therefore

usually assumed that the model’s posterior probability over the elements of Su can be evaluated. Given

a posterior distribution over ratings, one can use the uncertainty of the model in several different ways.

For example, one may pick the query about which the model is least confident, where, for example,

confidence is the probability of the mode of the posterior distribution:

max(r,p)∈Su

(1− Pr(surp|So, θ)) (2.14)

where Surp denotes the mode of that posterior distribution (i.e., Su

rp = argmaxs Pr(Surp = s|So, θ)).

Perhaps a more pleasing criterion would be to look at the whole distribution through its entropy rather

than its mode:

min(r,p)∈Su

−∑

s

Pr(surp = s|So, θ) log Pr(surp = s|So, θ) (2.15)

The intuition is that entropy-based uncertainty sampling should lead to a better model of the

full posterior distribution over ratings, whereas the previous approach (or variants that consider the

margin between the top predictions) might be better at discriminating between the most likely class

and the others [102]. In practice, the performance of the different methods seems to be application-

dependant [101, 104]. For regression problems, the same intuitions as above can be applied, where the

uncertainty of the predictive distribution is simply the variance over its predictions [102].

Rubens and Sugiyama [96] proposed a related approach where, instead of querying the example with

the most uncertainty, they query the example that most reduces the expected uncertainty of all items.

They applied this idea to CF and show that on a particular subset of a CF data set (one of the MovieLens

data sets), this approach outperforms both a variance-based and an entropy-based strategy, with both

performing equally to or worse than a random strategy. The authors compared the performance of the

different methods when very little information is known from each user (between 1 and 10 items).

Other authors have also applied these ideas to CF. Jin and Si [58] used the relative entropy between

the posterior distribution over parameters and the updated parameters given a possible query and its

response to guide their query selection. Using the same model, Harpale and Yang [45] considered that

in many domains, it is unreasonable to ask a user to rate any item because it is unlikely that a user

can easily access all items (for example, users might not be willing to watch any movie selected by the

system). The authors proposed that the system’s decision should be based on how likely an item is to


be rated, approximated by a quantity similar to the item’s popularity among users with similar profiles

as the user under elicitation.

Authors have also used uncertainty minimization strategies using non-probabilistic models. For

example in CF, Rish and Tesauro [90] looked at using MMMF and defined uncertainty to be proportional

to the distance to the decision boundary, or in other words, the distance in feature space between the

point and the hyperplane in a max-margin model. The next query is the one that is closest to the

separating hyperplane. Experimentally, the authors showed that this approach outperformed other

heuristics, namely random and max. uncertainty query selection. One theoretical problem with this

approach is that it suffers from sampling bias [30]. In other words, because the queries are chosen with

respect to the uncertainty of the current classifier, it is possible that some regions of the input space will

be completely ignored, although all regions have some weight under the true data distribution, leading to

poor generalization performance. Sampling bias is a problem that affects most active learning methods

unless special precautions are taken [30].

In general, uncertainty sampling approaches are computationally efficient even though they require

the evaluation of all possible O(RP ) queries.

2.3.2 Query by Committee

The idea of the query-by-committee framework is to keep a committee [105], that is, a set of hypotheses

from one or multiple models, trained and consistent on So. In classification tasks, such a set of hypotheses

is called a version space. The instance to be queried is then the one on which these various hypotheses

disagree the most. Several definitions of disagreement are possible; for example, one may use the average

Kullback-Leibler divergence between the distribution over ratings of each hypothesis. One obvious

limitation of these approaches is that one must be able to generate multiple hypotheses that are all

consistent with the training set.

There has been a number of recent developments in this field that try to circumvent this limitation. In

fact, these developments have led their proponents to declare “theoretical victory” over active learning. 1

Query-by-committee can also be extended to regression problems, which are usually the domain of

CF approaches, although the notion of version space does not apply any more [102]. Instead, one can

measure the variance between the predictions of the various committee members.

Compared to uncertainty sampling, query-by-committee attempts to choose queries that reduce un-

certainty over both model predictions and model parameters. The computational efficiency of using

this strategy will depend on the procedure for obtaining the set of hypotheses forming the committee.

Although we are not aware of any work that has applied the query-by-committee framework in either

CF or recommendation systems, one could imagine training differently-initialized (or different hyper-

parameters) Bayesian PMF models, or alternatively a set of PMF models with uncertainty defined by

to how close a predicted rating is to the threshold between different rating values. Overall, as noted in

Section 2.2.1, ensemble methods, combinations of different methods, where each method ends up learn-

ing a specific aspect of users and items have been shown to do extremely well in collaborative filtering

problems [7]. It remains to be seen whether this methodology can also be useful for active learning.

1http://hunch.net/?p=1800

http://hunch.net/?p=1800


2.3.3 Expected Model Change

The goal in expected-model-change approaches is to find the query that will maximize the (expected)

model change. A simple criterion for monitoring model change is to evaluate the norm of the difference

of the parameter vectors (e.g., ||θrrp − θ||), where θ is the current model’s parameter vector and θsrp is

the updated model’s parameter vector after querying srp = s. Settles et al. [103] proposed to exploit

this idea using the expected gradients of the ratings with respect to the model parameters (θ):

∑

s

P (surp = s|So, θ)||∇L(So ∪ surp = s)|| (2.16)

where ∇L(So ∪ surp) is a vector representing the gradient of the parameter vector with respect to a loss

function L. The loss is calculated using the observed scores and the proposed query, srp. Because the

learning method is typically re-trained after each query, the gradient can be approximated by ∇L(surp =

s). The chosen query is then the one that maximizes Equation 2.16. This query strategy is called

expected gradient length (EGL). It is important to note that in contrast to other approaches, EGL does

not directly reduce either model uncertainty or generalization error, but rather picks the query with the

“greatest impact on the [model’s] parameters” [102]. This strategy can be applied only to models that

can be trained using a gradient method (which is the case for PMF, but not for all other probabilistic

methods). Compared to other approaches, the efficiency of this method will be largely dependent on

the cost of evaluating the model gradients. Furthermore, because outliers will greatly affect the model

parameters, this strategy would be particularly prone to failure when using an uninformative model of

uncertainty.

2.3.4 Expected Error Reduction

Perhaps the most natural strategy to use is to choose the query that will most reduce, in expectation,

the error of the final objective. Recall that the usual method for evaluating the effectiveness of all

querying strategies is to compare the model learned using active learning with one learned without (by

so-called passive learning). Therefore, it makes sense to optimize for the true objective. Settles [102]

first proposed an approach to minimize the 0/1 loss:

∑

s

Prθ(surp = s|So, θ)

∑

(r′,p′)∈(Su\srp)

1−maxs′

Prθsrp

(sur′p′ = s′|So ∪ surp = s, θrrp)

. (2.17)

where Prθsrp(·) denotes the distribution over unlabelled ratings when the model with parameter vector θ

receives an additional training example srp = s. What is surprising about the above formulation is that

it is not actually calculating the error, but rather the change in confidence over the unlabelled data. In

a sense, it is using the unlabelled data as a validation set. Instead, it would seem to make sense to look

at the expected decrease in error on a labelled validation set (that is, a labelled set disjoint from the

training set and So = {Strain ∪ Svalidation}). The obvious problem with using a validation set is that

in an active learning setting, labelled data are often scarce (at least at the beginning of the elicitation),

and therefore there might not be enough data to create a meaningful validation set.

Importantly, this framework can be adapted to use any objective (loss) function. The strategy can

therefore be leveraged both in collaborative filtering with the usual loss functions or when using CF


for another task. One limitation is that because the framework requires retraining the model for each

possible value of each possible query, it can be very computationally expensive.

Some work has been done using this strategy for ranking applications. Arens [3] used a ranking

SVM and addressed the problem of ranking relevant papers for users. The queries consist of asking a

user whether the queried document is “definitely relevant”, “possibly relevant”, or “not relevant”. They

experimented with two querying strategies. The first queries the highest (predicted) ranked document

that is unlabelled. The second strategy queries the most uncertain documents, as defined by those that

is in the middle of the ranking, which were used as a signal of uncertainty (in this view, this approach

is similar to uncertainty sampling). The authors argue that the learner will rank the worst and best

documents with confidence and that therefore the learner is more uncertain about documents in the

middle of the ranking. Their results indicate that the (first) strategy of querying the highest-ranked

documents outperforms the second strategy (and random querying).

EVOI

The decision-theoretically optimal way of evaluating queries is to use expected value of information

(EVOI) [50]. At the heart of EVOI is a utility function, a problem-dependant objective which assigns a

number, a utility, to possible outcomes (for example to recommendations). The optimal query according

to EVOI is the one which maximizes expected utility. The expectation is taken with respect to the

distribution over query responses (for example, possible score values). In practice, it is common to use

the learning model’s distribution over possible responses (for example, the CF model).

We can better understand EVOI by looking at an example. Boutilier et al. [18] studied a setting where

the goal is to recommend one item at a time to a user. The utility of making such a recommendation

is defined to be the rating value of the recommended item (or its expectation according to the model

of unobserved ratings). To select a query, one evaluates each possible query and selects the one with

maximum myopic EVOI, that is, the query that yields the maximum difference in (expected) value over

all items:

EV OI(surp, θSo

) =∑

s

Pr(surp = s|So, θSo

)V (So ∪ surp = s, θSo∪srp)− V (So, θS

o

).

The value of the belief state associated with (So, θ) is defined as:

V (So, θSo

) = max(r,p)∈Su

∑

s

Pr(surp = s|So, θSo

)s, (2.18)

and similarly,

V (So ∪ sur′p′ = s′, θSo∪sr′p′ )) = max

(r,p)∈{Su\(r′,p′)}

∑

s

Pr(surp = s|So ∪ sur′p′ = s′, θSo∪sr′p′ )s, (2.19)

where Pr(·|So, θSo

rp ) is the model’s posterior over scores for user-item rp when So is observed. It is

important to note that EVOI can be used with any model that can infer a distribution over unobserved

ratings. In Boutilier et al. [18], the MCVQ model is used (see Section 2.2.1).

Calculating myopic EVOI involves two sources of costly computations. First, given a user, the

computation of the posteriors over every possible rating value must be performed for all queries (that

is, the set of unobserved items). Equations 2.19 and 2.18 say that the value of a belief is equal to the


maximum expected unobserved rating given that belief. Boutilier et al. [18] note that given the current

item with highest (mean) score p∗ and the previous expected scores over all other items (Equation 2.18)

one can calculate which subset of items might have their mean score increased enough to become the

item with maximal value according to Equation 2.19. Accordingly, they bound the impact that a query,

about a particular item, can have on the mean predicted score of all other items. This restricts the

number of item posteriors which must be re-computed at each step. Furthermore, these bounds are

calculated in a user-independent way and off-line. This same procedure can be applied to any learning

model, although the exact form of the bound is specific to MCVQ.

Second, the potentially large number of possible queries is also computationally problematic. Boutilier

et al. [18] suggest the use of prototype queries : a small set of queries such that any query is within some

ǫ, in terms of EVOI value, of a prototype query. Experimentally, the authors showed that even when

using less than 40% of all queries, the system outperforms passive learning and is close to the optimal

(myopic-)EVOI performance.

Calculations of non-myopic EVOI, which involves calculating the optimal querying sequence, are

even more costly because the effect of each query with respect to all possible future queries must be

evaluated.

Other work of note has proposed a similar framework for a memory-based collaborative filtering

model [132]. Notably they propose that the active learning procedure only query items that the user is

likely to have already consumed (for example, movies that the user is likely to have already watched)

based on what similar users have consumed. The rationale is that such items will be easier to score.

This method has the additional benefit of pruning the item search-space.

2.3.5 Batch Queries

Until now, we have assumed that we are picking queries one at a time in a greedy manner, considering

only the immediate myopic effect of the query on the model. This might not be optimal because it is

possible that a query’s value will reveal itself only after several more rounds of elicitation.

A related idea is that of batch queries. Instead of selecting and obtaining results of one query at

a time, we might consider finding optimal sets of queries. Batch querying may be necessary when a

system simply does not have the computational resources to find optimal queries on-line and therefore

a set of queries must be calculated off-line [102]. Alternatively, this approach can be motivated in

recommendation systems as a more natural mode of interaction for users. For example, imagining a

domain where rating a product involves performing an action off-line, it may be easier for users to rate

multiple products at a time. Another motivation for batch active learning is when parallel labellers are

available [44].

A simple strategy for selecting a set of N queries would simply be to pick the top-N queries greedily

according to one of the active-learning strategies already discussed. Such a strategy will typically not

be optimal because the selection process will not consider information gained by previous queries in the

batch. Using EVOI, an optimal strategy for batch querying would be to consider the value of all possible

sets of queries (of a given length). The combinatorial nature of such calculations makes this approach

impractical for most real-world problems. Several authors have looked at constructing optimal sets of

queries according to some less general criteria. Several authors have proposed considering diversity as a

reasonable criterion for batch-query selection [21, 131]. Recently, Guo [43] proposed the selection of the

instances which maximize the mutual information between labelled and unlabelled sets. The intuition


underlying this is that, for a learning method to generalize, the labelled set should be representative of

the unlabelled set.

We are unaware of previously published work in CF that has looked at asking batches of queries. We

develop a greedy approach to this problem in Chapter 6.

2.3.6 Stopping Criteria

A critical aspect of any active querying method is the criterion used to stop eliciting new preferences from

users. This is a trade-off between the cost of reducing the model’s error by acquiring extra information

and the cost incurred by making recommendations given the current model. For example, imagine that

each time a user is queried, there is a probability that he will become annoyed and leave the system.

In addition, if recommendations are not satisfactory, the same user may also leave the system. This

is a good example of the trade-off between further elicitation versus exploitation of the user’s current

ratings. Adopting a decision-theoretic approach, querying should stop once the utility of all queries is

negative. This assumes a utility function which accounts for all user benefits and costs; such a function

may be difficult to model accurately. In practice, Bloodgood and Vijay-Shanker [17] proposed that if

one could define a separate validation set, active learning could simply be stopped once the error of the

validation set stabilized. However, as previously noted, setting aside a labelled set is impractical in the

typical active-learning setting where few labelled instances exist. Several authors [17, 81] have agreed

that one must determine whether or not the model has stabilized, meaning that its predictions are not

likely to change given more data. Having said this, we are not aware of a formal, yet practical, way

to decide on a stopping criterion that is applicable to a wide array of models and data sets and is not

overly conservative [17]. In fact, in his survey, Settles [102] implicitly defended the decision-theoretic

framework: “the real stopping criterion [. . . ] is based on economic or other external factors, which likely

come well before an intrinsic learner-decided threshold.” In other words, the cost of reducing the model’s

error dominates the cost of the model’s error before the model has reached its best possible performance.

2.4 Matching

It is often the case that designers and users of a recommendation system have requirements, in addition to

the user’s intrinsic item preferences, which must be considered before making recommendations. In this

work, we will be specifically interested in recommendations which require an assignment, or a matching,

between users and items. An example of a matching recommendation, which we will discuss at length, is

matching reviewers to conference submissions. We will refer to recommendation systems which provide

this type of matching as match-constrained recommendations. Match-constrained recommendations are

also prevalent in other domains involving the selection of experts, such as grant reviewing, assignment

marking, and hospital-to-resident matching [94], as well as in completely different areas such as on-line

dating [46], where users must be matched to other users according to their romantic interest, or house

matching [52], or even in recommending taxis to customers.

The field of matching has a long and distinguished history dating back to the classical work on the

stable marriage problem [36]. The importance of this work was recognized when one of its authors

received the 2012 Bank of Sweden prize in economic science. 2 The stable marriage problem consists

2It is often referred to as the Nobel Prize in Economics


of finding a stable match between single women and single men. Stability implies that in the resulting

match, no woman-man pairs both prefer being matched to one another than their partners in the

matching. For evident reasons, stability is a property often sought by practitioners. Gale and Shapley

[36] proposed a simple iterative algorithm which is guaranteed to output a stable matching. Several other

domains have caught the interest of economics and theoretical computer science researchers, among them

college admission matching [36] and resident matching (of residency candidates to hospitals, also similar

to the roommate-matching problem) [94], which generalize the stable marriage problem to many-to-one

matches. In many-to-one matches members of one set of entities need to be matched to, not one, but

multiple entities of the other set (this is matching polygamy in a sense). Such problems can be solved

with a generalization of the stable marriage algorithm. Further constraints on these problems, such as

couples wanting to be in the same hospital have also been considered, but are typically harder to solve

(“couples matching” is NP-complete) [92].

An important dichotomy is that between two-sided and single-sided matching domains. In single-

side matching, entities to be recommended do not express preferences, for example matching in housing

markets [52]. Note that in single-sided domains, the notion of stability does not apply, although similar

properties are captured by the notion of the matching core.

A central aspect of great importance in this line of work has been truthful elicitation of user prefer-

ences. Indeed, to ensure the integrity of a matching system, it is often of great importance that users

have incentives to be truthful. This research, under the name mechanism design, seeks protocols, in-

cluding matching protocols, which ensure that a user cannot obtain a preferred match by not reporting

his preferences truthfully [95].

In our work, we focus exclusively on one-sided matching problems. Furthermore, we do not con-

sider strategic issues, such as the ones studied by mechanism design. We initiate our discussion of

match-constrained recommendations by introducing a running example in Chapter 3. We further study

matching formulations and constraints in Chapter 5.

Chapter 3

Paper-to-Reviewer Matching

Before detailing the research contributions of this thesis, we will discuss a real-life example of a matched-

constrained recommendation system. This example will be used as an illustration throughout our work.

Furthermore, this particular example has provided some of the initial motivation behind the work that

has led to this thesis.

We introduce the paper-to-reviewer matching problem. This problem must routinely be solved by

conference organizers to determine a conference program. The way typical conferences operate is that

once the paper-submission deadline has passed, conference organizers initiate the reviewing process for

the submissions. Concretely, organizers have to assign each submission to a set of reviewers. Reviewers

are typically chosen from a preselected pool. Once assigned their papers, reviewers will have a fixed

amount of time to provide paper reviews, which are informed opinions from domain experts describing

the merits of each submission. Because reviewers’ time is limited, each reviewer has only enough resources

to review at most fixed number of papers.

The paper-to-reviewer-assignment process aims to find the most expert reviewers for each submission.

Obtaining high-quality reviews is of great importance to the quality and reputation of a conference, and in

a certain sense, to shape the direction of a field. This assignment process can be seen as a recommendation

system with reviewer expertise substituting for the (more typical) factor of user preference. Furthermore,

constraints on the number of submissions that each reviewer may process and on the number of reviewers

per submission gives rise to a match-constrained recommendation problem.

We can frame the paper-to-reviewer matching problem in terms of our three-stage framework pre-

sented in Figure 1.1. First, expertise about submissions, with possibly other expertise-related informa-

tion, is elicited from reviewers. Second, a learning model can be used to predict the missing paper-

reviewer expertise data. Finally, given the specific constraints of each conference, the optimization

procedure consists of a matching procedure which assigns papers to reviewers according to their stated

and predicted expertise. Many research questions related to how learning models can be optimized for

this framework emerge when thinking of the paper-matching problem in this context. The rest of this

thesis, and specifically Chapter 5, discusses some of these questions.

We have built a software system called the Toronto Paper Matching System (TPMS), which provides

automated assistance to conference organizers in the process of assigning their submissions to their

reviewers. In this chapter, we discuss the intricacies of the practical paper-to-reviewer assignment

problem through a description of TPMS.

26

Chapter 3. Paper-to-Reviewer Matching 27

3.1 Paper Matching System

Assigning papers to reviewers is not an easy task. Conference organizers typically need to assign reviewers

within a couple of days of the conference submission deadline. Furthermore, conferences in many fields

now routinely receive more than one thousand papers, which have to be assigned to reviewers from a

pool which often consists of hundreds of reviewers. The assignment of each paper to a set of suitable

reviewers requires knowledge about both the topics studied in the paper and reviewers’ expertise. For

a typical conference, it will therefore be beyond the ability of a single person, for example, the program

chair, to assign all submissions to reviewers. Decentralized mechanisms are also problematic because

global constraints, such as reviewer load, absence of conflicts of interest, and the need for every paper

to be reviewed by a certain number of reviewers must be satisfied. The main motivation for automating

the reviewer-assignment process is to reduce the time required to assign submitted papers (manually)

to reviewers.

A second motivation for an automated reviewer-assignment system concerns the ability to find suit-

able reviewers for papers, to expand the reviewer pool, and to overcome research cliques. Particularly in

rapidly expanding fields such as machine learning, it is of increasing importance to include new reviewers

in the review process. Automated systems offer the ability to learn about new reviewers as well as the

latest research topics.

In practice, conferences often adopt a hybrid approach in which a reviewer’s interest with respect to a

paper is first independently assessed, either by allowing reviewers to bid on submissions, or, for example,

by letting members of the senior program committee provide their assessments of reviewer expertise.

Using either of these assessments, the problem of assigning reviewers to submissions can then be framed

and solved as an optimization problem. Such a solution still has important limitations. Reviewer bidding

requires reviewers to assess their preferences over the list of all papers. Failing to do so, for example if

reviewers only examine papers that contain specific terms matching their interest, is likely to decrease

the quality of the final assignments. On the other hand, asking the senior program committee to select

reviewers still imposes a major time burden.

Faced with these limitations, when Richard Zemel was the co-program chair of NIPS 2010, he decided

to build a more automated way of assigning reviewers to submissions. The resulting system that we

have developed aims to provide a proper evaluation of reviewer expertise to yield good reviewer assign-

ments while minimizing the time burden on conference program committees (reviewers, area chairs, and

program chairs). Since then, the system has gained adoption for both machine learning and computer

vision conferences and has now been used (repeatedly) by NIPS, ICML, UAI, AISTATS, CVPR, ICCV,

ECCV, ECML/PKDD, ACML, and ICGVIP.

3.1.1 Overview of the System Framework

In this section, we first describe the functional architecture of the system, including how several confer-

ences have used it. We then briefly describe the system’s software architecture.

Our aim is to determine reviewers’ expertise. Specifically, we are interested in evaluating the expertise

of every reviewer with respect to each submission. Given these assessments, it is then straightforward

to compute optimal assignments (see Chapter 5 for a detailed discussion of matching procedures). We

insist that reviewer expertise rather than reviewer interest is what we aim to evaluate. This is in contrast

with the more typical approaches that assess reviewer interest, for example through bidding.


Reviewer

papers

Initial scores

Submitted

papers

Elicited

scoresGuide

Ellicitation

Final

assignments

Matching

Match

ing

Final

scores

Ranked

list

Sorting

Sorting

Figure 3.1: A conference’s typical workflow.

The workflow of the system works in synergy with the conference submission procedures. Specifically,

for conference organizers, the busiest time is typically right after the paper submission deadline, because

at this time, the organizers are responsible for all submissions, and several different tasks, including the

assignment to reviewers, must be completed within tight time constraints. For TPMS to be maximally

helpful, assessments of reviewer expertise could be computed ahead of the submission deadline. With

this in mind, we note that an academic’s expertise is naturally reflected through his or her work and is

most easily assessed by examining his or her published papers. Hence, we use a set of published papers

for each reviewer participating in a conference. Throughout our work, we have used the raw text of these

papers. It stands to reason that other features of a paper could be modelled: for example, one could use

citation or co-authorship graphs built from each paper’s bibliography and co-authors respectively.

Reviewers’ published papers have proven to be very useful in assessing expertise. However, we have

found that we can further boost performance using another source of data: each reviewer’s self-assessed

expertise about the submissions. We will refer to such assessments as scores. We differentiate scores from

more traditional bids : scores represent expertise rather than interest. We use assessed scores to predict

missing scores and then use the full reviewer-paper score matrix to determine assignments. Hence, a

reviewer may be assigned to a paper for which he did not provide a score.

To summarize, although each conference has its own specific workflow, it usually involves the sequence

of steps shown in Figure 3.1. First, we collect reviewers’ previous publications (note that this can be

done before the conference’s paper submission deadline). Using these publications, we build reviewer

profiles which can be used to estimate each reviewer’s expertise. These initial scores can then be used to

produce paper-reviewer assignments or to refine our assessment of expertise by guiding a score-elicitation

procedure (e.g., using active learning to query scores from reviewers). Elicited scores, in combination

with our initial unsupervised expertise assessments, are then used to predict the final scores. Final scores

can then be used in various ways by the conference organizers (for example, to create per-paper reviewer

rankings that will be vetted by the senior program committee, or directly in the matching procedure).

Below, we describe the high-level workflow of several conferences that have used TPMS.

NIPS 2010: For this conference, the focus was mostly on modelling the expertise of the area chairs,

the members of the senior program committee. We were able to evaluate the area chairs’ expertise

initially using their previously published papers. There were 32 papers per area chair on average. We

then used these initial scores to perform elicitation. The exact process by which we picked which reviewer-

paper pairs to elicit is described in the next section. We performed the elicitation in two rounds. In the

first round, we kept about two-thirds of the papers selected as those about which our system was most

confident (estimated as the inverse entropy of the distribution across area chairs per paper). Using these

elicited scores, we were then able to run a supervised learning model and proceed to elicit information


about the remaining one-third of the papers. We then re-trained a supervised learning method using

all elicited scores. The result were used to assign a set of papers to each area chair. For the reviewers,

we also calculated initial scores from their previously published papers and used those initial scores to

perform elicitation. Each reviewer was shown a list of approximately eight papers on which they could

express their expertise. The initial and elicited scores were then used to evaluate the suitabilities of

reviewers for papers. Each area chair was then provided a ranked list of (suggested) reviewers for each

of his assigned papers.

ICCV-2013: ICCV used author suggestions, by which each author could suggest up to five area

chairs that could review a paper, to restrict area-chair score elicitation. The elicited scores were used to

assign area chairs. Area chairs then suggested reviewers for each of their papers. TPMS initial scores,

calculated from reviewers’ previously published papers, were used to present a ranked list of candidate

reviewers to each area chair.

ICML 2012: Both area chairs and reviewers could assess their expertise for all papers. To help in

this task, TPMS initial scores, calculated again from reviewer’s and area chairs previous publications,

were used to generate a personalized ranked list of candidate papers which area chairs and reviewers

could use to quickly identify relevant papers. TPMS then used recorded scores for both reviewers and

area chairs in a supervised learning model. Predicted scores were then used to assign area chairs and

one reviewer per paper (area chairs were able to assign the other two reviewers).1

3.1.2 Active Expertise Elicitation

As mentioned in the previous section, initial scores can be used to guide active elicitation of reviewer

expertise. The direction that we have taken is to run the matching program using the initial scores. In

other words, we use the initial scores to find an (optimal) assignment of papers to reviewers. Then a

reviewer’s expertise for all papers assigned to him are queried. Intuitively, these queries are informative

because according to our current scores, reviewers are queried about papers that they would have to

review (a strong negative assessment of a paper is therefore very informative). By adapting the matching

constraints, conference organizers can tailor the number of scores elicited per user (in practice, it can be

useful to query reviewers about more papers than is warranted by the final assignment). We formally

explore these ideas in Chapter 5. Note that our elicited scores will necessarily be strongly biased by

the matching constraints used in the elicitation procedure. To relate to our discussion in Section 2.2.1:

scores are not missing at random. In practice, this does not appear to be a problem for this application

(that is, assigning papers to a small number of expert reviewers). There are also empirical procedures,

such as pooled relevance judgements which combines the scores of different models while considering the

variance in between models, which have been used to reduce the elicitation bias [79]. It is possible that

such a method could be adapted to be effective with our matching procedure.

3.1.3 Software Architecture

For the NIPS-10 conference, the system was initially made up of a set of MATLAB routines that would

operate on conference data. The data were exported (and re-imported) from the conference Web site

hosted on Microsoft’s Conference Management Toolkit (CMT).2 This solution had limitations because

1The full ICML 2012 process has been detailed by the conference program chairs: http://hunch.net/?p=24072http://cmt.research.microsoft.com/cmt/


http://cmt.research.microsoft.com/cmt/


it imposed a high cost on conference organizers that wanted to use it. Since then, and encouraged by the

ICML 2012 organizers, we have developed an on-line version of the system which interfaces with CMT

and can be used by conference organizers through CMT (see Figure 3.2).

The system has two primary software features. One is to act as an archive by storing reviewers’

previously published papers. We refer to these papers as a reviewer’s archive or library. To populate

their archives, reviewers can register with and log in to the system through a Web interface. Reviewers

can then provide URLs pointing to their publications. The system automatically crawls the URLs to

find reviewers’ publications in PDF format. There is also a functionality which enables reviewers to

upload papers from their local computers. Conference organizers can also populate reviewers’ archives

on their behalf. Another option enables our system to crawl a reviewer’s Google Scholar profile.3 The

ubiquity of the PDF format has made it the system’s accepted format. On the programming side, the

interface is entirely built using the Python-based Django web framework4 (except for the crawler, which

is written in PHP and relies heavily on the wget utility5).

The second main software feature is one that permits communication with Microsoft’s CMT. Its

main purpose is to enable our system to access some of the CMT data as well as to enable organizers to

call our system’s functions through CMT. The basic workflow proceeds as follows: organizers, through

CMT, send TPMS the conference submissions; then they can send us a score request which queries

TPMS for reviewer-paper scores for a specified set of reviewers and papers. This request contains the

paper and reviewer identification for all the scores that should be returned. In addition, the request can

contain elicited scores (bids in CMT terminology). After receiving these requests, our system processes

the data, which may include PDF submissions and reviewer publications, and computes scores according

to a particular model. TPMS scores can then be retrieved through CMT by the conference organizers.

Technically speaking, our system can be seen as a paper repository where submissions and meta-

data, both originating from CMT, can be deposited. Accordingly, the communication protocol used is

SWORD based on the Atom Publishing Protocol (APP) version 1.0.6 SWORD defines a format to be

used on top of HTTP. The exact messages enable CMT to: a) deposit documents (submissions) to the

system; b) send information to the system about reviewers, such as their names and publication URLs;

c) send reviewers’ CMT bids to the system. On our side, the SWORD API was developed in Python and

is based on a simple SWORD server implementation.7 The SWORD API interfaces with a computations

module written in a mixture of Python and MATLAB (we also use Vowpal Wabbit8 for training some

of the learning models).

Note that although we interface with CMT, TPMS runs completely independently (and communi-

cates with CMT through the network); therefore, other conference management frameworks could easily

interact with TPMS. Furthermore, CMT has its own matching system which can be used to determine

reviewer assignments from scores. CMT’s matching program can be used to combine several pieces of

information such as TPMS scores, reviewer suggestions, and subject-area scores. Hence, we typically

return scores from CMT and conference organizers, then run CMT’s matching system to obtain a set of

(final) assignments.

3http://scholar.google.com/4https://www.djangoproject.com/5http://www.gnu.org/software/wget/6The same protocol with similar messages is used by http://arxiv.org to enable users to make programmatic submis-

sions of papers7https://github.com/swordapp/Simple-Sword-Server8http://hunch.net/~vw/

http://scholar.google.com/

https://www.djangoproject.com/

http://www.gnu.org/software/wget/

http://arxiv.org

https://github.com/swordapp/Simple-Sword-Server

http://hunch.net/~vw/


TPMS

CMT

ReviewersPaper collection

web-interface

Conference

Organizers

Score

models

Figure 3.2: High-level software architecture of the system.

3.2 Learning and Testing the Model

As mentioned in Section 3.1.1, at different stages in a conference workflow, we may have access to

different types of data. We use models tailored to the specifics of each situation. We first describe

models that can be used to evaluate reviewer expertise using the reviewers’ archive and the submitted

papers. Then we describe supervised models that have access to ground-truth expertise scores.

We remind the reader about some of our notation, which now takes on specific meaning for the

problem of paper-to-reviewer matching. An individual submission is denoted as p while P is the set

of all submitted papers. Similarly, single reviewers are denoted as r and the set of all reviewers is

denoted as R. We introduce a reviewer’s archive (a reviewer’s previously published papers) encoded

using a bag-of-words representation and denoted by the vector war . Note that we will assume that a

reviewer’s papers are concatenated into a single document to create that reviewer’s archive. Similarly,

a submission’s content is denoted as wdp. Finally, f(·) and g(·) represent functions which map papers,

submitted or archived, respectively, to a set of features. Features can be word counts associated with a

bag-of-words representation, in which case the f(wdp) and g(wa

r ) are the identity function, or possibly,

higher-level features such as those learned from a topic model [16].

3.2.1 Initial Score Models

Two different models were used to predict initial scores.

Language Model (LM): This model predicts a reviewer’s score as the dot product between a reviewer’s

archive representation and a submission:

srp = g(war )

T f(wdp) (3.1)

There are various possible incarnations of this model. The one that we have routinely used is due

to Mimno and McCallum [79] and consists of using the word-count representation of the submissions

(that is, each submission is encoded as a vector in which the value of an entry corresponds to the number

of times that the word associated with that entry appears in the submission). For the archive, we use

the normalized word count for each word appearing in the reviewer’s published work. By assuming

conditional independence between words given a reviewer and working in the log domain, the above is

equivalent to:

srp =∑

i∈p

log f(wari) (3.2)

In practice, we Dirichlet smooth [133] the reviewer’s normalized word counts to better deal with rare


words:

f(wari) =

(Nwa

r

Nwar+ µ

)wa

ri

Nwar

+

(µ

Nwar+ µ

)Nwi

N(3.3)

where Nwaris the total number of words in reviewer r’s archive, N is the total number of words in the

corpus, wari and Nwi

are the number of occurrences of word i in r’s archive and in the corpus respectively,

and µ is a smoothing parameter.

Because papers have different lengths, scores will be uncalibrated. This means that shorter papers

will receive higher scores than longer papers. Depending on how scores are used, this may not be

problematic. For example, this will not matter if one wishes to obtain ranked lists of reviewers for each

paper. We have obtained good matching results with such a model. However, normalizing each score

by the length of its paper has turned out to be also an acceptable solution. Finally, in the language

model described above, the dot product of the archive and submission representation is used to measure

similarity; other metrics could also be used, such as KL-divergence.

Latent Dirichlet Allocation (LDA) [16]: LDA is a unsupervised probabilistic method used to model

documents. Specifically, we can use the topic proportions found by LDA to represent documents. Equa-

tion 3.1 can then be used naturally to calculate expertise scores from the LDA representations of archives

and submissions. When using the LM with LDA representations we refer to it as topic-LM, in contrast

with word-LM when using bag-of-words representation.

3.2.2 Supervised Score-Prediction Models

Once elicited scores are available, supervised regression methods can be used. The problem can be seen

as one of collaborative filtering. Furthermore, because both reviewer (user) and paper (item) content

exist, the problem can also be modelled using a hybrid approach.

Probabilistic Matrix Factorization: PMF was discussed in Section 2.2.1. For easier comparison with

other models in this section, we can express the score generative model as: srp = γrTωp (where γr and

ωp correspond to ur and vp, respectively, in our earlier notation).

Because PMF does not use any information about either papers or reviewers, its performance suffers

in the cold-start regime. Nonetheless, it remains an interesting baseline for comparison.

Linear Regression (LR): On the other end of the spectrum are models which use document content,

but do not share any parameters across users and items. The simplest regression model learns a separate

model for each reviewer using submissions as features:

srp = γTr f(w

dp) (3.4)

where γr denotes user-specific parameters. This method has been shown to work well in practice,

particularly if many scores have been elicited from each reviewer.

A hybrid model can be of particular value in this domain. One issue with the conference domain

is that it is typical for some reviewers to have very few or even no observed scores. It may then be

beneficial to enable parameter sharing between users and papers. Furthermore, re-using information

from each reviewer’s archive may also be beneficial. One method of sharing parameters in a regression

model was proposed by John Langford, co-program chair of the ICML 2012 conference:

srp = b+ br + (γ + γr)T f(wd

p) + (ω + ωr)T g(wa

r ) (3.5)


where b is a global bias and γ and ω are parameters shared across reviewers and papers, which encode

weights over features of submissions and archives respectively. The ω shared parameter enables the

model to make predictions even for users that have no observed scores. In other words, the shared ω

enables the model to calibrate a reviewer’s archive using information from the other reviewers’ scores.

br, γp and ωr are parameters specific to a paper or a reviewer. For ICML-12, f(wdp) was paper p’s

word counts, while g(war ) was the normalized word count of reviewer r’s archive (similarly to LM). In

practice, for each reviewer-paper instance g(war ) is multiplied by the paper’s word occurrence vector.

For that conference and afterwards, that model was trained in an on-line fashion using Vowpal Wabbit

with a squared loss and L2 regularization. In practice, because certain reviewers have few or no observed

scores, one has to be careful to weight the regularizers of the various parameters properly so that the

shared parameters are learned at the expense of the individual parameters. To determine a good setting

of the hyper-parameters we typically use a validation set (and search in hyper-parameter space using, for

example, grid search). To ensure the good performance of the model across all reviewers, each reviewer

contributes equally to the validation set.

3.2.3 Evaluation

A machine-learning model is typically assessed by evaluating its performance on the task at hand. For

example, we can evaluate how well a model performs in the score-prediction task by comparing the

predicted scores to the ground-truth scores. The task of most interest here is that of finding good paper-

reviewer assignments. Ideally, we would be able to compare the quality of our assignments to some

gold-standard assignment. Such an assignment could then be used to test both different-score prediction

models and different matching formulations. Unfortunately, ground-truth assignments are unavailable.

Moreover, even humans would have difficulty finding optimal assignments, and hence we cannot count

on their future availability.

We must then explore different metrics to test the performance of the overall system, including

methods for comparing score-prediction models as well as matching performance. In practice, in-vivo

experiments can provide a good way to measure the quality of TPMS scores and ultimately the usefulness

of the system. ICML 2012’s program chairs experimented with different initial scoring methods using

a special interface which showed, one of three groups of, ranked candidate papers to reviewers.9 The

experiment had some biases: the papers of the three groups were ranked using TPMS scores. The poll,

which asked after the fact whether reviewers had found the ranked-list interface useful, showed that

reviewers who had used the list based on word-LM were slightly more likely to have preferred the list

than the regular CMT interface (the differences were likely not statistically significant).

We will defer comprehensive comparisons of the different score-prediction methods to Chapter 4,

where we will introduce a novel model and make comparisons with it. Likewise, in Chapter 5, we

will introduce different matching objectives and constraints and discuss relevant matching experiments.

Below, we offer some qualitative comparisons of initial score models.

Datasets

Through its operation, the system has gathered interesting datasets containing reviewer submission

preferences as well as reviewer and submitted papers. We have assembled a few datasets from these

9http://hunch.net/?p=2407



0 1 2 30

500

1000

1500

2000

2500

3000

Num

ber

of S

core

s

(a) NIPS-10

1 2 3 40

2000

4000

6000

8000

10000

12000

(b) ICML-12

Figure 3.3: Histograms of score values.

collected data. Below, we introduce certain datasets that are used throughout this thesis for empirical

evaluations. The datasets bear the name of the conference from which their data were assembled.

NIPS-10: This dataset consists of 1251 papers submitted to the NIPS 2010 conference. The set of

reviewers consists of the conference’s 48 area chairs. The submission and archive vocabulary consists

of 22,535 words. User-item preferences were integer scores in the 0 to 3 range. The histogram of score

values is given in Figure 3.3. Suitabilities on a subset of papers were elicited from reviewers using a rather

involved two-stage process. This process utilized the language model (LM) to estimate the suitability

of each reviewer for each paper, and then queried each reviewer on the papers on which his estimated

suitability was maximal. The output of the first round was fine-tuned using a combination of a hybrid

discriminative/generative RBM [65] with replicated softmax input units [97] trained on the initial scores,

and LM, which then determined the second round of queries. In total, each reviewer provided score on

an average of 143 queried papers (excluding one extreme outlier), and each paper received an average

of 3.3 suitability assessments (with a std. dev. of 1.3). The mean suitability score was 1.1376 (std. dev.

1.1).

With regards to our earlier discussion on missing at random (Section 2.2.1) we note that since the

querying process was biased towards asking about pairs with high predicted suitability, the unobserved

scores are not missing at random, but rather tended toward pairs with low suitability. We do not

distinguish the data acquired in the two phases of elicitation; both took place within a short time frame,

so we assume suitabilities for any one reviewer are stable.

ICML-12: This dataset consists of 857 papers and 431 reviewers from the ICML 2012 conference.

The submission and archive vocabulary consists of 21,409 words. User-item preferences were integer

scores in the 0 to 3 range. The histogram of score values is given in Figure 3.3. The elicitation process

was less constrained than for NIPS-10. Specifically, reviewers could assess their expertise for any paper

although they were also shown a selection of suggested papers produced either by the LM or based on

subject-area similarity.

The original vocabulary of these datasets was very slightly pre-processed. Specifically, we removed

stop words from a predefined list. Moreover, we only kept words that appeared in at least five documents,

including two different submissions. We performed further experiments with using word stems, which

did not make a significant difference. We therefore hypothesize that words that are useful in determining

a user’s expertise are typically technical words which are not declined under different stems.


NIPS-10 ICML-12NDCG@5 0.926 0.867NDCG@10 0.936 0.884

Table 3.1: Evaluating the similarity of the top-ranked reviewers for word-LM versus topic-LM on theNIPS-10 and ICML-12 datasets.

Initial score quality

We examined the quality of the initial scores, those estimated solely by comparing the archive and the

submissions, without access to elicited scores. We will compare the performance of a model which uses

the archive and submission representations in word space to one which uses these representations in topic

space. The method that operates in word space is the language model as described by Equation 3.2.

For purposes of comparison, we further normalized these scores using the length of each submission. We

refer to this method as word-LM. To learn topics, we used LDA to learn 30 topics using the content of

both the archives and submissions. For the archive, we learned topics for each reviewer’s paper and then

averaged a reviewer’s papers in topic space. This version of the language model is denoted as topic-LM.

We first compared the two methods with each other by comparing the top-ranked reviewers for each

paper according to word-LM and topic-LM. Table 3.1 reports the average similarity of the top-5 and

top-10 reviewers using NDCG, where word-LM was used to make predictions of topic-LM. The high

NDCG values indicate that both methods generally agree about the most expert reviewers.

We can obtain a better appreciation of the scores of each model by plotting the model’s (top) scores

for each paper. Each data point on Figure 3.4 shows the score of one of the top-40 reviewers, on the

y-axis, for a particular paper, on the x-axis. For visualization purposes, points corresponding to the same

reviewer ranking across papers are connected.10 One can see that word-LM scores all fall within a small

range. In other words, given a paper, word-LM evaluates all reviewers to have very similar expertise.

We have usually not found this to be a problem when using word-LM scores to assign reviewers to

papers. In one exceptional case, we did get a poor assignment in a case when the program committee of

a conference was very small (9 members) and the breadth of the submissions was large. From Figure 3.4,

we see that topic-LM does not have this problem as it better separates the reviewers, and often finds a

few top reviewers to have a certain (expertise) margin over the others, which seems sensible. One possible

explanation for these discrepancies is that working in topic space removes some of the noise present in

word space. It has been suggested that using a topic model “fuse[s] weak cues from each individual

document word into strong cues for the document as a whole”.11 Specifically, elements like the specific

words, and to a certain extent the writing style, used by individual authors may be abstracted away by

moving to topic space. Topic-LM may therefore provide a better evaluation of reviewer expertise than

can word-LM.

We would also like to compare word-LM and topic-LM on matching results. However, such results

would be biased toward word-LM because, in our datasets, this method was used to produce initial scores

which guided the elicitation of scores from reviewers (we have validated experimentally that word-

LM slightly outperforms topic-LM using this experimental procedure). Using the same experimental

procedure, word-LM also outperforms matching based on CMT subject areas.

10This methodology was suggested and first experimented with by Bill Triggs, the program chair for the ICGVIP 2012conference.

11From personal communication with Bill Triggs.


0 2 4 6 8 10 12 14 16 18 200.03

0.04

0.05

0.06

0.07

0.08

0.09

(a) NIPS-10 using topic-LM

0 2 4 6 8 10 12 14 16 18 200.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

(b) ICML-12 using topic-LM

0 2 4 6 8 10 12 14 16 18 200.116

0.118

0.12

0.122

0.124

0.126

0.128

0.13

0.132

(c) NIPS-10 using word-LM

0 2 4 6 8 10 12 14 16 18 200.12

0.122

0.124

0.126

0.128

0.13

0.132

0.134

0.136

0.138

(d) ICML-12 using word-LM

Figure 3.4: Score of the top-40 reviewers (y-axis) for 20 randomly selected submitted papers (x-axis).

3.3 Related Work

We are aware that other conferences, such as SIGGRAPH, KDD, and EMNLP, have previously used

a certain level of automation for the task of assigning papers to reviewers. The only conference man-

agement system that we are aware of that has explored machine learning techniques for paper-reviewer

assignments is MyReview 12, and some of their efforts are detailed in [89].

The reviewer-to-paper matching problem has also received attention in the academic literature. Conry

et al. [29] propose combining several sources of information including reviewer co-authorship informa-

tion, reviewer and submission self-selected subject areas as well as submission representations using the

submission’s abstract. Their proposed collaborative filtering model linearly combines regressors on these

different sources of information in a similar way as we do in Equation 3.5. Using elicited and predicted

scores, matching is then performed, as a second step, using a standard matching formulation [122]. We

note that in this work no differentiation is made between reviewer expertise and reviewer interest.

In TPMS we build representations of submissions using the full text of submissions. Other authors

12http://myreview.lri.fr/

http://myreview.lri.fr/


have exploited the information contained in the submissions’ references instead. Specifically, Rodriguez

and Bollen [91] use the identity of the authors of the referenced papers and a co-authorship graph, built

offline, to identify potential reviewers. Although building a useful co-authorship graph and identifying

the authors of referenced papers in submissions presents a technical challenge, this method has the

advantage of not requiring any input from reviewers.

Benferhat and Lang [8], Goldsmith and Sloan [42], Garg et al. [37] assume that reviewer-to-paper

suitability scores are available and focus on the matching problem, and various desirable constraints.

We provide a more complete review of these studies in Section 5.3.

3.3.1 Expertise Retrieval and Modelling

The reviewer-matching problem can be seen as an instantiation of the more general expertise retrieval

problem [6]. The key problem of expertise retrieval, a sub-field of information retrieval, is to find the

right expert given a query. A basic query is a set of terms indicating the topics for which one wishes to

find an expert. In our case, queries are the papers submitted to conferences, and expertise is modelled

using previously authored papers. Because reviewer expertise and queries live in the same space, we

never have to model expertise explicitly in terms of high-level, or human-designed, topics. Overall,

the field of expertise retrieval has studied models similar to ours, including probabilistic and specifically

language models [6].

3.4 Other Possible Applications

An automated paper matching system has been particularly useful in a context, such as a conference,

where many items have to be assigned to many users under short time constraints. Nonetheless, the

system could be used in other similar contexts where an individual’s level of expertise can be learned

from that individual’s (textual) work. One example is the grant-reviewing matching process, where

grant applications must be matched to competent referees. Academic journals also have to deal often

with the problem of finding qualified reviewers for their papers. As more evaluation processes migrate

to on-line frameworks, it is also likely that other problems fitting within this system’s capabilities will

emerge. Other applications of more general match-constrained recommendations systems are discussed

in Chapter 5.

3.5 Conclusion and Future Opportunities

There are a variety of system enhancements, improved functionality and research directions we are

currently developing, and others we would like to explore in the near future. On the software side, we

intend to automate the system further to reduce the per-conference cost (both to us and to conference

organizers) of using the system. This implies providing automated debugging and explanatory tools for

conference organizers.

We have also identified a few more directions that will require our attention in the future:

1. Re-using reviewer scores from conference to conference: Currently reviewers’ archives are the only

piece of information that is re-used between conferences. It possible that elicited scores could also

be used as part of a particular reviewer’s profile that can be shared across conferences.


2. Score elicitation before the submission deadline: Conferences often have to adhere to strict and

short deadlines when assigning papers to reviewers after the submission deadline. Hence, collecting

additional information about reviewers before the deadline may be able to save more time. One

possibility would be elicit scores from reviewers about a set of representative papers from the

conference (for example, a set of papers published in the conference’s previous edition).

3. Releasing the data: The data that we gathered through TPMS have opened various research

opportunities. We are hoping that some of these data can be properly anonymized and released

for use by other researchers.

4. Better integration with conference management software: Running outside of CMT (or other con-

ference organization packages) has provided advantages, but the relatively weak coupling between

the two systems also has disadvantages for conference organizers. As more conferences use our

system, we will be in a better position to develop further links between TPMS and CMT.

5. Leveraging other sources of side information: Other researchers have been able to leverage other

type of information apart from a bag-of-words representation of the submissions’ main text (see

Section 3.3). The model presented in Chapter 4 may be a gateway to modelling some of this other

information.

6. As mentioned in Section 3.3.1 many models have been proposed for expertise retrieval. It would

be worthwhile to compare the performance of our current models with the ones developed in that

community.

Finally, we are actively exploring ways to evaluate the accuracy, usefulness, and impact of TPMS more

effectively. We can currently evaluate how good our assignments are in terms of how qualified reviewers

are for the papers to which they are assigned (Chapter 5 details assignment procedures and evaluations).

Further, anecdotal evidence from program organizers, the number of the number of conferences that have

expressed an interest in using the system, and the several experiments that we have run, all suggest that

our system provides value in proposing good reviewer assignments and in saving conference organizers

and senior program committee members time and cognitive effort. However, we do not have data to

evaluate the intrinsic impact that the system may have on conferences and, more generally, on the field

as a whole. We can divide the search for answers to such questions into two stages. First, how much

does reviewer expertise affect reviewing quality. For example is it the case that finding good expert

reviewers leads to better reviews? Further, are there synergies between reviewers (for example, what

is the benefit of having both a senior researcher and a graduate student reviewer assigned to the same

paper)? Second, how much impact do reviews have on the quality of a conference and ultimately of

the field? For example do good reviews lead to: a) more accurate accept and reject decisions; and

b) higher-quality papers? One way to answer these questions would be to run controlled experiments

with conference assignments, for example, by using two different assignment procedures. Subsequently

we could evaluate the performance of the procedures by surveying reviewers, authors, and conference

attendees. While we may obtain conclusive answers about some of these questions, evaluating the quality

of the field as a result of reviews seems like a major challenge. We have only begun exploring such issues

but already, in practice, we have found it a challenge to incentivize conference organizers to carry out

evaluation procedures which could help in assessing these impacts.

Chapter 4

Collaborative Filtering with Textual

Side-Information

Good user-item preference predictions are essential to the goal of providing users with good recommen-

dations. The task of preference prediction, which is the focus of this chapter, is therefore at the heart

of recommender systems. Preference prediction corresponds to the second stage of our recommendation

framework, depicted for the purposes of this chapter in Figure 4.1. We discuss this stage first because

of the importance of preference prediction in the framework as well as the fact that the other stages of

the framework will assume access to a preference prediction method.

There are various flavours of preference prediction problems, which are differentiated by which type of

input information is available to the preference prediction model. The simplest of situations is when only

a subset of user-item preferences are observed. The system must then learn, using a collaborative filtering

model, similarities in-between users that will enable it to accurately predict unobserved preferences. We

are interested in studying situations where side information, features of users, items or both, along with

user-item preferences is available. Specifically, we focus on hybrid systems, systems that use collaborative

filtering models while also modelling the side information.

We first provide a high-level description of the current state-of-the-art in the field of hybrid recom-

mender systems. We then focus on the specific problem of leveraging user and item textual side infor-

mation for user-preference predictions. We propose the novel collaborative score topic model (CSTM)

for this problem and present supporting empirical evidence of its good performance across data regimes.

Before introducing our model we also provide a brief review of topic models and variational inference

methods used for learning the parameters of those topic model.

4.1 Side Information in Collaborative Filtering

We define side information as any information about users or items distinct from user-item preferences.

The immediate goal of modelling side information is to glean information that will be helpful in learning

better models of users and items and (consequently) provide more accurate preference predictions.

Overall, side information has been particularly useful to combat the cold-start problem which plagues

collaborative-filtering methods. Figure 4.2 provides a sketch of the representations of the major class of

models (used in the literature) to combine side information with collaborative filtering. There are two

39

Chapter 4. Collaborative Filtering with Textual Side-Information 40


preferences

S31

1 2

Users

Entities

Stated preferences

1 1 20 1 22 3 2

Stated & predicted

preferences

1 1 20 1 22 3 2

Recommendations

Side-Information

from Users and Items

F

Figure 4.1: Flow chart depicting the framework developed in this thesis with a particular focus on thesecond stage of the framework, the preference prediction stage.

predominant classes. The first, depicted in Figure 4.2(b) is to use the side information to build a prior

over user representations for example by enforcing that users (and items) with similar side information

(fu and fv) have similar latent representations (u and v). Such models are therefore of particular use in

the cold-start data regime. The second methodology which has received a lot of attention is to use the

side information as covariates (or features) in a regression model (see Figure 4.2(c)). The parameters

over such features are often shared across user and items. Finally, a third methodology is to model

the side information together with the scores as shown in Figure 4.2(d). The resulting user and item

latent factors of this generative model should capture elements of both which can be useful for preference

prediction and for side information prediction.

Side information has been used, with varying levels of success, in many different domains. To better

understand the domains in which side information has been useful, we discuss user side information and

item side information separately.

User side information has shown to be useful in various forms. Examples of user side information

includes frequently available user demographics, such as user gender, age group or a user’s geographical

location, which have often been used and have shown to produce, at least, mild performance gains [1, 2,

26]. Concretely, it is likely that such side information is weakly indicative of preferences (for example,

two average teens may have similar movie tastes). More descriptive features such as a user’s social

network have shown promise both as a way to select a user’s neighbourhood in model-free approaches

[53, 76, 70] (neighbourhood models were introduced in Section 2.2.1), as well as in in collaborative

filtering approaches as a way to regularize a user representations by assuming that neighbours of a social

network have similar preferences [26, 54]. Features that could be described as relating user behaviours,

such as user browsing patterns, user search queries, or user purchases, have also shown their utility,

again as a way to regularize user representations [61, 99].

On the item side, gains have mostly come from domains where the content of items can easily be

analyzed using machine learning techniques, for example, when the content is text as it is for scientific

papers [126] or newspaper articles [27]. For books, Agarwal and Chen [2] have proposed using book

reviews as representative of a book’s content (in Section 4.6 we show that using the content of books is

challenging). They use a probabilistic model of content and user preferences which combines regressors

over side information in a conceptually similar way as is done in Conry et al. [29] which we reviewed in

Section 3.3. For other tasks, such as the one of predicting interactions between drugs and proteins, side


MN

(a) A collaborative fil-tering model such asprobabilistic matrix fac-torization [99]

MN

MN

(b) The user side information fu, and the item sideinformation fv regularize the user and item represen-tations respectively. Here, the graphical representationillustrates that each user and item latent factors (u andv) can be regularized by all other users side informa-tion or all other items side information (this could bethe case for users when using the structure of a socialnetwork as side information). Regularization of userand item representations is used by many researchers[54, 39, 126, 66, 99]

MN

(c) Side information is used as features (covariates)used to learn a regressor [1, 2, 26, 61, 27, 29, 130]

MN

(d) Learn a generative model of side information whereuser and item latent factors have to both predict scoresand the side information. Such a model can be seenas another way of regularizing the user and latent fac-torizations. A few researchers have used such models[106, 70].

Figure 4.2: Graphical model representations of the three common classes of models for using side in-formation within collaborative filtering models. Note that we depict the pure collaborative filteringmodel as one which includes, but is not restricted to, the popular probabilistic matrix factorization.We emphasize that the above figures are sketches meant as a guide of the different model classes (us-ing a similar graphical representation), they are not intended to reproduce the exact parameters of theproposed models.


information consisting of drug similarities and protein similarities have also shown promise as a way to

regularize the representations of a matrix factorization model [39].

However in other popular domains such as movie preference prediction where content features present

modelling difficulties, using item side information, for example a movie’s genre, its actors or its cast, has

been less useful (see, for example, [66, 106]). In the movie domain, modest gains have been achieved

when using related-content information such as user-proposed movie tags [2, 1].

Music recommendation offers an interesting testbed, falling between the simpler representations of

text and the more complex representations of movies, as we now have tools that allow the accurate

analysis, including determination of genre, of music tracks (see, for example, the MIREX competition1).

However, results in the music domains mostly point to the effectiveness of pure collaborative filtering.

In a recent contest [77], content-based methods still did not rival with pure collaborative filtering sys-

tems [Section 6.2.3, 9]. In a similar task, Weston et al. [130] show that using MFCC audio features as

regression features, does not provide a gain over pure collaborative filtering methods. These negative

results have made some researchers question our ability to extract useful content features from such

complex domains as music, movies and images [108].

We provided an fairly high-level overview of the work that has combined side information to col-

laborative filtering models. In this chapter we will present a novel model to do so which is applicable

to textual side information (or more generally to the case when side information can be modelled with

a topic model) and we will compare it, experimentally, to competing models in this domain and in

particular to [126].

4.2 Problem Definition

We do not aim to solve the general problem of hybrid recommender systems in this chapter. Instead

we focus on pushing the state-of-the-art by studying the task of document recommendation using both

user-item scores and side information. The primary novelty of our work lies in leveraging a particular

form of side information: the content of documents associated with users, which we call user libraries.

A typical scenario that can be modelled in this way is scientific-paper recommendation for researchers;

for example, Google Scholar recommends papers based on an individual’s profile. A second scenario is

paper-reviewer assignment, where each reviewer’s previously published papers can be used to assess the

match between their expertise and each submitted paper. Another relevant application domain is book

recommendation, as online book merchants typically enable users to collect items in a virtual container

akin to a personal library.2 In each case a user’s library, or side information, consists of documents which

are not necessarily explicitly rated but nonetheless may contain information about a user’s preferences.

To model user-item scores as well as user and item content we introduce a novel directed graphical

model. This model uses twin topic models, with shared topics, to model the side information. User

and item topic proportions are then used as features to predict user-item scores with a collaborative

filtering model. The collaborative filtering component enables the model to effectively make use of the

side information with varying number of observed scores. We demonstrate empirically that the model

outperforms several other methods on three datasets in both cold and warm-start data regimes. We

1http://www.music-ir.org/mirex/2For example., Amazon’s Kindle and Kobo’s tablets have an option for users to populate their libraries, while Barnes

and Nobles’ Nook gives users an active shelf.

http://www.music-ir.org/mirex/


N

K

D

(a)

N

K

D

(b)

Figure 4.3: The graphical models of two topic models (a) LDA and (b) CTM.

further show that the model automatically learns to gradually trade off the use of side information in

favour of information learned from user-item scores as the amount of user preference-data increases.

4.3 Background

In Chapter 3 we used a topic model as a tool to learn representations for user libraries and paper

submissions. In this chapter we propose a graphical model which jointly models user and item topics

and user-item scores. To better understand this model and our proposed inference procedures it will be

useful to further introduce LDA as well as introduce a second topic model called the correlated topic

model (CTM).

Topic models are a class of directed graphical models. They were initially proposed as a way to

model collections of textual documents [16]. Specifically, each document in the collection is modelled

as a mixture over topics. A topic is a distribution over words. With that in mind the aim of topic

models is then to learn a (word-level) representation of a set of documents. This representation involves

two key components: a) distributions over words or topics; b) for each document in the collection a

distribution over topics. Over the years topic models have been extended to model many different

practical situations, including: modelling how documents change over time [12], modelling hierarchies

of topics [15], and modelling document contents and labels [14, 84, 63].

For our needs we will specifically focus on LDA and its CTM variant. LDA’s graphical model is

depicted in Figure 4.3. The generative model of LDA over a set of D documents, an N -size word

vocabulary, and K topics is:

• For each document, d = 1 . . . D

- Draw d’s topic proportions: ηd ∼ Dirichlet(α)


- For each word in the document, n = 1 . . . N :

- Draw a topic: zdn ∼ Multinomial(ηd)

- Draw a word: wn ∼ Multinomial(βzdn)

ηd represents the topic proportions, or a mixture over topics, of document d. For each word in a

document LDA first samples a single topic, zdn. A word is then sampled from the corresponding topic

distribution. It is useful to encode β as a matrix of size K × N , where K is the number of topics.

Then the zdn’th row of that matrix βzdn is the distribution (over words) corresponding to topic zdn.

Formally, for each document, LDA learns to represent a distribution over words given model parameters

and latent variables: Pr(wd|zd,φd,β, α). Across all documents LDA then represents the space of joint

distributions over document words.

The correlated topic model is very similar to LDA except that mixtures over topics are modelled using

a logistic normal distribution instead of a multinomial. Accordingly, instead of sampling a document’s

distribution over topics from a Dirichlet prior, it’s sampled from a Normal distribution with mean

parameters µ and covariance Σ parameters. CTM’s graphical model is shown in Figure 4.3 and its

generative model is given by:

• For each document, d = 1 . . . D

- Draw d’s topic proportions: ηd ∼ N (µ,Σ)

- For each word in a document, n = 1 . . . N

- Draw a topic: zdn ∼ Multinomial(softmax(ηd))

- Draw a word: wdn ∼ Multinomial(βzdn)

where softmax(v) = exp(v)∑k′ exp(vk′ )

. In CTM the main advantage of using a normal distribution is that

correlations between topics (e.g., topic 1 often co-occurs with topic 5) can be represented. This has shown

experimentally to provide better models of document collections (and accessorily provides potentially

interesting visualizations of the discovered topic correlations) [13]. The main disadvantage brought

upon by using a normal distribution is that the normal is not conjugate to the multinomial (whereas

the Dirichlet is) and thus this change complicates resulting inference procedures [13, 127].

4.3.1 Variational Inference in Topic Models

The key inference problem of graphical models is to evaluate the posterior distributions over latent

variables ηd and zd given the observed variables wd and the model parameters α and β. Posteriors

can then be used to make predictions using standard Bayesian inference. In the case of LDA the per-

document posterior, following Bayes’ rule, is given by:

Pr(zd,ηd|wd,β, α) =Pr(wd|zd, β) Pr(zd|ηd)) Pr(ηd|α)∑

z

∫Pr(wd|zd, β) Pr(zd|ηd)) Pr(ηd|α) ∂η

(4.1)

However the denominator is intractable due to the coupling of η and β [16]. Hence we must resort to

an approximate inference approach. Specifically we will use variational inference.

In variational inference the idea is to replace the intractable posterior by a tractable distribution

over the latent variables: Q({s}, {η}). The choice of Q is model dependant, however, for computational

simplicity, it is often given a parametric form. For simplicity let us group all latent variables, across


documents, into Z = {{zd}, {ηd}}d∈D and the visible variables into X = {wd}d∈D.3 Then the approach

is to optimize this tractable distribution such that it is as close as possible to the true posterior. The

measure of closeness between the two distributions is taken to be the KL-divergence:

KL(Q||P ) =

∫Q(Z) ln

(Q(Z)

P (Z|X,α, β)

)∂Z (4.2)

=

∫Q(Z) ln(Q(Z))−Q(Z) ln(P (Z|X,α, β) ∂Z (4.3)

= −H(Q(Z))− EQ[P (Z|X,α, β)] (4.4)

where H(·) denotes the entropy of its argument.

Minimizing the KL-Divergence is still intractable because it requires the evaluation of the posterior

P (Z|X,α, β) (Equation 4.4). However, by subtracting lnP (X) the log-marginal probability of the data,

which is constant with respect to Z, we obtain:

KL(Q||P )− lnP (X) =

∫Q(Z) ln(Q(Z))−Q(Z) lnP (Z|X) ∂Z −

∫Q(Z) lnP (X)∂Z (4.5)

=

∫Q(Z) ln

((Q(Z))

P (Z|X,α, β)P (X|α, β)

)∂Z (4.6)

=

∫Q(Z) ln

((Q(Z))

P (Z,X|α, β)

)∂Z (4.7)

=

∫Q(Z) lnQ(Z)−Q(Z) lnP (Z,X|α, β) (4.8)

= −H(Q(Z))− EQ[P (Z,X|α, β)] (4.9)

The resulting expression, which consists of an entropy term and an expectation over the complete-data

likelihood Pr(Z,X|α, β), can now be evaluated.

The variational expectation-maximization (EM) [32, 80] algorithm (approximately) minimizes the

above, also known as the free energy, by alternating the following two optimization steps. In the E-

step the algorithm minimizes the free energy with respect to the variational posterior parameters, the

parameters of Q. Then, in the M-step the free energy is minimized with respect to the model parameters

(that is, α, β in LDA and µ,Σ, β in CTM). The M-step is also often described as maximizing the complete-

data log-likelihood. The EM algorithm is a classic algorithm in the machine learning literature. The

particular derivation that we have shown is similar to the one in Frey and Jojic [35].

There are several common choices of variational distributions. For example, we recover maximum

a posterior configuration of the latent variables (MAP) by choosing the variational distribution to be

a Dirac delta function with its mode equal to the variational parameter h: Q(h) = δh(h), where δ

h(h)

evaluates to one when h = h and zero otherwise. Another common technique is to use a mean-field

approach where the intractable posterior distribution is approximated using decoupled distributions.

For example, in the case of LDA, Q(ηd, {zd}|γd, {φd}) = Q(η|γd)Q({zd}|{φd}) where γd and {φd} are

variational parameters.

3Equivalently we will write {wd} when the domain of the index is clear from context.


4.4 Collaborative Score Topic Model (CSTM)

Our approach to document recommendation relies on having: a) a set of observed user-item preferences

({srp}); b) contents of the items ({wdp}); and c) the content of user-libraries ({wa

r}). The model’s aim

is to utilize the content in its user-item score predictions (which can then be used to recommend items

to users).

To do so we combine a topic model over side information and a collaborative filtering model of user-

item preferences. Specifically, we propose a generative model over the joint space of user preferences,

user side information (user libraries) and item side information (item contents). This model can be cast

in the light of the previous work on incorporating side information along with collaborative filtering

which we described earlier in this chapter (Section 4.1). Specifically, we combine a generative model

of user and item side information as well as of user-item preferences similarly as in Figure 4.2(d). The

side information, which consists of textual documents is modelled using a topic model. The user and

item topic proportions are then used within a linear regression model over user-item preferences. In that

respect our model also shares commonalities with the class of models depicted in Figure 4.2(c). The

resulting model is called the collaborative score topic model (CSTM).

We now describe in more detail the components of our model. Our content-based model is mediated

by topics: we learn a shared topic model from the words of the documents and user libraries. We

represent topic proportions with a normal distribution and realized topics zu and zd, in the same way as

CTM, using the logistic normal [13]. User and item topic proportions offer a compact representation of

user and item side information. We favor CTM over LDA because of its continuous representations of

topic proportions which can be useful for the regression model. We use these representations as covariates

in a regression model to predict user-item preferences. The regression has two sets of parameters. The

first are user-specific parameters on the item topics covariates. The second are compatibility parameters,

which are shared across users and items, and are based on the compatibility between the item topics

and the topics of the user library.

We now introduce the complete graphical model of the CSTM. A graphical representation of the

model is given in Figure 4.4. The associated generative model is:

• Draw compatibility parameters: θ ∼ N (0, λ2θI)

• Draw shared-user parameters: γ0 ∼ N (0, λ2γ0I)

• For each user r = 1 . . . R:

- Draw individual-user parameters: γr ∼ N (0, λ2γI)

- Draw user-topic proportions: ar ∼ N (0, λ2aI)

• For each document p = 1 . . . P :

- Draw document-topic proportions: dp ∼ N (0, λ2pI)

• For all of user r’s user-library words, n = 1 . . . N :4

- Draw zarn ∼ Multinomial(softmax(ar))

- Draw warn ∼ Multinomial(βza

rn)

• Repeat the above for all of document p’s M words

4For simplicity, we’ll assume in the notation that all user-libraries contain N words and all item documents containsM words.


MN

K

RP

Figure 4.4: Graphical model representation for CSTM.

• For each user-document pair (r, p), draw scores:

srp ∼ N ((ar ⊗ dp)T θ + dT

p (γ0 + γr), σ2s) (4.10)

where N (µ, σ2) represents a normal distribution with mean µ and variance σ2, ⊗ stands for the

Hadamard product, softmax(v) = exp(v)∑k′ exp(vk′ )

, and I is the identity matrix.

The specific parametrization of the preference regression shown in Equation 4.10 is important. Our

model is designed to perform well in both cold-start and warm-start data regimes. In cold-start settings

the model needs the user’s side information to predict user-item preferences. When the number of

observed preferences increases, the model can gradually leverage that information, smoothly combining

it with information gleaned from the side information to refine its model of missing preferences. To

accomplish this, the regression model (Equation 4.10) is separated in two: one component that exploits

user side information ((ar ⊗ dp)T θ) and another that does note include the user side information but

does include user-specific parameters (dTp (γ0 + γr)).

Item side information is incorporated by modulating the user information through an element-wise

product. The weights θ then serve several purposes: 1) they can act to amplify or reduce the effect

of certain topics (for example diminish the influence of topics bearing little preference information); 2)

they enable the model to more easily calibrate its output to the range of observed preference values; and

3) changing the magnitude of θ enables the model to control how much it uses the side information for

preference prediction.

When user-item preferences are more abundant, the model can use them to learn a user-specific

model, γr, over item features. Note that these user-specific parameters are combined with a shared


set of parameters, γ0, which allows for some transfer across users. An individual’s γr can be used to

increase that user’s reliance on user-item preferences at the possible expense of item side information,

as the joint magnitude of the γ’s defines the weights associated with this part of the model.

Our model learns a single set of topics to model user and item content. Sharing topics ensures that

the user and item representations (ar and dd ∀r, ∀d) are aligned and render their element-wise product

meaningful.

4.4.1 The Relationship Between CSTM and Standard Models

Simplifying the proposed CSTM model in various ways produces other models that have been used

for similar tasks. First, setting γ0 and γr, for every user r, to zero and θ’s to a vector of ones we

obtain the language model introduced in 3.2.1. To be precise, because we are using user and item topic

representations, this degeneracy corresponds to topic-LM (see Section 3.2.3). Mimno and McCallum

[79] used the related word-LM in a preference prediction task and found its performance particularly

strong in low-data regimes.

Further, setting θ and γ0 to zero we obtain the individual user regression model LR introduced in

Section 3.2.2. As we will demonstrate LR typically outperforms purely collaborative filtering models in

our task.

By modelling preferences as a combination of user features and item features, our model can also be

seen as an instance of collaborative filtering [99, 98, 7].

Finally, we have opted to represent topic proportions using a logistic normal distribution as in

CTM [13]. In our case we utilize the logistic normal due to its representational form, and not as a

means of learning topic correlations.5 Compared to a multinomial, the normal distribution adds a level

of flexibility that may be useful to better calibrate CSTM’s preference predictions; the drawback is

additional complexity in model inference.

Our model also shares several similarities with the model of Equation 3.5. First, both models are

composed of two separate components which, respectively, learn to regress over user and item side-

information. While the regression over item content is the same, CSTM in its other component, to

maximize its performance on cold start users, combines the user and item side-information representa-

tions. Early experiments with CSTM showed that having both global and user-specific compatibility

parameters (which would be closer to the parameters ω + ωr of Equation 3.5) did not yield empirical

improvements. One important difference between the two models is that the model of Equation 3.5 uses

representations learned offline whereas CSTM learns these representations jointly with the regression

parameters.

4.4.2 Learning and Inference

For learning we use a version of the EM algorithm where we alternate between updates of the user-item

specific variables (H = {{γr}, {ar}, {dp}, {za}, {zd}}) in the E-step and updates of the parameters or

shared variables (Θ = {γ0,θ,β}) in the M-step. The inference and learning procedures are similar to

those proposed for nonconjugate LDA models in Wang and Blei [126]. The general EM algorithm is

shown in Algorithm 1.

5Because learning topic correlations has been found to improve on standard LDA, it is possible that learning the topiccorrelations could also improve our model.


E-Step

Inference in this model is intractable, so we must rely on approximations when manipulating the posterior

over the user-item specific variables. The log-posterior over user-item variables, given the fixed model

parameters and the data, is

L := − 1

2λa

R∑

r

aTr ar −1

2λd

P∑

p

dTp dp −

1

2λγ

R∑

r

γTr γr

− 1

2σ2s

∑

(r,p)∈So

(srp − ((ar ⊗ dp)

Tθ + (γ0 + γu)Tdp)

)2

+

R,N∑

r,n

logexp(arza

rn)∑

j exp(arj)+

P,M∑

p,m

logexp(spzd

pm)

∑j exp(spj)

+

R,N∑

r,n

log βzarn,w

arn

+

P,M∑

p,m

log βzdpm,wd

pm− logZ(Θ) (4.11)

where So stands for the set of observed preferences and Z(Θ) is the normalizing term of the posterior,

intractable in part because ar and dp cannot be analytically integrated out because they are not conjugate

to the distribution over topic assignments [13].

We address this computational issue by employing variational approximate inference. For each of the

topic-proportion and regression variables {ar}, {dp}, {γr}, we use a Dirac delta posterior parameterized

by its mode {ar}, {dr}, {γr}. For the topic-assignment variables {za}, {zd}, we instead utilize a mean-

field posterior. The full approximate posterior is therefore:

q({ar}, {dp}, {γr}, {zar}, {zdp}| {ar}, {γr}, {dp}, {φar}, {φd

p}) =(R∏

r

δγr(γr)

)(R∏

r

δar(ar)

N∏

n

φarnza

r

)(D∏

d

δdp(dp)

M∏

m

φdpmzd

p

)

where δµ(x) is the delta function with mode µ and {φar},{φd

p} are the mean-field parameters (e.g., φar is

a matrix whose entries φarnj are the probabilities that the nth word in user r’s library belongs to topic

j).

Approximate inference entails finding the variational parameters {ar}, {dp}, {γr}, {φar}, {φd

p} that


minimize the KL-divergence with the true posterior

KL := −Eq [L]−H(q)

=1

2λa

R∑

r

aTr ar +1

2λd

P∑

p

dTp dp +

1

2λγ

R∑

r

γTr γr

+1

2σ2s

∑

(r,p)∈So


Tθ + (γ0 + γr)T dp)

)2

−R,N,K∑

r,n,k

φarnk

(log

exp(ark)∑j exp(arj)

+ log βk,warn

− log φarnk

)

−P,M,K∑

p,m,k

φdpmk

(log

exp(spk)∑j exp(spj)

+ log βk,wdpm

− log φdpmk

)

+ constant (4.12)

Our strategy is to perform one pass of coordinate descent, optimizing each set of variational parameters

given the others.6 For γr, we obtain a closed-form update by differentiating the above equation and

setting the result to 0:

γr =1

σ2s

∑

p∈So(r)

(srp − (dp ⊗ ar)Tθ − dT

p γ0)dTp

∑

p∈So(r)

dpdTp

σ2s

+1

2λγI

−1

where So(r) is the set of indices for documents that user r has rated. The {ar}, {dp} parameters do not

have closed-form solutions, hence we use conjugate gradient descent for optimization. The derivatives

with respect to the posterior KL are:

∂KL

∂ar=arλa

− 1

σ2s

∑

p∈So(r)

(srp − srp)(dp ⊗ θ) +Nexp(ar)∑j exp(arj)

−∑

n

φarn

∂KL

∂dp

=dp

λd

− 1

σ2s

∑

r∈So(p)

(srp − rrp)(ar ⊗ θ + γ0 + γr) +Mexp(dp)∑j exp(spj)

−∑

n

φdpn

where, srp = (ar ⊗ dp)Tθ + dT

p (γ0 + γr).

In practice, when optimizing for ar and dp, we use the approach of Blei and Lafferty [13] and

introduce additional variational parameters, {ζa, ζd}, which are used to bound the respective softmax

denominators. For each user ζar is updated using ζar =∑K

k exp(ark), and similarly for ζdp ’s.

For the mean-field parameters {φar},{φd

p}, minimizing the KL while enforcing normalization leads to

the following solution:

φarnk =

βk,warn

exp(ark)∑j βj,wa

rnexp(arj)

φdpmk =

βk,wdpm

exp(spk)∑j βj,wd

pmexp(spj)

6While we could cycle through all variational parameters until convergence before beginning the M-step, we’ve found asingle pass of updates per E-step to work well in practice.


Algorithm 1 EM for the CSTM

Input: {war}, {wd

p}, {srp} ∈ So.

while Convergence criteria not met do# E-Stepfor all p ∈ P doUpdate dp,φ

dp

end forfor all r = 1 . . . R doUpdate ar, γr,φ

ar

end for

# M-StepUpdate θ, γ0,β

end while

We update the variational parameters of all users and subsequently of all documents.

M-Step

The M-step aims to maximize the expectation of the complete likelihood under the variational posterior

(taking into account the prior over the parameters γ0,θ):

Eq [L] + log p(γ0) + log p(θ) = − 1

2σ2s

∑

(r,p)∈So


Tθ + (γ0 + γr)T dp)

)2

+

R,N,K∑

r,n,k

φarnk log βk,wa

rn+

P,M,K∑

p,m,k

φdpmk log βk,wd

pm

− 1

2λγ0

γT0 γ0 −

1

2λθ

θTθ + constant .

Setting the derivatives to zero (and satisfying the βjw parameters’ normalization constraints), we obtain

the following updates:

θ =1

σ2s

( ∑

(r,p)∈So

(srd − dTp (γ0 + γr))(dp ⊗ ar)

T) ∑

(r,p)∈So

(dp ⊗ ar)2

σ2s

+1

λθI

−1

γ0 =1

σ2s

( ∑

(r,p)∈So

(srp − (dp ⊗ ar)Tθ − (γT

r dp))dp

) ∑

(r,p)∈So

dpdTp

σ2s

+1

λγ0I

−1

βjk =

∑r,n φ

arnj1{wa

rn=k} +∑

p,m φdpmj1{wd

pm=k}∑k′,r,n φ

arnj1{wa

rn=k′} +∑

p,m φdpmj1{wd

pm=k′}

.

At test time, prediction of missing preferences is made using srp, which is readily available.


s M

K

RP

Figure 4.5: A graphical representation of the CTR model (adapted from [126])

4.5 Related Work

Previous work on hybrid collaborative filtering includes a few models that have combined item-only

topic and regression models for user-item preference prediction. We are not aware of any earlier work

that develops a text-based model of a user, nor one that combines user and item side information as in

CSTM.

Agarwal and Chen [2] model several sources of side information including item textual side informa-

tion using LDA. The topic assignment proportions of documents (∑

m zddm/M for all d ∈ D documents)

are used as item features and combined multiplicatively with user-specific features. The results are

linearly combined with user demographic information to generate preferences.

Wang and Blei [126] also combine LDA with a regression model for the task of recommending sci-

entific articles. Here the item topic proportions are used as a prior mean on normally-distributed item

(regression) latent variables. User latent variables are also normally distributed from a zero-mean prior.

A specific user-item score is then generated as the inner product of item and user latent variables:

srp = aTr (dp + ǫp), where ǫp is drawn from a zero-mean normal distribution. The preference prediction

model is the same as the one used in probabilistic matrix factorization [99]. Wang and Blei also report

that a modified version of their model analogous to the model of Agarwal and Chen [2] performed worse

on their data. Shan and Banerjee [106] proposed a similar model without the bias term ǫd but used

CTM [13].

The fact that we model an additional type of information (user textual side information) makes

it difficult to directly compare our model to the ones above. In addition, the parametrization we

use to predict preferences is very different from previous models. We initially experimented with a

parametrization similar to Wang and Blei [126], albeit modified to also model user side information, and


found it did not perform as well as CSTM (see the next section for experimental comparisons).

Finally, Agarwal and Chen [1] propose a collaborative filtering model with side information. Although

the form of the side information is not amenable to using topic models, the authors utilize a combination

of linear models to obtain good performance in both cold and warm-start data regimes. They reported

improved results in all the regimes they tried compared to pure collaborative filtering methods.

CSTM is also useful for predicting reviewer-paper affinities. Other researchers have look at modelling

reviewer expertise using collaborative filtering or topic models. We note the work of Conry et al. [29]

which uses a collaborative filtering method along with side information about both papers and reviewers

to predict reviewer paper scores. Mimno and McCallum [79] developed a novel topic model to help

predict reviewer expertise. Finally, Balog et al. [5] utilize a language model to evaluate the suitability

of experts for various tasks.

4.6 Experiments

We first describe the three datasets used for our experiments. We then introduce the set of methods

against which we perform empirical comparisons, ranging from pure CF methods to pure side information

methods. We report three separate sets of experiments. In the first we focus on the cold-start problem

for new users and examine the effect of including user libraries. In the second we study how the

methods perform on users with varying amounts of observed scores. Finally, we design a synthetic paper

recommendation experiment and simulate the arrival of new cold-start new users in order to test the

value of using both the user library and the user-provided item scores.

4.6.1 Datasets

We evaluate the models using these three datasets:

NIPS-10 and ICML-12 were both introduced in Section 3.2.3. We note that for NIPS-10, Each user’s

library consists of his own previously published papers. Users have an average of 31 documents (std.

20). After some basic preprocessing (we removed words that either appear in fewer than 3 submissions

or words which appear in more than 90% of all submissions), the length of the joint vocabulary is now

slightly over 18,000 words. For ICML-12, Users have an average of 25 documents (std. 29) each and the

length of the joint vocabulary is 16,201 words (identical pre-processing as with NIPS-10 above).

Kobo: The third dataset is from Kobo, a large North American-based ebook retailer.7 The dataset

contains 316 users and 2601 documents (books). Users average 81 documents (std. 100). We removed

very-infrequent and very-frequent words (those appearing in less than 1% or more than 95% of all

documents). The resulting vocabulary contains 6,440 words. Users have a minimum of 15 expressed

scores (mean 22, std. 6).

The respective score distributions of the each of the three datasets is shown in Figure 4.6.

4.6.2 Competing Models

We use empirical comparisons against other models to evaluate the performances of CSTM. For the

competing models we re-use some of the general models previously introduced and introduce a few other

7http://www.kobo.com

http://www.kobo.com


1 2 3 40

200

400

600

800

1000

1200

(a) NIPS-10

1 2 3 40

1000

2000

3000

4000

5000

6000

7000

8000

9000

(b) ICML-12

1 2 3 4 50

500

1000

1500

2000

2500

3000

3500

(c) Kobo

Figure 4.6: Score histograms. Compared to Figure 3.3 the histograms for NIPS-10, and ICML-12 arealtered according to the user categorization described in Section 4.6.3.

User side information Document side information Shared Parameters

SLM-I X X

SLM-II

X X

LR X

PMF X

CTR X X

CSTM X X X

Table 4.1: A comparison of the modelling capabilities of each model. “Shared Params” stands for modelsthat share information between users and/or items (in other words those which use some form of CF).

ad-hoc models for the task. Specifically, each competing model has particular characteristics (Table 4.1)

which will help in understanding CSTM’s performance.

Note that we use topic representations of documents for competing models that use side information.

Such representations were learned offline using a correlated topic model [13]. We re-use some of our

previous notation to describe these models. Namely au and dd are K-length vectors which designate a

user’s and a document’s (topic) representation respectively.

Constant: The constant model predicts the average observed scores for all missing preferences.

Comparison to this baseline is useful to evaluate the value of learning.

Supervised language model I (SLM-I): This model is a supervised version of the LM: srp := (aTu θA)(dTd θD)T

where the parameters θA, θD are K×F matrices. F is a hyper-parameter determined using a validation

set (ranges from 5 to 30 in our experiments).

Supervised language model II (SLM-II): This model uses isotonic regression (see, for example, [10])

to calibrate the LM. The idea is to learn a regression model that satisfies the implicit ranking established

by the LM:

minimizesr,∀r∑

(rp)∈So

(srp − srp)2

subject to srp ≤ sr(p+1), ∀p.

where the constraints enforce a user-specific document ordering specified by the output of the LM. Once

learned the set of regression parameters {s} are used as the model’s predictions. To obtain predictions


for an unobserved document we have found that taking the average of the (predicted) scores of the two

(observed) documents ranked, according to the LM, directly above and below the new document works

well. The regression is user-specific and therefore cannot be used for users with no observed preferences.

For such users we simply re-use the learned parameters of its closest user. We leave further research into

more principled approach, for example a collaborative one, for future work.

LR: The linear regression model was introduced in Section 3.2.2. To recapitulate, it is a user-specific

regression model where predictions are given by: srp = γTudd.

PMF [99]: PMF was introduced in Section 2.2.1. We remind the reader that it is a state-of-the-art

collaborative filtering approach and that it does not model the side information. The size of the latent

space is determined using a validation set (range from 1 to 30).

Collaborative topic regression (CTR): CTR, is matrix factorization with document-content model

introduced in Wang and Blei [126]. CTR was briefly reviewed in Section 4.5. We use a slightly different

version than the one introduced by its authors. Namely, we have replaced LDA by CTM. Also, in our

application since all user-item scores are given we use a single variance value over scores (σs).

For SLM-I, LR, and PMF learning is performed using a variational approximation with a Gaussian

likelihood model and zero-mean Gaussian priors over the model’s parameters. The prior variances are

determined using a validation set.

Finally, we investigated a few other models. Of note: instead of modelling user libraries as side

information we used the documents of user libraries as observed highly-scored items. We experimented

with various scoring schemes but none lead to consistent improvements over the baselines described

above. We also experimented with replacing directed topic models with an supervised extension of an

undirected topic model [97]. However these method did not perform well and are not discussed further.

4.6.3 Results

To run CSTM on the above datasets we first concatenated user libraries (for example a reviewer’s

previously published papers) into a single document. The content of the resulting document can then

be used as that a user’s side information (wr) in CSTM. To get user and item topic proportions we

learned a CTM topic model [13] using the content of the items and then projected user documents into

that space to obtain user topic proportions. We directly used these topics in SLM-1, SLM-II and LR.

We also used these topics as initialization in those models which jointly learn topics and scores (CSTM

and CTR). In all experiments we use 30 topics.

For training we create 5 folds from the available scores. Each fold is split into 80 percent observed

and 20 percent test data. We used the first fold to determine the hyper-parameters of the model. We

report the average results over the five folds as well as the variance of this estimator.

We want to evaluate the performance of CSTM in settings where some users have no observed scores.

The cold-start setting is of particular practical importance and one that should enable a good model

to leverage the user’s side information. Accordingly, in our datasets we randomly selected one fourth

of all users and removed all of their observed scores for training but kept their test scores (and their

side information remains available at training). Further, for NIPS-10 and Kobo, whose users have a

more uniform number of ratings, we binned the remaining users (three quarters) uniformly into three

categories. For NIPS-10, users in each category had 15, 30 and 55 observed scores respectively. In each of

the three categories 5 ratings per user were kept for validation. For Kobo users in the first two categories

had 8 and 10 scores while the scores of users in the last category were left untouched (5 scores per user


NIPS-10 ICML-12 Kobo

Constant 0.4378±2×10−3 0.6386±4×10−5 0.6882±5×10−4

SLM-I 0.4684±2×10−3 0.7903±4×10−5 0.6873±1×10−3

SLM-II 0.4696±3×10−4 0.7752±1×10−4 0.6926±6×10−4

CSTM 0.4846±1×10−30.8096±1×10−4

0.7243±2×10−4

Table 4.2: Comparisons between CSTM and competitors for cold-start users using NDCG@5. We reportthe mean NDCG@5 value and the variance over the five training folds.

were kept for validation). For ICML-12 since users are already naturally distributed into categories, we

split the observed data into 25 percent validation and 75 percent train.

For the next two experiments, for each dataset, we train each model on all of the data but we divide

our discussion into two parts. First we discuss cold-start users, and then we examine the (other) user

categories.

Cold-Start Data Regime

We first report the results for the completely cold-start data regime. As a reminder, this simulates

new users entering the system with their side-information. That is, user and item side-information

is available but scores are not. For the cold-start users, it is difficult to calibrate the output of the

model to the correct score range since only the users’ side information is available. The prediction is

that the models can use the side information to get a better understanding of users’ preferences and

discriminate between items of interest. Accordingly we report results using Normalized DCG, (NDCG)

a well-established ranking measure (see Equation 2.9), where a value of 1 indicates a perfect ranking

and 0 a reverse-ordered perfect ranking [57]. NDCG@T considers exclusively the top T items. Table 4.2

reports results for the three datasets using NDCG@5 (note that other values of NDCG gave similar

results). We can only report results for the methods that have the ability to predict scores for cold-start

users: PMF, LR, and CTR do not use any user side information and hence do not have that ability.

In this challenging setting CSTM significantly outperforms the other methods. Further we see that

methods using side information typically outperform the constant baseline. This demonstrates that the

useful information about user preferences can be leveraged from the user libraries. Further, the good

performance of CSTM in this setting shows that the model is able to leverage that information.

Warm-start data regimes

The goal of CSTM is to perform well across different data regimes. In the previous section we examined

the performance of several methods on cold start users; we now focus on users with observed scores. For

each dataset we report the performance of the various methods for each user category. For ICML-12 we

separated users into roughly equal sized bins according to their number of observed scores. Results for

the three datasets are provided in Figure 4.7. First we note that as the number of observed scores is

increased the performance of the different methods also increases. CSTM outperforms all other methods

on lower data-regimes. On users with more observed scores CSTM is competitive with both CTR and

LR.

We note that overall in this task, and even when many observed preferences are available, PMF is

not competitive with most of the methods that have access to the side information. This highlights


=15 =30 =550.95

1

1.05

1.1

1.15

ConstantLM−ILM−IIPMFLRCTRCSTM

(a) NIPS-10 data set

>0,<=20 >20,<=30 Others0.9

0.95

1

1.05

1.1

1.15

(b) ICML-12 data set

>0,<=8 >8,<=10 Others0.75

0.8

0.85

0.9

0.95

1

(c) Kobo data set

Figure 4.7: Test RMSE of the different methods across the different datasets. For each dataset, we reportresults for the three subsets of users with different number of observed scores (categories). Figures betterseen in color (however the ordering in the legend corresponds to the ordering of the bars in each group).


1 2 3 41

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

|θ|/(

w0+

w)

1 2 3 40.85

0.9

0.95

1

1.05

Figure 4.8: Averaged norm of parameters under users with varying number of scores (left NIPS-10, rightICML-12).

the value of content side information on both user and item sides. This is further made clear by the

relatively strong performance of both SLM-I and SLM-II.

Overall user libraries do not seem to help as much on the Kobo dataset. There are several explanations

for this. First, in Kobo the distribution over scores is very skewed toward high scores. Therefore a

constant baseline does quite well. Further, bag-of-words representations are particularly well suited for

academic papers where the presence (absence) of specific words are very good indications of a document’s

field and hence its targeted audience. However, in (non-technical) books user preferences also rely on

other aspects such as the document’s prose which is harder to capture in a bag-of-words topic model.

Tradeoff of side information with observed scores

In Section 4.4 we motivated the specific parametrization of CSTM by its ability to trade off the influence

of the user library side information versus that of the user-item scores. Here we show that learning in

our model performs as expected. Figure 4.8 reports the relative norm of the compatibility parameters

θ versus the (shared and individual) user parameters (γ0 + γ) as a function of the number of observed

scores: |θ|/|(γ0 +γ)|. As hypothesized as the number of observed scores increases the relative weight of

the user library side information decreases.

Results on original ICML-12 dataset

For completeness we also report results on the original, unmodified, version of ICML-12 (that is a version

without the artificial 25 percent cold-start users, see Section 4.6.3). Accordingly, users are now binned

differently. In Figure 4.9 and Table 4.3 we provide comparisons of the different methods on the original

version of ICML-12. We highlight that these results further display the good performance of CSTM in

all data regimes. Similarly as in the previous experiments with ICML-12, CSTM is only outperformed

by CTR for reviewers with many training scores.


<= 16 >16,<=27 >27,<=40 Others0.9

0.95

1

1.05

1.1

1.15

Figure 4.9: RMSE results on the unmodified ICML-12 dataset.

ICML-12

Constant 0.8950±5×10−4

SLM-I 0.9301±2×10−4

SLM-II 0.9308±4×10−4

CSTM 0.9409±3×10−4

Table 4.3: For the unmodified ICML-12 dataset, comparisons between CSTM and competitors for cold-start users using NDCG@5.

Variations of CSTM

We also experimented with variations of CSTM to better understand the roles played by the different

aspects of the model and its training.

CSTM fixed topics (CSTM-FT): This model uses the exact preference regression model used by

CSTM but it uses fixed user topic and document topic representations; that is, it predicts preferences

with rud = (au ⊗ dd)θT + dd(γ0 + γ)T where au and dd are previously learned offline.

CSTM no user side information (CSTM-NUSI): To evaluate the gain of using user side information

we experimented with a version of our model that does not model user side information (i.e., as if a user

did not have any documents). Specifically, in this model ar ≡ 0 for all users.

We provide some results comparing CSTM with its variations in Table 4.4. We notice that it is the

superior synergy of the user side information and the joint training of the model that explain CSTM’s

performance.

NIPS-10 ICML-12 ICML-12(original) Kobo

CSTM-NUSI 0.4941±4×10−4 0.7765±9×10−5 0.8048±9×10−4 0.7997±8×10−5

CSTM-FT 0.4984±2×10−4 0.8036±5×10−5 0.8066±8×10−5 0.8026±8×10−5

CSTM 0.5016±2×10−40.8217±2×10−5

0.8322±2×10−40.8037±2×10−5

Table 4.4: Comparisons between CSTM and two variations. Results report NDCG@5 over all users.


Poster Recommendations

We explore a different scenario which is meant to simulate what would happen when a model is deployed

in a complete recommender system, for example, to guide users to posters of interest in an academic

conference. Specifically, we evaluate the performance of CTR and CSTM as new users arrive into the

system and gradually provide information about themselves. We postulate that users first provide the

system with their library. Then users gradually express their preferences for certain (user-chosen) items.

We trained CTR and CSTM on all but 50 randomly-chosen ICML-12 users, restricting our attention

to users with at least 15 observed scores. We then simulated these users entering the system and evaluate

their individual impact. In this experiment we want to recommend a few top-ranked items to each user.

In terms of our recommendation framework in Figure 4.1, the recommendation of the personalized-best

items is our ultimate objective F (·). Therefore we evaluate the system’s performance using NDCG (of

held out data). Figure 4.10 presents the performance of CSTM and CTR as a function of the amount

of data available in the system. When a user first enters the system no data is available about him

(indicated by “0” in the figure). The methods revert to using a constant predictor which predicts the

mean of the previously observed scores across all users. Once a user provides a library (Lib.) we see that

CSTM’s performance increases very significantly. CTR cannot leverage that side information. Then once

users provide scores, the performance of both methods increases and the performance of CTR eventually

reaches the performance of CSTM.

Figure 4.10 demonstrates the advantage of having access to user side information, namely, the system

can quickly give good recommendations to new users. Further, in absolute terms the system performs

relatively well without having access to any scores. It is also interesting to note, in this experiment, as

far as NDCG goes, the performance of CSTM only modestly improves as the number observed scores in-

creases. This may be a consequence of our fairly primitive online learning procedure. As far as modelling

goes this experiment is also a demonstration that our model of user libraries is effective at extracting

features (ar for all users) indicative of preferences and that the regression model (Equation 4.10) then

successfully combines the user and item side information.


We have introduced a novel graphical model to leverage user libraries for preference prediction tasks.

We showed experimentally that CSTM overall outperforms competing methods and can leverage the

information of other users and of user libraries to perform particularly well in cold-start regimes. We also

explored a paper recommendation task and demonstrated the positive impact on the recommendation

quality of having access to user libraries.

Overall, we showed that using the content of items outperforms state-of-the-art (content-less) collab-

orative filtering methods in both cold and warm start regimes. Furthermore, we have showed that using

user-item side information is also a win in many cases. Finally, user-item side information is essential to

quickly provide good recommendations to new users.

Future work offers both immediate and longer-term possibilities. In the near term, we could refine

the inference procedure used in training our model. For example by using a fully variational approach

and by leveraging the latest inference procedures for non-conjugate models such as CTM [127].

An important question is what is missing from CSTM before it can used in real life. Of practical

importance is a model’s scalability properties. Currently, with respect to the datasets used in this


0 Lib. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.55

0.6

0.65

0.7

0.75

0.8

CTRCSTM

Figure 4.10: Comparison of CSTM and CTR’s NDCG@10 performance on new users as a function ofthe amount of data provided by users. The x-axis denotes what user data is available (Lib. stands foruser libraries while integer values denote the number of available user scores). Without any user data(labelled 0 on the x-axis) both methods revert to a constant predictor.

chapter, CSTM could be trained on larger datasets but is still far from being able to learn from datasets

that may be common in industry. To give an indication of training time, the current implementation of

our model, written in Matlab, can be fitted to our datasets within a few hours on a modern computer.

Currently, our inference procedure, at each iteration, must iterate through all users and all items.

One interesting avenue to improve scaling (in terms of number of users and items) is to use stochastic

variational inference [47, 23]. The basic intuition behind the stochastic optimization methods used in

machine learning, for example stochastic gradient descent, is to allow multiple parameter updates per

training pass. This stands in contrast to Algorithm 1 where we update the model parameters after

having examined all observed preferences. Stochastic optimization techniques are very commonly used

to train (collaborative-filtering) models on large datasets (for example, [99]). For our purposes, we could

imagine sampling n users, along with the documents that these users have rated, per iteration and use

the sufficient statistics from these n users to update the global parameters. Therefore, the training time

per iteration would at reduced by least be an order of R/n (where R is the total number of users). It

is still difficult to evaluate the resulting size of datasets that CSTM could then be fitted to since the

number of iterations can only be determined empirically.

Related to the above is the question of using CSTM as the main model inside of TPMS (for example

in place of the LM or of the model depicted in Equation 3.5). CSTM is designed for this exact domain

and, based on the experiments of this chapter it could improve the performance of the system. Before

changing the model used in TPMS we will need to evaluate CSTM against the model of Equation 3.5.

Such comparisons, since CSTM is novel, do not yet exist. Furthermore, there are practical aspects which


do not favour the inclusion of CSTM. First and foremost, many conferences only rely on user and item

content and never elicit expertise from reviewers. Furthermore, conference organizers are often working

under stringent time constraints which may favour the use of simpler models. For example, the LM is

fast to fit and we have found good settings of its hyper-parameters that work well across all conferences.

On the other hand both CSTM and the model of Equation 3.5 require tuning of the hyper-parameters

which can be time-consuming. One further aspect to consider is how much the empirical gains of CSTM

translate to perceivable differences in reviewer scores and ultimately in reviewer assignments. As an

indication, the next chapter shows that, on our datasets, there is a strong correlation between prediction

performance and matching performance. Thus given our current experiments, when time permits, CSTM

seems like a very good candidate for inclusion into TPMS.

A second aspect of practical importance is that once we move to online recommendation, models

must also be able to adapt to new data, including novel items and users, updates to user libraries, and

new user-item scores. In the poster recommendations experiment we have seen that a simple conditional

inference method works relatively well for novel users. However, one would also like to use the information

from novel users to learn better representations of all users. In other words, we would need a mechanism

which updates model parameters once a sufficient amount of new data is available. Furthermore, we

could refine such a method inter alia to allow the system to adapt to the evolving preferences of users

over time. For example, Agarwal and Chen [1] propose a decaying mechanism to emphasize more recent

scores over older ones. A similar mechanism could be use to weight the different documents in a user’s

library (for example based on date of publication for research papers or purchase date for books).

There is also the question of other potential applications for which CSTM could be useful. In addition

to modelling text, topic models have also been shown to model images [33]. CSTM could then be used

as an image recommendation tool (for example to photographers). In that case, much like for the

books of the Kobo dataset, it remains to be seen whether topic models can capture features of images

which are indicative of preferences. Another application for CSTM is the one of modelling legislators’

interests whom, similarly to academic reviewers, write and express their preferences about proposed

laws. Furthermore, there are novels aspects in this domain such as the effects of party lines which,

for certain bills, may create voting correlations which may have to be taken into account to correctly

determine the legislators’ true preferences.

Finally, the approach behind CSTM, which allows it to smoothly interpolate between side information

and preferences, could be refined. In its current form, if some user documents are not useful to predict

preferences the model must either: a) adjust the global compatibility parameters θ, thereby changing

the predictions of all users; or b) increase user individual parameters γr while keeping the calibration

of predicted preferences to observed ones; or c) change the representation of user libraries therefore

changing the topic model; or d) use a combination of the above. A more pleasing solution would allow

the model to independently adjust the importance of user libraries, and perhaps even of individual

documents within a library. Overall, interpolating from side information based recommendations to a

preference one requires additional attention.

Chapter 5

Learning and Matching in the

Constrained Recommendation

Framework

In the majority of recommender systems the predicted preferences enable the recommendation stage.

The recommendation step, the ultimate step in our constrained-recommendation framework, can assume

various forms. In particular it may include several different constraints and objectives in addition to

the initial preference prediction objective (the loss function associated with the learning method). This

chapter is devoted to studying this last step and the interactions between the learning and recommen-

dation stages. Motivated by the reviewer-paper assignment problem, we also develop and explore an

approach to constrained recommendation consisting of a matching between reviewers and papers.

We begin by detailing the design choices of our proposed recommendation framework. We then look

at an instance of the framework for matching reviewers to papers. We especially focus on the development

of the matching stage and, present several experimental results on matching problems. The experiments

aim to compare different learning models on the matching task, and also explore different matching

formulations motivated by real-life requirements. Finally, we demonstrate synergies between matching

and learning.

5.1 Learning and Recommendations

For readability purposes we reproduce a constrained-recommendation framework in Figure 5.1. We

remind the reader of the three stages of the framework: 1) elicitation of user information; 2) prefer-

ence prediction using the previously elicited preferences and side information; and 3) determination of

recommendations using domain-specific considerations (objective and constraints).

Separating the preference prediction stage and the recommendation stage implies the use of two

separate objective functions. Ideally we would rather use a single objective which encompasses both the

ideals of the learning objective (e.g., accurate preference predictions) and the ideals of the recommen-

dation objective (e.g., learn a model which will give good recommendations). In practice this is usually

impossible for two reasons: 1) The optimization stage can involve solving a complex, for example a

63

Chapter 5. Learning and Matching in the Constrained Recommendation Framework 64


preferencesElicitation

Final objectives

and constraints

S31

1 2

Users

Entities

Stated preferences

1 1 20 1 22 3 2

Stated & predicted

preferences

1 1 20 1 22 3 2

Recommendations

Side-Information

from Users and

Entities

F

Figure 5.1: Flow chart depicting a constrained recommender system.

non-continuous, optimization problem with constraints which we cannot express as a learning objective;

2) Learning the parameters of a predictive model is likely the most expensive step of the three stages;

once learned it can be useful if the predictions can be used in different optimization problems. The

reviewer to paper matching system of Chapter 3 is a good example of the latter as conference organizers

will often use TPMS scores multiple times while exploring their preferences over the various matching

constraints.

Further, in machine learning it is sometimes empirically advantageous to learn a model using a

simpler loss function that the one which is required by the task. A good example, which we reviewed in

Section 2.2.2, occurs in the field of learning to rank for recommender systems, a recommendation task

requiring a straightforward sorting procedure as its last stage. Certain state-of-the-art methods often

split the learning into two different steps (akin to the two stages of our framework): in the first step they

learn a standard preference prediction model the output of which is then used to train a model using a

(domain specific) ranking loss [4].

The separation of the prediction and optimization stages does not imply that the two stages cannot

work in synergy when possible. Specifically, defining a loss function that is sensitive to the final objective

may provide performance gains. In fact we explore some of these synergies for the matching problem in

Section 5.4.4.1

5.2 Matching Instantiation

Motivated by the reviewer to paper matching problem we explore different formulations of an assignment

or matching problem to optimally assign papers to reviewers given some constraints. In other words,

as shown in Figure 5.2, the third stage of our constrained recommendation framework is a matching

problem.

Concretely, we frame the assignment problem as an integer program [122], and explore several varia-

tions that reflect different desiderata, and how these interact with various learning methods. We test our

framework on two data sets collected from a large AI conference, measuring predictive accuracy with

respect to both reviewer suitability score and matching performance, and exploring several different

matching objectives and how they can be traded off against one another.

1Synergies between the elicitation and the optimization steps, which we show to be even more important experimentally,will be explored in Chapter 6.



preferences

Matching objective

and constraints

S31

1 2

Users

Entities

Stated preferences

1 1 20 1 22 3 2

Stated & predicted

preferences

1 1 20 1 22 3 2

Assignments

Side-Information

from Users and

Entities

Figure 5.2: Match-constrained recommendation framework.

Although we focus on reviewer matching, our methods are applicable to any constrained matching

domain where: (a) preferences can be used to improve matching quality; (b) it is infeasible or undesirable

for users to express preferences over all items; and (c) capacity or other constraints limit the min/max

number of users-per-item (or vice versa). Examples include facility location, school/college admissions,

certain forms of scheduling and time-tabling, and many others.

We reuse the notation established in previous chapters. As a reminder, we denote the matrix of

observed scores, or reviewer-paper suitabilities, by So. Further we denote the observed scores for a

particular reviewer r and paper p by Sor and So

p , respectively. Su, Sur , S

up are the analogous collections

of unobserved scores.

Given this information, our goal is to find a “good” matching of papers to reviewers in the presence

of incomplete information about reviewer suitabilities, possibly exploiting the side information available.

5.2.1 Matching Objectives

We articulate several different criteria that may influence the definition of a “good” matching and explore

different formulations of the optimization problem that can be used to accommodate these criteria. We

also discuss how these criteria may interact with our learning methods.

Naturally, one would like to assign submitted papers to their most suitable reviewers; of course, this

is almost never possible since some reviewers will be well suited to far more papers than other reviewers.

In general, load balancing is enforced by placing an upper limit or maximum on the number of papers

per reviewer. Similarly, we may impose a minimum to ensure reasonable load equity or load fairness

across reviewers. However, limiting the paper load increases the probability that certain papers will be

assigned to very unsuitable reviewers. This suggests only making assignments involving pairs with score

srp above some minimum score threshold. This ensures that every paper is reviewed by a minimally

suitable reviewer, but may sacrifice load equity (indeed, it may sacrifice feasibility). One may also desire

suitability fairness across reviewers; that is, reviewers should have similar score distributions over their

assigned papers (so on average no reviewer is assigned to significantly more papers for which he is poorly

suited than any other reviewer). Finally, when multiple reviewers are assigned to papers, it may be

desirable to assign complementary reviewers to a paper so as to cover the range of topics spanned by a


submission. Related is the desire to ensure each paper is reviewed by at least one “well-suited” reviewer.

The intricacies of different conferences prevent us from establishing an exhaustive list of matching

desiderata (see [8, 42, 37] for further discussion). We now explore matching mechanisms that will account

for several of these criteria: we frame the matching procedure as an optimization problem and show how

several properties can be formulated as constraints or modifications of the objective function.

We formulate the basic matching problem as an integer program (IP), where each paper is assigned

to its best-suited reviewers given the constraints [122]:

maximize Jbasic(Y, S) =

∑

r

∑

p

srpyrp (5.1)

subject to yrp ∈ {0, 1}, ∀r, p (5.2)∑

r

yrp = Rtarget , ∀p (5.3)

The binary variable yrp encodes the matching of item p to user r; a match is an instantiation of these

variables. Y is the match matrix Y = {yr}r∈R,p∈P and similarly, S is the score matrix S = {sr}r∈R,p∈P .

Jbasic(Y, S) denotes the value of the objective of the IP with match matrix Y and score matrix S. Rtarget

is the desired number of reviewers per paper. Minimum and maximum reviewer load, Pmin and Pmax

respectively, can be incorporated as constraints [122]:

∑

p

yrp ≥ Pmin,∑

p

yrp ≤ Pmax, ∀r. (5.4)

This IP, including constraints (5.4), is our basic formulation (Basic IP). Its solution, the optimal match,

maximizes total reviewer suitability given the constraints. Although IPs can be computationally difficult,

our constraint matrix is totally unimodular, so the linear program (LP) relaxation (allowing yrp ∈ [0, 1])

does not affect the integrality of the optimal solution; hence the problem can be solved as an LP. This

can be understood as follows. The constraints define a feasible set : Ay ≤ b. Each constraint is linear

thus the feasible set is a polyhedron. Since A the constraint matrix is totally unimodular [122] and b

is an integer vector then the vertices of that polyhedron have integer coordinates. An LP’s objective

function is also linear and therefore its optima corresponds to one, or more, of this polyhedron’s vertices.

Although not mentioned above, it is essential for the matching to prevent assignments of reviewers to

submitted papers for which they have conflicts of interest (COI). The above formulation can easily enforce

known COI by directly constraining the conflicting assignments’ yrp variables to be 0, alternatively, we

can set the relevant scores srp’s to −∞.

To capture additional matching desiderata, we can modify the objective or the constraints of this IP.

Load balancing can be controlled by manipulating Pmin and Pmax: a small range ensures each reviewer

is assigned to roughly the same number of papers at the expense of match quality, while a larger range

does the converse. We can instead enforce load equity by making the tradeoff explicit in the objective

with “soft constraints” on load:

Jbalance(Y, S) =

∑

r

∑

p

srpyrp −∑

r

λf(

(

∑

p

yrp)

− y)

(5.5)

where y is the average number of papers per reviewer (M/N) and f is a penalty function (for example,

f(x) = |x| or f(x) = x2). The parameter λ controls the tradeoff between load equity and match quality.

The Jbalance objective (Equation 5.5) along with the constraints expressed in Equation 5.2 comprise our

Balance IP. Note if f(x) is nonlinear, then Balance IP becomes a nonlinear optimization problem.


The Jbasic objective (Equation 5.1) maximizes the overall suitability of the assignments, equating

“utility” with suitability. However, the utility of a specific match yrp may not be linear in suitability srp.

For example, utility may be more “binary”: as long as a paper is assigned to a reviewer whose suitability

is above a certain threshold, then the assignment is good, otherwise it is not. This can be realized

by applying some non-linear transformation g to the scores in the matching objective (for example, a

matched pair with score srp ∈ {2, 3} may be greatly preferred to srp ∈ {0, 1}):

Jtransformed(Y, S) =

∑

r

∑

p

g(srp)yrp. (5.6)

In this transformed objective J transformed , if g is a logistic function then score are softly “binarized.”

Note that g(srp) can be evaluated offline and therefore, J tfm can still be used as the objective of an IP.

Finally, we note that some of these matching objectives can also be incorporated into the suitability

prediction model. For example, the nonlinear transformation g can be directly used in the learning

training objective (for example, instead of vanilla RMSE):

CLR-TFM(So) =1

|So|

∑

srp∈So

(srp − g(srp))2. (5.7)

5.3 Related Work on Matching Expert Users to Items

We briefly reviewed some of the connections between our work and expertise retrieval in Section 3.3.1.

Here we broaden our horizon and discuss research that deals explicitly with matching expert users with

suitable items, including research from the expertise retrieval field.

Stern et al. [115] propose to use a hybrid recommender system to assign experts to specific tasks. The

tasks are (hard) combinatorial problems and the set of experts comprises different algorithms that may

be able to solve these tasks. The novelty of this approach is that new unseen tasks appear all the time

and so side information—features of the tasks represented by certain properties of the combinatorial

problems—must be used to relate a new task to other tasks and their experts. A Bayesian model is used

to combine collaborative filtering and content-based recommendations [114].

Matching reviewers to papers in scientific conferences has also received some attention [8, 42, 37].

Recently Conry et al. [29], showed how this domain could benefit from using a collaborative filtering

approach. This approach is conceptually similar to the one used by TPMS in Chapter 3. As a reminder,

imagine that reviewers have assessed their expertise (or preferences) for a subset of papers. CF can be

used to fill in the missing scores (or ratings). Once all scores are known the objective is then to assign

the best reviewers to each paper. Doing so, without further constraints, could create large imbalances

between reviewers with larger expertises, say more senior reviewers, and those with less expertise. One

solution is to post constraints on the maximum number of papers reviewers can be assigned to. Likewise,

papers must be reviewed by a minimum number of reviewers. The final objective function that must be

optimized has constraints across users (reviewers) and items (papers). Ideally, the collaborative filtering

would concentrate on accurately predicting scores corresponding to reviewer-paper pairs that will end

up being matched. This is not easy as the matching objective is typically not continuous. Conry et al.

[29] have studied this problem without integrating both steps.

Rodriguez and Bollen [91] have built co-authorship graphs using the references within submissions in

order to suggest initial reviewers. In Karimzadehgan et al. [60], authors argue that reviewers assigned to


a submission should cover all the different aspects of the submission. They introduce several methods,

including ones based on modelling papers and reviewer using topics models, to attain good (topic)

coverage. Our CSTM model (Chapter 4) is similar in the sense that it models reviewers and papers

using a topic model. However, in CSTM reviewer preferences are determined according to the global

suitability of the reviewer, and not explicitly according to his coverage with respect to a paper’s topics.

In follow-up work Karimzadehgan and Zhai [59] and Tang et al. [118] explore similar coverage ideas and

show how they can be incorporated as part of a matching optimization problem (akin to the one we

present in Section 5.2.1).

A second body of work focuses exclusively on the matching problem itself. Benferhat and Lang [8],

Goldsmith and Sloan [42], and Garg et al. [37] discuss various optimization criteria, and some of the

practices used by program chairs and existing conference management software. Taylor [122] shows how

these criteria can be formulated as an IP. Tang et al. [117] propose several extensions to the IP. This

work assumes reviewer suitability for each paper is known, and deals exclusively with specific matching

criteria.

5.4 Empirical Results

We start by describing the data sets used in our experiments. The rest of the section is divided into

three parts. The first considers score predictions with the different learning models. The second turns

to matching quality and explores the soft constraints on the number of papers matched per reviewer.

Finally, the third part evaluates a transformation of the matching objective and shows how using a

transformed learning objective can enhance performance on the transformed matching problem.

5.4.1 Data

Experiments are run using the NIPS-10 data, described in previous chapters, and the NIPS-09 dataset,

from the 2009 edition of the NIPS conference.2 As before, side information for each reviewer comprises a

self-selected set of papers representative of his or her areas of expertise; these were summarized as word

count vectors war . Side information about submitted papers consisted of document word counts wd

p for

each p. The total vocabulary used by submissions (across both sets) contained over 21,000 words; here

we used only the top 1000 words for our experiments as ranked using TF-IDF (|wp| = |wr| = 1000).

Reviewer suitability scores ranged from 0 to 3; 0 meaning “paper lies outside my expertise;” 1 means

“can review if necessary;” 2 means “qualified to review;” and 3 means “very qualified to review.” As

discussed above, these scores are intended to reflect reviewer expertise, not desire. We focus on the area

chair (or meta-reviewer) assignment problem, where the matching task is to assign a single area chair

to each paper. We use the term reviewer below to refer to such area chairs.

NIPS-09 comprises 1079 submitted papers and 30 area chairs. Contrary to the procedure followed

by NIPS-10 (and ICML-12) reviewer scores were not elicited, but instead provided by the conference

program chairs for every reviewer-submission pair. The mean suitability score was 0.19 (std. dev. 0.57).

A histogram of the scores for each dataset is shown in Figure 5.3.

2See http://nips.cc

http://nips.cc


0 1 2 30

500

1000

1500

2000

2500

3000

Num

ber

of S

core

s

(a) NIPS-10

0 1 2 30

0.5

1

1.5

2

2.5

3x 104

(b) NIPS-09

Figure 5.3: Observed scores for the two datasets. Figure for NIPS-10 is reproduced here (from Section 3.3) inorder to highlight differences in score distribution with NIPS-09.

5.4.2 Suitability Prediction Experimental Methodology

We first describe the methodology used to train and test the score prediction models. We do not report

suitability prediction results since Chapter 4 has already highlighted this stage of the framework.

Suitability methods: We reuse some of the models compared in Chapter 4. Namely, we use a method

which only uses side information, (word-)LM (see LM in Section 3.2.1 for details); a pure collaborative

filtering method, BPMF, a Bayesian extension of PMF already reviewed in Section 2.2.1; and LR, a

reviewer-specific linear regression model (see Section 3.2.2). The goal here is not to compare the merits

of the different approaches on a score prediction task but rather to compare their matching performance.

For learning, we are given a set of training instances, Str ≡ So. We split this set into a training and

validation set. The trained model predicts all unobserved scores Su. Since we do not have true suitability

values for all unobserved scores, we distinguish Su as being the union of test instances Ste (for which

we have scores in the data set), and missing instances Sm. LR is trained using a regularized squared

loss (or, equivalently, by assuming a Gaussian likelihood model and Gaussian priors over parameters).

We denote a model’s estimates of the test instances as Ste.

We use 5 different splits of the data in all experiments. In each split, the data is divided into training,

validation and test sets in 60/20/20 proportions. There is no overlap in the test sets across the 5 splits.

Training LR is naturally slightly faster than training BPMF3, for which we used 330 MCMC samples

including 30 burn-in samples, but both methods can be trained in a few minutes on both of our data

sets.

5.4.3 Match Quality

We now turn our attention to the matching framework. We first elaborate on how we perform the

matching. We then evaluate the performance of the different learning methods on the matching objec-

tive. Finally we introduce soft constraints into the matching objective and analyze the trade-offs they

3We use an implementation of BPMF provided by its authors and currently available athttp://www.cs.toronto.edu/~rsalakhu/BPMF.html

http://www.cs.toronto.edu/~rsalakhu/BPMF.html


train/validation test missing

Matching Str Ste Sm = τEvaluation Str Ste Sm = τ

Table 5.1: Overview of the matching/evaluation process.

introduce.

Matching: Experimental Procedures

The matching IPs discussed above assume access to fully known (or predicted) suitability scores. Since

we learn estimates of the unknown scores, we denote a model’s estimates of the test instances as Ste,

and impute a value for all suitability values that are missing, using a constant imputation of τ ∈ R.

Since missing scores are likely to reflect, on average, lower suitability than their observed counterparts,

we use τ = 1 in all experiments (NIPS-10’s mean is 1.1376 and NIPS-09 has no missing scores).

Given the estimate Ste computed by one of our learning methods, we perform a matching with

S = Str ∪ Ste ∪ (Sm = τ). Note that this permits missing values to be matched, which is important in

the regime where few suitability scores are known. Table 5.1 summarizes this procedure. For data set

NIPS-10 we set Pmin and Pmax to 20 and 30, respectively, while the range is 30–40 for data set NIPS-09.4

Baseline: We adopt a baseline method that provides an absolute comparison across methods. The

baseline has access to Str and imputes τ for any element of Ste. To allow meaningful comparison to

other methods, it employs the same imputation for missing scores, Sm = τ .

A note on LM: Although the output of LM can be directly used for matching, it does not exploit

observed suitabilities in its usual formulation. However LM can make use of some of the training data

Str by incorporating submitted papers assessed as “suitable” by some reviewer r into his or her word

vector war . Specifically, we include all papers in wa

r for which r offered a score of 3 (only if this score is

in Str).

For all methods, once an optimal match Y ∗ is found, we evaluate it using all observed and unobserved

scores, with the same constant imputation for the missing scores, where match quality is measured using

Jbasic (see Equation 5.1): ∑

r

∑

p

x∗rp(S

tr ∪ Ste ∪ Sm = 1) (5.8)

Matching Performance using Basic IP

We now report on the quality of the matchings that result from using the predictions of the different

methods. Similarly to the preference prediction experiments in Chapter 4 , we consider dynamic matching

performance as the amount of training data per user increases. Note that the optimal match value is

3053 for NIPS-10 and 2172 for NIPS-09, which occurs when Ste = Ste.

Figure 5.4 shows how matching quality varies as the amount of training data per user increases in

NIPS-10. Since training scores are also observed at matching time (Equation 5.8), all methods benefit

from having a larger training set. Figure 5.4 leads to the following three observations. Firstly, when no

observed data is available (i.e., when using only the archive) LM does very well, with a matching score

of 2247 ± 32, nearly identical to the quality of LR and BPMF with 10 suitabilities per user, and much

4These represent typical ranges for members of the senior program committee members.


0 10 20 30 40 50 60 70 80 90

2100

2200

2300

2400

2500

2600

2700

2800

2900

Training set size per user

Mat

chin

g O

bjec

tive

LR

BPMF

LM

Baseline

Figure 5.4: Performance on the matching task on the NIPS-10 dataset.

better than the match quality of 1262 obtained using constant scores ((Ste ∪Sm) = τ) . Secondly, when

very few scores are available, LR and LM perform best (and do equally well). As mentioned above, LM

is able to exploit observed suitabilities by adding relevant papers to the user corpus, but this attenuates

the impact of elicited scores: we see LM is outperformed by all other methods when sufficient data is

available. Thirdly, LR outperforms all other methods as data is added. We also see that as the number

of observed scores increases, unsurprisingly, the gain in matching performance (value of information)

from additional scores decreases.

It is also interesting to note that a total matching score of over 2500 implies that, on average, each

reviewer is assigned papers on which her average preference is greater than 2 (out of 3). LR reaches this

level of performance with less than 30 observed scores per user, while other methods need 30% more

data per user to reach the same level of performance.

Further insight into matching quality on NIPS-10 induced by the different learning methods can be

gained by examining the distribution of scores associated with matched papers (Figure 5.5) or under

different sizes of the training set (Figure 5.6). Figure 5.5 displays the number of scores of each value (0–3)

that get assigned with a training set size of 40. Not surprisingly, LR and BPMF assign significantly more

2s and 3s combined than all other methods. LM is very good at picking the top scores which reinforces

the fact that word-level features, from reviewer and submitted papers, contain useful information for

matching reviewers. Similar results were obtained on NIPS-09 and thus LM’s performance is not simply

a consequence of the data collection method used for NIPS-10. In addition, Baseline assigns few zeros,

since all missing and test scores are imputed to be τ = 1.

Figure 5.6 provides another perspective on assignment quality. Here we plot results for the best

performing method, LR, on both NIPS-10 and NIPS-09, for 3 different training set sizes. We first note

that the extreme imbalance in the distribution over scores of NIPS-09 leads LR to assign many zeros


0 1 2 3

0

100

200

300

400

500

Base

line

Base

line

Base

line

Base

line

LM LM LM LM

BPM

F

BPM

F

BPM

F

BPM

F

LR LR LR LR

Ave

rage

num

ber

of a

ssig

nmen

ts

Figure 5.5: Assignments for NIPS-10 by score value when using 40 training examples per user.

even with 80 training scores per user. Overall, both data sets show that as the number of training scores

increases, more 2s and 3s, and fewer 0s and 1s, are assigned.

Our remaining results deal exclusively with NIPS-10 since experimental results with NIPS-09 were

similar.

Load Balancing Balance IP

The experiments above all constrain the number of papers per reviewer to be within a specific range

(Pmin,Pmax). There is no good indication as to how to set these two limits. Instead we now use the

Balance IP, both for matching and evaluation (see Equation 5.8), setting f to be the absolute value

function. The resulting problem cannot be expressed directly as an LP. However, we can use a standard

procedure which involves adding auxiliary variables. Specifically, the Balance IP, with f the absolute

value function can be solved as the following optimization problem (for clarity we omit the constraints

on yrp):

maximize Jbalance(Y, S) =

∑

r

∑

p

srpyrp −∑

r

λtr

subject to(

(

∑

p

yrp)

− y)

≤ tr, ∀r (5.9)

−(

(

∑

p

yrp)

− y)

≤ tr, ∀r (5.10)

where the auxiliary variables are denoted as tr. We have added two sets of constraints (Eq. 5.9 and

5.10), only one of which will be active at a time for each reviewer r, that ensure that tr is lower bounded

by the absolute value of((∑

p yrp)− y). Since tr can only decrease the value of the objective, tr will


NIP

S-10

0 1 2 3

0

100

200

300

400

500

600

Avg

. num

ber

of a

ssig

nmen

ts

(a) 10 examples per user

0 1 2 3

(b) 40 examples per user

0 1 2 3

(c) 86 examples per user

NIP

S-09

0 1 2 3

0

100

200

300

400

500

600

Avg

. num

ber

of a

ssig

nmen

ts

(d) 10 examples per user

0 1 2 3

(e) 40 examples per user

0 1 2 3

(f) 80 examples per user

Figure 5.6: Comparison of the distribution of assignments by score value given different number ofobserved scores.

λ 0 0.1 0.25 0.5 0.75 1Jbasic 2625 2615 2600 2573 2569 2569

Variance 4.62 3.28 2.61 0.89 0.37 0.33

Table 5.2: Comparison of the matching objective versus within-reviewer variance of the number ofassigned papers as a function of λ.

then be exactly equal to either the left hand side of Equation 5.9 or the left hand side of Equation 5.10.

Figure 5.7 shows the histogram of assigned papers per reviewer given by the optimal solution to

the IP for different λ ∈ {0, 0.1, 1}. Experimentally when λ = 0 load equity is ignored, and almost all

reviewers either get assigned the minimum (Rmin) or the maximum (Rmax) number of papers (this is

due to certain reviewers having high expertise for more papers than others); within-reviewer variance

(∑

p(yrp− y)2/M) is extremely high. When a “soft constraint” on load equity is introduced, assignments

become more balanced as the λ increases (i.e., the balance constraint becomes “harder”). Table 5.2

reports the matching objective versus the variance, averaged across users, for different values of λ with

a training set size of 40 (other training sizes yielded similar results). Not surprisingly, larger penalties

λ for deviating from the mean reviewer load give rise to greater load balance (lower load variance) and

worse matching performance: Table 5.2 shows that the best matching, in this experiments, had a value

of 2625 and a load variance of 4.62, while the worst matching had a value of 2569 and a load variance

of 0.37. Generally, an appropriate λ will be chosen by the conference organizers, that nicely trades

off performance versus load balance across reviewers (here, perhaps around λ = 0.5). In practice it is

possible that the organizers will have to further examine the assignments (that is beyond looking at only

the matching objective) before selecting an appropriate value for λ.


20 21 22 23 24 25 26 27 28 29 300

10

20

30

40

50

λ = 020 21 22 23 24 25 26 27 28 29 30

λ = 0.120 21 22 23 24 25 26 27 28 29 30

λ = 1

Figure 5.7: Histograms of number of papers matched per reviewer with different values of λ. Theleftmost plot shows results with only hard constraints on reviewer loads (λ = 0); the others also includea soft constraint minimizing load variance. The corresponding matching values for different λ values arereported in Table 5.2

5.4.4 Transformed Matching and Learning

We now consider a non-linear transformation of the scores, reflecting the view that it is much better

to assign reviewer-paper pairs with suitabilities of 2 and 3, than pairs with 0 and 1; as discussed above

this can be accomplished by allowing “utility” yrp to be non-linear in suitability score srp. We adopt

the following sigmoid function to effect this non-linear transformation: σ(s) = 1/(1+ exp(−(s− 1.5)β));

here 1.5 is the middle of the scores’ range. We set β = 4.5, which gives: σ(0) = 0.001; σ(1) = 0.095;

σ(2) = 0.90; σ(3) = 1.0. We first show how this transformation impacts matching performance without

learning; then we discuss how one can incorporate the transformation into the learning objective itself.

We first test how matching using the transformed objectives affects results without using learning

to infer missing scores (consequently, Su = τ), by examining difference in matching performance when

varying the percentage of observed scores. Figure 5.8(a) shows the difference when matching with

the transformed objective (J transformed ) versus the basic objective (Jbasic). In both cases the resulting

matches are evaluated using J transformed . Although a minor gain is observed when most of the known data

is observed, there is, overall, very little difference in performance when matching with either objective.

Recall that the mean number of scores per paper is less than 4. Hence, when matching using a small

fraction of the data, the matching procedure has very little flexibility to assign high scoring pairs unless

learning is used to predict unobserved scores.

We can modify the learning objective to take into account the nonlinearity introduced in the matching

objective. We do this by transforming all labels using the same sigmoidal transformation as in the

matching objective (Equation 5.7). This allows learning to better predict the transformed scores by

explicitly training on them.

Figure 5.8(b) shows the transformed matching performance of both LR on the non-transformed

data, and LR-TFM, a linear regression model trained using the transformed learning objective. Not

surprisingly, LR-TFM outperforms LR across all training set sizes, since it is trained for the modified

objective J transformed . The difference is especially pronounced with smaller training sets—when enough

data is available, both methods will naturally assign many 2s and 3s. (We also verified that LR-TFM

outperforms BPMF trained on the transformed objective).


0.4 0.5 0.6 0.7 0.8 0.9 1700

750

800

850

900

950

1000

1050

1100

Fraction of known scores

J tfm

Original Matching

Transformed Matching

(a) Comparing performance of original (Jbasic) and trans-formed (J transformed ) matching objectives, without learn-ing.

10 20 30 40 50 60 70 80 90

800

850

900

950

1000

1050

1100

Training set size per user

J tfm

LR

LR−TFM

(b) Comparing performance of original and transformedLR learning using the transformed matching objective.

Figure 5.8: Performance on the transformed matching objective on NIPS-10.


We have instantiated the recommendation stage of our framework as an assignment procedure between

papers and reviewers. We showed how when only a small subset of reviewer-scores are elicited and

inferring unobserved scores, using one of several learning methods, we are able to determine high-quality

matchings. We explored the trade-off between matching quality and paper load balancing, which helps

one avoid the need to manually set limits on the reviewer load. Finally we showed that using the

realistic assumption that utility is non-linear in suitability score, we discover better matches using the

same nonlinear transformation in the learning objective.

Given how matching benefits from an interaction with learning, a next step would be to develop ways

to strengthen this interaction by making the learning methods sensitive to the final matching objective.

We discuss such interactions for active learning—where the system chooses which reviewer scores to

query—in Chapter 6. Another possible path for future research is to explore optimization models for

different recommendation problems and to more formally define the circumstances in which adapting

the learning loss to the final objective provides an advantage.

Finally, as we outlined in Section 5.3 several researchers have looked at possible matching constraints,

which could be useful for the reviewer-to-paper matching task, using matching formulations similar as

ours. Based on our experience with TPMS we would are also interested in exploring more expressive

matching constraints, which would give more flexibility to conference organizers in expressing their

preferences over assignments, which may not fit as part of our proposed LP framework. For example,

in practice, assigning complementary reviewers (one reviewer of pool A and one reviewer of pool B) to

submission is often sought by conference organizers. For such cases, mapping the matching optimization

to one of performing inference in a undirected graphical model (see Section 6.2.2) and making use of

high-order potentials (see, for example, [119]) could offer interesting opportunities.

In our work we have used the value of the matching objective, which is based on scores elicited prior

to assignment, to evaluate assignment quality. A possibly stronger method of evaluation would be to


re-evaluate the expertise of reviewers after they have reviewed their assigned papers. We could elicit

reviewer expertise from the reviewers themselves (for example by using the confidence of their review).

Alternatively, senior program committee members could be asked to evaluate the expertise of reviewers

based on their reviews. Another advantage of performing post-hoc evaluation is that it would enable

comparison between our system and other ways of assigning papers to reviewers (either manually or

using other more automated procedures). Exploring different evaluations also relates to our discussion

about evaluation of TPMS in Section 3.5.

Chapter 6

Task-Directed Active Learning

Chapter 5 demonstrated the importance of effective learning of user preferences in matching problems.

Equally important is the question of query selection, which has the potential to further reduce the amount

of preference information users must provide and hence limit the impact of the cold-start problem.

Eliciting preferences (e.g., in the form of item ratings) imposes significant time and cognitive costs

on users. In domains such as paper matching, product recommendation, or online dating, users will have

limited patience for specifying preferences. While learning techniques can be used to limit the amount

of required information in match-constrained recommendation, the intelligent selection of preference

queries will be critical in reducing user burden. It is this problem we address in this chapter. We frame

the problem as one of active learning : our aim is to determine those preference queries with the greatest

potential to improve the quality of the matching. This is a departure from most work in active learning,

and, specifically, approaches tailored to recommender systems (as we discuss below), where queries are

selected to improve the overall quality of ratings prediction. We develop techniques that focus on queries

whose responses will impact—possibly indirectly by changing predictions—the matching quality itself.

We also propose a new probabilistic matching technique that accounts for uncertainty in predicted

preferences when constructing a matching. Finally, we test our methods on several real-life data sets

comprised of preferences for online dating, conference reviewing, and jokes. Our results show that active

learning methods that are tuned to the matching task significantly outperform a standard active learning

method. Furthermore, we show that our probabilistic methods can be successfully leveraged in active

learning.

6.1 Related work

Active learning is a rich field which we surveyed in the context of recommender systems in Section 2.3.

With respect to previous research into active learning directed at other tasks, we are aware of one

relevant study which explored active learning in matching domains. Rigaux [89] considers an iterative

elicitation method for paper matching using neighbourhood CF, but requires an initial partitioning of

reviewers, and elicits scores for the same papers from all reviewers in a partition (with the aim only

of improving score prediction quality). In our work, we need not partition users, and we focus on

optimizing matching quality rather than prediction accuracy. Our approach is thus conceptually similar

to CF methods trained for specific recommendation tasks (Section 2.2.2 surveyed this work).

77

Chapter 6. Task-Directed Active Learning 78

S31

1 2

Users

Entities

Stated preferences

1 1 20 1 22 3 2

Stated & predicted

preferences

1 1 20 1 22 3 2

Assignments

Elicitation

of user

preferences

Figure 6.1: Elicitation in the match-constrained recommendation framework. As shown by the arrowfrom the matching to the elicitation, the elicitation can be sensitive to the matching objective.

We also note that Bayesian optimization has been used for active learning recently [22]; however,

these methods assume a continuous query space and some similarity metric over item space, hence are not

readily adaptable to our match-constrained problems. As such, we will not explore the use of Bayesian

optimization in this thesis.

6.2 Active learning for Match-Constrained Recommendation

Problems

Our framework is depicted, specifically for the purposes of this chapter, in Figure 6.2. Preference

elicitation corresponds to the first stage of our framework. Designing elicitation process that exploits

feedback from the matching stage is the focus of this chapter. The active learning methods that we

develop are in line with some of the previous work on active learning (see Section 2.3) and especially the

work which is guided by prediction uncertainty (see Section 2.3.1).

Once elicited, preferences can be used to predict missing preferences using one of the models described

in the previous chapters. We use Bayesian PMF (described in Section 2.2.1). The reason for choosing

this model will become clear in the next section.

For matching, we reuse the IP presented in the previous chapter, with the Jbasic objective and

constraints on the number of papers per reviewer and on the number of reviewers per paper:

maximize J(Y, S) =∑

r

∑

p

srpyrp (6.1)

subject to yrp ∈ {0, 1}, ∀r, p (6.2)∑

r

yrp ≥ Rmin,∑

r

yrp ≤ Rmax, ∀p (6.3)

∑

p

yrp ≥ Pmin,∑

p

yrp ≤ Pmax, ∀r. (6.4)

We also reuse some of our previously-defined notation: we denote the set of observed suitabilities by

So, and denote the observed scores for a particular user r and item p by Sor and So

p , respectively. Su,

Sur , S

up are the analogous collections of unobserved scores.


6.2.1 Probabilistic Matching

Model uncertainty is at the heart of several active learning techniques (see Section 2.3). For example,

a successful querying strategies such as uncertainty sampling (Section 2.3.1) directly aims at reducing

model uncertainty by querying scores over which the model has the most uncertainty. Our aim is

to extend such strategies to matching problems. In this section we will introduce a novel method

for determining probabilistic matchings which can use score uncertainty as a way to determine the

probability of a user-item pair being matched.

While the IP optimization is straightforward, and provides optimal solutions when all scores are

observed, it has potential drawbacks when used with predicted scores, and specifically, when used in

conjunction with active learning. First, the IP does not consider potentially useful information contained

in the uncertainty of the (predicted) suitabilities. Namely, because of the uncertainty of the (predicted)

suitabilities, we are uncertain of the quality of the IP’s assignment. Evaluating how score uncertainty

affects the IP’s assignment may allow us to reduce the quality uncertainty. Second, the IP does not

express the range of possible matches that might optimize total suitability (given the constraints).

While optimal matching given true scores can be viewed as a deterministic process, score prediction

is inherently uncertain; and we can exploit this if our prediction model outputs a distribution over

unobserved scores Su rather than a point estimate. Different score values supported by the uncertainty

model may lead to different assignments. The number of assignments a particular user-item pair is

assigned in is indicative of matching uncertainty (resulting from the score uncertainty). Given inputs

consisting of observed scores So and possibly additional side information X, we can express uncertainty

over a match Y ′ as:

Pr(Y = Y ′|So, X, θ) =

∫δY ′(Y ∗(Su, So)) Pr(Su|So, X, θ) dSu,

where Pr(Su|So, X, θ) is our score prediction model (assuming model parameters θ), Y ∗(·) (see Equa-

tion 6.1) is the optimal match matrix given a fixed set of scores, and δY ′ is the delta Dirac function with

mode Y ′. Using a similar idea, we can formulate a probabilistic model over individual marginals:

Pr(yrp = 1|So, X, θ) =

∫δ1(y

∗rp(S

u, So)) Pr(Su|So, X, θ) dSu, (6.5)

where y∗rp is the user-item rp entry of Y ∗.

With this in hand, we overcome the limitations of pure IP-based optimization by developing a sam-

pling method for determining “soft” or probabilistic matchings that reflect the range of optimal matchings

given uncertainty in predicted suitabilities. While Equation 6.5 expresses the induced distribution over

marginals, the integral is intractable as it requires solving a large number of matching problems (in this

case solving IPs). Instead we take a sampling approach: we independently sample each score from the

posterior Pr(Su|So, X, θ) to build a complete score matrix, then solve the matching optimization (IP)

using this sampled matrix. Repeating this process T times provides an estimated distribution over op-

timal matchings. We can then average the resulting match matrices, obtaining Y = 1T

∑Tt=1 Y

(t), where

Y (t) is the t’th matching. Each entry Y rp is the (estimated) marginal probability that user-item pair rp

is matched; and the probability of this match depends, as desired, on the distribution Pr(srp|So, X, θ).

Fig. 6.2 illustrates Y , comparing it to the IP solution, on a randomly-generated “toy” problem with

3 reviewers and 6 papers (the match is constrained to exactly 1 reviewer per paper and 2 papers per


Figure 6.2: A “toy” example with a synthetic score matrix S with 3 reviewers and 6 papers. Eachreviewer must be matched to exactly two papers while papers must each be matched to a single reviewer.Matching results using IP (bottom-left), the Y approximation with two different variance matrices andthe loopy belief Propagation matching formulation (Z, bottom-right).

reviewer). Assuming a fixed predicted score matrix S, two versions of Y are shown, one when all

estimated variances are low (Y low), the other when they are higher (Y high).1 Note that the Y matrices

respect the matching constraints by design (for visualization purposes we round matching probabilities).

Y low agrees with the IP, but for Y high, we observe the inherent uncertainty in the optimal matching;

e.g., column one shows all three match probabilities to be reasonably high. In addition, the last column

shows that even though the second and third users have scores that differ by 2 on the sixth paper, the

high variance in their scores gives both users a reasonable probability of being matched to that paper.

6.2.2 Matching as Inference in an Undirected Graphical Model

We have introduced a method for propagating the score uncertainty into matching uncertainty. There

is however another source of uncertainty which we have not yet discussed: the inherent uncertainty over

the range of possible matchings. Given a probabilistic model over possible matchings, the matching IP

returns the single most likely (binary) solution (that is, the optimal solution). There are, independent

of score uncertainty, possibly other good matches which, while being dominated by the optimal, may

provide useful information for active learning.

Tarlow et al. [121] model the matching problem using an undirected graph where (binary) nodes

correspond to assignment variables yrp (in other words, there is a one-to-one mapping between nodes and

paper-reviewer pairs). Nodes have singleton potentials to denote their corresponding score (srp) and high-

order cardinality potentials enforce the reviewer and paper constraints (Equations 6.3 and 6.4). Tarlow

et al. [121] present an efficient approximate-inference algorithm (based on loopy belief propagation [83])

for computing marginal probabilities (Pr(yrp = 1)) for such models. The probability of a marginal

represents the weighted (approximate) number of times that the particular entry is part of a match over

all valid matches. Each match is weighted by the match quality (the objective of the IP, Equation 6.1).

In the rest of this chapter we refer to the matching marginal obtained using this method as the loopy-BP

1Variances are sampled uniformly at random; in a real problem they would be given by the prediction model.


matching and denote the resulting matrix of match marginals as Z (compared to Y for matching variables

from the IP). Figure 6.2 presents the results of applying loopy-BP matching to a “toy” problem. We note

that on this small problem the solution exploiting matching uncertainty Z is similar to Y high, the solution

exploiting score uncertainty. This result is reasonable since different sampled score instantiations may

end up exploring a range of matches that have high weight under loopy-BP matching. In other words,

small perturbations around the mean of the predicted scores may result in IP assignments that are close

the one another and coincide with the assignments that are given high probability by loop-BP.

6.3 Active Querying for Matching

Little work has considered strategies for actively querying the “most informative” preferences from users.

In combination with supervised learning, active querying can further reduce the elicitation burden on

users. Random selection of user-item pairs for assessment will generally be sub-optimal, since query

selection is uninformed by the learned model, the objective function, or any previous data. By contrast,

an active approach, in which queries are tailored to both the current preference model and the current

best matching, will typically give rise to better matchings with fewer queries.2

In this section we describe several distinct strategies for query selection: we review a standard active

learning technique and introduce several novel methods that are sensitive to the matching objective.

Our methods can be broadly categorized based on two properties (which we use to label the different

methods): whether they select queries by evaluating their impact in score space S or in matching space

Y or Z; and whether they select queries with the maximal value M, or maximal entropy E.

S-Entropy (SE)

Uncertainty sampling is a common approach in active learning, which greedily selects queries involving

(unobserved) user-item pairs for which the model is most uncertain [102]. In our context, this corresponds

to selecting the user-item pair with maximum score entropy w.r.t. the score distribution produced by

the learned model. The rationale is clear: uncertainty in score predictions may lead to poor estimates

of match quality. Of course, this approach fails to explicitly account for the matching objective (the

term Y (Su, So) in Equation 6.5), instead focusing (myopically) on entropy reduction in the predictive

model (the term Pr(Su|So, X, θ)). Queries that reduce prediction entropy may have no influence on the

resulting matching. For example, if surp has high entropy, but a much lower mean than some “competing”

sur′p, user r′ may remain matched to p with high probability regardless of the response to query rp.

S-Max (SM)

An alternate, yet still simple, strategy is to select queries involving user-item pairs with highest predicted

score w.r.t. MAP score estimates given our predictions of unobserved scores:

Su ≡ argmaxSu

Pr(Su = smax|So, X, θ),

with smax being the highest possible score value (for example, in a particular data set). This may be

especially advantageous for matching problems where, all else being equal, high scores are more likely

2In our settings one can elicit a rating or suitability score from a user for any item (e.g., paper, date, joke); so the fullset Su

r serves as potential queries for user r.


to be assigned (see Equation 6.1).

SM’s insensitivity to the matching objective means that it shares the obvious shortcoming as SE.

One remedy is to use expected value of information (EVOI) to measure the improvement in matching

quality given the response to a query (taking expectation over predicted responses). This approach,

which we reviewed in Section 2.3.4, has been used effectively in (non-constrained) CF [18]; but EVOI is

notoriously hard to evaluate. In our context, we would (in principle) have to consider each possible query

rp, estimate the impact of each possible response sorp on the learned model (the term Pr(Su|So, X, θ) in

Equation 6.5), and re-solve the estimated matching (the term Y (Su, So) in Equation 6.5). Instead, we

consider several more tractable strategies that embody some of the same intuitions.

Y -Max (YM)

A simple way to select queries in a match-sensitive fashion is to consider the solution returned by the

IP w.r.t. the observed scores, So, and the MAP solution of the unobserved scores, Su. We query the

unknown pair rp that contributes the most to the value of the objective:

arg max(rp)∈Su

yrpsrp,

where yrp ∈ Y (So, Su) is the binary match value for user r and item p, and srp the corresponding

MAP score value. In other words, we query the unobserved pair among those actually matched with

the highest predicted score. We refer to this strategy as Y-Max (YM). It reflects the intuition that

we should either confirm or refute scores for matched pairs, i.e., those pairs that, under the current

model, directly determine the value of the matching objective. However notice that YM is insensitive

to score uncertainty. So, for example, it may query unobserved scores whose predictions have very high

confidence, despite the fact that such queries are highly unlikely to provide valuable information.

Y -Max (YM))

As remedy to YM’s problem, YM exploits our probabilistic matching model to select queries. As with

YM, YM queries the unobserved pair rp that contributes the most to the objective value:

arg max(rp)∈Su

Y rpsrp.

The difference is that we use the probabilistic match, exploiting prediction uncertainty in query selection.

Y -Entropy (YE)

This method exploits the probabilistic match Y as well, but unlike YM, Y E queries unknown pairs

whose entropy in the match distribution is greatest. Specifically, we view each Yrp as a Bernoulli

random variable with (estimated) success probability Y rp. We then query that pair with maximum

match entropy:

arg max(rp)∈Su

[− Y rp log Pr(Y rp)− (1− Y rp) log Pr(1− Y rp)

].


Z-Max (ZM)

In this method we exploit the matching marginals of Z with maximal-value query selection:

arg max(rp)∈Su

zrpsrp,

where zrp ∈ Z(So, Su) is the probabilistic match value for user r and item p according to loopy-BP

matching introduced in Section 6.2.2. We also experimented with a strategy that considered both score

uncertainty and matching uncertainty (ZM) but initial results were not better than those of ZM. We

hypothesize that solving the IP using different sets of sampled scores is an alternate method of exploring

the range of good matches.

One important point to note is that the match-sensitive strategies, YM, YM, Y E, all attempt to

query unobserved pairs that occur (possibly stochastically) in the optimal match. When the IP does

not match on any unobserved pairs, a fall-back strategy is needed. All three strategies resort to random

querying as a fall-back, selecting a random unobserved item score for any specific user as its query. For

YM and Y E, we further consider all queries that corresponds to a user-item pair with less than a 1%

chance of being matched to be “random” queries.

6.4 Experiments

We test the active learning approaches described above on three data sets, each with very different

characteristics. We begin with a brief description of the data sets and matching tasks, then describe our

experimental setup, before proceeding to a discussion of our results.

6.4.1 Data Sets

We first describe our three data sets and define the corresponding matching tasks.

Jokes data set: The Jester data set [41] is a standard CF data set in which over 73,000 users have

each rated a subset of 100 jokes on a scale of -10 to 10. It has a dense subset in which all users rate ten

common jokes. Our experiments use a data set consisting of these ten jokes and 300 randomly selected

users. 3. We convert this to a matching problem by requiring the assignment of a single joke to each

user (for example, to be told at a convention or conference), and requiring that each joke be matched

to between 25 and 35 users (to ensure “jocular diversity” at the convention). Fig. 6.5(a) provides a

histogram of the suitabilities for the Jester sub-data set.

Conference data set: This is the data set derived from the NIPS 2010 conference that we have been

using across the different experimental sections of this thesis. As previously reported the suitabilities

for a subset of papers were elicited in two rounds. In the first round scores were elicited for about 80

papers per reviewer, with queries selected using the YM procedure described above (where the initial

scores were estimated using a word-LM, see Section 3.2.1, using reviewers’ published papers).

Dating data set: The third data set comes from an online dating website.4 It contains over 17 million

ratings from roughly 135,000 users of 168,000 items (other users). We use a denser subset of 32,000

ratings from 250 users (each with at least 59 ratings) over 250 items (other users); see Figure 6.5(c) for

3This data set is derived from the 18,000 subset of Dataset 1 presented in Goldberg et al. [41] Documentation for thedata sets is available online, currently at http://eigentaste.berkeley.edu/dataset/

4See http://www.occamslab.com/petricek/data/

http://eigentaste.berkeley.edu/dataset/

http://www.occamslab.com/petricek/data/


−10 −5 0 50

50

100

150

200

Num

ber

of s

core

s

(a) Jokes Data set

1 2 3 4 5 6 7 8 9 100

1000

2000

3000

4000

5000

6000

(b) Dating data set

Figure 6.3: Histograms of known suitabilities for two of our data sets.

the score histogram. Since items are users with preferences over their matches, dating is generally treated

as a two-sided problem. While two-sided matching can fit within our general framework, the focus of

our current work is on one-sided matching. As such, we only consider user preferences for “items” and

not vice versa. Each user is assigned 25–35 items (and vice versa since “items” are users).

6.4.2 Experimental Procedures

Our experiments simulate the typical interaction of a recommendation or matching engine with its users.

All experiments start with a few observed preferences for each user, for example, preferences provided

by users upon first entering the system, and then go through several rounds of querying. At each round,

a querying strategy selects queries to ask one or more users. Note that in practice we restrict the

strategies to only query (unobserved) scores that are actually available in our datasets. That is, since

we simulate the process of actively querying scores we can only simulate the responses that are available

in the underlying dataset. Once all users have responded, the system re-trains the learning model with

newly and previously observed preferences, then proceeds to select the next batch of queries. This is a

somewhat simplified model that assumes semi-synchronous user communication and thus our strategies

cannot take advantage of users’ most recent elicited preferences until the end of each round. We also

assume for simplicity that the same fixed number of queries per user is asked in each round. The initial

goal is simply to assess the relative performance of each method; we do relax some of these assumptions

in Section 6.4.3.

There are a variety of reasonable interaction modes for eliciting user preferences. For example, in

paper-reviewer matching, posing a single query per round is undesirable, since a reviewer, after assessing

a single paper, must wait for other reviewer responses—and the system to re-train—before being asked

a subsequent query. Reviewers generally prefer to assess their expertise off-line w.r.t. a collection of

papers. Consequently batch interaction is most appropriate: users are asked to assess K items. While

batch frameworks for active learning have received recent attention (e.g., [44]), here we are interested in

comparing different query strategies. Hence we use a very simple greedy batch approach where we elicit

the “top” K preferences from a user, where the “top” queries are ranked by the specific active strategy

under consideration. Appropriate choice of K is application dependent: smaller values of K may lead to


better recommendations with fewer queries, but require more frequent user interaction and user delay.

We test different values of K below.

We use BPMF to generate our predictions and its uncertainty model for unobserved scores. We chose

BPMF because it keeps a full distribution over unobserved scores Pr(Su|So,θ), given model parameters

θ, which is useful for our active learning strategies. A procedure for setting some of the hyper-parameters

of BPMF is outlined in [98]. We use a validation set for the other methods giving (using notation from

the original paper):

• Jokes: D = 1, α = 0.1, β0u = 0.1, β0v = 10

• Conference: D = 15, α = 2, β0u = β0v = 0.1

• Dating: D = 2, α = 2, β0u = β0v = 0.1

Each observed score is assigned a fixed small uncertainty value of 1e−3. The exact value is unimportant

as long as it emphasizes near-certainty compared to the model’s uncertainty over predicted scores (we

verified that this was the case in our experiments). For Y -based methods, which require sampling, we

use 50 samples in all experiments. The ZM method, unlike the IP, is sensitive to the scale of the scores

(the temperature of the system). A large scale will lead to more deterministic matches while a low scale

has the opposite effect. Visual inspection of the results showed that, for each dataset, scaling the score

such that the max score has a value of 10 leads to an acceptable level uncertainty. The elaboration of a

more formal validation procedure is left for future work.

We compare query selection methods w.r.t. their matching performance—i.e., the matching objective

value of Equation 6.1—using the match matrix given by the IP using estimated scores and known scores

So, evaluated on the full set of available scores. We use a random querying strategy, which selects

unobserved items uniformly at random for each user, as a baseline. All figures show the number of

queries per user on the x-axis. The y-axis indicates the difference in the matching objective value

between a specific querying strategy and the baseline. Positive differences indicate better performance

relative to the baseline. The magnitude of this difference can be best understood relative to the number

of users in the data set. For example, a difference of 300 in objective value for the 300 users in the Jokes

data set means that users are matched to jokes that are better by one “score unit” on average (as a

reminder the scores of the jokes range from -10 to 10). Note that as we increase the number of queries,

even random queries will eventually find good matches—in the limit, where all scores are observed,

matching performance of all methods will be identical (hence the bell-shaped curves and asymptotic

convergence in our results).

We don’t focus on running time in our experiments since query determination can often be done

off-line (depending on batch sizes). Having said that, even the most intense querying techniques are fast

and can support online interaction: (a) in all 3 data sets, solving the IP takes a fraction of a second; (b)

BPMF can be trained in a matter of a few minutes at most, but can be run asynchronously with query

selection. For example, we could choose to re-run BPMF as soon as we have received a new (pre-defined)

number of scores. The query selection would then use the most up-to-date learned model available; and

(c) sampling scores is very fast as the posterior distribution is Gaussian. Furthermore, given the above,

our methods should scale to larger datasets although the training time of BPMF may preclude fully

online interaction.


−50 0 50 100 150 200 250 300−100

−50

0

50

100

150

200

250

Diff

eren

ce in

Mat

chin

g O

bjec

tive

Queries−50 0 50 100 150 200 250 300

3400

3600

3800

4000

4200

4400

4600

4800

5000

Abs

olut

e M

atch

ing

Obj

ectiv

e

(a) Jokes data set (10 qbu)

−20 0 20 40 60 80 100 120−50

0

50

100

150

200

Diff

eren

ce in

Mat

chin

g O

bjec

tive

Queries−20 0 20 40 60 80 100 120

1800

2000

2200

2400

2600

2800

3000

Abs

olut

e M

atch

ing

Obj

ectiv

e

(b) Conference (20 qbu)

−20 0 20 40 60 80 100 120 140−200

−100

0

100

200

300

400

500

600

700

800

Diff

eren

ce in

Mat

chin

g O

bjec

tive

Queries−20 0 20 40 60 80 100 120 140

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

4.9x 10

4

Abs

olut

e M

atch

ing

Obj

ectiv

e

(c) Dating (20 qbu)

Figure 6.4: Matching performance for active learning results (queries per batch per user: qbu). Standarderror is also shown. The triangle plot (using the right vertical axis) shows the absolute matching valueof the random strategy.


0 100 200 3000

20

40

60

80

100

Num

ber

of fa

ll−ba

ck q

uerie

s

(a) Jokes data set

0 50 100 1500

200

400

600

800

1000

(b) Conference data set

0 50 100 1500

1000

2000

3000

4000

(c) Dating data set

Figure 6.5: Usage frequency of the fall-back query strategy for YM, YM and Y E.

6.4.3 Results

We first investigate the performance of the different querying strategies on our three data sets using

default batch sizes—these K values were deemed to be natural given the domains (different K values

are discussed below). Figure 6.4(a) shows results for Jokes using batches of 10 queries per user per

round (K = 10). Figures 6.4(b) and 6.4(b) show Conference and Dating results, respectively, both with

a batch size of 20. All users start with 20 observed scores: 15 are used for training and 5 for validation.

We also experimented with a more realistic setting where some users have few observed scores (e.g., new

users)—results are qualitatively very similar but are not discussed here.

The relative performance of each of the active methods exhibits a fairly consistent pattern across all

three domains, which permits us to draw some reasonably strong conclusions.5 First, we see that all

methods except for SE outperform the baseline in all domains. Recall that SE is essentially uncertainty

sampling, a classic (match-insensitive) active learning model often used as a general baseline method

for active learning. It outperforms the random baseline only occasionally, most significantly after the

first round of elicitation in Dating. Second, all of our proposed match-sensitive techniques outperform

SE consistently on all data sets. Third, the match-sensitive approaches that leverage uncertainty over

scores, namely, YM and Y E, typically outperform YM, especially after the initial rounds of elicitation.

This difference in performance behaviour is most pronounced in the Conference domain.

We gain further insight into these results by examining the inner workings of these strategies.6

Figure 6.5 shows the number of random (or fall-back) queries used (on average) by each of YM, YM and

Y E. On all data sets YM resorts to the fall-back strategy significantly earlier than the others, explaining

YM’s fall-off in performance and indicating that the diversity of potential matches identified by our

probabilistic matching technique plays a vital role in match-sensitive active learning.

Sequential Querying

We employed a semi-synchronous querying procedure above, where all users are queried in parallel at

each round. We now consider a different mode of interaction where, at each round, users are queried

sequentially in round robin fashion. This allows the responses of earlier users within a round to influence

the queries asked to later users—potentially reducing the total number of queries at the expense of

5We do not report the performance of SM— it is consistently outperformed by the baseline in all experiments. We haveobserved that SM typically selects all queries from among only a few items, namely, those with high predicted averagescore; hence it acquires no information about the vast majority of items.

6We did not perform further experiments with ZM since it appears as if considering inherent matching uncertainty doesnot lead to a win compared to the Y and Y methods.


0 100 200−100

−50

0

50

100

150

200

250

Diff

eren

ce in

Mat

chin

g O

bjec

tive

Queries

(a) Jokes data set (10 queries per batch per user, sequen-tial)

−20 0 20 40 60 80 100 120−100

−50

0

50

100

150

200

Queries

(b) Conference data set (10 queries per batch, parallel)

−20 0 20 40 60 80 100−100

−50

0

50

100

150

Queries

(c) Conference data set (40 queries per batch, parallel)

−50 0 50 100 150−50

0

50

100

150

200

Queries

(d) Conference data set with larger user and item con-straints

Figure 6.6: Matching performance for active learning results using non-default parameters.

increased synchronization (and delay) among users. Fig. 6.6(a) shows that our methods are robust to

this modification in the querying procedure. Specifically, the SE is quickly outperformed by all methods

that are sensitive to the matching. Furthermore, both YM and Y E, which use matching uncertainty,

outperform YM overall.

Batch Sizes

The choice of the number of queries K per batch affects both the frequency with which the user interacts

with the system as well as the overall match performance. For example, high values of K reduce the

number of user “interactions” needed for a specific level of performance, at the expense of query efficiency

(improvement in matching objective per query). The “optimal” value for K depends on the actual

recommendation application. Figs. 6.6(b) and (c) shows results with different values of K on Conference,

using 10 and 40 queries per round, respectively. The relative performance of the active methods remains

almost identical. As expected, absolute performance w.r.t. query efficiency is better with smaller values


of K. The matching-sensitive strategies clearly outperform the score-based techniques. Results are

similar across all data sets.

Matching Constraints

Our results are also robust to the use of different matching constraints, specifically, bounds on the

numbers of items per user and vice versa (i.e., Rmin, Rmax, Pmin, Pmax). Using the Conference data set,

we increase to two (from one) the number of reviewers assigned to each paper. Fig. 6.6(d) shows that

the behavior of the methods changes little, with both Y -methods still outperforming all other methods.

The other domains (not shown) exhibit similar results.


We investigated the problem of active learning for match-constrained recommender systems. We ex-

plored several different approaches to generating queries that are guided by the matching objective, and

introduced a novel method for probabilistic matching that accounts for uncertainty in predicted scores.

Experiments demonstrate the effectiveness of our methods in determining high-quality matches with

significantly less elicitation of user preferences than that required by uncertainty sampling, a standard

active learning method. Our results highlight the importance of choosing queries in a manner that is

sensitive to the matching objective and uncertainty over predicted scores.

One effect of choosing which user preferences to label, based on some objective, is that the model

will learn from (possibly) biased data. That is, the sampled data may provide the model with a biased

view of the true underlying data distribution. This is a general problem of active learning known as

sampling bias (for a formal description see, for example, [31]). By performing elicitation based on the

matching objective we add a further source of bias. Informally, the system is more likely to query scores

that are expected to be higher since they will more likely be matched. In our datasets we have not found

it to be a problem. However, in general in order to reduce this added bias, one may want to elicit scores

based on both the matching and learning objectives. Another possible practical avenue is to query all

users about a (small) common fixed set of items. This would ensure a certain level of score diversity, in

particular, it would ensure that the system has access to low scores for all users.

On a practical note it has been very challenging to obtain informative uncertainty models using the

preference datasets that we experimented with. We have shown that we are still able to leverage the

model uncertainty to obtain performance gains. However, in general, uninformative uncertainty models

may fundamentally limit the general usefulness of the active learning methods that (even indirectly) rely

on model uncertainty.

There are many promising avenues of future research in match-constrained recommendation. We

could explore different matching objectives, for example two-sided matching with stability constraints

(e.g., as would be appropriate in online dating as well as paper-reviewer matching, where papers require

“sufficient” expertise). We could also explore methods for eliciting side information from users in a way

that is guided by the recommendation objective. Furthermore, higher-level, abstract queries (such as

preferences over item categories or features) may significantly boost “gain per query” performance. In

fact, one could use CSTM (Chapter 4) to model higher-level features such a item genres (or the subject

areas of submissions and reviewers). Initial experiments with such data showed some promises which

could pave the way to using CSTM as the underlying active learning model. Another possibility would


be to allow the active learning to choose between different types of queries (for example queries about the

subject areas of a reviewer or queries about user-item preferences). EVOI could be used as a principled

way of selecting queries of these different types. Modelling this extra level of user preferences will also

be useful for cold-start items. For example, in the reviewer-to-paper matching domain, imagine that

reviewer profiles are kept across conferences. Then user preferences over paper subject areas could be

used to get better score estimates of the submitted papers which in turn will help the active learning.

Without using side information the initial active learning queries would be no better than the ones of a

random baseline.

Finally, in this chapter after new data was obtained we always re-trained the system using all available

data. This approach is unlikely to scale, and online learning techniques, which could learn using only

the newly-available data, would be of interest.

Overall, explicit preference elicitation requires the voluntary participation of users. Conference re-

viewers may oblige since they will reap immediate benefits. However, in other applications, it may be

harder to convince users of the advantages of engaging in this elicitation mechanism. In such cases,

recommender systems may either have to be more subtle about the elicitation adopted, for example, by

including an exploration policy within their recommendation objective, or even by deducing preferences

from user behaviour, for example, by examining the list of page they browsed, the search queries they

issued, or the information they communicated through their participation in online social networking

sites.

Chapter 7

Conclusion

Throughout this thesis we have shown how we can both tailor existing machine learning models and

methods, as well as develop new ones to increase the performance and expand the capabilities of rec-

ommender systems. We now provide a summary of our work and highlight opportunities for future

work.

7.1 Summary

Firstly, we established preference prediction as the core problem of interest in recommender systems.

Current supervised learning methods are well suited to this problem. Specifically, supervised methods

tailored to the specifics of recommender systems, such as collaborative filtering methods, have shown to

be excellent at missing preference prediction in typical recommendation domains when user preference

data is plentiful. However, in cold-start scenarios we must resort to leveraging side information, possibly

including content information, about user and items. We showed how we can model textual user and item

side information, using topic models, in a document-prediction domain and obtain superior performances

compared to state-of-the-art methods that use only user scores of preferences.

Secondly, we introduced a simple framework which decomposes the recommendation problem into

three interacting stages: a) preference elicitation; c) preference prediction; and c) determination of rec-

ommendations. We showed how we can cast match-constrained problems, such as the paper-to-reviewer

matching problem, into this framework. For match-constrained recommendations we experimentally

demonstrated, using two conference datasets, the strong correlation between the methods’ preference-

prediction performance and their matching performance. Further, we exploited the synergy between

learning and matching objectives when using a non-linear mapping between utility and suitability.

Finally, using active learning, we explored the interaction between preference elicitation and matching

in a match-constrained recommender system. The active querying methods we developed focussed on

improving the recommendation objective instead of the learning objective. To reach our goals we also

developed a probabilistic matching procedures to account for the uncertainty in predicted preferences.

Our methods, including those that use probabilistic matching, proved useful in querying user preferences

for the matching problem. Overall, using our conference datasets, a dating dataset and a jokes dataset,

we showed that together with preference prediction methods, active learning methods greatly reduce the

elicitation burden on users and thus help alleviate the cold-start problem.

91

Chapter 7. Conclusion 92

Interestingly, the non-research contribution of this thesis, the Toronto Paper Matching System, is

the component of this thesis which has seemingly had the most immediate impact on the community.

Further, it has both revealed research opportunities, for example by allowing us to collect interesting

data sets which were essential in the development of our work, and it has also shown to be a good test

bed for our research ideas.

7.2 Future Research Directions

As recommender systems become ubiquitous several research opportunities will naturally present them-

selves. We have outlined some of these directions in the preceding chapters. We now outline a few

additional directions which are the closest to the work presented in this thesis.

1. A wide variety of sources of side information may be indicative of user preferences (for example,

the different aspects of user online behaviour, such as their search terms, the sites they visited

and the frequency and length of the visits, their purchased items, and others). Learning simul-

taneously from all such sources has the potential to refine recommendations and, more generally,

personalization models. Therefore, it is worth developing and analyzing models which can learn

from these potentially heterogeneous sources of (side) information. Learning from combinations of

heterogeneous data is a general challenge of machine learning and one of particularly interest to

recommender systems.

Further these additional sources of side information will often contain more expressive forms of

user preferences. For example, deriving preferences from text (e.g., of reviews), where extracting

user preferences cannot be done with a bag-of-words model, is still a challenge in machine learning

(although the field has been progressing, especially if allowed to learn from large collections of

labelled data [110]). One possible avenue for accomplishing this task is—similar to our approach

using CSTM—to learn general higher-level representations of this side information and then use

these representations as features in a user preference model. In general, this direction of research

may require the exploration of interesting combinations of existing machine learning content models

(for example, models of text or images) with preference prediction models.

2. In many recommendation domains it may be unreasonable to require the explicit elicitation of

many preferences from users. Therefore, as we discussed in the last section of Chapter 6, it

will be essential to effectively learn from weaker sources of user preferences such as users’ online

behaviour. As a first step, there has already been some work using implicit user feedback [51]

for recommendations. While extending this work is promising, methods that can aggregate weak

forms of preference data into definitive user preferences will be of importance.

3. Current recommender systems typically work for single item-domains. For example, a system rec-

ommends either books or restaurants but not both. This is partly due to the mechanisms that

are involved in creating user-item preference data sets. However, there are necessarily correlations

between users’ preferences across different domains which can be exploited using a multi-domain

recommender system. Further, a multi-domain recommender system has the potential of learning

much finer-level representations of user preferences, leading to better user personalization. There-

fore, methods which work across multiple domains, and which eventually lead to more general

models of user preferences, will be beneficial even beyond recommender systems.

Chapter 7. Conclusion 93

To be useful across many domains a recommender system will often be required to make recommen-

dations in domains for which it has very little user preference information. For example, an online

system will constantly need to adapt to both novel items and users as well as to users’ interests

as they evolve over time. Again, the use of side information, including content information, will

be a necessity to quickly identify exploitable correlations between the different recommendation

domains.

Ideally, recommender systems will not only recommend single items, but will also be able to rec-

ommend structured list of items. For example, a system could recommend full trip itineraries,

including places to stay, sightseeing activities and tickets to shows, or recommend a reading cur-

riculum composed of a list of scientific or news articles to (gradually) learn about specific topics, or

sets of ingredients to create harmonious recipes and meals. Such problems could be modelled using

our current framework, by first having a preference learning objective followed by a combinatorial

optimization step. In cases where structured label data exists we could then turn our attention to

the work on structure output learning (for example, [123]) or more appropriately, to works that

can deal explicitly with the ultimate combinatorial optimization such as Perturb-and-Map [82] and

others [120].

4. Finally, there are also other particularities of practical recommender systems which have often

been ignored in the academic literature. Examples of such particularities include:

• research that treats missing preferences, in commonly available datasets, as missing at random

• research that eludes the temporal and spacial contexts around recommendations

Such considerations will likely become of crucial importance for real-life recommender systems and

will become more accessible to academics once datasets containing information relevant to these

considerations become available.

We have presented recommender systems as a major beneficiary from advances in machine learning

research, and specifically from supervised and active learning methods. It is likely that recommender

systems, especially at their intersection with human behaviour modelling, will take on even more impor-

tant for machine learning techniques. This will especially be the case as larger user preference data sets,

especially if they contain extra features such as more expressive information indicative of preferences,

become available.

Bibliography

[1] Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models. In Proceedings

of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

KDD ’09, pages 19–28, New York, NY, USA, 2009. ACM. 40, 41, 42, 53, 62

[2] Deepak Agarwal and Bee-Chung Chen. flda: matrix factorization through latent dirichlet alloca-

tion. In Proceedings of the third ACM International Conference on Web Search and Data Mining,

WSDM ’10, pages 91–100, New York, NY, USA, 2010. ACM. 40, 41, 42, 52

[3] Robert Arens. Learning SVM ranking functions from user feedback using document metadata and

active learning in the biomedical domain. In Johannes Furnkranz and Eyke Hullermeier, editors,

Preference Learning, pages 363–383. Springer-Verlag, 2010. 22

[4] Suhrid Balakrishnan and Sumit Chopra. Collaborative ranking. In Proceedings of the Fifth ACM

International Conference on Web Search and Data Mining, WSDM ’12, pages 143–152, New York,

NY, USA, 2012. ACM. ISBN 978-1-4503-0747-5. 17, 64

[5] Krisztian Balog, Leif Azzopardi, and Maarten de Rijke. Formal models for expert finding in enter-

prise corpora. In Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Jarvelin,

editors, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval (SIGIR-06), pages 43–50, Seattle, Washington, USA, 2006.

ACM. ISBN 1-59593-369-7. 53

[6] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. Expertise retrieval.

Foundations and Trends in Information Retrieval, 6(2):127–256, February 2012. ISSN 1554-0669.

37

[7] Robert M. Bell and Yehuda Koren. Lessons from the Netflix prize challenge. SIGKDD Exploration

Newsletter, 9:75–79, December 2007. ISSN 1931-0145. 8, 13, 20, 48

[8] Salem Benferhat and Jerome Lang. Conference paper assignment. International Journal of Intel-

ligent Systems, 16(10):1183–1192, 2001. 37, 66, 67, 68

[9] T. Bertin-Mahieux. Large-Scale Pattern Discovery in Music. PhD thesis, Columbia University,

February 2013. 42

[10] Michael J. Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a unifying

framework. Math. Program., 47:425–439, 1990. 54

94

BIBLIOGRAPHY 95

[11] Alina Beygelzimer, Daniel Hsu, John Langford, and Zhang Tong. Agnostic active learning without

constraints. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,

Advances in Neural Information Processing Systems 23, pages 199–207. 2010. 18

[12] David M. Blei and John D. Lafferty. Dynamic topic models. In Proceedings of the Twenty-third

International Conference of Machine Learning (ICML), 2006. 43

[13] David M. Blei and John D. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007.

44, 46, 48, 49, 50, 52, 54, 55

[14] David M. Blei and Jon D. McAuliffe. Supervised topic models. In Advances in Neural Information

Processing Systems (NIPS), 2007. 43

[15] David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum. Hierarchi-

cal topic models and the nested chinese restaurant process. In Advances in Neural Information

Processing Systems (NIPS), 2003. 43

[16] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent Dirichlet allocation.

Journal of Machine Learning Research, 3:993–1022, 2003. 31, 32, 43, 44

[17] Michael Bloodgood and K. Vijay-Shanker. A method for stopping active learning based on sta-

bilizing predictions and the need for user-adjustable stopping. In Proceedings of the Thirteenth

Conference on Computational Natural Language Learning (CoNLL-2009), 2009. 24

[18] Craig Boutilier, Richard S. Zemel, and Benjamin Marlin. Active collaborative filtering. In UAI,

pages 98–106, Acapulco, 2003. 22, 23, 82

[19] Darius Braziunas and Craig Boutilier. Assessing regret-based preference elicitation with the UT-

PREF recommendation system. In Proceedings of the Eleventh ACM Conference on Electronic

Commerce (EC-10), pages 219–228, Cambridge, MA, 2010. 8

[20] John S. Breese, David Heckerman, and Carl Myers Kadie. Empirical analysis of predictive algo-

rithms for collaborative filtering. In G. F. Cooper and S. Moral, editors, Proceedings of the 14th

Conference on Uncertainty in Artificial Intelligence, pages 43–52, 1998. 9

[21] Klaus Brinker. Incorporating diversity in active learning with support vector machines. In Fawcett

and Mishra [34], pages 59–66. ISBN 1-57735-189-4. 23

[22] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive

cost functions, with application to active user modeling and hierarchical reinforcement learning.

Technical Report TR-2009-23, Department of Computer Science, University of British Columbia,

November 2009. 78

[23] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan.

Streaming variational bayes, 2013. arXiv:1307.6769. 61

[24] Christopher J.C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth

cost functions. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information

Processing Systems 19, pages 193–200. MIT Press, Cambridge, MA, 2007. 17

BIBLIOGRAPHY 96

[25] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise

approach to listwise approach. Tech Report MSR-TR-2007-40, Microsoft Research, April 2007. 17

[26] Tianqi Chen, Hang Li, Qiang Yang, and Yong Yu. General functional matrix factorization using

gradient boosting. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th

International Conference on Machine Learning (ICML-13), volume 28, pages 436–444. JMLR

Workshop and Conference Proceedings, 2013. 40, 41

[27] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining content-

based and collaborative filters in an online newspaper. In Proceedings of the ACM SIGIR ’99

Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California, 1999. ACM.

40, 41

[28] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of

Artificial Intelligence Research (JAIR), 10:243–270, May 1999. ISSN 1076-9757. 18

[29] Don Conry, Yehuda Koren, and Naren Ramakrishnan. Recommender systems for the conference

paper assignment problem. In Proceedings of the Third ACM Conference on Recommender Systems,

RecSys ’09, pages 357–360, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-435-5. 36, 40,

41, 53, 67

[30] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the

Twenty-Fifth International Conference (ICML 2008), pages 208–215, 2008. 20

[31] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the

Twenty-fifth International Conference on Machine learning (ICML), pages 208–215, 2008. 89

[32] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via

the em algorithm. Journal of The Royal Statistical Society, Series B, 39(1):1–38, 1977. 45

[33] P. Perona E. Bart, M. Welling. Unsupervised organization of image collections: Taxonomies and

beyond. IEEE Transactions of Pattern Analysis and Machine Intelligence, 2011. 62

[34] Tom Fawcett and Nina Mishra, editors. Machine Learning, Proceedings of the Twentieth Interna-

tional Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003. AAAI Press.

ISBN 1-57735-189-4. 95, 102

[35] Brendan J. Frey and Nebojsa Jojic. A comparison of algorithms for inference and learning in prob-

abilistic graphical models. IEEE Trans. Pattern Anal. Mach. Intell., 27(9):1392–1416, September

2005. ISSN 0162-8828. 45

[36] David Gale and Lloyd S. Shapley. College admissions and the stability of marriage. American

Mathematical Monthly, 69(1):9–15, 1962. ISSN 0002-9890. 24, 25

[37] Naveen Garg, Telikepalli Kavitha, Amit Kumar, Kurt Mehlhorn, and Julian Mestre. Assigning

papers to referees. Algorithmica, 58(1):119–136, 2010. 37, 66, 67, 68

[38] Kostadin Georgiev and Preslav Nakov. A non-iid framework for collaborative filtering with re-

stricted boltzmann machines. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of

the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1148–1156.

JMLR Workshop and Conference Proceedings, May 2013. 13

BIBLIOGRAPHY 97

[39] Mehmet Gnen, Suleiman Khan, and Samuel Kaski. Kernelized bayesian matrix factorization. In

Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference

on Machine Learning (ICML-13), volume 28, pages 864–872. JMLR Workshop and Conference

Proceedings, May 2013. 41, 42

[40] David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. Using collaborative filtering

to weave an information tapestry. Communications of the ACM, 35:61–70, December 1992. ISSN

0001-0782. 2, 7, 9

[41] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time

collaborative filtering algorithm. Information Retrieval, 4(2):133–151, July 2001. ISSN 1386-4564.

83

[42] Judy Goldsmith and Robert H. Sloan. The AI conference paper assignment problem. In AAAI-07

Workshop on Preference Handling in AI, pages 53–57, Vancouver, 2005. 37, 66, 67, 68

[43] Yuhong Guo. Active instance sampling via matrix partition. In J. Lafferty, C. K. I. Williams,

J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing

Systems 23, pages 802–810. 2010. 23

[44] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In J.C. Platt,

D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems

20, pages 593–600. MIT Press, Cambridge, MA, 2008. 23, 84

[45] Abhay Harpale and Yiming Yang. Personalized active learning for collaborative filtering. In SIGIR,

pages 91–98, 2008. 19

[46] Guenter Hitsch and Ali Hortacsu. What makes you click? an empirical analysis of online dating.

2005 Meeting Papers 207, Society for Economic Dynamics, 2005. 24

[47] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational

inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013. ISSN 1532-4435. 61

[48] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval,

SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM. ISBN 1-58113-096-1. 12

[49] Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22

(1):89–115, 2004. 11, 12

[50] R.A. Howard. Information value theory. Systems Science and Cybernetics, IEEE Transactions on,

2(1):22–26, 1966. ISSN 0536-1567. doi: 10.1109/TSSC.1966.300074. 22

[51] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets.

In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages

263–272, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3502-9. 92

[52] Aanund Hylland and Richard J. Zeckhauser. The efficient allocation of individuals to positions.

Journal of Political Economy, 87(2):293–314, 1979. ISSN 0022-3808. 24, 25

BIBLIOGRAPHY 98

[53] Mohsen Jamali and Martin Ester. Trustwalker: a random walk model for combining trust-based

and item-based recommendation. In Proceedings of the 15th ACM SIGKDD international confer-

ence on Knowledge discovery and data mining, KDD ’09, pages 397–406, New York, NY, USA,

2009. ACM. ISBN 978-1-60558-495-9. 40

[54] Mohsen Jamali and Martin Ester. A matrix factorization technique with trust propagation for

recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender

systems, RecSys ’10, pages 135–142, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0.

40, 41

[55] Tamas Jambor and Jun Wang. Optimizing multiple objectives in collaborative filtering. In ACM

Recommender Systems, 2010. 18

[56] Tamas Jambor and Jun Wang. Goal-driven collaborative filtering: A directional error based

approach. In Proc. of European Conference on Information Retrieval (ECIR), 2010. 15

[57] Kalervo Jarvelin and Jaana Kekalainen. Ir evaluation methods for retrieving highly relevant doc-

uments. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and

Development in Information Retrieval, SIGIR ’00, pages 41–48, New York, NY, USA, 2000. ACM.

16, 56

[58] Rong Jin and Luo Si. A bayesian approach toward active learning for collaborative filtering.

In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-04), pages

278–285. AUAI Press, 1 2004. 19

[59] Maryam Karimzadehgan and ChengXiang Zhai. Integer linear programming for constrained multi-

aspect committee review assignment. Inf. Process. Manage., 48(4):725–740, July 2012. ISSN 0306-

4573. doi: 10.1016/j.ipm.2011.09.004. URL http://dx.doi.org/10.1016/j.ipm.2011.09.004.

68

[60] Maryam Karimzadehgan, ChengXiang Zhai, and Geneva Belford. Multi-aspect expertise matching

for review assignment. In Proceedings of the 17th ACM conference on Information and knowledge

management, CIKM ’08, pages 1113–1122, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-

991-3. 67

[61] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model.

In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and

data mining, KDD ’08, pages 426–434, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-193-4.

40, 41

[62] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recom-

mender systems. IEEE Computer, 42(8):30–37, 2009. 8, 13

[63] Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. Disclda: Discriminative learning for dimen-

sionality reduction and classification. In NIPS, 2008. 43

[64] Helge Langseth and Thomas Dyhre Nielsen. A latent model for collaborative filtering. Int. J.

Approx. Reasoning, 53(4):447–466, June 2012. ISSN 0888-613X. 13

http://dx.doi.org/10.1016/j.ipm.2011.09.004

BIBLIOGRAPHY 99

[65] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted boltzmann

machines. In Proceedings of the Twenty-fifth International Conference on Machine Learning, ICML

’08, pages 536–543, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. 34

[66] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with gaussian processes. In

Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages

601–608, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. 8, 12, 41, 42

[67] David D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional

data. SIGIR Forum, 29:13–19, September 1995. ISSN 0163-5840. 19

[68] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings

of KDD Cup and Workshop, 2007. 10

[69] Nathan N. Liu and Qiang Yang. Eigenrank: a ranking-oriented approach to collaborative filtering.

In Proceedings of the 31st annual international ACM SIGIR conference on Research and develop-

ment in information retrieval, SIGIR ’08, pages 83–90, New York, NY, USA, 2008. ACM. ISBN

978-1-60558-164-4. 17, 18

[70] Hao Ma, Haixuan Yang, Michael R. Lyu, and Irwin King. Sorec: social recommendation using

probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information

and knowledge management, CIKM ’08, pages 931–940, New York, NY, USA, 2008. ACM. ISBN

978-1-59593-991-3. 40, 41

[71] Benjamin Marlin. Modeling user rating profiles for collaborative filtering. In Advances in Neural

Information Processing Systems 16 (NIPS), 2003. 12

[72] Benjamin Marlin. Collaborative filtering: A machine learning perspective. Technical report,

University of Toronto, 2004. 8, 14

[73] Benjamin M. Marlin and Richard S. Zemel. The multiple multiplicative factor model for collabo-

rative filtering. In Carla E. Brodley, editor, ICML, volume 69 of ACM International Conference

Proceeding Series. ACM, 2004. 12, 13

[74] Benjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking with non-random

missing data. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09,

pages 5–12, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-435-5. 14

[75] Benjamin M. Marlin, Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. Collaborative

filtering and the missing at random assumption. In Ronald Parr and Linda C. van der Gaag,

editors, UAI, pages 267–275. AUAI Press, 2007. ISBN 0-9749039-3-0. 14

[76] Paolo Massa and Paolo Avesani. Trust-aware recommender systems. In Proceedings of the 2007

ACM conference on Recommender systems, RecSys ’07, pages 17–24, New York, NY, USA, 2007.

ACM. ISBN 978-1-59593-730–8. 40

[77] B. McFee, T. Bertin-Mahieux, D. Ellis, and G. Lanckriet. The million song dataset challenge. In

Proc. of the 4th International Workshop on Advances in Music Information Research (AdMIRe

’12), April 2012. 42

BIBLIOGRAPHY 100

[78] Lorraine McGinty and Barry Smyth. On the role of diversity in conversational recommender

systems. In Proceedings of the 5th international conference on Case-based reasoning: Research

and Development, ICCBR’03, pages 276–290, Berlin, Heidelberg, 2003. Springer-Verlag. ISBN

3-540-40433-3. 2, 14

[79] David M. Mimno and Andrew McCallum. Expertise modeling for matching papers with reviewers.

In Pavel Berkhin, Rich Caruana, and Xindong Wu, editors, Proceedings of the 13th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (KDD), pages 500–509, San

Jose, California, 2007. ACM. ISBN 978-1-59593-609-7. 29, 31, 48, 53

[80] Radford M. Neal and Geoffrey E. Hinton. Learning in graphical models. chapter A view of the

EM algorithm that justifies incremental, sparse, and other variants, pages 355–368. MIT Press,

Cambridge, MA, USA, 1999. ISBN 0-262-60032-3. 45

[81] Fredrik Olsson and Katrin Tomanek. An intrinsic stopping criterion for committee-based active

learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learn-

ing, CoNLL ’09, pages 138–146, Stroudsburg, PA, USA, 2009. Association for Computational

Linguistics. ISBN 978-1-932432-29-9. 24

[82] G. Papandreou and A. Yuille. Perturb-and-map random fields: Using discrete optimization to

learn and sample from energy models. In Proceedings of the IEEE International Conference on

Computer Vision (ICCV), pages 193–200, Barcelona, Spain, November 2011. 93

[83] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan

Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 0-934613-73-7. 80

[84] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. Labeled lda: A

supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009. 43

[85] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collab-

orative prediction. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM

International Conference Proceeding Series, pages 713–719. ACM, 2005. ISBN 1-59593-180-5. 11

[86] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture

for collaborative filtering of netnews. In 1994 ACM Conference on Computer Supported Collabora-

tive Work Conference, pages 175–186, Chapel Hill, NC, 10/1994 1994. Association of Computing

Machinery, Association of Computing Machinery. 9

[87] Paul Resnick and Hal R. Varian. Recommender systems. Communications of the ACM, 40(3):

56–58, March 1997. ISSN 0001-0782. 2

[88] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors. Recommender Systems

Handbook. Springer, 2011. ISBN 978-0-387-85819-7. 1, 2

[89] Philippe Rigaux. An iterative rating method: application to web-based conference management.

In SAC, pages 1682–1687, 2004. 36, 77

[90] Irina Rish and Gerald Tesauro. Active collaborative prediction with maximum margin matrix

factorization. In ISAIM 2008, 2008. 20

BIBLIOGRAPHY 101

[91] Marko A. Rodriguez and Johan Bollen. An algorithm to determine peer-reviewers. In Proceeding

of the 17th ACM Conference on Information and Knowledge Management (CIKM-08), pages 319–

328, Napa Valley, California, USA, 2008. ACM. ISBN 978-1-59593-991-3. 37, 67

[92] Eytan Ronn. Np-complete stable matching problems. J. Algorithms, 11(2):285–304, May 1990.

ISSN 0196-6774. 25

[93] David A. Ross and Richard S. Zemel. Multiple cause vector quantization. In Advances in Neural

Information Processing Systems 15 (NIPS), pages 1017–1024, 2002. 12

[94] Alvin E. Roth. The evolution of the labor market for medical interns and residents: A case study

in game theory. Journal of Political Economy, 92(6):991–1016, 1984. 24, 25

[95] Alvin E. Roth and Elliott Peranson. The redesign of the matching market for american physicians:

Some engineering aspects of economic design. Working Paper 6963, National Bureau of Economic

Research, February 1999. 25

[96] Neil Rubens and Masashi Sugiyama. Influence-based collaborative active learning. In Proceedings

of the 2007 ACM conference on Recommender systems, RecSys ’07, pages 145–148, New York,

NY, USA, 2007. ACM. ISBN 978-1-59593-730–8. 19

[97] Ruslan Salakhutdinov and Geoffrey Hinton. Replicated softmax: an undirected topic model. In

Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in

Neural Information Processing Systems 22 (NIPS), pages 1607–1614. 2009. 34, 55

[98] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov

chain Monte Carlo. In Proceedings of the International Conference on Machine Learning, vol-

ume 25, 2008. 10, 11, 48, 85

[99] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural

Information Processing Systems (NIPS), volume 20, 2008. 10, 40, 41, 48, 52, 55, 61

[100] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for

collaborative filtering. In Proceedings of the International Conference on Machine Learning, vol-

ume 24, pages 791–798, 2007. 13

[101] Andrew Ian Schein. Active learning for logistic regression. PhD thesis, University of Pennsylvania,

Philadelphia, PA, USA, 2005. AAI3197737. 19

[102] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University

of Wisconsin–Madison, 2009. 19, 20, 21, 23, 24, 81

[103] B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. In Advances in Neural

Information Processing Systems (NIPS), volume 20, pages 1289–1296. MIT Press, 2008. 21

[104] Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks.

In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP

’08, pages 1070–1079, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. 19

[105] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee, 1992. 20

BIBLIOGRAPHY 102

[106] Hanhuai Shan and Arindam Banerjee. Generalized probabilistic matrix factorizations for collabo-

rative filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM

’10, pages 1025–1030, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-

4256-0. 41, 42, 52

[107] Yue Shi, Martha Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization

for collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems,

RecSys ’10, pages 269–272, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0. 17

[108] Malcolm Slaney. Web-scale multimedia analysis: Does content matter? IEEE MultiMedia, 18(2):

12–15, April 2011. ISSN 1070-986X. 42

[109] Alex J. Smola, S. V. N. Vishwanathan, and Quoc V. Le. Bundle methods for machine learning.

In Advances in Neural Information Processing Systems 20 (NIPS), 2007. 17

[110] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y.

Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment

treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,

Stroudsburg, PA, October 2013. Association for Computational Linguistics. 92

[111] Nathan Srebro. Learning with matrix factorizations. PhD thesis, Massachusetts Institute of Tech-

nology, Cambridge, MA, USA, 2004. AAI0807530. 9

[112] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Fawcett and Mishra

[34], pages 720–727. ISBN 1-57735-189-4. 10

[113] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization.

In Advances in Neural Information Processing Systems (NIPS), 2004. 10, 11, 13

[114] David H. Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online bayesian rec-

ommendations. In Juan Quemada, Gonzalo Leon, Yoelle S. Maarek, and Wolfgang Nejdl, editors,

WWW, pages 111–120. ACM, 2009. ISBN 978-1-60558-487-4. 67

[115] David H. Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando

Tacchella. Collaborative expert portfolio management. In Maria Fox and David Poole, editors,

AAAI. AAAI Press, 2010. 67

[116] Johan A. K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural

Processing Letters, 9(3):293–300, 1999. 11

[117] Wenbin Tang, Jie Tang, and Chenhao Tan. Expertise matching via constraint-based optimization.

In IEEE/WIC/ACM International Conference on Web Intelligence (WI-10) and Intelligent Agent

Technology (IAT-10), volume 1, pages 34–41, Toronto, Canada, 2010. IEEE Computer Society.

ISBN 978-0-7695-4191-4. 68

[118] Wenbin Tang, Jie Tang, Tao Lei, Chenhao Tan, Bo Gao, and Tian Li. On optimization of expertise

matching with various constraints. Neurocomputing, 76(1):71–83, January 2012. ISSN 0925-2312.

68

BIBLIOGRAPHY 103

[119] Daniel Tarlow. Efficient Machine Learning with High Order and Combinatorial Structures. PhD

thesis, University of Toronto, February 2013. 75

[120] Daniel Tarlow, Ryan Prescott Adams, and Richard S Zemel. Randomized optimum models for

structured prediction. In Proceedings of the 15th Conference on Artificial Intelligence and Statis-

tics, pages 21–23, 2012. 93

[121] Daniel Tarlow, Kevin Swersky, Richard S Zemel, Ryan P Adams, and Brendan J Frey. Fast exact

inference for recursive cardinality models. In Proceedings of the 28th Conference on Uncertainty

in Artificial Intelligence, 2012. 80

[122] Camillo J. Taylor. On the optimal assignment of conference papers to reviewers. Technical Report

MS-CIS-08-30, University of Pennsylvania, 2008. 36, 64, 66, 68

[123] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector

machine learning for interdependent and structured output spaces. In Proceedings of the Twenty-

First International Conference on Machine Learning, ICML’04, pages 104–, New York, NY, USA,

2004. ACM. ISBN 1-58113-838-5. 93

[124] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc.,

New York, NY, USA, 1995. ISBN 0-387-94559-8. 7

[125] Maksims Volkovs and Rich Zemel. Collaborative ranking with 17 parameters. In P. Bartlett, F.C.N.

Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information

Processing Systems 25, pages 2303–2311. 2012. 17

[126] Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientific articles.

In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and

data mining, KDD ’11, pages 448–456, New York, NY, USA, 2011. ACM. 14, 40, 41, 42, 48, 52,

55

[127] Chong Wang and David M. Blei. Variational inference in nonconjugate models. Journal of Machine

Learning Research, 14(1):1005–1031, April 2013. 44, 60

[128] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - maximum

margin matrix factorization for collaborative ranking. In NIPS, 2007. 14, 16, 17

[129] Markus Weimer, Alexandros Karatzoglou, and Alex Smola. Adaptive collaborative filtering. In

Proceedings of the 2008 ACM conference on Recommender systems, RecSys ’08, pages 275–282,

New York, NY, USA, 2008. ACM. ISBN 978-1-60558-093-7. 11, 17

[130] Jason Weston, Chong Wang, Ron Weiss, and Adam Berenzweig. Latent collaborative retrieval. In

ICML. icml.cc / Omnipress, 2012. 41, 42

[131] Zuobing Xu, Ram Akella, and Yi Zhang 0001. Incorporating diversity and density in active learning

for relevance feedback. In Giambattista Amati, Claudio Carpineto, and Giovanni Romano, editors,

ECIR, volume 4425 of Lecture Notes in Computer Science, pages 246–257. Springer, 2007. ISBN

978-3-540-71494-1. 23

BIBLIOGRAPHY 104

[132] Kai Yu, Anton Schwaighofer, Volker Tresp, Xiaowei Xu, and Hans-Peter Kriegel. Proba-

bilistic memory-based collaborative filtering. IEEE Trans. on Knowl. and Data Eng., 16

(1):56–69, January 2004. ISSN 1041-4347. doi: 10.1109/TKDE.2004.1264822. URL

http://dx.doi.org/10.1109/TKDE.2004.1264822. 23

[133] Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied

to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, April 2004. ISSN 1046-8188. 31

http://dx.doi.org/10.1109/TKDE.2004.1264822

by laurentcharlin - university of toronto t-space...the recommendation task. in this thesis we...

Documents