by laurentcharlin - university of toronto t-space...the recommendation task. in this thesis we...
Post on 19-Aug-2020
0 Views
Preview:
TRANSCRIPT
Supervised and Active Learning for Recommender Systems
by
Laurent Charlin
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Computer Science
University of Toronto
© Copyright 2014 by Laurent Charlin
Abstract
Supervised and Active Learning for Recommender Systems
Laurent Charlin
Doctor of Philosophy
Graduate Department of Computer Science
University of Toronto
2014
Traditional approaches to recommender systems have often focused on the collaborative filtering problem:
using users’ past preferences in order to predict their future preferences. Although essential, rating
prediction is only one of the components of a successful recommender system. One important problem
is how to translate predicted ratings into actual recommendations. Furthermore, considering additional
information either about users or items may offer substantial gains in performance while allowing the
system to provide good recommendations to new users.
We develop machine learning methods in response to some of the limitations of current recommender-
systems’ research. Specifically, we propose a three-stage framework to model recommender systems.
We first propose an elicitation step which serves as a way to collect user information beneficial to
the recommendation task. In this thesis we framed the elicitation process as one of active learning. We
developed several active elicitation methods which, unlike previous approaches which exclusively focus
improving the learning model, directly aim at improving the recommendation objective.
The second stage of our framework uses the elicited user information to inform models that predict
user-item preferences. We focus user-preference prediction for a document recommendation problem for
which we introduce a novel graphical model over the space of user side-information, item (document)
contents, and user-item preferences. Our model is able to smoothly tradeoff its usage of side information
and of user-item preferences to make good document recommendations in both cold-start and non-cold-
start data regimes.
The final step of our framework consists of the recommendation procedure. In particular, we focus on
a matching instantiation and explore different natural matching objectives and constraints for the paper-
to-reviewer matching problem. Further, we explore and analyze the synergy between the recommendation
objective and the learning objective.
In all stages of our work we experimentally validate our models on a variety of datasets from differ-
ent domains. Of particular interest are several datasets containing reviewer preferences about papers
submitted to conferences. These datasets were collected using the Toronto Paper Matching System, a
system we built to help conference organizers in the task of matching reviewers to submitted papers.
ii
Acknowledgements
I am most in debt to my supervisors Richard Zemel and Craig Boutilier. Without their advice, their
continued encouragements and their help and support this work would not have been possible. I am
glad we made this co-supervision work.
I am especially grateful to Rich, whom as the NIPS’10 program chair provided the initial motivation
and momentum behind this thesis. Furthermore, Rich’s ideas, presence and experience were determinant
in our joint creation of the Toronto paper matching system. Throughout these projects I found a great
mentor and I have been privileged to work closely to Rich. Rich has thought me a lot about how to pick
and approach research problems as well as about how to model them.
I am also very grateful to have been able to work with Craig. I have learned a lot from Craig. His
vision, his curiosity and his scientific rigour are qualities that I strive for. Craig’s ideas were also the
ones that initially helped foster this research and his insights and ideas throughout have provided great
balance to my work. Our interactions through COGS have further widen my research interests.
I would also like to thank my first mentor, Pascal Poupart, who showed me how exciting research
could be and gave me some of the tools to succeed at it.
My thanks also go to the members of my thesis committee, Sheila Mcllraith and Geoffrey Hinton, for
their precise comments and questions throughout my PhD. Geoff’s enthusiasm and presence in the lab
were also very motivating to me. I would also like to thank my external advisor, Andrew McCallum for
the appropriateness of his comments regarding my work and also for pointing out important immediate
future steps of great benefit. Finally, I am thankful to Ruslan Salakhutdinov and Anna Goldenberg for
reading and commenting on the final copy of my thesis.
The constant support and love of Anne were also determinant in undertaking and successfully finishing
this PhD. Her reassuring words have helped me in many occasions. I am especially thankful for her
ideas and her outlook on life which she selflessly shares with me and which I have learned so much from.
Further, I want to dedicate this thesis to Viviane, the next big project in our lives.
Although their involvement was more indirect I learned a lot from postdocs that have tenured in
Toronto, specifically I want to thank Iain, Ryan, Marc’Aurelio and, of course, Hugo who has become a
good friend and collaborator.
Finally, the machine learning group at Toronto was an extremely stimulating and pleasant place to
work at thanks to collaborators, close colleagues and friends: Kevin R., Jasper, Danny, Ilya, Kevin S.,
Jen, Fernando, Eric, Maks, Darius, Bowen, Charlie, Tijmen, Graham, John, Vlad, Deep, Nitish, George,
Andriy, Tyler, Justin, Chris, Niail, Phil, and Genevieve. Special thanks to Kevin R., Jasper, Danny and
Ilya for many interesting discussions about everything throughout our graduate years.
iii
Contents
1 Introduction 1
1.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Constrained Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Preliminaries and Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Preference Modelling and Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 CF for Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Active Preference Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Query by Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Expected Model Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Expected Error Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 Batch Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.6 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Paper-to-Reviewer Matching 26
3.1 Paper Matching System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Overview of the System Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Active Expertise Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Learning and Testing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Initial Score Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Supervised Score-Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Expertise Retrieval and Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Other Possible Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Conclusion and Future Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
4 Collaborative Filtering with Textual Side-Information 39
4.1 Side Information in Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Variational Inference in Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Collaborative Score Topic Model (CSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 The Relationship Between CSTM and Standard Models . . . . . . . . . . . . . . . 48
4.4.2 Learning and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6.2 Competing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Conclusion and Future Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Learning and Matching in the Constrained Recommendation Framework 63
5.1 Learning and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Matching Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Matching Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Related Work on Matching Expert Users to Items . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2 Suitability Prediction Experimental Methodology . . . . . . . . . . . . . . . . . . . 69
5.4.3 Match Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4.4 Transformed Matching and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Conclusion and Future Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Task-Directed Active Learning 77
6.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Active learning for Match-Constrained Recommendation Problems . . . . . . . . . . . . . 78
6.2.1 Probabilistic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.2 Matching as Inference in an Undirected Graphical Model . . . . . . . . . . . . . . 80
6.3 Active Querying for Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.2 Experimental Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Conclusion and Future Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Conclusion 91
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography 94
v
List of Tables
3.1 Comparing the top ranks of word-LM and topic-LM . . . . . . . . . . . . . . . . . . . . . 35
4.1 Modelling capabilities of the different score prediction models . . . . . . . . . . . . . . . . 54
4.2 Comparisons between CSTM and competitors for cold-start users . . . . . . . . . . . . . . 56
4.3 Test performance of CSTM and competitors on the unmodified ICML-12 dataset . . . . . 59
4.4 Comparisons between CSTM and two variations . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Overview of the matching/evaluation process. . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Comparison of the matching objective versus within-reviewer variance . . . . . . . . . . . 73
vi
List of Figures
1.1 Constrained recommendation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Graphical model representation of PMF and BPMF . . . . . . . . . . . . . . . . . . . . . 11
2.2 Graphical model representation of a mixture model for collaborative filtering . . . . . . . 12
3.1 A conference’s typical workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 High-level software architecture of the system. . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Histograms of score values for NIPS-10 and ICML-12 . . . . . . . . . . . . . . . . . . . . . 34
3.4 Comparison of top word-LM scores with top topic-LM scores . . . . . . . . . . . . . . . . 36
4.1 Preference prediction in the constrained recommendation framework . . . . . . . . . . . . 40
4.2 Graphical model representation of collaborative filtering with side information models . . 41
4.3 Graphical model representations of LDA and CTM . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Graphical model representation of CSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Graphical model representation of CTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Score histograms for NIPS-10, ICML-12, and Kobo . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Test performance comparing CSTM and competitors across datasets . . . . . . . . . . . . 57
4.8 Learned parameters for NIPS-10 and ICML-12 . . . . . . . . . . . . . . . . . . . . . . . . 58
4.9 Test performance on unmodified ICML-12 dataset . . . . . . . . . . . . . . . . . . . . . . 59
4.10 Test performance comparing CSTM and CTR on new users . . . . . . . . . . . . . . . . . 61
5.1 Constrained recommendation framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Match-constrained recommendation framework . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Score histograms for NIPS-10 and NIPS-09 . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Performance on the matching task on the NIPS-10 dataset. . . . . . . . . . . . . . . . . . 71
5.5 Histogram of assignments by score value for the NIPS-10 dataset . . . . . . . . . . . . . . 72
5.6 Comparison of score assignment distributions . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Histograms of number of papers matched per reviewer under soft constraints . . . . . . . 74
5.8 Performance on the transformed matching objective on NIPS-10. . . . . . . . . . . . . . 75
6.1 Elicitation in the match-constrained recommendation framework . . . . . . . . . . . . . . 78
6.2 Comparison of matching methods on a “toy” example . . . . . . . . . . . . . . . . . . . . 80
6.3 Histograms of score values for Jokes and Dating datasets. . . . . . . . . . . . . . . . . . . 84
6.4 Matching performance comparing different active learning strategies . . . . . . . . . . . . 86
6.5 Usage frequency of the fall-back query strategy of our active learning methods . . . . . . 87
vii
6.6 Additional matching experiments comparing different active learning strategies . . . . . . 88
viii
Chapter 1
Introduction
Easy access to large digital storage has enabled us to record data of interest, from documents of great
scientific importance to videos of cats and hippos. The ease with which digital content can now be
shared renders this content nearly-instantly accessible to interested (online) users worldwide. Without
some form of organization, for example through search capabilities, these data would rapidly lose the
interest of users overwhelmed under this deluge of data. Hence, the tasks of organizing and analyzing
immense quantities of data are of critical importance. On-line search engines, by enabling their users
to search on-line records for relevant items, have been the original tools of choice for finding relevant
data. Although a good search engine can distinguish relevant from irrelevant items given a specific query
(for example, what movies are playing this weekend? ), search engines lack the ability to determine the
level of user interest within the set of relevant documents. In the last few years, recommender systems
have become indispensable. Recommender systems enable their users to filter items based on individual
preferences (for example, which weekend movies would I enjoy? ). Recommender systems add so much
value to search engines that major search engines now include a recommendation engine to personalize
their results. For similar reasons, recommender systems have become an important topic of academic
and corporate research.
1.1 Recommender Systems
Recommender systems are software systems that can recommend items, for example, scientific papers
or movies, of interest to their users, for example, scientists or movie lovers. Although specifics will differ
across real-world applications, recommender systems aim to help users in decision-making tasks [88].
Hence, recommender systems are useful in domains where it is infeasible or impractical for users to
experience or provide input on every available item. For example, a user may require a recommender
system to help in deciding which movie to attend at a film festival.
To fulfil its duties, a recommender system must correctly model information about its users and
items. A good recommender system should represent user preferences, but also should use other user
information which may affect that users’ immediate preferences, such as a user’s current state of mind
or his or her decision mechanism. In this respect, a recommender system behaves similarly to a user’s
friend who suggests an item of interest. Furthermore, a good recommender system should also analyze
information about items which will be useful in determining user interests. A good recommender system
1
Chapter 1. Introduction 2
for a user should therefore combine the insights of that user’s friends with the knowledge of a domain
expert [87].
Goldberg et al. [40] introduced the first practical recommender system appeared at the end of the
1990s. Incidentally, the same authors also coined the term collaborative filtering to describe a system
which uses the collective knowledge of all its users to more accurately infer the preferences of individual
users. From that point, interest in both academia and the corporate world grew rapidly. Resnick
and Varian [87] formalized some of the design parameters in recommender systems. These parameters
included the type of user feedback (user evaluations of items), the various sources of preferences, and
the aggregation of user feedback, as well as how to communicate recommendations to the users. The
authors noted that “one of the richest areas for exploration is how to aggregate evaluations”. The future
proved them right because a plethora of work has since been carried out on this topic [88]. Furthermore,
this central question also constitutes the core of this thesis.
Creating a recommender system is a task which spans several research areas. As pointed out by Ricci
et al. [88] in their recommender systems handbook: “Development of recommender systems is a multidis-
ciplinary effort which involves experts from various fields such as artificial intelligence, human-computer
interaction, information technology, data mining, statistics, adaptive user interfaces, decision- support
systems, marketing, and consumer behaviour”. This thesis considers the determination of user-item pref-
erences, using information about users and items, as the central task faced by recommender systems.
Machine learning and specifically prediction models and techniques offer a natural way of representing
available information for predicting user-item preferences. Hence, machine learning models constitute
the central component of recommender systems. Other aspects of recommender systems, for example,
the interface between the system and its users, are auxiliary components which are built around the
prediction mechanisms.
1.1.1 Constrained Recommender Systems
User preferences represent the main thrust behind a recommender system’s suggestions. Therefore,
optimizing the prediction of preferences is the natural goal of a recommender system. However the
system’s designer may have additional objectives that he wishes to optimize, as well as other constraints
that he may be required to satisfy. For example, an on-line movie store may want to consider its
stock of each movie to ensure that only available movies are recommended. Depending on the exact
constraints, such recommendations may be similar to or very different from the initial, constraint-free
recommendations. The same on-line retailer may also wish to maximize some other function such as
user happiness, or perhaps more realistically, his or her own long-term profit. Such objectives lead to
a trade-off between an individual’s preferences, the preferences of other users, and the objectives of
the designers. Another common objective is that a recommender system may be inclined to take into
account the diversity of its suggestions [78] (for example, as a way to hedge its bets against non-optimal
recommendations). A recommender system may also use its recommendations as a means to further
refine its user model, for example, by selecting a subset of recommended items using an “exploration
strategy”. In this context, it may be useful for the system to recommend items that will be most useful
in refining its user model, somewhat independently of a system’s (ultimate) objective to provide the best
possible recommendations. The study of constrained recommender systems, and specifically the design
of learning methods which are tailored to the interaction between the preference prediction objective
and the final objective, is a primary focus of this thesis.
Chapter 1. Introduction 3
1.2 Contributions
The aim of this thesis is to develop machine learning methods tailored to recommender systems. Fur-
thermore, we are interested in how such machine learning methods may lead to better user and item
representations and, therefore, ultimately lead to improved recommender. We propose to model the
recommendation problem using a three-stage process which is elaborated in this thesis:
Preference Collection: When initially engaging with a recommender system, a new user must provide
information related to his preferences. This information will be used by the recommender system
to build a user model which paves the way to personalized recommendations. The preference
collection phase represents an opportunity for the system to elicit user information actively. We
frame the problem of selecting what to ask a user as an active learning problem. Such elicitation,
by asking informative queries which quickly lead to a good user model, has the potential to mitigate
the cold-start problem, a problem which refers to the difficulty for a system to learn good models of
users with little known preferences such as novel users. We propose several methods for performing
active learning of user preferences over items for (match-constrained) recommender systems. Our
main contribution that leads to the empirical success we demonstrate is our novel active learning
methods—they are sensitive to the matching objective, which is the ultimate objective of match-
constrained recommender systems.
Predict missing preferences: The central task of a recommender system is to use elicited user infor-
mation to predict user-item preferences. This means that the system must use available information
to learn a predictive model of user-item preferences. Here, we propose several models to predict
these missing user preferences. We focus on developing models which, in addition to user-item
preferences, also model and leverage side-information, that is features of users or items aside from
user preferences. We demonstrate empirically that this side-information can be beneficial in both
cold-start and non-cold-start data regimes.
Preferences to recommendations: Using predicted preferences to suggest items to users is the goal
of the recommender system. Going from preferences to suggestions can be as simple as selecting
a subset of the most preferred items or providing users with a preference-ordered list of items.
However, it can also involve solving a more complex optimization problem such as that imposed by a
constrained recommender system. This stage therefore involves using user-item preferences, either
user-elicited or predicted, as inputs to a recommendation procedure, for example a combinatorial
optimization problem, which considers the final objective and constraints and determines the
recommendations.
In this thesis, we introduce and instantiate match-constrained recommender systems which involve
matching users to items under constraints which globally restrict the set of possible matches. We
explore several matching objectives and constraints and show, for certain objectives, a synergy
between the final matching objective and the preference-prediction objective (the learning loss).
A flow chart depicting the three stages described above is shown in Figure 1.1. Although the first two
stages directly benefit from machine-learning techniques, the third, which arises from practical consid-
erations of recommender systems, provides an ultimate objective to be used to guide the development of
appropriate machine-learning techniques. In other words, there is a synergy between the first two stages
and the last stage. Specifically, the first stage’s objective is to collect useful information from users. The
Chapter 1. Introduction 4
Prediction of missing
preferencesElicitation
Final objectives
and constraints
S31
1 2
Users
Entities
Stated preferences
1 1 20 1 22 3 2
Stated & predicted
preferences
1 1 20 1 22 3 2
Recommendations
Side-Information
from Users and
Entities
F
Figure 1.1: Flow chart depicting the different components of our research. The first part representselicitation, followed by missing-preference prediction and finally the recommendation procedure, F ().The recommendation procedure represents the ultimate objective of the system (for example, ranking,matching, per-user diversity, social-welfare maximization).
usefulness of the information will be determined by the recommendation objective. The active learning
strategies can therefore be guided by the recommendation objective. Furthermore, when learning mod-
els for missing-preference prediction, a model sensitive to the recommendation objective may focus on
correctly predicting preferences that are more likely to be part of the recommender system’s suggestions.
The exact form of inter-stage interactions will be detailed in the appropriate chapters.
1.3 Outline
We will begin in Chapter 2 by defining the problem more formally, explaining some of the foundational
principles behind our work and reviewing relevant previous research in the areas of preference modelling
and prediction, active learning, and matching.
Chapter 3 introduces the paper-to-reviewer matching problem as a practical problem that will be
used to motivate and illustrate the various contributions throughout this thesis. This chapter also
introduces and describes in detail the on-line software system that we have implemented and released to
help conference organizers assign submitted papers to reviewers. Furthermore, this platform serves as a
testbed for the developments in other chapters.
Chapter 4 describes our work on the problem of preference prediction with textual side-information,
which is the first research contribution of this thesis. We introduce a novel graphical model for personal
recommendations of textual documents. The model’s chief novelty lies in its learned model of individual
libraries, or sets of documents, associated with each user. Overall, our model is a joint directed prob-
abilistic model of user-item scores (ratings), and the textual side information in the user libraries and
the items. Creating a generative description of scores and the text allows our model to perform well
in a wide variety of data regimes, smoothly combining the side information with observed ratings as
the number of ratings available for a given user ranges from none to many. We compare the model’s
performance on preference prediction with a variety of other models, including two methods used in our
paper-to-reviewer matching system. Overall, our method compares favourably to the competing mod-
els especially in cold-start data settings where user side-information is essential. We further show the
benefits of modelling user side-information in an application for personal recommendations of posters to
view at conference.
Chapter 1. Introduction 5
In Chapter 5, we formally introduce our framework for optimizing constrained recommender systems
and instantiate a match-constrained recommender system from it. We frame the matching or assignment
problem as an integer program and propose several variations tailored to the paper-to-reviewer matching
domain. Experiments on two data sets of recent conferences examine the performance of several learning
methods as well as the effectiveness of the matching formulations. We show that we can obtain high-
quality matches using our proposed framework. Finally, we explore how preference prediction and
matching interact. Experimentally we show that matching can benefit from interacting with the learning
objective when matching utility is a non-linear function of user preferences.
Active learning methods to optimize match-constrained recommender systems are proposed in Chap-
ter 6. Specifically, we develop several new active learning strategies that are sensitive to the specific
matching objective. Further, we introduce a novel method for determining probabilistic matchings
that accounts for the uncertainty of predicted preferences. This is of importance for as active learning
strategies are often guided by, or make use of, model uncertainty (for example, a common strategy is
to pick queries that will most reduce the model’s uncertainty). Experiments with real-world data sets
spanning diverse domains compare our proposed methods to standard techniques based on the qual-
ity of the resulting matches as a function of the number of elicited user preferences. We demonstrate
that match-sensitive active learning leads to higher-quality matches more quickly compared to standard
active learning techniques.
Finally, we provide concluding remarks and discuss opportunities for future work in Chapter 7. We
also provide a brief discussion of the main future issues to be addressed by the field of recommender
systems as a whole.
Chapter 2
Background
In this chapter, we review the literature pertaining to the three stages of recommendation systems
studied in this thesis. We begin by introducing the notation that will be used throughout this thesis and
reviewing some of the foundational concepts of machine learning. Then we review some of the previous
research in preference prediction, focusing especially on side-information based and collaborative filtering
methods. We also introduce the concept of active learning as a way to collect user preferences and review
some of the relevant literature. Finally, we introduce and discuss the matching literature as an interesting
example of a recommendation objective.
Note that throughout this chapter, we survey some of the work that relates to this thesis as a whole.
Work specifically related to individual chapters will be surveyed independently as part of the relevant
chapter.
2.1 Preliminaries and Conventions
In our study of recommendation systems, we take user-item preferences to be the atomic and quintessen-
tial pieces of information used to represent a user’s interest in a particular item. Our convention is to
use the term preferences to denote a user’s interest in an item, but we will also use it to denote a user’s
expertise with respect to an item. A user’s expertise denotes his competence, or qualifications, with re-
spect to a particular item and as such may differ from his preferences (for example, conference organizers
may wish to assign submitted papers to the most expert reviewers regardless of the reviewer’s actual
interest in the paper). This distinction will usually be clear from context, although we will typically
use the more specialized term score to denote expertise and rating to denote interest. Note that all
models and methods developed in this thesis can readily deal with ratings or scores. We only consider
cases where users explicitly express their preferences for items individually. For example, we do not
consider preferences that could be expressed by providing a ranking of a group of items. We assume that
user-item preferences are expressed using numeric (typically integer) values. Furthermore, we equate
higher ratings (or scores) with a stronger preference (that is, users would prefer an item rated 5 over
an item rated 1). Interestingly, the semantic meaning of the preferences is usually not specified in data
sets, and therefore the value of a preference is usually taken to represent its utility. As a result, most of
the work we survey considers the rating-prediction problem as a metric regression problem rather than
an ordinal regression problem, and we have followed their lead.
6
Chapter 2. Background 7
We denote an individual user as r, the set of all users as R, and the number of users as R (that is,
R ≡ R). Similarly, individual items are denoted as p, the set of all items as P, and the number of items
as P . In our work, we will use the term users to designate the set of entities to which recommendations
will be provided, regardless of whether they represent a person, a group of people, or other entities
of interest. Similarly, items are entities to be recommended, regardless of their actual representation.
Typical items in this thesis will be documents, jokes, and other humans. We denote the preference of
user r toward item p as srp (a score). It will also be useful to think of a preference as a (r, p, s)-triplet.
In general, we denote matrices using uppercase letters (for example, U , V ), vectors using bold-font
lower-case letters (for example, a,γ), and scalars using lower case letters (for example, a,b). Unless this
introduces ambiguity, we also denote the size of sets using uppercase letters (for example, N and R).
2.1.1 Learning
The central machine-learning task in this thesis is to model user interest in items to predict user-item
preferences. We will assume that a subset of user-item preferences, possibly with side information, is
always available to the system. Ways of obtaining such information will be discussed in Section 2.3. Our
goal is therefore to predict the preferences of users for items that they have not yet rated. In other words,
our goal is to use observed user-item preferences to learn a model that can predict missing user-item
preferences. This prediction problem is framed as a supervised machine-learning problem. Accordingly,
our aim is to minimize, for a suitable error or loss function, the expected loss of our preference prediction
model. The expectation is taken with respect to a fixed but unknown data-generating distribution,
where the generating distribution is a joint distribution over preferences and user-item pairs. Because
in practice the true expected loss cannot be evaluated, it is customary to evaluate and report instead
the empirical loss [124]. To evaluate the performance of a model, we will then use user-item preferences
collected from users otherwise known as ground-truth preferences. The empirical loss is the result of
evaluating the loss function using the ground-truth preferences with the predicted preferences.
For the purposes of empirical comparison, the available ground-truth preferences are divided into
two disjoint sets: the training set, which the machine learning model is allowed to leverage, and the
test set, which the method is not given access to and, therefore, which can be used to evaluate model
performance. In addition, it can be useful to reserve additional data from the training set to constitute
a validation set. The validation set can be used to evaluate model performance during the training
stage. In particular, validation sets are often used to determine the value of hyper-parameters, which are
parameters given as input to a learning model, and to prevent overfitting. A model is said to overfit if
its training loss is much smaller than its test loss.
2.2 Preference Modelling and Predictions
In the context of recommendation systems, two broad classes of preference prediction systems have been
explored in the machine-learning literature: content-based systems and collaborative filtering systems.
Both types of systems make use of user preferences for items. Content-based systems assume access to
the content of the items and model content features, features derived from the content. As an example,
in a book domain, the content of items include the words of the books, the corresponding word counts
would be content features. On the other hand, collaborative filtering leverages only the preference
similarities between users and/or items [40]. Instead of content-based we prefer the more general term
Chapter 2. Background 8
side information, which denotes all user, item, or user-item features except the user-item preferences
themselves.
2.2.1 Collaborative Filtering
A simple example of collaborative filtering (CF) can be described as follows: imagine two users that
have similar ratings for certain items. CF makes the assumption that these similarities extend across
items rated by only one of the two users. Therefore, to learn a user’s model CF methods try to glean
preference information from similar users (that is, users whose preferences for items rated in common
are similar). Using this natural idea, researchers have built a wealth of models that have performed very
well on several real-life data sets [72, 62]. These methods usually exploit CF by enabling the learning
models to share parameters either across users or across items, or both.
We view CF as having certain advantages over side-information based methods:
1. CF leverages the very natural and powerful idea that a user’s preferences can be predicted using
the preferences of similar users. Using other users’ preferences is especially attractive in today’s
connected world where it has become easier to collect preferences for large numbers of users.
2. Because CF uses user preferences only, it is largely domain-independent and can also be used to
model user preferences across different item domains. On the contrary, side-information-based
systems leverage the similarities contained in the features of users or items. Selecting features that
are discriminative of user preferences may not be straightforward in certain domains. For example,
in the movie-domain, meta-data, such as a movie’s genre, has not typically led to significant
performance gains (see for example [66]).
CF also has certain limitations:
1. Because it leverages only the information contained in user preferences, CF is problematic when
the preferences of a specific user or for a specific item are not available, a problem commonly
known as the cold-start problem because it generally applies when users or items first enter the
system. Generally, CF is effective in domains where collections of preferences from users for items
are accessible, in particular when each user can rate multiple items and each item is rated by
multiple users. For example, typical domains in which CF has been successful relate to everyday
decisions, such as movie or restaurant recommendations, which entail easy access to large numbers
of users and their preferences. By design, CF would typically perform worst in domains where
obtaining many preferences from each user or for each item is difficult. Example domains include
recommending houses, cars, or universities to which a high-school student should apply. In such
domains, CF may still be useful once side information is available [19].
2. The fact that CF does not use domain-specific information also means that it does not make use
of potentially useful domain information when it is available.
We find that the domain-independence of CF methods, as well as the apparent difficulty of leveraging
side information in systems where many preferences are available, and their performance in practice [7]
tips the balance in favour of using CF methods for recommendation systems.
Hybrid systems that combine the advantages of both CF and side-information techniques offer inter-
esting opportunities. Overall, the field of side-information-CF models has only begun to be explored by
Chapter 2. Background 9
researchers. Because Chapter 4 will present several novel hybrid approaches, we will defer our discussion
of previous research in this field until then.
Original Collaborative Filtering Models
The term collaborative filtering was first coined by researchers who used this technique to help users
filter their emails by leveraging their colleagues’ preferences [40]. This was soon followed by similar work
using a neighbourhood, or memory-based, method aimed at filtering articles from netnews [86].
In model-free, or neighbourhood, approaches to CF, a prediction for a user-item pair is the weighted
combination of the ratings given to that item by the user’s neighbours. Alternatively, one could use the
weighted combination of the ratings given by that user to neighbouring items. The weight of each neigh-
bour is typically taken to be some similarity measure between the two users. Breese et al. [20] compare
different similarity measures such as cosine similarity and the Pearson correlation coefficient, as well as
several variations of these. In their experiments, they show that modifications to the latter yield the
best results.
The other main class of CF models is called model-based CF [20]. In this approach, parameters of
a (probabilistic) model are learned using known ratings, and the model is then used to predict missing
ratings.
In terms of ratings prediction performance, the neighbourhood approaches are typically not as good
as the best model-based approaches on real-world CF data sets. However, the fact that they are very
computationally efficient renders them attractive for on-line recommendation applications (for example,
those that need to update their recommendations according to newly acquired ratings in real time).
Model-based approaches, in addition to providing stronger performance, are generally more principled.
Among model-based approaches, there has been a push toward using probabilistic models (see the next
two sections). Probabilistic models generally offer a wealth of advantages, such as uncertainty modelling,
higher robustness to noise in the data, and a more natural way to encode additional information such
as prior information available to the system or other features of items and users. By using probabilistic
models, it is also possible to benefit from recent developments in these models, including advances in
learning and inference techniques.
In this document, we will focus mostly on model-based approaches, which have been favoured in
most recent machine-learning work.
Matrix Factorization Models
User-item-score triplets of the form (r, p, s) can be seen as the entries of a matrix, where the pair (r, p)
represents an index of the matrix and s corresponds to the value at that index. We will denote this score
matrix by S and the triplet (r, p, s) by srp = s. The dimension of the matrix is equal to the number of
users R by the number of items P (S ∈ RR×P ). The unobserved triplets are encoded as missing entries
in this matrix. The resulting matrix containing observed and unobserved triplets is therefore a sparse
matrix. The goal of collaborative filtering is then to fill the matrix by estimating the missing entries of
the matrix (Sm) given its observed entries (So). One natural way of doing so is to find a factorization
of S, S ≈ UTV , with U ∈ Rk×R and V ∈ R
k×P , with k effectively determining the rank of UTV [111].
Then a missing value, (r, p), can be recovered by multiplying the r-th column of U with the p-th column
of V : uTr vp. When S encodes user-item preferences, then U could be seen as encoding user attitudes
toward the item features encoded by V . Unless this introduces ambiguity, we simplify the notation by
Chapter 2. Background 10
denoting user r’s factors, that is, r’s column of U , by ur, and similarly item p’s factors, p’s column of
V , by vp. For matrices without missing entries, a factorization can be recovered by finding the singular
value decomposition (SVD) of R, which is a convex optimization problem. However, SVD cannot be
used for matrices with missing entries [112]. Furthermore, CF applications typically involve predicting
preferences for a large number of users and items given a small number of observed ratings (i.e., S is
sparse), |So| << RP . For that reason, estimating (R2 +RP + P 2) parameters would not be wise.
A variety of methods have attempted to deal with these difficulties for CF. Srebro and Jaakkola
[112] proposed to regularize the factorization by finding a low-rank approximation: S ≈ S, where
rank(S) = k << min(R,P ). Specifically, Srebro and Jaakkola [112] searched for U and V that would
minimize the squared loss between the reconstruction and the observed ratings:
∑
i
∑
j
Iij(sij − (uTi vj))
2. (2.1)
where Iij = 1 if score sij is observed and 0 otherwise. They also considered more elaborate cases where
I could be a real number corresponding to a weight, “for example in response to some estimate of the
noise variance” [112]. They proposed both an iterative procedure, based on the fact that the objective
becomes convex when conditioning on either U or V , and an expectation-maximization (EM) procedure.
Salakhutdinov and Mnih [99] further regularized the objective using the Frobenius norm of U and V :
∑
i
∑
j
Iij(sij − (uTi vj)
2 + λ1||U ||Fro + λ2||V ||Fro. (2.2)
|| · || denotes the Frobenius norm of a matrix and corresponds to the square root of the sum of squares
of all elements of the matrix (it is the equivalent, for matrices, of the vector Euclidean). The minimum
of this objective corresponds to the MAP solution of a probabilistic model with a Gaussian likelihood:
Pr(S|U, V, σ2) =∏
i
∏
j
N (sij |uTi vj , σ
2) (2.3)
and isotropic Gaussian priors over U and V . This model is called probabilistic matrix factorization
(PMF) , and its graphical representation, by way of a Bayesian network, is shown in Figure 2.1(a). The
authors show that although the problem is non-convex, performing gradient descent jointly on U and
V yields very good performance on a large real-life data set such as the Netflix challenge data set (100
million ratings, over 480,000 users, and 17,000 items). Instead of approximating the posterior with the
independent mode of U and V , Lim and Teh [68] proposed to use a mean-field variational approximation
which assumes an independent posterior over U and V (Pr(U, V |So) = Pr(U |So) Pr(V |So)). Salakhutdi-
nov and Mnih [98] further proposed Bayesian PMF, a fuller Bayesian extension of PMF, which involves
setting hyper-priors over the parameters of the priors of U and V (see Figure 2.1(b)). The posterior can
be approximated using a sampling approach. Specifically, they developed a Gibbs sampling approach
that scales well enough to be applied to large data sets and which outperforms ordinary PMF.
Instead of performing regularization by constraining the rank of U and V , which would render the
problem non-convex, Srebro et al. [113] proposed regularizing (only) the norm of the factors, ||U ||Fro
and ||V ||Fro, or equivalently the trace norm of UTV (that is, the sum of its singular values, denoted
by ||UTV ||Σ). With certain loss functions (for example, the hinge loss), finding the optimal UTV with
the trace-norm constraints is a max-margin learning problem, and hence the name of this model is max
Chapter 2. Background 11
UVj i
Rij
j=1,...,Mi=1,...,N
V U
α
α α
(a)
j
Rij
j=1,...,Mi=1,...,N
Vµ µ Ui
ΛU
µU
0ν , W0
µ0V0
VΛ
, W00ν
α(b)
Figure 2.1: Graphical model representation of PMF (a) and BPMF (b). These figures show that BayesianPMF is a PMF model with additional priors of U and V . Both figures are borrowed from [98].
margin matrix factorization (MMMF). To illustrate this, suppose, for ease of exposition, that S is binary.
Then maximizing S(UTV ) with fixed ||(UTV )||Σ will find the solution, (UTV ), that yields the largest
distance between the hyperplane defined by the normal vector (uTi vj) and the ratings {sij}. Although
this approach leads to a convex optimization problem, Srebro et al. [113] framed the optimization as
a semi-definite program (SDP), and given current solvers, the optimization problem cannot be solved
for more than a few thousand variables [129]. In their formulation, the number of variables is equal
to the number of observed entries, meaning that this solution technique severely limits the scale of
the applicable data sets. To combat this problem, Rennie and Srebro [85] proposed a faster variant of
MMMF, in which they use a second-order gradient descent method, and bound the ranks of U and V
(effectively solving the non-convex problem). Another way to understand this work is as follows. Start
from Equation2.2; then a max margin approach reveals itself if the squared error term is replaced by a
loss function that maximizes the margin between the hyperplanes defined by uTi vj and {sij} ∀i, j. In
fact, with binary ratings, PMF can also be seen as a least-squares SVM [116].
As previously noted in matrix factorization techniques, the factors U and V can be thought of as user
features and item features respectively. Of note is the fact that factorization models are symmetric with
respect to users and items: transposing the observed score matrix does not alter the optimal solution.
Other Probabilistic Models
There has also been significant work on probabilistic models that are not based on factorizing the score
matrix. Such models typically learn different sets of parameters for users and for items. The principle
underlying most of these probabilistic models is to cluster users into a set of global user profiles. Profiles
can be seen as user attitudes toward items. Once a user’s profile has been determined, that profile alone
interacts, in some specific way, with an item’s representation to produce a rating. Some of this work
originates in applying mixture models to the problem of CF [49]. The generative model is as follows:
Pr(sij |ui, vj) =∑
z∈Z
Pr(zi = z|ui) Pr(Szj = s;µzj , σ2zj), (2.4)
Chapter 2. Background 12
s
i = 1, . . . , N
j = 1, . . . ,M
u
zv
Figure 2.2: Graphical representation of a mixture model proposed by Hofmann [49] to predict ratings.
where Z is a latent random variable representing the different user profiles. Pr(Szj = s;µzj , σ2zj) is a
Gaussian distribution with mean µzj and variance σ2zj . This mixture model’s Bayesian-network repre-
sentation is presented in Figure 2.2. Informally, the corresponding generative model first associates user
ui with a profile z. Then that profile and the item of interest vj combine to set the parameters of a
Gaussian, which in turn determines the value of the score for (ui, vj). The parameters to learn are the
mixture weights for each user (that is, the distribution over the profiles) in addition to the mean and
variance of a Gaussian distribution for every item and every profile. Several authors have proposed to
use priors over possible user profiles [71, 48]. In addition to representing users as a distribution over a
small set of profiles, Ross and Zemel [93] proposed to also further cluster items (MCVQ).
Marlin and Zemel [73] model users as a set of binary factors (binary vectors), and factors are viewed
as user attitudes. Each factor is associated with a multinomial distribution over ratings for each item.
The main distinguishing feature of this approach is that the influences of the different factors, expressed
through their multinomials over ratings, are combined multiplicatively to predict a rating for a particular
item. This combination implies that factors can express no opinion about certain items (by having a
uniform distribution over the item’s ratings). Furthermore, the multiplicative combination also means
that the distribution over a rating can be sharper that any of the factor distributions, something that is
impossible in mixture models where distributions are averaged.
Lawrence and Urtasun [66] proposed the use of a Gaussian process recovered by marginalizing out U
from the PMF formulation, Equation 2.3, assuming an isotropic Gaussian prior over U . The likelihood
over ratings is then a zero-mean Gaussian with a special covariance structure:
P (S|V, σ2, αw) =
N∏
j=1
N (sj |0, α−1w V TV + σ2I). (2.5)
One then optimizes for the parameters V and the scalar parameters αw and σ using (stochastic) gradient
descent. Here the parameters, V , form a covariance matrix that encodes similarities between items. To
obtain a prediction for a user given the above model, one simply has to condition on the observed ratings
of the user. Given the Gaussian nature of this model, a user’s unobserved ratings will then be predicted
by a weighted combination of the user’s observed ratings. The weights of the combination are then a
function of the variance between observed and unobserved items.
Chapter 2. Background 13
Salakhutdinov et al. [100] proposed to apply restricted Boltzmann machines (RBMs) to the collab-
orative filtering task. RBMs are a class of undirected graphical models, general enough to be applied,
with few modifications, to problems in many different domains. For CF, Salakhutdinov et al. [100]
showed that using RBMs enables efficient learning and inference that can scale to large problems (such
as the Netflix data set) and can yield near state-of-the-art performance. Furthermore, RBM distributed
representations combine active features multiplicatively in a similar way and with the same advantages
over mixture models, as Marlin and Zemel [73]. Recently, Georgiev and Nakov [38] have proposed a
novel RBM-based model for CF.
Without direct comparative experiments, it is difficult to differentiate among the methods described
above in terms of performance. Matrix factorization techniques, which are typically very fast to train
(and support effective inference) have been the most widely used in recent years. One lesson drawn from
the large problem run by the on-line movie rental company Netflix is that ensemble methods, which
are combinations of several different methods, are particularly well-suited for collaborative filtering
problems. With this in mind, combinations of linear models such as matrix factorization techniques
have been shown to be very effective [62, 7, 64].
Classical Evaluation Metrics
We now turn our attention to the performance metrics typically used to evaluate CF models. Two
metrics are commonly used to compare the performance of CF methods. Both are decomposable over
each single predicted rating. The first is the mean absolute error (MAE):
MAE :=
∑(r,p)∈So |srp − srp|
|So| , (2.6)
where srp is a learning method’s estimation of srp and So is the set of user-item index pairs of interest
(e.g., the pairs contained in a test set). Researchers have also used a normalized version of this method
(NMAE), where MAE is normalized so that random guessing would yield an error value of one. Nor-
malization enables comparison of errors across data sets that have different rating scales. The second
metric is the mean squared error (MSE):
MSE :=
∑(r,p)∈So(srp − srp)
2
|So| , (2.7)
and its square-root version RMSE:=√MSE. The probabilistic methods introduced in the previous
section do not directly optimize for these measures. Instead, they learn a maximum likelihood (ML)
estimate of the parameters, a maximum a posteriori (MAP) estimate, or a full distribution over their
parameters in the case of Bayesian methods such as Bayesian PMF. As outlined above, finding a MAP
estimate is, under certain assumptions, equivalent to minimizing the squared loss of individual predictions
(and hence the MSE). Once learning is done, probabilistic models can be used to infer a distribution over
ratings. Given that distribution, different evaluation metrics call for different prediction procedures: the
MSE is minimized by taking the expectation over possible rating values, while the MAE is optimized by
predicting the median of the ratings distribution. One exception is in the case of MMMF, which does
not have a direct probabilistic interpretation, but Srebro et al. [113] proposed to minimize a hinge loss
which generalizes to a multi-class objective like MAE.
Chapter 2. Background 14
The Missing At-Random Assumption
There is a general issue with the evaluation framework used in most CF research which may be problem-
atic when deploying these models in the real world. Because the empirical loss is taken as a surrogate
for the expected loss, one assumes that the data-generating distribution of both is the same; however,
in practice, it usually is not. In other words, users generally do not rate items at random, and in fact,
it has been shown empirically that users tend to bias their ratings toward items in which they are inter-
ested [75]. Positively biased ratings are likely due to users wanting to experience items that they believe
they will enjoy a priori. Hence, learning models will be affected unless this bias is explicitly taken into
account. Technically speaking, most models assume that data are missing at random, which is often
incorrect. Marlin et al. [75] and Marlin and Zemel [74] proposed a series of models which explicitly
account for data not missing at random. Using such models is preferred if data about items considered
but not explicitly rated are available. In practice, such data are seldom available (although on-line mer-
chants may have access to them). Other solutions try to reconcile the (possibly biased) data with the
need to minimize the (unbiased) expected loss, but they also have certain drawbacks [126]. Although
of great interest, such questions are somewhat tangential to our goals and therefore will not be directly
addressed here.
2.2.2 CF for Recommendations
The astute reader will have noticed that although we claim that CF techniques belong to the realm
of recommendation systems, neither the outlined CF methods nor their evaluation metrics offer a clear
way of making item recommendations to users. In fact, many authors (see, for example, [72]) see the
prediction of preferences as the first of two steps leading to personalized recommendations. The second
step consists of sorting the (predicted) ratings. The top recommendation for a user is then the item
with the highest predicted score. This two-step view of a recommendation system using collaborative
filtering has two important shortcomings:
1. When the goal is simply to recommend the top items to users, the solution to the ratings estimation
problem is also a solution to the ranking problem. Weimer et al. [128] pointed out that score-
estimation models must learn to calibrate scores, thus making the score-estimation problem actually
harder than estimating rankings.
2. As we have argued in Chapter 1, defining item recommendations by suggesting to users their
predicted favourite items is only one example of a possible recommendation task. In practice other
conditions and constraints may need to be satisfied:
(a) Certain domains may warrant the consideration of factors such as confidence in predictions.
For example, in a mobile application where there are costs associated with recommendations,
a system might prefer to use a more conservative strategy by, say, recommending an item for
which it has a higher certainty of success.
(b) The availability of side information may introduce additional constraints. For example, if
several items are to be recommended at once, criteria such as diversity might have to be
taken into account (see, e.g., [78]).
(c) In addition to constraints affecting single users or items, there are possibly global constraints
on users, items, or both. For example, certain items may be available only in limited quan-
Chapter 2. Background 15
tities, and therefore these items may not be recommended to more than a certain number of
users. More complex relationships may also exist between users and items, such as those that
arise in the domain of paper-to-reviewer matching for scientific conferences (see Chapter 3 for
a complete discussion).
In the cases described above, modifying the second step according to the recommendation task
may not be optimal because a learning method trained for ratings prediction may not capture the
constraints and aims of the recommendation system.
Instead of considering CF for recommendations as the product of two separate steps, where the first
is unaware of the objectives of the second, we argue that the tasks should be more integrated in that the
objective of collaborative filtering should be sensitive to the final objective. In fact, this line of reasoning
will be present throughout this thesis. We review some existing methods on the topic below and defer
a full discussion of our approaches to Chapter 5.
Collaborative Filtering in the Service of Another Task
In previous sections, we have reviewed various collaborative filtering techniques that all optimize rating
predictions. We have also established that to offer recommendations, a system should be aware of the
final recommendation task from the beginning. We mean by this that the learning procedure, at training
time, should consider a loss reflective of the ultimate objective. One may object to this by pointing out
that regardless of the final task, the learning method will be accurate if it is able to estimate perfectly
the unobserved ratings. However, this is almost never the case in practice. Instead, integrating the final
objective into the learning objective can be seen as indicating to the learning algorithm where to focus
its attention (capacity) to maximize performance on the final task.
One may wonder why this seemingly simple principle has not been more widely applied. First, as
will be made clear in this section, the ultimate task objective is often harder to optimize than the
ratings estimation task. Both RMSE and MAE conveniently decompose over single predictions and are
continuous. Furthermore, although possibly sub-optimal, it may be more appealing to design a general
CF method that can easily be applied to multiple different tasks rather than one with a specific task in
mind.
In this section, we outline some methods that have looked at integrating the preference estimation
step and the recommendation step into one. We also present a few tasks for which collaborative filtering
has been used and which may benefit from an integrated approach.
Most of the existing work on CF in the service of another task has proposed solutions for specific
recommendation tasks as opposed to providing frameworks that can easily be adapted to different tasks.
Jambor and Wang [56] stand out because they do not focus on a specific application, but rather propose
a general framework that can be tailored to the requirements of various applications. Although the
solution proposed in this paper does not constitute a definitive solution, it touches upon some of the
ideas that we have put forward in the introduction to this section and that have been developed by others
looking at specific applications. The main technical idea developed by Jambor and Wang [56] is that
contrary to what RMSE and MAE do, not all errors are “created equal”, and different types of errors
should be weighted differently based on the objectives of the recommendation system. Accordingly, the
Chapter 2. Background 16
authors use a loss function weighted by a function of the ground-truth rating and the prediction:
∑
(r,p)∈So
w(srp, Srp)(srp − srp)2 (2.8)
These weights can either be fixed a priori or learned to minimize the system’s loss function (for example,
a ranking function). Let us look at two simple examples to demonstrate how the w’s may be set for
different objectives. For simplicity, we will assume that the ratings are binary. In the first example,
imagine that the objective is to maximize the system’s precision performance, a reasonable objective
when the goal is to recommend top items to users. In that case, predicting a zero for a rating with
a ground-truth (GT) value of one will hurt the system less than predicting a one with a GT of zero.
Then it is sensible to set w(0, 1) > w(1, 0) in the above formula. In general, given that our objective is
precision, underestimating a high rating is less costly than overestimating a low rating. Now imagine
that instead of maximizing precision, we would like to maximize recall. Then it would be sensible for the
learning method to incur a greater cost when incorrectly classifying a one versus incorrectly classifying
a zero, and therefore w(1, 0) > w(0, 1).
Collaborative Ranking
The most natural recommendation task is to provide every user with a list of items ranked in order of
(estimated) preferences or ratings. This list may not contain all items, but is rather a truncated list with
only the top items for a given user. Naturally, then, optimizing a collaborative filtering system with a
ranking objective has received the most attention from the community.
Although many different ranking objectives exist, researchers in the learning-to- rank literature have
recently focused mostly on using the normalized cumulative discounted gain (NDCG) [57]:
NDCG@T (R, π) :=1
N
K∑
i=1
2Sπi − 1
log(i+ 2)(2.9)
where S are the ground-truth scores and π denotes a permutation, that is, Sπiis the value of the
rating at position πi. π reflects the ordering of the predictions of the underlying learning model. The
truncation parameter T corresponds to the number of items to be recommended to each user, and N is
a normalization factor such that the possible values of NDCG lie between zero and one. Note that the
denominator has the effect of weighting the higher-ranked items more strongly than the lower-ranked
ones. Having T and the denominator in the NDCG objective represents two major differences compared
to traditional collaborative filtering metrics and highlights the fact that, for top-T recommendations,
learning with NDCG@T might be beneficial compared to learning with a traditional loss (e.g., RMSE).
Moreover, NDCG as defined above considers the ranking of only one user, whereas in a multi-user setting,
it is customary to report the performance of a method using the average NDCG across all users.
A major challenge in this line of work is that NDCG is not a continuous function, but in fact is
piecewise constant [128] because all ratings with the same label can be ranked in a number of different
consistent orders without affecting the value of NDCG. For this reason, it is challenging to perform
gradient-based optimization of NDCG. Weimer et al. [128] derived a convex lower bound on NDCG
and maximized that instead. The underlying model predicting the ratings has the same form as PMF.
Consequently, their approach can be understood as training PMF using the NDCG loss and is called
Chapter 2. Background 17
CoFiRank. The resulting optimization is still non-trivial because the evaluation of the lower bound
is expensive, but using a bundle method [109] with an LP in its inner loop, Weimer et al. [128] were
able to report results on large data sets such as the Netflix data set. In experiments on standard CF
data sets, they compared their approach to MMMF as well as their approach using an ordinal loss and
a simple regression loss (RMSE). They showed that for most settings, training using a ranking loss
(NDCG or ordinal regression) results in a significant performance gain versus MMMF. In a subsequent
paper, Weimer et al. [129] showed less convincing results; on a CF ranking task, their method trained
using an ordinal loss is outperformed by the same method trained using a regression loss. The data
sets used in both papers differed, which may explain these differences. Balakrishnan and Chopra [4]
proposed a two-stage procedure resembling the optimization of NDCG. First, much as in regular CF, a
PMF model is learned. In the second stage, they use the latent variables inferred in PMF as features
in a regression model using a hand-crafted loss which is meant to capture the main characteristics of
NDCG. They show that jointly learning these two steps easily outperforms CoFiRankwhile providing
a simpler optimization approach. Volkovs and Zemel [125] used a similar two-stage approach. They
first extract user-item features using neighbour preferences. They then use these features inside a linear
model trained using LambdaRank, a standard learning-to-rank approach [24]. This model delivered
state-of-the-art performance and requires only 17 parameters, making learning and inference very fast.
Shi et al. [107] define a simple user-specific probability model over item rankings. The probability
distribution encodes the probability that a specific item, v, is ranked first for user u [25]:
P (srp) =exp(srp)∑p′ exp(srp′)
. (2.10)
A cross-entropy loss function is used to learn to match the ground-truth distribution, the distribution
yielded by the above equation using the ground- truth ratings, with the model (PMF) distribution:
Loss := −∑
p∈So(r)
exp(srp)∑r′∈So(r) exp(srp′)
logexp(g(uT
r vp))∑p′∈So(r) exp(g(u
Tr vp′))
(2.11)
where g() is a logistic function and So(r) are the items rated by user u. They optimize the above
function using gradient descent, alternately fixing U and optimizing V and vice versa. In experiments,
this model’s performance was generally significantly higher than CoFiRank’s performance.
Finally, Liu and Yang [69] bypassed the difficulty of training a ranking loss by using a model-free
CF approach. Here the similarity measure between users is the Kendall rank correlation coefficient,
which measures similarity between the ranks of items rated by both users. They model ratings using
pairwise preferences. A user’s preference function over two items, p and p′, Ψu(p, p′) is positive if the
user prefers p to p′ (srp > srp′), negative if the contrary, and zero if the user does not prefer one to the
other (srp = srp′). Formally,
Ψu(p, p′) =
∑r′∈N
p,p′
usim(r, r′)(srp − srp′)
∑u′∈N
p,p′
usim(r, r′)
(2.12)
where Nv,v′
u is the set of neighbours of u that have rated both v and v′ and simr,r′ is the similarity
between users u and u′. Then the optimal ranking for each user is that which maximizes the sum of the
Chapter 2. Background 18
properly ordered pairwise Ψ:
maxπ
.∑
p,p′:π(p)>π(p′)
Ψr(p, p′). (2.13)
Although the decision-variant of this optimization problem is NP-complete to solve [28], Liu and Yang
[69] proposed two heuristics that are less computationally costly and that perform well in practice.
Various authors have also looked at how side information about items may affect recommendations.
Jambor and Wang [55] studied ways to incorporate item-resource constraints. For example, a company
might prefer not to recommend aggressively products for which its stock is low, or alternatively, it might
want to recommend products with higher profit margins. They proposed to deal with such additional
constraints once the ratings had been estimated by allowing the system to re-rank the output of the
original CF method based on the additional constraints. They formulate this re-ranking as a convex
optimization problem, where the corresponding constraints or an additional term in the objective were
used to capture the requirement described above. They did not detail the exact technique used to
solve this optimization problem, but depending on requirements, their problem can be cast either as a
linear program or as a quadratic program and both can be solved with off-the-shelf solvers (for example,
CPLEX).
Overall, relatively little work has been done on using CF in more complex recommendation applica-
tions. The handful of papers that have considered CF with a true recommendation perspective will be
discussed in Chapter 5.
2.3 Active Preference Collection
In the previous section, we have assumed a typical supervised learning setting, where a set of labelled
instances (ratings) are provided and we are looking for the set of parameters within a class of models
that will minimize a certain loss function. Hence, we are assuming that the set of ratings is fixed, which
is unnatural in many recommendation system domains. For example: a) any new system will first have
to gather item preferences from users; b) typically, new users continually enter the system; and c) if the
system is on-line, it is probable that existing users will be providing new ratings for previously unrated
items. These events provide an opportunity for the system to gather extra preference data, or extra
information indicative of user preferences, and thus to improve its performance. Moreover, if the system
were able to target specifically valuable user preferences over items, these could help the system achieve
greater increases in performance more quickly. The process of selecting which labels or ratings to query
falls into the category of active learning. This is a learning paradigm which is often motivated in a
context where the cost of acquiring labelled data is high [11]. The aim of active learning is to select the
queries that will improve the performance of the learning model most quickly. Active learning methods
are often evaluated by comparing their performance to passive methods, methods which randomly select
queries to be elicited after each new query or after a set of queries.
Considering that relatively little work has been done on active learning directly aimed at recommen-
dation systems, this section will give a brief overview of active learning methods in general. Whenever
work related to recommendation systems does exist, we will highlight it, and we will also emphasize
the merits of the various techniques with respect to their applicability and possible effectiveness for
recommendation applications.
We will focus on four main query strategy frameworks: 1) uncertainty sampling, 2) query-by-
Chapter 2. Background 19
committee, 3) expected model change, and 4) expected error reduction [102]. Apart from papers related
to recommendation systems, most of the material presented in this section originates from the excellent
survey of Settles [102] and references therein. In what follows, we will denote the set of available ratings
as So and the set of unobserved, or missing, ratings as Su. The goal will then be to query ratings from
Su using a particular query strategy. Furthermore, because our focus is on the recommendation domain,
we will assume that the set of possible queries, which consist of unobserved user-item pairs, is known a
priori. At each time step, the goal is then to pick the best possible query, according to a specific query
strategy, among all possible queries. Once a user has responded to a query, his response can used to
re-train the learning model. After that we can proceed to query selection for the next time step.
2.3.1 Uncertainty Sampling
In uncertainty sampling, the goal is to pick the query that most reduces the uncertainty of the model’s
predictions [67]. The querying method can be seen as sampling examples, hence its name. It is therefore
usually assumed that the model’s posterior probability over the elements of Su can be evaluated. Given
a posterior distribution over ratings, one can use the uncertainty of the model in several different ways.
For example, one may pick the query about which the model is least confident, where, for example,
confidence is the probability of the mode of the posterior distribution:
max(r,p)∈Su
(1− Pr(surp|So, θ)) (2.14)
where Surp denotes the mode of that posterior distribution (i.e., Su
rp = argmaxs Pr(Surp = s|So, θ)).
Perhaps a more pleasing criterion would be to look at the whole distribution through its entropy rather
than its mode:
min(r,p)∈Su
−∑
s
Pr(surp = s|So, θ) log Pr(surp = s|So, θ) (2.15)
The intuition is that entropy-based uncertainty sampling should lead to a better model of the
full posterior distribution over ratings, whereas the previous approach (or variants that consider the
margin between the top predictions) might be better at discriminating between the most likely class
and the others [102]. In practice, the performance of the different methods seems to be application-
dependant [101, 104]. For regression problems, the same intuitions as above can be applied, where the
uncertainty of the predictive distribution is simply the variance over its predictions [102].
Rubens and Sugiyama [96] proposed a related approach where, instead of querying the example with
the most uncertainty, they query the example that most reduces the expected uncertainty of all items.
They applied this idea to CF and show that on a particular subset of a CF data set (one of the MovieLens
data sets), this approach outperforms both a variance-based and an entropy-based strategy, with both
performing equally to or worse than a random strategy. The authors compared the performance of the
different methods when very little information is known from each user (between 1 and 10 items).
Other authors have also applied these ideas to CF. Jin and Si [58] used the relative entropy between
the posterior distribution over parameters and the updated parameters given a possible query and its
response to guide their query selection. Using the same model, Harpale and Yang [45] considered that
in many domains, it is unreasonable to ask a user to rate any item because it is unlikely that a user
can easily access all items (for example, users might not be willing to watch any movie selected by the
system). The authors proposed that the system’s decision should be based on how likely an item is to
Chapter 2. Background 20
be rated, approximated by a quantity similar to the item’s popularity among users with similar profiles
as the user under elicitation.
Authors have also used uncertainty minimization strategies using non-probabilistic models. For
example in CF, Rish and Tesauro [90] looked at using MMMF and defined uncertainty to be proportional
to the distance to the decision boundary, or in other words, the distance in feature space between the
point and the hyperplane in a max-margin model. The next query is the one that is closest to the
separating hyperplane. Experimentally, the authors showed that this approach outperformed other
heuristics, namely random and max. uncertainty query selection. One theoretical problem with this
approach is that it suffers from sampling bias [30]. In other words, because the queries are chosen with
respect to the uncertainty of the current classifier, it is possible that some regions of the input space will
be completely ignored, although all regions have some weight under the true data distribution, leading to
poor generalization performance. Sampling bias is a problem that affects most active learning methods
unless special precautions are taken [30].
In general, uncertainty sampling approaches are computationally efficient even though they require
the evaluation of all possible O(RP ) queries.
2.3.2 Query by Committee
The idea of the query-by-committee framework is to keep a committee [105], that is, a set of hypotheses
from one or multiple models, trained and consistent on So. In classification tasks, such a set of hypotheses
is called a version space. The instance to be queried is then the one on which these various hypotheses
disagree the most. Several definitions of disagreement are possible; for example, one may use the average
Kullback-Leibler divergence between the distribution over ratings of each hypothesis. One obvious
limitation of these approaches is that one must be able to generate multiple hypotheses that are all
consistent with the training set.
There has been a number of recent developments in this field that try to circumvent this limitation. In
fact, these developments have led their proponents to declare “theoretical victory” over active learning. 1
Query-by-committee can also be extended to regression problems, which are usually the domain of
CF approaches, although the notion of version space does not apply any more [102]. Instead, one can
measure the variance between the predictions of the various committee members.
Compared to uncertainty sampling, query-by-committee attempts to choose queries that reduce un-
certainty over both model predictions and model parameters. The computational efficiency of using
this strategy will depend on the procedure for obtaining the set of hypotheses forming the committee.
Although we are not aware of any work that has applied the query-by-committee framework in either
CF or recommendation systems, one could imagine training differently-initialized (or different hyper-
parameters) Bayesian PMF models, or alternatively a set of PMF models with uncertainty defined by
to how close a predicted rating is to the threshold between different rating values. Overall, as noted in
Section 2.2.1, ensemble methods, combinations of different methods, where each method ends up learn-
ing a specific aspect of users and items have been shown to do extremely well in collaborative filtering
problems [7]. It remains to be seen whether this methodology can also be useful for active learning.
1http://hunch.net/?p=1800
Chapter 2. Background 21
2.3.3 Expected Model Change
The goal in expected-model-change approaches is to find the query that will maximize the (expected)
model change. A simple criterion for monitoring model change is to evaluate the norm of the difference
of the parameter vectors (e.g., ||θrrp − θ||), where θ is the current model’s parameter vector and θsrp is
the updated model’s parameter vector after querying srp = s. Settles et al. [103] proposed to exploit
this idea using the expected gradients of the ratings with respect to the model parameters (θ):
∑
s
P (surp = s|So, θ)||∇L(So ∪ surp = s)|| (2.16)
where ∇L(So ∪ surp) is a vector representing the gradient of the parameter vector with respect to a loss
function L. The loss is calculated using the observed scores and the proposed query, srp. Because the
learning method is typically re-trained after each query, the gradient can be approximated by ∇L(surp =
s). The chosen query is then the one that maximizes Equation 2.16. This query strategy is called
expected gradient length (EGL). It is important to note that in contrast to other approaches, EGL does
not directly reduce either model uncertainty or generalization error, but rather picks the query with the
“greatest impact on the [model’s] parameters” [102]. This strategy can be applied only to models that
can be trained using a gradient method (which is the case for PMF, but not for all other probabilistic
methods). Compared to other approaches, the efficiency of this method will be largely dependent on
the cost of evaluating the model gradients. Furthermore, because outliers will greatly affect the model
parameters, this strategy would be particularly prone to failure when using an uninformative model of
uncertainty.
2.3.4 Expected Error Reduction
Perhaps the most natural strategy to use is to choose the query that will most reduce, in expectation,
the error of the final objective. Recall that the usual method for evaluating the effectiveness of all
querying strategies is to compare the model learned using active learning with one learned without (by
so-called passive learning). Therefore, it makes sense to optimize for the true objective. Settles [102]
first proposed an approach to minimize the 0/1 loss:
∑
s
Prθ(surp = s|So, θ)
∑
(r′,p′)∈(Su\srp)
1−maxs′
Prθsrp
(sur′p′ = s′|So ∪ surp = s, θrrp)
. (2.17)
where Prθsrp(·) denotes the distribution over unlabelled ratings when the model with parameter vector θ
receives an additional training example srp = s. What is surprising about the above formulation is that
it is not actually calculating the error, but rather the change in confidence over the unlabelled data. In
a sense, it is using the unlabelled data as a validation set. Instead, it would seem to make sense to look
at the expected decrease in error on a labelled validation set (that is, a labelled set disjoint from the
training set and So = {Strain ∪ Svalidation}). The obvious problem with using a validation set is that
in an active learning setting, labelled data are often scarce (at least at the beginning of the elicitation),
and therefore there might not be enough data to create a meaningful validation set.
Importantly, this framework can be adapted to use any objective (loss) function. The strategy can
therefore be leveraged both in collaborative filtering with the usual loss functions or when using CF
Chapter 2. Background 22
for another task. One limitation is that because the framework requires retraining the model for each
possible value of each possible query, it can be very computationally expensive.
Some work has been done using this strategy for ranking applications. Arens [3] used a ranking
SVM and addressed the problem of ranking relevant papers for users. The queries consist of asking a
user whether the queried document is “definitely relevant”, “possibly relevant”, or “not relevant”. They
experimented with two querying strategies. The first queries the highest (predicted) ranked document
that is unlabelled. The second strategy queries the most uncertain documents, as defined by those that
is in the middle of the ranking, which were used as a signal of uncertainty (in this view, this approach
is similar to uncertainty sampling). The authors argue that the learner will rank the worst and best
documents with confidence and that therefore the learner is more uncertain about documents in the
middle of the ranking. Their results indicate that the (first) strategy of querying the highest-ranked
documents outperforms the second strategy (and random querying).
EVOI
The decision-theoretically optimal way of evaluating queries is to use expected value of information
(EVOI) [50]. At the heart of EVOI is a utility function, a problem-dependant objective which assigns a
number, a utility, to possible outcomes (for example to recommendations). The optimal query according
to EVOI is the one which maximizes expected utility. The expectation is taken with respect to the
distribution over query responses (for example, possible score values). In practice, it is common to use
the learning model’s distribution over possible responses (for example, the CF model).
We can better understand EVOI by looking at an example. Boutilier et al. [18] studied a setting where
the goal is to recommend one item at a time to a user. The utility of making such a recommendation
is defined to be the rating value of the recommended item (or its expectation according to the model
of unobserved ratings). To select a query, one evaluates each possible query and selects the one with
maximum myopic EVOI, that is, the query that yields the maximum difference in (expected) value over
all items:
EV OI(surp, θSo
) =∑
s
Pr(surp = s|So, θSo
)V (So ∪ surp = s, θSo∪srp)− V (So, θS
o
).
The value of the belief state associated with (So, θ) is defined as:
V (So, θSo
) = max(r,p)∈Su
∑
s
Pr(surp = s|So, θSo
)s, (2.18)
and similarly,
V (So ∪ sur′p′ = s′, θSo∪sr′p′ )) = max
(r,p)∈{Su\(r′,p′)}
∑
s
Pr(surp = s|So ∪ sur′p′ = s′, θSo∪sr′p′ )s, (2.19)
where Pr(·|So, θSo
rp ) is the model’s posterior over scores for user-item rp when So is observed. It is
important to note that EVOI can be used with any model that can infer a distribution over unobserved
ratings. In Boutilier et al. [18], the MCVQ model is used (see Section 2.2.1).
Calculating myopic EVOI involves two sources of costly computations. First, given a user, the
computation of the posteriors over every possible rating value must be performed for all queries (that
is, the set of unobserved items). Equations 2.19 and 2.18 say that the value of a belief is equal to the
Chapter 2. Background 23
maximum expected unobserved rating given that belief. Boutilier et al. [18] note that given the current
item with highest (mean) score p∗ and the previous expected scores over all other items (Equation 2.18)
one can calculate which subset of items might have their mean score increased enough to become the
item with maximal value according to Equation 2.19. Accordingly, they bound the impact that a query,
about a particular item, can have on the mean predicted score of all other items. This restricts the
number of item posteriors which must be re-computed at each step. Furthermore, these bounds are
calculated in a user-independent way and off-line. This same procedure can be applied to any learning
model, although the exact form of the bound is specific to MCVQ.
Second, the potentially large number of possible queries is also computationally problematic. Boutilier
et al. [18] suggest the use of prototype queries : a small set of queries such that any query is within some
ǫ, in terms of EVOI value, of a prototype query. Experimentally, the authors showed that even when
using less than 40% of all queries, the system outperforms passive learning and is close to the optimal
(myopic-)EVOI performance.
Calculations of non-myopic EVOI, which involves calculating the optimal querying sequence, are
even more costly because the effect of each query with respect to all possible future queries must be
evaluated.
Other work of note has proposed a similar framework for a memory-based collaborative filtering
model [132]. Notably they propose that the active learning procedure only query items that the user is
likely to have already consumed (for example, movies that the user is likely to have already watched)
based on what similar users have consumed. The rationale is that such items will be easier to score.
This method has the additional benefit of pruning the item search-space.
2.3.5 Batch Queries
Until now, we have assumed that we are picking queries one at a time in a greedy manner, considering
only the immediate myopic effect of the query on the model. This might not be optimal because it is
possible that a query’s value will reveal itself only after several more rounds of elicitation.
A related idea is that of batch queries. Instead of selecting and obtaining results of one query at
a time, we might consider finding optimal sets of queries. Batch querying may be necessary when a
system simply does not have the computational resources to find optimal queries on-line and therefore
a set of queries must be calculated off-line [102]. Alternatively, this approach can be motivated in
recommendation systems as a more natural mode of interaction for users. For example, imagining a
domain where rating a product involves performing an action off-line, it may be easier for users to rate
multiple products at a time. Another motivation for batch active learning is when parallel labellers are
available [44].
A simple strategy for selecting a set of N queries would simply be to pick the top-N queries greedily
according to one of the active-learning strategies already discussed. Such a strategy will typically not
be optimal because the selection process will not consider information gained by previous queries in the
batch. Using EVOI, an optimal strategy for batch querying would be to consider the value of all possible
sets of queries (of a given length). The combinatorial nature of such calculations makes this approach
impractical for most real-world problems. Several authors have looked at constructing optimal sets of
queries according to some less general criteria. Several authors have proposed considering diversity as a
reasonable criterion for batch-query selection [21, 131]. Recently, Guo [43] proposed the selection of the
instances which maximize the mutual information between labelled and unlabelled sets. The intuition
Chapter 2. Background 24
underlying this is that, for a learning method to generalize, the labelled set should be representative of
the unlabelled set.
We are unaware of previously published work in CF that has looked at asking batches of queries. We
develop a greedy approach to this problem in Chapter 6.
2.3.6 Stopping Criteria
A critical aspect of any active querying method is the criterion used to stop eliciting new preferences from
users. This is a trade-off between the cost of reducing the model’s error by acquiring extra information
and the cost incurred by making recommendations given the current model. For example, imagine that
each time a user is queried, there is a probability that he will become annoyed and leave the system.
In addition, if recommendations are not satisfactory, the same user may also leave the system. This
is a good example of the trade-off between further elicitation versus exploitation of the user’s current
ratings. Adopting a decision-theoretic approach, querying should stop once the utility of all queries is
negative. This assumes a utility function which accounts for all user benefits and costs; such a function
may be difficult to model accurately. In practice, Bloodgood and Vijay-Shanker [17] proposed that if
one could define a separate validation set, active learning could simply be stopped once the error of the
validation set stabilized. However, as previously noted, setting aside a labelled set is impractical in the
typical active-learning setting where few labelled instances exist. Several authors [17, 81] have agreed
that one must determine whether or not the model has stabilized, meaning that its predictions are not
likely to change given more data. Having said this, we are not aware of a formal, yet practical, way
to decide on a stopping criterion that is applicable to a wide array of models and data sets and is not
overly conservative [17]. In fact, in his survey, Settles [102] implicitly defended the decision-theoretic
framework: “the real stopping criterion [. . . ] is based on economic or other external factors, which likely
come well before an intrinsic learner-decided threshold.” In other words, the cost of reducing the model’s
error dominates the cost of the model’s error before the model has reached its best possible performance.
2.4 Matching
It is often the case that designers and users of a recommendation system have requirements, in addition to
the user’s intrinsic item preferences, which must be considered before making recommendations. In this
work, we will be specifically interested in recommendations which require an assignment, or a matching,
between users and items. An example of a matching recommendation, which we will discuss at length, is
matching reviewers to conference submissions. We will refer to recommendation systems which provide
this type of matching as match-constrained recommendations. Match-constrained recommendations are
also prevalent in other domains involving the selection of experts, such as grant reviewing, assignment
marking, and hospital-to-resident matching [94], as well as in completely different areas such as on-line
dating [46], where users must be matched to other users according to their romantic interest, or house
matching [52], or even in recommending taxis to customers.
The field of matching has a long and distinguished history dating back to the classical work on the
stable marriage problem [36]. The importance of this work was recognized when one of its authors
received the 2012 Bank of Sweden prize in economic science. 2 The stable marriage problem consists
2It is often referred to as the Nobel Prize in Economics
Chapter 2. Background 25
of finding a stable match between single women and single men. Stability implies that in the resulting
match, no woman-man pairs both prefer being matched to one another than their partners in the
matching. For evident reasons, stability is a property often sought by practitioners. Gale and Shapley
[36] proposed a simple iterative algorithm which is guaranteed to output a stable matching. Several other
domains have caught the interest of economics and theoretical computer science researchers, among them
college admission matching [36] and resident matching (of residency candidates to hospitals, also similar
to the roommate-matching problem) [94], which generalize the stable marriage problem to many-to-one
matches. In many-to-one matches members of one set of entities need to be matched to, not one, but
multiple entities of the other set (this is matching polygamy in a sense). Such problems can be solved
with a generalization of the stable marriage algorithm. Further constraints on these problems, such as
couples wanting to be in the same hospital have also been considered, but are typically harder to solve
(“couples matching” is NP-complete) [92].
An important dichotomy is that between two-sided and single-sided matching domains. In single-
side matching, entities to be recommended do not express preferences, for example matching in housing
markets [52]. Note that in single-sided domains, the notion of stability does not apply, although similar
properties are captured by the notion of the matching core.
A central aspect of great importance in this line of work has been truthful elicitation of user prefer-
ences. Indeed, to ensure the integrity of a matching system, it is often of great importance that users
have incentives to be truthful. This research, under the name mechanism design, seeks protocols, in-
cluding matching protocols, which ensure that a user cannot obtain a preferred match by not reporting
his preferences truthfully [95].
In our work, we focus exclusively on one-sided matching problems. Furthermore, we do not con-
sider strategic issues, such as the ones studied by mechanism design. We initiate our discussion of
match-constrained recommendations by introducing a running example in Chapter 3. We further study
matching formulations and constraints in Chapter 5.
Chapter 3
Paper-to-Reviewer Matching
Before detailing the research contributions of this thesis, we will discuss a real-life example of a matched-
constrained recommendation system. This example will be used as an illustration throughout our work.
Furthermore, this particular example has provided some of the initial motivation behind the work that
has led to this thesis.
We introduce the paper-to-reviewer matching problem. This problem must routinely be solved by
conference organizers to determine a conference program. The way typical conferences operate is that
once the paper-submission deadline has passed, conference organizers initiate the reviewing process for
the submissions. Concretely, organizers have to assign each submission to a set of reviewers. Reviewers
are typically chosen from a preselected pool. Once assigned their papers, reviewers will have a fixed
amount of time to provide paper reviews, which are informed opinions from domain experts describing
the merits of each submission. Because reviewers’ time is limited, each reviewer has only enough resources
to review at most fixed number of papers.
The paper-to-reviewer-assignment process aims to find the most expert reviewers for each submission.
Obtaining high-quality reviews is of great importance to the quality and reputation of a conference, and in
a certain sense, to shape the direction of a field. This assignment process can be seen as a recommendation
system with reviewer expertise substituting for the (more typical) factor of user preference. Furthermore,
constraints on the number of submissions that each reviewer may process and on the number of reviewers
per submission gives rise to a match-constrained recommendation problem.
We can frame the paper-to-reviewer matching problem in terms of our three-stage framework pre-
sented in Figure 1.1. First, expertise about submissions, with possibly other expertise-related informa-
tion, is elicited from reviewers. Second, a learning model can be used to predict the missing paper-
reviewer expertise data. Finally, given the specific constraints of each conference, the optimization
procedure consists of a matching procedure which assigns papers to reviewers according to their stated
and predicted expertise. Many research questions related to how learning models can be optimized for
this framework emerge when thinking of the paper-matching problem in this context. The rest of this
thesis, and specifically Chapter 5, discusses some of these questions.
We have built a software system called the Toronto Paper Matching System (TPMS), which provides
automated assistance to conference organizers in the process of assigning their submissions to their
reviewers. In this chapter, we discuss the intricacies of the practical paper-to-reviewer assignment
problem through a description of TPMS.
26
Chapter 3. Paper-to-Reviewer Matching 27
3.1 Paper Matching System
Assigning papers to reviewers is not an easy task. Conference organizers typically need to assign reviewers
within a couple of days of the conference submission deadline. Furthermore, conferences in many fields
now routinely receive more than one thousand papers, which have to be assigned to reviewers from a
pool which often consists of hundreds of reviewers. The assignment of each paper to a set of suitable
reviewers requires knowledge about both the topics studied in the paper and reviewers’ expertise. For
a typical conference, it will therefore be beyond the ability of a single person, for example, the program
chair, to assign all submissions to reviewers. Decentralized mechanisms are also problematic because
global constraints, such as reviewer load, absence of conflicts of interest, and the need for every paper
to be reviewed by a certain number of reviewers must be satisfied. The main motivation for automating
the reviewer-assignment process is to reduce the time required to assign submitted papers (manually)
to reviewers.
A second motivation for an automated reviewer-assignment system concerns the ability to find suit-
able reviewers for papers, to expand the reviewer pool, and to overcome research cliques. Particularly in
rapidly expanding fields such as machine learning, it is of increasing importance to include new reviewers
in the review process. Automated systems offer the ability to learn about new reviewers as well as the
latest research topics.
In practice, conferences often adopt a hybrid approach in which a reviewer’s interest with respect to a
paper is first independently assessed, either by allowing reviewers to bid on submissions, or, for example,
by letting members of the senior program committee provide their assessments of reviewer expertise.
Using either of these assessments, the problem of assigning reviewers to submissions can then be framed
and solved as an optimization problem. Such a solution still has important limitations. Reviewer bidding
requires reviewers to assess their preferences over the list of all papers. Failing to do so, for example if
reviewers only examine papers that contain specific terms matching their interest, is likely to decrease
the quality of the final assignments. On the other hand, asking the senior program committee to select
reviewers still imposes a major time burden.
Faced with these limitations, when Richard Zemel was the co-program chair of NIPS 2010, he decided
to build a more automated way of assigning reviewers to submissions. The resulting system that we
have developed aims to provide a proper evaluation of reviewer expertise to yield good reviewer assign-
ments while minimizing the time burden on conference program committees (reviewers, area chairs, and
program chairs). Since then, the system has gained adoption for both machine learning and computer
vision conferences and has now been used (repeatedly) by NIPS, ICML, UAI, AISTATS, CVPR, ICCV,
ECCV, ECML/PKDD, ACML, and ICGVIP.
3.1.1 Overview of the System Framework
In this section, we first describe the functional architecture of the system, including how several confer-
ences have used it. We then briefly describe the system’s software architecture.
Our aim is to determine reviewers’ expertise. Specifically, we are interested in evaluating the expertise
of every reviewer with respect to each submission. Given these assessments, it is then straightforward
to compute optimal assignments (see Chapter 5 for a detailed discussion of matching procedures). We
insist that reviewer expertise rather than reviewer interest is what we aim to evaluate. This is in contrast
with the more typical approaches that assess reviewer interest, for example through bidding.
Chapter 3. Paper-to-Reviewer Matching 28
Reviewer
papers
Initial scores
Submitted
papers
Elicited
scoresGuide
Ellicitation
Final
assignments
Matching
Match
ing
Final
scores
Ranked
list
Sorting
Sorting
Figure 3.1: A conference’s typical workflow.
The workflow of the system works in synergy with the conference submission procedures. Specifically,
for conference organizers, the busiest time is typically right after the paper submission deadline, because
at this time, the organizers are responsible for all submissions, and several different tasks, including the
assignment to reviewers, must be completed within tight time constraints. For TPMS to be maximally
helpful, assessments of reviewer expertise could be computed ahead of the submission deadline. With
this in mind, we note that an academic’s expertise is naturally reflected through his or her work and is
most easily assessed by examining his or her published papers. Hence, we use a set of published papers
for each reviewer participating in a conference. Throughout our work, we have used the raw text of these
papers. It stands to reason that other features of a paper could be modelled: for example, one could use
citation or co-authorship graphs built from each paper’s bibliography and co-authors respectively.
Reviewers’ published papers have proven to be very useful in assessing expertise. However, we have
found that we can further boost performance using another source of data: each reviewer’s self-assessed
expertise about the submissions. We will refer to such assessments as scores. We differentiate scores from
more traditional bids : scores represent expertise rather than interest. We use assessed scores to predict
missing scores and then use the full reviewer-paper score matrix to determine assignments. Hence, a
reviewer may be assigned to a paper for which he did not provide a score.
To summarize, although each conference has its own specific workflow, it usually involves the sequence
of steps shown in Figure 3.1. First, we collect reviewers’ previous publications (note that this can be
done before the conference’s paper submission deadline). Using these publications, we build reviewer
profiles which can be used to estimate each reviewer’s expertise. These initial scores can then be used to
produce paper-reviewer assignments or to refine our assessment of expertise by guiding a score-elicitation
procedure (e.g., using active learning to query scores from reviewers). Elicited scores, in combination
with our initial unsupervised expertise assessments, are then used to predict the final scores. Final scores
can then be used in various ways by the conference organizers (for example, to create per-paper reviewer
rankings that will be vetted by the senior program committee, or directly in the matching procedure).
Below, we describe the high-level workflow of several conferences that have used TPMS.
NIPS 2010: For this conference, the focus was mostly on modelling the expertise of the area chairs,
the members of the senior program committee. We were able to evaluate the area chairs’ expertise
initially using their previously published papers. There were 32 papers per area chair on average. We
then used these initial scores to perform elicitation. The exact process by which we picked which reviewer-
paper pairs to elicit is described in the next section. We performed the elicitation in two rounds. In the
first round, we kept about two-thirds of the papers selected as those about which our system was most
confident (estimated as the inverse entropy of the distribution across area chairs per paper). Using these
elicited scores, we were then able to run a supervised learning model and proceed to elicit information
Chapter 3. Paper-to-Reviewer Matching 29
about the remaining one-third of the papers. We then re-trained a supervised learning method using
all elicited scores. The result were used to assign a set of papers to each area chair. For the reviewers,
we also calculated initial scores from their previously published papers and used those initial scores to
perform elicitation. Each reviewer was shown a list of approximately eight papers on which they could
express their expertise. The initial and elicited scores were then used to evaluate the suitabilities of
reviewers for papers. Each area chair was then provided a ranked list of (suggested) reviewers for each
of his assigned papers.
ICCV-2013: ICCV used author suggestions, by which each author could suggest up to five area
chairs that could review a paper, to restrict area-chair score elicitation. The elicited scores were used to
assign area chairs. Area chairs then suggested reviewers for each of their papers. TPMS initial scores,
calculated from reviewers’ previously published papers, were used to present a ranked list of candidate
reviewers to each area chair.
ICML 2012: Both area chairs and reviewers could assess their expertise for all papers. To help in
this task, TPMS initial scores, calculated again from reviewer’s and area chairs previous publications,
were used to generate a personalized ranked list of candidate papers which area chairs and reviewers
could use to quickly identify relevant papers. TPMS then used recorded scores for both reviewers and
area chairs in a supervised learning model. Predicted scores were then used to assign area chairs and
one reviewer per paper (area chairs were able to assign the other two reviewers).1
3.1.2 Active Expertise Elicitation
As mentioned in the previous section, initial scores can be used to guide active elicitation of reviewer
expertise. The direction that we have taken is to run the matching program using the initial scores. In
other words, we use the initial scores to find an (optimal) assignment of papers to reviewers. Then a
reviewer’s expertise for all papers assigned to him are queried. Intuitively, these queries are informative
because according to our current scores, reviewers are queried about papers that they would have to
review (a strong negative assessment of a paper is therefore very informative). By adapting the matching
constraints, conference organizers can tailor the number of scores elicited per user (in practice, it can be
useful to query reviewers about more papers than is warranted by the final assignment). We formally
explore these ideas in Chapter 5. Note that our elicited scores will necessarily be strongly biased by
the matching constraints used in the elicitation procedure. To relate to our discussion in Section 2.2.1:
scores are not missing at random. In practice, this does not appear to be a problem for this application
(that is, assigning papers to a small number of expert reviewers). There are also empirical procedures,
such as pooled relevance judgements which combines the scores of different models while considering the
variance in between models, which have been used to reduce the elicitation bias [79]. It is possible that
such a method could be adapted to be effective with our matching procedure.
3.1.3 Software Architecture
For the NIPS-10 conference, the system was initially made up of a set of MATLAB routines that would
operate on conference data. The data were exported (and re-imported) from the conference Web site
hosted on Microsoft’s Conference Management Toolkit (CMT).2 This solution had limitations because
1The full ICML 2012 process has been detailed by the conference program chairs: http://hunch.net/?p=24072http://cmt.research.microsoft.com/cmt/
Chapter 3. Paper-to-Reviewer Matching 30
it imposed a high cost on conference organizers that wanted to use it. Since then, and encouraged by the
ICML 2012 organizers, we have developed an on-line version of the system which interfaces with CMT
and can be used by conference organizers through CMT (see Figure 3.2).
The system has two primary software features. One is to act as an archive by storing reviewers’
previously published papers. We refer to these papers as a reviewer’s archive or library. To populate
their archives, reviewers can register with and log in to the system through a Web interface. Reviewers
can then provide URLs pointing to their publications. The system automatically crawls the URLs to
find reviewers’ publications in PDF format. There is also a functionality which enables reviewers to
upload papers from their local computers. Conference organizers can also populate reviewers’ archives
on their behalf. Another option enables our system to crawl a reviewer’s Google Scholar profile.3 The
ubiquity of the PDF format has made it the system’s accepted format. On the programming side, the
interface is entirely built using the Python-based Django web framework4 (except for the crawler, which
is written in PHP and relies heavily on the wget utility5).
The second main software feature is one that permits communication with Microsoft’s CMT. Its
main purpose is to enable our system to access some of the CMT data as well as to enable organizers to
call our system’s functions through CMT. The basic workflow proceeds as follows: organizers, through
CMT, send TPMS the conference submissions; then they can send us a score request which queries
TPMS for reviewer-paper scores for a specified set of reviewers and papers. This request contains the
paper and reviewer identification for all the scores that should be returned. In addition, the request can
contain elicited scores (bids in CMT terminology). After receiving these requests, our system processes
the data, which may include PDF submissions and reviewer publications, and computes scores according
to a particular model. TPMS scores can then be retrieved through CMT by the conference organizers.
Technically speaking, our system can be seen as a paper repository where submissions and meta-
data, both originating from CMT, can be deposited. Accordingly, the communication protocol used is
SWORD based on the Atom Publishing Protocol (APP) version 1.0.6 SWORD defines a format to be
used on top of HTTP. The exact messages enable CMT to: a) deposit documents (submissions) to the
system; b) send information to the system about reviewers, such as their names and publication URLs;
c) send reviewers’ CMT bids to the system. On our side, the SWORD API was developed in Python and
is based on a simple SWORD server implementation.7 The SWORD API interfaces with a computations
module written in a mixture of Python and MATLAB (we also use Vowpal Wabbit8 for training some
of the learning models).
Note that although we interface with CMT, TPMS runs completely independently (and communi-
cates with CMT through the network); therefore, other conference management frameworks could easily
interact with TPMS. Furthermore, CMT has its own matching system which can be used to determine
reviewer assignments from scores. CMT’s matching program can be used to combine several pieces of
information such as TPMS scores, reviewer suggestions, and subject-area scores. Hence, we typically
return scores from CMT and conference organizers, then run CMT’s matching system to obtain a set of
(final) assignments.
3http://scholar.google.com/4https://www.djangoproject.com/5http://www.gnu.org/software/wget/6The same protocol with similar messages is used by http://arxiv.org to enable users to make programmatic submis-
sions of papers7https://github.com/swordapp/Simple-Sword-Server8http://hunch.net/~vw/
Chapter 3. Paper-to-Reviewer Matching 31
TPMS
CMT
ReviewersPaper collection
web-interface
Conference
Organizers
Score
models
Figure 3.2: High-level software architecture of the system.
3.2 Learning and Testing the Model
As mentioned in Section 3.1.1, at different stages in a conference workflow, we may have access to
different types of data. We use models tailored to the specifics of each situation. We first describe
models that can be used to evaluate reviewer expertise using the reviewers’ archive and the submitted
papers. Then we describe supervised models that have access to ground-truth expertise scores.
We remind the reader about some of our notation, which now takes on specific meaning for the
problem of paper-to-reviewer matching. An individual submission is denoted as p while P is the set
of all submitted papers. Similarly, single reviewers are denoted as r and the set of all reviewers is
denoted as R. We introduce a reviewer’s archive (a reviewer’s previously published papers) encoded
using a bag-of-words representation and denoted by the vector war . Note that we will assume that a
reviewer’s papers are concatenated into a single document to create that reviewer’s archive. Similarly,
a submission’s content is denoted as wdp. Finally, f(·) and g(·) represent functions which map papers,
submitted or archived, respectively, to a set of features. Features can be word counts associated with a
bag-of-words representation, in which case the f(wdp) and g(wa
r ) are the identity function, or possibly,
higher-level features such as those learned from a topic model [16].
3.2.1 Initial Score Models
Two different models were used to predict initial scores.
Language Model (LM): This model predicts a reviewer’s score as the dot product between a reviewer’s
archive representation and a submission:
srp = g(war )
T f(wdp) (3.1)
There are various possible incarnations of this model. The one that we have routinely used is due
to Mimno and McCallum [79] and consists of using the word-count representation of the submissions
(that is, each submission is encoded as a vector in which the value of an entry corresponds to the number
of times that the word associated with that entry appears in the submission). For the archive, we use
the normalized word count for each word appearing in the reviewer’s published work. By assuming
conditional independence between words given a reviewer and working in the log domain, the above is
equivalent to:
srp =∑
i∈p
log f(wari) (3.2)
In practice, we Dirichlet smooth [133] the reviewer’s normalized word counts to better deal with rare
Chapter 3. Paper-to-Reviewer Matching 32
words:
f(wari) =
(Nwa
r
Nwar+ µ
)wa
ri
Nwar
+
(µ
Nwar+ µ
)Nwi
N(3.3)
where Nwaris the total number of words in reviewer r’s archive, N is the total number of words in the
corpus, wari and Nwi
are the number of occurrences of word i in r’s archive and in the corpus respectively,
and µ is a smoothing parameter.
Because papers have different lengths, scores will be uncalibrated. This means that shorter papers
will receive higher scores than longer papers. Depending on how scores are used, this may not be
problematic. For example, this will not matter if one wishes to obtain ranked lists of reviewers for each
paper. We have obtained good matching results with such a model. However, normalizing each score
by the length of its paper has turned out to be also an acceptable solution. Finally, in the language
model described above, the dot product of the archive and submission representation is used to measure
similarity; other metrics could also be used, such as KL-divergence.
Latent Dirichlet Allocation (LDA) [16]: LDA is a unsupervised probabilistic method used to model
documents. Specifically, we can use the topic proportions found by LDA to represent documents. Equa-
tion 3.1 can then be used naturally to calculate expertise scores from the LDA representations of archives
and submissions. When using the LM with LDA representations we refer to it as topic-LM, in contrast
with word-LM when using bag-of-words representation.
3.2.2 Supervised Score-Prediction Models
Once elicited scores are available, supervised regression methods can be used. The problem can be seen
as one of collaborative filtering. Furthermore, because both reviewer (user) and paper (item) content
exist, the problem can also be modelled using a hybrid approach.
Probabilistic Matrix Factorization: PMF was discussed in Section 2.2.1. For easier comparison with
other models in this section, we can express the score generative model as: srp = γrTωp (where γr and
ωp correspond to ur and vp, respectively, in our earlier notation).
Because PMF does not use any information about either papers or reviewers, its performance suffers
in the cold-start regime. Nonetheless, it remains an interesting baseline for comparison.
Linear Regression (LR): On the other end of the spectrum are models which use document content,
but do not share any parameters across users and items. The simplest regression model learns a separate
model for each reviewer using submissions as features:
srp = γTr f(w
dp) (3.4)
where γr denotes user-specific parameters. This method has been shown to work well in practice,
particularly if many scores have been elicited from each reviewer.
A hybrid model can be of particular value in this domain. One issue with the conference domain
is that it is typical for some reviewers to have very few or even no observed scores. It may then be
beneficial to enable parameter sharing between users and papers. Furthermore, re-using information
from each reviewer’s archive may also be beneficial. One method of sharing parameters in a regression
model was proposed by John Langford, co-program chair of the ICML 2012 conference:
srp = b+ br + (γ + γr)T f(wd
p) + (ω + ωr)T g(wa
r ) (3.5)
Chapter 3. Paper-to-Reviewer Matching 33
where b is a global bias and γ and ω are parameters shared across reviewers and papers, which encode
weights over features of submissions and archives respectively. The ω shared parameter enables the
model to make predictions even for users that have no observed scores. In other words, the shared ω
enables the model to calibrate a reviewer’s archive using information from the other reviewers’ scores.
br, γp and ωr are parameters specific to a paper or a reviewer. For ICML-12, f(wdp) was paper p’s
word counts, while g(war ) was the normalized word count of reviewer r’s archive (similarly to LM). In
practice, for each reviewer-paper instance g(war ) is multiplied by the paper’s word occurrence vector.
For that conference and afterwards, that model was trained in an on-line fashion using Vowpal Wabbit
with a squared loss and L2 regularization. In practice, because certain reviewers have few or no observed
scores, one has to be careful to weight the regularizers of the various parameters properly so that the
shared parameters are learned at the expense of the individual parameters. To determine a good setting
of the hyper-parameters we typically use a validation set (and search in hyper-parameter space using, for
example, grid search). To ensure the good performance of the model across all reviewers, each reviewer
contributes equally to the validation set.
3.2.3 Evaluation
A machine-learning model is typically assessed by evaluating its performance on the task at hand. For
example, we can evaluate how well a model performs in the score-prediction task by comparing the
predicted scores to the ground-truth scores. The task of most interest here is that of finding good paper-
reviewer assignments. Ideally, we would be able to compare the quality of our assignments to some
gold-standard assignment. Such an assignment could then be used to test both different-score prediction
models and different matching formulations. Unfortunately, ground-truth assignments are unavailable.
Moreover, even humans would have difficulty finding optimal assignments, and hence we cannot count
on their future availability.
We must then explore different metrics to test the performance of the overall system, including
methods for comparing score-prediction models as well as matching performance. In practice, in-vivo
experiments can provide a good way to measure the quality of TPMS scores and ultimately the usefulness
of the system. ICML 2012’s program chairs experimented with different initial scoring methods using
a special interface which showed, one of three groups of, ranked candidate papers to reviewers.9 The
experiment had some biases: the papers of the three groups were ranked using TPMS scores. The poll,
which asked after the fact whether reviewers had found the ranked-list interface useful, showed that
reviewers who had used the list based on word-LM were slightly more likely to have preferred the list
than the regular CMT interface (the differences were likely not statistically significant).
We will defer comprehensive comparisons of the different score-prediction methods to Chapter 4,
where we will introduce a novel model and make comparisons with it. Likewise, in Chapter 5, we
will introduce different matching objectives and constraints and discuss relevant matching experiments.
Below, we offer some qualitative comparisons of initial score models.
Datasets
Through its operation, the system has gathered interesting datasets containing reviewer submission
preferences as well as reviewer and submitted papers. We have assembled a few datasets from these
9http://hunch.net/?p=2407
Chapter 3. Paper-to-Reviewer Matching 34
0 1 2 30
500
1000
1500
2000
2500
3000
Num
ber
of S
core
s
(a) NIPS-10
1 2 3 40
2000
4000
6000
8000
10000
12000
(b) ICML-12
Figure 3.3: Histograms of score values.
collected data. Below, we introduce certain datasets that are used throughout this thesis for empirical
evaluations. The datasets bear the name of the conference from which their data were assembled.
NIPS-10: This dataset consists of 1251 papers submitted to the NIPS 2010 conference. The set of
reviewers consists of the conference’s 48 area chairs. The submission and archive vocabulary consists
of 22,535 words. User-item preferences were integer scores in the 0 to 3 range. The histogram of score
values is given in Figure 3.3. Suitabilities on a subset of papers were elicited from reviewers using a rather
involved two-stage process. This process utilized the language model (LM) to estimate the suitability
of each reviewer for each paper, and then queried each reviewer on the papers on which his estimated
suitability was maximal. The output of the first round was fine-tuned using a combination of a hybrid
discriminative/generative RBM [65] with replicated softmax input units [97] trained on the initial scores,
and LM, which then determined the second round of queries. In total, each reviewer provided score on
an average of 143 queried papers (excluding one extreme outlier), and each paper received an average
of 3.3 suitability assessments (with a std. dev. of 1.3). The mean suitability score was 1.1376 (std. dev.
1.1).
With regards to our earlier discussion on missing at random (Section 2.2.1) we note that since the
querying process was biased towards asking about pairs with high predicted suitability, the unobserved
scores are not missing at random, but rather tended toward pairs with low suitability. We do not
distinguish the data acquired in the two phases of elicitation; both took place within a short time frame,
so we assume suitabilities for any one reviewer are stable.
ICML-12: This dataset consists of 857 papers and 431 reviewers from the ICML 2012 conference.
The submission and archive vocabulary consists of 21,409 words. User-item preferences were integer
scores in the 0 to 3 range. The histogram of score values is given in Figure 3.3. The elicitation process
was less constrained than for NIPS-10. Specifically, reviewers could assess their expertise for any paper
although they were also shown a selection of suggested papers produced either by the LM or based on
subject-area similarity.
The original vocabulary of these datasets was very slightly pre-processed. Specifically, we removed
stop words from a predefined list. Moreover, we only kept words that appeared in at least five documents,
including two different submissions. We performed further experiments with using word stems, which
did not make a significant difference. We therefore hypothesize that words that are useful in determining
a user’s expertise are typically technical words which are not declined under different stems.
Chapter 3. Paper-to-Reviewer Matching 35
NIPS-10 ICML-12NDCG@5 0.926 0.867NDCG@10 0.936 0.884
Table 3.1: Evaluating the similarity of the top-ranked reviewers for word-LM versus topic-LM on theNIPS-10 and ICML-12 datasets.
Initial score quality
We examined the quality of the initial scores, those estimated solely by comparing the archive and the
submissions, without access to elicited scores. We will compare the performance of a model which uses
the archive and submission representations in word space to one which uses these representations in topic
space. The method that operates in word space is the language model as described by Equation 3.2.
For purposes of comparison, we further normalized these scores using the length of each submission. We
refer to this method as word-LM. To learn topics, we used LDA to learn 30 topics using the content of
both the archives and submissions. For the archive, we learned topics for each reviewer’s paper and then
averaged a reviewer’s papers in topic space. This version of the language model is denoted as topic-LM.
We first compared the two methods with each other by comparing the top-ranked reviewers for each
paper according to word-LM and topic-LM. Table 3.1 reports the average similarity of the top-5 and
top-10 reviewers using NDCG, where word-LM was used to make predictions of topic-LM. The high
NDCG values indicate that both methods generally agree about the most expert reviewers.
We can obtain a better appreciation of the scores of each model by plotting the model’s (top) scores
for each paper. Each data point on Figure 3.4 shows the score of one of the top-40 reviewers, on the
y-axis, for a particular paper, on the x-axis. For visualization purposes, points corresponding to the same
reviewer ranking across papers are connected.10 One can see that word-LM scores all fall within a small
range. In other words, given a paper, word-LM evaluates all reviewers to have very similar expertise.
We have usually not found this to be a problem when using word-LM scores to assign reviewers to
papers. In one exceptional case, we did get a poor assignment in a case when the program committee of
a conference was very small (9 members) and the breadth of the submissions was large. From Figure 3.4,
we see that topic-LM does not have this problem as it better separates the reviewers, and often finds a
few top reviewers to have a certain (expertise) margin over the others, which seems sensible. One possible
explanation for these discrepancies is that working in topic space removes some of the noise present in
word space. It has been suggested that using a topic model “fuse[s] weak cues from each individual
document word into strong cues for the document as a whole”.11 Specifically, elements like the specific
words, and to a certain extent the writing style, used by individual authors may be abstracted away by
moving to topic space. Topic-LM may therefore provide a better evaluation of reviewer expertise than
can word-LM.
We would also like to compare word-LM and topic-LM on matching results. However, such results
would be biased toward word-LM because, in our datasets, this method was used to produce initial scores
which guided the elicitation of scores from reviewers (we have validated experimentally that word-
LM slightly outperforms topic-LM using this experimental procedure). Using the same experimental
procedure, word-LM also outperforms matching based on CMT subject areas.
10This methodology was suggested and first experimented with by Bill Triggs, the program chair for the ICGVIP 2012conference.
11From personal communication with Bill Triggs.
Chapter 3. Paper-to-Reviewer Matching 36
0 2 4 6 8 10 12 14 16 18 200.03
0.04
0.05
0.06
0.07
0.08
0.09
(a) NIPS-10 using topic-LM
0 2 4 6 8 10 12 14 16 18 200.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
(b) ICML-12 using topic-LM
0 2 4 6 8 10 12 14 16 18 200.116
0.118
0.12
0.122
0.124
0.126
0.128
0.13
0.132
(c) NIPS-10 using word-LM
0 2 4 6 8 10 12 14 16 18 200.12
0.122
0.124
0.126
0.128
0.13
0.132
0.134
0.136
0.138
(d) ICML-12 using word-LM
Figure 3.4: Score of the top-40 reviewers (y-axis) for 20 randomly selected submitted papers (x-axis).
3.3 Related Work
We are aware that other conferences, such as SIGGRAPH, KDD, and EMNLP, have previously used
a certain level of automation for the task of assigning papers to reviewers. The only conference man-
agement system that we are aware of that has explored machine learning techniques for paper-reviewer
assignments is MyReview 12, and some of their efforts are detailed in [89].
The reviewer-to-paper matching problem has also received attention in the academic literature. Conry
et al. [29] propose combining several sources of information including reviewer co-authorship informa-
tion, reviewer and submission self-selected subject areas as well as submission representations using the
submission’s abstract. Their proposed collaborative filtering model linearly combines regressors on these
different sources of information in a similar way as we do in Equation 3.5. Using elicited and predicted
scores, matching is then performed, as a second step, using a standard matching formulation [122]. We
note that in this work no differentiation is made between reviewer expertise and reviewer interest.
In TPMS we build representations of submissions using the full text of submissions. Other authors
12http://myreview.lri.fr/
Chapter 3. Paper-to-Reviewer Matching 37
have exploited the information contained in the submissions’ references instead. Specifically, Rodriguez
and Bollen [91] use the identity of the authors of the referenced papers and a co-authorship graph, built
offline, to identify potential reviewers. Although building a useful co-authorship graph and identifying
the authors of referenced papers in submissions presents a technical challenge, this method has the
advantage of not requiring any input from reviewers.
Benferhat and Lang [8], Goldsmith and Sloan [42], Garg et al. [37] assume that reviewer-to-paper
suitability scores are available and focus on the matching problem, and various desirable constraints.
We provide a more complete review of these studies in Section 5.3.
3.3.1 Expertise Retrieval and Modelling
The reviewer-matching problem can be seen as an instantiation of the more general expertise retrieval
problem [6]. The key problem of expertise retrieval, a sub-field of information retrieval, is to find the
right expert given a query. A basic query is a set of terms indicating the topics for which one wishes to
find an expert. In our case, queries are the papers submitted to conferences, and expertise is modelled
using previously authored papers. Because reviewer expertise and queries live in the same space, we
never have to model expertise explicitly in terms of high-level, or human-designed, topics. Overall,
the field of expertise retrieval has studied models similar to ours, including probabilistic and specifically
language models [6].
3.4 Other Possible Applications
An automated paper matching system has been particularly useful in a context, such as a conference,
where many items have to be assigned to many users under short time constraints. Nonetheless, the
system could be used in other similar contexts where an individual’s level of expertise can be learned
from that individual’s (textual) work. One example is the grant-reviewing matching process, where
grant applications must be matched to competent referees. Academic journals also have to deal often
with the problem of finding qualified reviewers for their papers. As more evaluation processes migrate
to on-line frameworks, it is also likely that other problems fitting within this system’s capabilities will
emerge. Other applications of more general match-constrained recommendations systems are discussed
in Chapter 5.
3.5 Conclusion and Future Opportunities
There are a variety of system enhancements, improved functionality and research directions we are
currently developing, and others we would like to explore in the near future. On the software side, we
intend to automate the system further to reduce the per-conference cost (both to us and to conference
organizers) of using the system. This implies providing automated debugging and explanatory tools for
conference organizers.
We have also identified a few more directions that will require our attention in the future:
1. Re-using reviewer scores from conference to conference: Currently reviewers’ archives are the only
piece of information that is re-used between conferences. It possible that elicited scores could also
be used as part of a particular reviewer’s profile that can be shared across conferences.
Chapter 3. Paper-to-Reviewer Matching 38
2. Score elicitation before the submission deadline: Conferences often have to adhere to strict and
short deadlines when assigning papers to reviewers after the submission deadline. Hence, collecting
additional information about reviewers before the deadline may be able to save more time. One
possibility would be elicit scores from reviewers about a set of representative papers from the
conference (for example, a set of papers published in the conference’s previous edition).
3. Releasing the data: The data that we gathered through TPMS have opened various research
opportunities. We are hoping that some of these data can be properly anonymized and released
for use by other researchers.
4. Better integration with conference management software: Running outside of CMT (or other con-
ference organization packages) has provided advantages, but the relatively weak coupling between
the two systems also has disadvantages for conference organizers. As more conferences use our
system, we will be in a better position to develop further links between TPMS and CMT.
5. Leveraging other sources of side information: Other researchers have been able to leverage other
type of information apart from a bag-of-words representation of the submissions’ main text (see
Section 3.3). The model presented in Chapter 4 may be a gateway to modelling some of this other
information.
6. As mentioned in Section 3.3.1 many models have been proposed for expertise retrieval. It would
be worthwhile to compare the performance of our current models with the ones developed in that
community.
Finally, we are actively exploring ways to evaluate the accuracy, usefulness, and impact of TPMS more
effectively. We can currently evaluate how good our assignments are in terms of how qualified reviewers
are for the papers to which they are assigned (Chapter 5 details assignment procedures and evaluations).
Further, anecdotal evidence from program organizers, the number of the number of conferences that have
expressed an interest in using the system, and the several experiments that we have run, all suggest that
our system provides value in proposing good reviewer assignments and in saving conference organizers
and senior program committee members time and cognitive effort. However, we do not have data to
evaluate the intrinsic impact that the system may have on conferences and, more generally, on the field
as a whole. We can divide the search for answers to such questions into two stages. First, how much
does reviewer expertise affect reviewing quality. For example is it the case that finding good expert
reviewers leads to better reviews? Further, are there synergies between reviewers (for example, what
is the benefit of having both a senior researcher and a graduate student reviewer assigned to the same
paper)? Second, how much impact do reviews have on the quality of a conference and ultimately of
the field? For example do good reviews lead to: a) more accurate accept and reject decisions; and
b) higher-quality papers? One way to answer these questions would be to run controlled experiments
with conference assignments, for example, by using two different assignment procedures. Subsequently
we could evaluate the performance of the procedures by surveying reviewers, authors, and conference
attendees. While we may obtain conclusive answers about some of these questions, evaluating the quality
of the field as a result of reviews seems like a major challenge. We have only begun exploring such issues
but already, in practice, we have found it a challenge to incentivize conference organizers to carry out
evaluation procedures which could help in assessing these impacts.
Chapter 4
Collaborative Filtering with Textual
Side-Information
Good user-item preference predictions are essential to the goal of providing users with good recommen-
dations. The task of preference prediction, which is the focus of this chapter, is therefore at the heart
of recommender systems. Preference prediction corresponds to the second stage of our recommendation
framework, depicted for the purposes of this chapter in Figure 4.1. We discuss this stage first because
of the importance of preference prediction in the framework as well as the fact that the other stages of
the framework will assume access to a preference prediction method.
There are various flavours of preference prediction problems, which are differentiated by which type of
input information is available to the preference prediction model. The simplest of situations is when only
a subset of user-item preferences are observed. The system must then learn, using a collaborative filtering
model, similarities in-between users that will enable it to accurately predict unobserved preferences. We
are interested in studying situations where side information, features of users, items or both, along with
user-item preferences is available. Specifically, we focus on hybrid systems, systems that use collaborative
filtering models while also modelling the side information.
We first provide a high-level description of the current state-of-the-art in the field of hybrid recom-
mender systems. We then focus on the specific problem of leveraging user and item textual side infor-
mation for user-preference predictions. We propose the novel collaborative score topic model (CSTM)
for this problem and present supporting empirical evidence of its good performance across data regimes.
Before introducing our model we also provide a brief review of topic models and variational inference
methods used for learning the parameters of those topic model.
4.1 Side Information in Collaborative Filtering
We define side information as any information about users or items distinct from user-item preferences.
The immediate goal of modelling side information is to glean information that will be helpful in learning
better models of users and items and (consequently) provide more accurate preference predictions.
Overall, side information has been particularly useful to combat the cold-start problem which plagues
collaborative-filtering methods. Figure 4.2 provides a sketch of the representations of the major class of
models (used in the literature) to combine side information with collaborative filtering. There are two
39
Chapter 4. Collaborative Filtering with Textual Side-Information 40
Prediction of missing
preferences
S31
1 2
Users
Entities
Stated preferences
1 1 20 1 22 3 2
Stated & predicted
preferences
1 1 20 1 22 3 2
Recommendations
Side-Information
from Users and Items
F
Figure 4.1: Flow chart depicting the framework developed in this thesis with a particular focus on thesecond stage of the framework, the preference prediction stage.
predominant classes. The first, depicted in Figure 4.2(b) is to use the side information to build a prior
over user representations for example by enforcing that users (and items) with similar side information
(fu and fv) have similar latent representations (u and v). Such models are therefore of particular use in
the cold-start data regime. The second methodology which has received a lot of attention is to use the
side information as covariates (or features) in a regression model (see Figure 4.2(c)). The parameters
over such features are often shared across user and items. Finally, a third methodology is to model
the side information together with the scores as shown in Figure 4.2(d). The resulting user and item
latent factors of this generative model should capture elements of both which can be useful for preference
prediction and for side information prediction.
Side information has been used, with varying levels of success, in many different domains. To better
understand the domains in which side information has been useful, we discuss user side information and
item side information separately.
User side information has shown to be useful in various forms. Examples of user side information
includes frequently available user demographics, such as user gender, age group or a user’s geographical
location, which have often been used and have shown to produce, at least, mild performance gains [1, 2,
26]. Concretely, it is likely that such side information is weakly indicative of preferences (for example,
two average teens may have similar movie tastes). More descriptive features such as a user’s social
network have shown promise both as a way to select a user’s neighbourhood in model-free approaches
[53, 76, 70] (neighbourhood models were introduced in Section 2.2.1), as well as in in collaborative
filtering approaches as a way to regularize a user representations by assuming that neighbours of a social
network have similar preferences [26, 54]. Features that could be described as relating user behaviours,
such as user browsing patterns, user search queries, or user purchases, have also shown their utility,
again as a way to regularize user representations [61, 99].
On the item side, gains have mostly come from domains where the content of items can easily be
analyzed using machine learning techniques, for example, when the content is text as it is for scientific
papers [126] or newspaper articles [27]. For books, Agarwal and Chen [2] have proposed using book
reviews as representative of a book’s content (in Section 4.6 we show that using the content of books is
challenging). They use a probabilistic model of content and user preferences which combines regressors
over side information in a conceptually similar way as is done in Conry et al. [29] which we reviewed in
Section 3.3. For other tasks, such as the one of predicting interactions between drugs and proteins, side
Chapter 4. Collaborative Filtering with Textual Side-Information 41
MN
(a) A collaborative fil-tering model such asprobabilistic matrix fac-torization [99]
MN
MN
(b) The user side information fu, and the item sideinformation fv regularize the user and item represen-tations respectively. Here, the graphical representationillustrates that each user and item latent factors (u andv) can be regularized by all other users side informa-tion or all other items side information (this could bethe case for users when using the structure of a socialnetwork as side information). Regularization of userand item representations is used by many researchers[54, 39, 126, 66, 99]
MN
(c) Side information is used as features (covariates)used to learn a regressor [1, 2, 26, 61, 27, 29, 130]
MN
(d) Learn a generative model of side information whereuser and item latent factors have to both predict scoresand the side information. Such a model can be seenas another way of regularizing the user and latent fac-torizations. A few researchers have used such models[106, 70].
Figure 4.2: Graphical model representations of the three common classes of models for using side in-formation within collaborative filtering models. Note that we depict the pure collaborative filteringmodel as one which includes, but is not restricted to, the popular probabilistic matrix factorization.We emphasize that the above figures are sketches meant as a guide of the different model classes (us-ing a similar graphical representation), they are not intended to reproduce the exact parameters of theproposed models.
Chapter 4. Collaborative Filtering with Textual Side-Information 42
information consisting of drug similarities and protein similarities have also shown promise as a way to
regularize the representations of a matrix factorization model [39].
However in other popular domains such as movie preference prediction where content features present
modelling difficulties, using item side information, for example a movie’s genre, its actors or its cast, has
been less useful (see, for example, [66, 106]). In the movie domain, modest gains have been achieved
when using related-content information such as user-proposed movie tags [2, 1].
Music recommendation offers an interesting testbed, falling between the simpler representations of
text and the more complex representations of movies, as we now have tools that allow the accurate
analysis, including determination of genre, of music tracks (see, for example, the MIREX competition1).
However, results in the music domains mostly point to the effectiveness of pure collaborative filtering.
In a recent contest [77], content-based methods still did not rival with pure collaborative filtering sys-
tems [Section 6.2.3, 9]. In a similar task, Weston et al. [130] show that using MFCC audio features as
regression features, does not provide a gain over pure collaborative filtering methods. These negative
results have made some researchers question our ability to extract useful content features from such
complex domains as music, movies and images [108].
We provided an fairly high-level overview of the work that has combined side information to col-
laborative filtering models. In this chapter we will present a novel model to do so which is applicable
to textual side information (or more generally to the case when side information can be modelled with
a topic model) and we will compare it, experimentally, to competing models in this domain and in
particular to [126].
4.2 Problem Definition
We do not aim to solve the general problem of hybrid recommender systems in this chapter. Instead
we focus on pushing the state-of-the-art by studying the task of document recommendation using both
user-item scores and side information. The primary novelty of our work lies in leveraging a particular
form of side information: the content of documents associated with users, which we call user libraries.
A typical scenario that can be modelled in this way is scientific-paper recommendation for researchers;
for example, Google Scholar recommends papers based on an individual’s profile. A second scenario is
paper-reviewer assignment, where each reviewer’s previously published papers can be used to assess the
match between their expertise and each submitted paper. Another relevant application domain is book
recommendation, as online book merchants typically enable users to collect items in a virtual container
akin to a personal library.2 In each case a user’s library, or side information, consists of documents which
are not necessarily explicitly rated but nonetheless may contain information about a user’s preferences.
To model user-item scores as well as user and item content we introduce a novel directed graphical
model. This model uses twin topic models, with shared topics, to model the side information. User
and item topic proportions are then used as features to predict user-item scores with a collaborative
filtering model. The collaborative filtering component enables the model to effectively make use of the
side information with varying number of observed scores. We demonstrate empirically that the model
outperforms several other methods on three datasets in both cold and warm-start data regimes. We
1http://www.music-ir.org/mirex/2For example., Amazon’s Kindle and Kobo’s tablets have an option for users to populate their libraries, while Barnes
and Nobles’ Nook gives users an active shelf.
Chapter 4. Collaborative Filtering with Textual Side-Information 43
N
K
D
(a)
N
K
D
(b)
Figure 4.3: The graphical models of two topic models (a) LDA and (b) CTM.
further show that the model automatically learns to gradually trade off the use of side information in
favour of information learned from user-item scores as the amount of user preference-data increases.
4.3 Background
In Chapter 3 we used a topic model as a tool to learn representations for user libraries and paper
submissions. In this chapter we propose a graphical model which jointly models user and item topics
and user-item scores. To better understand this model and our proposed inference procedures it will be
useful to further introduce LDA as well as introduce a second topic model called the correlated topic
model (CTM).
Topic models are a class of directed graphical models. They were initially proposed as a way to
model collections of textual documents [16]. Specifically, each document in the collection is modelled
as a mixture over topics. A topic is a distribution over words. With that in mind the aim of topic
models is then to learn a (word-level) representation of a set of documents. This representation involves
two key components: a) distributions over words or topics; b) for each document in the collection a
distribution over topics. Over the years topic models have been extended to model many different
practical situations, including: modelling how documents change over time [12], modelling hierarchies
of topics [15], and modelling document contents and labels [14, 84, 63].
For our needs we will specifically focus on LDA and its CTM variant. LDA’s graphical model is
depicted in Figure 4.3. The generative model of LDA over a set of D documents, an N -size word
vocabulary, and K topics is:
• For each document, d = 1 . . . D
- Draw d’s topic proportions: ηd ∼ Dirichlet(α)
Chapter 4. Collaborative Filtering with Textual Side-Information 44
- For each word in the document, n = 1 . . . N :
- Draw a topic: zdn ∼ Multinomial(ηd)
- Draw a word: wn ∼ Multinomial(βzdn)
ηd represents the topic proportions, or a mixture over topics, of document d. For each word in a
document LDA first samples a single topic, zdn. A word is then sampled from the corresponding topic
distribution. It is useful to encode β as a matrix of size K × N , where K is the number of topics.
Then the zdn’th row of that matrix βzdn is the distribution (over words) corresponding to topic zdn.
Formally, for each document, LDA learns to represent a distribution over words given model parameters
and latent variables: Pr(wd|zd,φd,β, α). Across all documents LDA then represents the space of joint
distributions over document words.
The correlated topic model is very similar to LDA except that mixtures over topics are modelled using
a logistic normal distribution instead of a multinomial. Accordingly, instead of sampling a document’s
distribution over topics from a Dirichlet prior, it’s sampled from a Normal distribution with mean
parameters µ and covariance Σ parameters. CTM’s graphical model is shown in Figure 4.3 and its
generative model is given by:
• For each document, d = 1 . . . D
- Draw d’s topic proportions: ηd ∼ N (µ,Σ)
- For each word in a document, n = 1 . . . N
- Draw a topic: zdn ∼ Multinomial(softmax(ηd))
- Draw a word: wdn ∼ Multinomial(βzdn)
where softmax(v) = exp(v)∑k′ exp(vk′ )
. In CTM the main advantage of using a normal distribution is that
correlations between topics (e.g., topic 1 often co-occurs with topic 5) can be represented. This has shown
experimentally to provide better models of document collections (and accessorily provides potentially
interesting visualizations of the discovered topic correlations) [13]. The main disadvantage brought
upon by using a normal distribution is that the normal is not conjugate to the multinomial (whereas
the Dirichlet is) and thus this change complicates resulting inference procedures [13, 127].
4.3.1 Variational Inference in Topic Models
The key inference problem of graphical models is to evaluate the posterior distributions over latent
variables ηd and zd given the observed variables wd and the model parameters α and β. Posteriors
can then be used to make predictions using standard Bayesian inference. In the case of LDA the per-
document posterior, following Bayes’ rule, is given by:
Pr(zd,ηd|wd,β, α) =Pr(wd|zd, β) Pr(zd|ηd)) Pr(ηd|α)∑
z
∫Pr(wd|zd, β) Pr(zd|ηd)) Pr(ηd|α) ∂η
(4.1)
However the denominator is intractable due to the coupling of η and β [16]. Hence we must resort to
an approximate inference approach. Specifically we will use variational inference.
In variational inference the idea is to replace the intractable posterior by a tractable distribution
over the latent variables: Q({s}, {η}). The choice of Q is model dependant, however, for computational
simplicity, it is often given a parametric form. For simplicity let us group all latent variables, across
Chapter 4. Collaborative Filtering with Textual Side-Information 45
documents, into Z = {{zd}, {ηd}}d∈D and the visible variables into X = {wd}d∈D.3 Then the approach
is to optimize this tractable distribution such that it is as close as possible to the true posterior. The
measure of closeness between the two distributions is taken to be the KL-divergence:
KL(Q||P ) =
∫Q(Z) ln
(Q(Z)
P (Z|X,α, β)
)∂Z (4.2)
=
∫Q(Z) ln(Q(Z))−Q(Z) ln(P (Z|X,α, β) ∂Z (4.3)
= −H(Q(Z))− EQ[P (Z|X,α, β)] (4.4)
where H(·) denotes the entropy of its argument.
Minimizing the KL-Divergence is still intractable because it requires the evaluation of the posterior
P (Z|X,α, β) (Equation 4.4). However, by subtracting lnP (X) the log-marginal probability of the data,
which is constant with respect to Z, we obtain:
KL(Q||P )− lnP (X) =
∫Q(Z) ln(Q(Z))−Q(Z) lnP (Z|X) ∂Z −
∫Q(Z) lnP (X)∂Z (4.5)
=
∫Q(Z) ln
((Q(Z))
P (Z|X,α, β)P (X|α, β)
)∂Z (4.6)
=
∫Q(Z) ln
((Q(Z))
P (Z,X|α, β)
)∂Z (4.7)
=
∫Q(Z) lnQ(Z)−Q(Z) lnP (Z,X|α, β) (4.8)
= −H(Q(Z))− EQ[P (Z,X|α, β)] (4.9)
The resulting expression, which consists of an entropy term and an expectation over the complete-data
likelihood Pr(Z,X|α, β), can now be evaluated.
The variational expectation-maximization (EM) [32, 80] algorithm (approximately) minimizes the
above, also known as the free energy, by alternating the following two optimization steps. In the E-
step the algorithm minimizes the free energy with respect to the variational posterior parameters, the
parameters of Q. Then, in the M-step the free energy is minimized with respect to the model parameters
(that is, α, β in LDA and µ,Σ, β in CTM). The M-step is also often described as maximizing the complete-
data log-likelihood. The EM algorithm is a classic algorithm in the machine learning literature. The
particular derivation that we have shown is similar to the one in Frey and Jojic [35].
There are several common choices of variational distributions. For example, we recover maximum
a posterior configuration of the latent variables (MAP) by choosing the variational distribution to be
a Dirac delta function with its mode equal to the variational parameter h: Q(h) = δh(h), where δ
h(h)
evaluates to one when h = h and zero otherwise. Another common technique is to use a mean-field
approach where the intractable posterior distribution is approximated using decoupled distributions.
For example, in the case of LDA, Q(ηd, {zd}|γd, {φd}) = Q(η|γd)Q({zd}|{φd}) where γd and {φd} are
variational parameters.
3Equivalently we will write {wd} when the domain of the index is clear from context.
Chapter 4. Collaborative Filtering with Textual Side-Information 46
4.4 Collaborative Score Topic Model (CSTM)
Our approach to document recommendation relies on having: a) a set of observed user-item preferences
({srp}); b) contents of the items ({wdp}); and c) the content of user-libraries ({wa
r}). The model’s aim
is to utilize the content in its user-item score predictions (which can then be used to recommend items
to users).
To do so we combine a topic model over side information and a collaborative filtering model of user-
item preferences. Specifically, we propose a generative model over the joint space of user preferences,
user side information (user libraries) and item side information (item contents). This model can be cast
in the light of the previous work on incorporating side information along with collaborative filtering
which we described earlier in this chapter (Section 4.1). Specifically, we combine a generative model
of user and item side information as well as of user-item preferences similarly as in Figure 4.2(d). The
side information, which consists of textual documents is modelled using a topic model. The user and
item topic proportions are then used within a linear regression model over user-item preferences. In that
respect our model also shares commonalities with the class of models depicted in Figure 4.2(c). The
resulting model is called the collaborative score topic model (CSTM).
We now describe in more detail the components of our model. Our content-based model is mediated
by topics: we learn a shared topic model from the words of the documents and user libraries. We
represent topic proportions with a normal distribution and realized topics zu and zd, in the same way as
CTM, using the logistic normal [13]. User and item topic proportions offer a compact representation of
user and item side information. We favor CTM over LDA because of its continuous representations of
topic proportions which can be useful for the regression model. We use these representations as covariates
in a regression model to predict user-item preferences. The regression has two sets of parameters. The
first are user-specific parameters on the item topics covariates. The second are compatibility parameters,
which are shared across users and items, and are based on the compatibility between the item topics
and the topics of the user library.
We now introduce the complete graphical model of the CSTM. A graphical representation of the
model is given in Figure 4.4. The associated generative model is:
• Draw compatibility parameters: θ ∼ N (0, λ2θI)
• Draw shared-user parameters: γ0 ∼ N (0, λ2γ0I)
• For each user r = 1 . . . R:
- Draw individual-user parameters: γr ∼ N (0, λ2γI)
- Draw user-topic proportions: ar ∼ N (0, λ2aI)
• For each document p = 1 . . . P :
- Draw document-topic proportions: dp ∼ N (0, λ2pI)
• For all of user r’s user-library words, n = 1 . . . N :4
- Draw zarn ∼ Multinomial(softmax(ar))
- Draw warn ∼ Multinomial(βza
rn)
• Repeat the above for all of document p’s M words
4For simplicity, we’ll assume in the notation that all user-libraries contain N words and all item documents containsM words.
Chapter 4. Collaborative Filtering with Textual Side-Information 47
MN
K
RP
Figure 4.4: Graphical model representation for CSTM.
• For each user-document pair (r, p), draw scores:
srp ∼ N ((ar ⊗ dp)T θ + dT
p (γ0 + γr), σ2s) (4.10)
where N (µ, σ2) represents a normal distribution with mean µ and variance σ2, ⊗ stands for the
Hadamard product, softmax(v) = exp(v)∑k′ exp(vk′ )
, and I is the identity matrix.
The specific parametrization of the preference regression shown in Equation 4.10 is important. Our
model is designed to perform well in both cold-start and warm-start data regimes. In cold-start settings
the model needs the user’s side information to predict user-item preferences. When the number of
observed preferences increases, the model can gradually leverage that information, smoothly combining
it with information gleaned from the side information to refine its model of missing preferences. To
accomplish this, the regression model (Equation 4.10) is separated in two: one component that exploits
user side information ((ar ⊗ dp)T θ) and another that does note include the user side information but
does include user-specific parameters (dTp (γ0 + γr)).
Item side information is incorporated by modulating the user information through an element-wise
product. The weights θ then serve several purposes: 1) they can act to amplify or reduce the effect
of certain topics (for example diminish the influence of topics bearing little preference information); 2)
they enable the model to more easily calibrate its output to the range of observed preference values; and
3) changing the magnitude of θ enables the model to control how much it uses the side information for
preference prediction.
When user-item preferences are more abundant, the model can use them to learn a user-specific
model, γr, over item features. Note that these user-specific parameters are combined with a shared
Chapter 4. Collaborative Filtering with Textual Side-Information 48
set of parameters, γ0, which allows for some transfer across users. An individual’s γr can be used to
increase that user’s reliance on user-item preferences at the possible expense of item side information,
as the joint magnitude of the γ’s defines the weights associated with this part of the model.
Our model learns a single set of topics to model user and item content. Sharing topics ensures that
the user and item representations (ar and dd ∀r, ∀d) are aligned and render their element-wise product
meaningful.
4.4.1 The Relationship Between CSTM and Standard Models
Simplifying the proposed CSTM model in various ways produces other models that have been used
for similar tasks. First, setting γ0 and γr, for every user r, to zero and θ’s to a vector of ones we
obtain the language model introduced in 3.2.1. To be precise, because we are using user and item topic
representations, this degeneracy corresponds to topic-LM (see Section 3.2.3). Mimno and McCallum
[79] used the related word-LM in a preference prediction task and found its performance particularly
strong in low-data regimes.
Further, setting θ and γ0 to zero we obtain the individual user regression model LR introduced in
Section 3.2.2. As we will demonstrate LR typically outperforms purely collaborative filtering models in
our task.
By modelling preferences as a combination of user features and item features, our model can also be
seen as an instance of collaborative filtering [99, 98, 7].
Finally, we have opted to represent topic proportions using a logistic normal distribution as in
CTM [13]. In our case we utilize the logistic normal due to its representational form, and not as a
means of learning topic correlations.5 Compared to a multinomial, the normal distribution adds a level
of flexibility that may be useful to better calibrate CSTM’s preference predictions; the drawback is
additional complexity in model inference.
Our model also shares several similarities with the model of Equation 3.5. First, both models are
composed of two separate components which, respectively, learn to regress over user and item side-
information. While the regression over item content is the same, CSTM in its other component, to
maximize its performance on cold start users, combines the user and item side-information representa-
tions. Early experiments with CSTM showed that having both global and user-specific compatibility
parameters (which would be closer to the parameters ω + ωr of Equation 3.5) did not yield empirical
improvements. One important difference between the two models is that the model of Equation 3.5 uses
representations learned offline whereas CSTM learns these representations jointly with the regression
parameters.
4.4.2 Learning and Inference
For learning we use a version of the EM algorithm where we alternate between updates of the user-item
specific variables (H = {{γr}, {ar}, {dp}, {za}, {zd}}) in the E-step and updates of the parameters or
shared variables (Θ = {γ0,θ,β}) in the M-step. The inference and learning procedures are similar to
those proposed for nonconjugate LDA models in Wang and Blei [126]. The general EM algorithm is
shown in Algorithm 1.
5Because learning topic correlations has been found to improve on standard LDA, it is possible that learning the topiccorrelations could also improve our model.
Chapter 4. Collaborative Filtering with Textual Side-Information 49
E-Step
Inference in this model is intractable, so we must rely on approximations when manipulating the posterior
over the user-item specific variables. The log-posterior over user-item variables, given the fixed model
parameters and the data, is
L := − 1
2λa
R∑
r
aTr ar −1
2λd
P∑
p
dTp dp −
1
2λγ
R∑
r
γTr γr
− 1
2σ2s
∑
(r,p)∈So
(srp − ((ar ⊗ dp)
Tθ + (γ0 + γu)Tdp)
)2
+
R,N∑
r,n
logexp(arza
rn)∑
j exp(arj)+
P,M∑
p,m
logexp(spzd
pm)
∑j exp(spj)
+
R,N∑
r,n
log βzarn,w
arn
+
P,M∑
p,m
log βzdpm,wd
pm− logZ(Θ) (4.11)
where So stands for the set of observed preferences and Z(Θ) is the normalizing term of the posterior,
intractable in part because ar and dp cannot be analytically integrated out because they are not conjugate
to the distribution over topic assignments [13].
We address this computational issue by employing variational approximate inference. For each of the
topic-proportion and regression variables {ar}, {dp}, {γr}, we use a Dirac delta posterior parameterized
by its mode {ar}, {dr}, {γr}. For the topic-assignment variables {za}, {zd}, we instead utilize a mean-
field posterior. The full approximate posterior is therefore:
q({ar}, {dp}, {γr}, {zar}, {zdp}| {ar}, {γr}, {dp}, {φar}, {φd
p}) =(R∏
r
δγr(γr)
)(R∏
r
δar(ar)
N∏
n
φarnza
r
)(D∏
d
δdp(dp)
M∏
m
φdpmzd
p
)
where δµ(x) is the delta function with mode µ and {φar},{φd
p} are the mean-field parameters (e.g., φar is
a matrix whose entries φarnj are the probabilities that the nth word in user r’s library belongs to topic
j).
Approximate inference entails finding the variational parameters {ar}, {dp}, {γr}, {φar}, {φd
p} that
Chapter 4. Collaborative Filtering with Textual Side-Information 50
minimize the KL-divergence with the true posterior
KL := −Eq [L]−H(q)
=1
2λa
R∑
r
aTr ar +1
2λd
P∑
p
dTp dp +
1
2λγ
R∑
r
γTr γr
+1
2σ2s
∑
(r,p)∈So
(srp − ((ar ⊗ dp)
Tθ + (γ0 + γr)T dp)
)2
−R,N,K∑
r,n,k
φarnk
(log
exp(ark)∑j exp(arj)
+ log βk,warn
− log φarnk
)
−P,M,K∑
p,m,k
φdpmk
(log
exp(spk)∑j exp(spj)
+ log βk,wdpm
− log φdpmk
)
+ constant (4.12)
Our strategy is to perform one pass of coordinate descent, optimizing each set of variational parameters
given the others.6 For γr, we obtain a closed-form update by differentiating the above equation and
setting the result to 0:
γr =1
σ2s
∑
p∈So(r)
(srp − (dp ⊗ ar)Tθ − dT
p γ0)dTp
∑
p∈So(r)
dpdTp
σ2s
+1
2λγI
−1
where So(r) is the set of indices for documents that user r has rated. The {ar}, {dp} parameters do not
have closed-form solutions, hence we use conjugate gradient descent for optimization. The derivatives
with respect to the posterior KL are:
∂KL
∂ar=arλa
− 1
σ2s
∑
p∈So(r)
(srp − srp)(dp ⊗ θ) +Nexp(ar)∑j exp(arj)
−∑
n
φarn
∂KL
∂dp
=dp
λd
− 1
σ2s
∑
r∈So(p)
(srp − rrp)(ar ⊗ θ + γ0 + γr) +Mexp(dp)∑j exp(spj)
−∑
n
φdpn
where, srp = (ar ⊗ dp)Tθ + dT
p (γ0 + γr).
In practice, when optimizing for ar and dp, we use the approach of Blei and Lafferty [13] and
introduce additional variational parameters, {ζa, ζd}, which are used to bound the respective softmax
denominators. For each user ζar is updated using ζar =∑K
k exp(ark), and similarly for ζdp ’s.
For the mean-field parameters {φar},{φd
p}, minimizing the KL while enforcing normalization leads to
the following solution:
φarnk =
βk,warn
exp(ark)∑j βj,wa
rnexp(arj)
φdpmk =
βk,wdpm
exp(spk)∑j βj,wd
pmexp(spj)
6While we could cycle through all variational parameters until convergence before beginning the M-step, we’ve found asingle pass of updates per E-step to work well in practice.
Chapter 4. Collaborative Filtering with Textual Side-Information 51
Algorithm 1 EM for the CSTM
Input: {war}, {wd
p}, {srp} ∈ So.
while Convergence criteria not met do# E-Stepfor all p ∈ P doUpdate dp,φ
dp
end forfor all r = 1 . . . R doUpdate ar, γr,φ
ar
end for
# M-StepUpdate θ, γ0,β
end while
We update the variational parameters of all users and subsequently of all documents.
M-Step
The M-step aims to maximize the expectation of the complete likelihood under the variational posterior
(taking into account the prior over the parameters γ0,θ):
Eq [L] + log p(γ0) + log p(θ) = − 1
2σ2s
∑
(r,p)∈So
(srp − ((ar ⊗ dp)
Tθ + (γ0 + γr)T dp)
)2
+
R,N,K∑
r,n,k
φarnk log βk,wa
rn+
P,M,K∑
p,m,k
φdpmk log βk,wd
pm
− 1
2λγ0
γT0 γ0 −
1
2λθ
θTθ + constant .
Setting the derivatives to zero (and satisfying the βjw parameters’ normalization constraints), we obtain
the following updates:
θ =1
σ2s
( ∑
(r,p)∈So
(srd − dTp (γ0 + γr))(dp ⊗ ar)
T) ∑
(r,p)∈So
(dp ⊗ ar)2
σ2s
+1
λθI
−1
γ0 =1
σ2s
( ∑
(r,p)∈So
(srp − (dp ⊗ ar)Tθ − (γT
r dp))dp
) ∑
(r,p)∈So
dpdTp
σ2s
+1
λγ0I
−1
βjk =
∑r,n φ
arnj1{wa
rn=k} +∑
p,m φdpmj1{wd
pm=k}∑k′,r,n φ
arnj1{wa
rn=k′} +∑
p,m φdpmj1{wd
pm=k′}
.
At test time, prediction of missing preferences is made using srp, which is readily available.
Chapter 4. Collaborative Filtering with Textual Side-Information 52
s M
K
RP
Figure 4.5: A graphical representation of the CTR model (adapted from [126])
4.5 Related Work
Previous work on hybrid collaborative filtering includes a few models that have combined item-only
topic and regression models for user-item preference prediction. We are not aware of any earlier work
that develops a text-based model of a user, nor one that combines user and item side information as in
CSTM.
Agarwal and Chen [2] model several sources of side information including item textual side informa-
tion using LDA. The topic assignment proportions of documents (∑
m zddm/M for all d ∈ D documents)
are used as item features and combined multiplicatively with user-specific features. The results are
linearly combined with user demographic information to generate preferences.
Wang and Blei [126] also combine LDA with a regression model for the task of recommending sci-
entific articles. Here the item topic proportions are used as a prior mean on normally-distributed item
(regression) latent variables. User latent variables are also normally distributed from a zero-mean prior.
A specific user-item score is then generated as the inner product of item and user latent variables:
srp = aTr (dp + ǫp), where ǫp is drawn from a zero-mean normal distribution. The preference prediction
model is the same as the one used in probabilistic matrix factorization [99]. Wang and Blei also report
that a modified version of their model analogous to the model of Agarwal and Chen [2] performed worse
on their data. Shan and Banerjee [106] proposed a similar model without the bias term ǫd but used
CTM [13].
The fact that we model an additional type of information (user textual side information) makes
it difficult to directly compare our model to the ones above. In addition, the parametrization we
use to predict preferences is very different from previous models. We initially experimented with a
parametrization similar to Wang and Blei [126], albeit modified to also model user side information, and
Chapter 4. Collaborative Filtering with Textual Side-Information 53
found it did not perform as well as CSTM (see the next section for experimental comparisons).
Finally, Agarwal and Chen [1] propose a collaborative filtering model with side information. Although
the form of the side information is not amenable to using topic models, the authors utilize a combination
of linear models to obtain good performance in both cold and warm-start data regimes. They reported
improved results in all the regimes they tried compared to pure collaborative filtering methods.
CSTM is also useful for predicting reviewer-paper affinities. Other researchers have look at modelling
reviewer expertise using collaborative filtering or topic models. We note the work of Conry et al. [29]
which uses a collaborative filtering method along with side information about both papers and reviewers
to predict reviewer paper scores. Mimno and McCallum [79] developed a novel topic model to help
predict reviewer expertise. Finally, Balog et al. [5] utilize a language model to evaluate the suitability
of experts for various tasks.
4.6 Experiments
We first describe the three datasets used for our experiments. We then introduce the set of methods
against which we perform empirical comparisons, ranging from pure CF methods to pure side information
methods. We report three separate sets of experiments. In the first we focus on the cold-start problem
for new users and examine the effect of including user libraries. In the second we study how the
methods perform on users with varying amounts of observed scores. Finally, we design a synthetic paper
recommendation experiment and simulate the arrival of new cold-start new users in order to test the
value of using both the user library and the user-provided item scores.
4.6.1 Datasets
We evaluate the models using these three datasets:
NIPS-10 and ICML-12 were both introduced in Section 3.2.3. We note that for NIPS-10, Each user’s
library consists of his own previously published papers. Users have an average of 31 documents (std.
20). After some basic preprocessing (we removed words that either appear in fewer than 3 submissions
or words which appear in more than 90% of all submissions), the length of the joint vocabulary is now
slightly over 18,000 words. For ICML-12, Users have an average of 25 documents (std. 29) each and the
length of the joint vocabulary is 16,201 words (identical pre-processing as with NIPS-10 above).
Kobo: The third dataset is from Kobo, a large North American-based ebook retailer.7 The dataset
contains 316 users and 2601 documents (books). Users average 81 documents (std. 100). We removed
very-infrequent and very-frequent words (those appearing in less than 1% or more than 95% of all
documents). The resulting vocabulary contains 6,440 words. Users have a minimum of 15 expressed
scores (mean 22, std. 6).
The respective score distributions of the each of the three datasets is shown in Figure 4.6.
4.6.2 Competing Models
We use empirical comparisons against other models to evaluate the performances of CSTM. For the
competing models we re-use some of the general models previously introduced and introduce a few other
7http://www.kobo.com
Chapter 4. Collaborative Filtering with Textual Side-Information 54
1 2 3 40
200
400
600
800
1000
1200
(a) NIPS-10
1 2 3 40
1000
2000
3000
4000
5000
6000
7000
8000
9000
(b) ICML-12
1 2 3 4 50
500
1000
1500
2000
2500
3000
3500
(c) Kobo
Figure 4.6: Score histograms. Compared to Figure 3.3 the histograms for NIPS-10, and ICML-12 arealtered according to the user categorization described in Section 4.6.3.
User side information Document side information Shared Parameters
SLM-I X X
SLM-II
X X
LR X
PMF X
CTR X X
CSTM X X X
Table 4.1: A comparison of the modelling capabilities of each model. “Shared Params” stands for modelsthat share information between users and/or items (in other words those which use some form of CF).
ad-hoc models for the task. Specifically, each competing model has particular characteristics (Table 4.1)
which will help in understanding CSTM’s performance.
Note that we use topic representations of documents for competing models that use side information.
Such representations were learned offline using a correlated topic model [13]. We re-use some of our
previous notation to describe these models. Namely au and dd are K-length vectors which designate a
user’s and a document’s (topic) representation respectively.
Constant: The constant model predicts the average observed scores for all missing preferences.
Comparison to this baseline is useful to evaluate the value of learning.
Supervised language model I (SLM-I): This model is a supervised version of the LM: srp := (aTu θA)(dTd θD)T
where the parameters θA, θD are K×F matrices. F is a hyper-parameter determined using a validation
set (ranges from 5 to 30 in our experiments).
Supervised language model II (SLM-II): This model uses isotonic regression (see, for example, [10])
to calibrate the LM. The idea is to learn a regression model that satisfies the implicit ranking established
by the LM:
minimizesr,∀r∑
(rp)∈So
(srp − srp)2
subject to srp ≤ sr(p+1), ∀p.
where the constraints enforce a user-specific document ordering specified by the output of the LM. Once
learned the set of regression parameters {s} are used as the model’s predictions. To obtain predictions
Chapter 4. Collaborative Filtering with Textual Side-Information 55
for an unobserved document we have found that taking the average of the (predicted) scores of the two
(observed) documents ranked, according to the LM, directly above and below the new document works
well. The regression is user-specific and therefore cannot be used for users with no observed preferences.
For such users we simply re-use the learned parameters of its closest user. We leave further research into
more principled approach, for example a collaborative one, for future work.
LR: The linear regression model was introduced in Section 3.2.2. To recapitulate, it is a user-specific
regression model where predictions are given by: srp = γTudd.
PMF [99]: PMF was introduced in Section 2.2.1. We remind the reader that it is a state-of-the-art
collaborative filtering approach and that it does not model the side information. The size of the latent
space is determined using a validation set (range from 1 to 30).
Collaborative topic regression (CTR): CTR, is matrix factorization with document-content model
introduced in Wang and Blei [126]. CTR was briefly reviewed in Section 4.5. We use a slightly different
version than the one introduced by its authors. Namely, we have replaced LDA by CTM. Also, in our
application since all user-item scores are given we use a single variance value over scores (σs).
For SLM-I, LR, and PMF learning is performed using a variational approximation with a Gaussian
likelihood model and zero-mean Gaussian priors over the model’s parameters. The prior variances are
determined using a validation set.
Finally, we investigated a few other models. Of note: instead of modelling user libraries as side
information we used the documents of user libraries as observed highly-scored items. We experimented
with various scoring schemes but none lead to consistent improvements over the baselines described
above. We also experimented with replacing directed topic models with an supervised extension of an
undirected topic model [97]. However these method did not perform well and are not discussed further.
4.6.3 Results
To run CSTM on the above datasets we first concatenated user libraries (for example a reviewer’s
previously published papers) into a single document. The content of the resulting document can then
be used as that a user’s side information (wr) in CSTM. To get user and item topic proportions we
learned a CTM topic model [13] using the content of the items and then projected user documents into
that space to obtain user topic proportions. We directly used these topics in SLM-1, SLM-II and LR.
We also used these topics as initialization in those models which jointly learn topics and scores (CSTM
and CTR). In all experiments we use 30 topics.
For training we create 5 folds from the available scores. Each fold is split into 80 percent observed
and 20 percent test data. We used the first fold to determine the hyper-parameters of the model. We
report the average results over the five folds as well as the variance of this estimator.
We want to evaluate the performance of CSTM in settings where some users have no observed scores.
The cold-start setting is of particular practical importance and one that should enable a good model
to leverage the user’s side information. Accordingly, in our datasets we randomly selected one fourth
of all users and removed all of their observed scores for training but kept their test scores (and their
side information remains available at training). Further, for NIPS-10 and Kobo, whose users have a
more uniform number of ratings, we binned the remaining users (three quarters) uniformly into three
categories. For NIPS-10, users in each category had 15, 30 and 55 observed scores respectively. In each of
the three categories 5 ratings per user were kept for validation. For Kobo users in the first two categories
had 8 and 10 scores while the scores of users in the last category were left untouched (5 scores per user
Chapter 4. Collaborative Filtering with Textual Side-Information 56
NIPS-10 ICML-12 Kobo
Constant 0.4378±2×10−3 0.6386±4×10−5 0.6882±5×10−4
SLM-I 0.4684±2×10−3 0.7903±4×10−5 0.6873±1×10−3
SLM-II 0.4696±3×10−4 0.7752±1×10−4 0.6926±6×10−4
CSTM 0.4846±1×10−30.8096±1×10−4
0.7243±2×10−4
Table 4.2: Comparisons between CSTM and competitors for cold-start users using NDCG@5. We reportthe mean NDCG@5 value and the variance over the five training folds.
were kept for validation). For ICML-12 since users are already naturally distributed into categories, we
split the observed data into 25 percent validation and 75 percent train.
For the next two experiments, for each dataset, we train each model on all of the data but we divide
our discussion into two parts. First we discuss cold-start users, and then we examine the (other) user
categories.
Cold-Start Data Regime
We first report the results for the completely cold-start data regime. As a reminder, this simulates
new users entering the system with their side-information. That is, user and item side-information
is available but scores are not. For the cold-start users, it is difficult to calibrate the output of the
model to the correct score range since only the users’ side information is available. The prediction is
that the models can use the side information to get a better understanding of users’ preferences and
discriminate between items of interest. Accordingly we report results using Normalized DCG, (NDCG)
a well-established ranking measure (see Equation 2.9), where a value of 1 indicates a perfect ranking
and 0 a reverse-ordered perfect ranking [57]. NDCG@T considers exclusively the top T items. Table 4.2
reports results for the three datasets using NDCG@5 (note that other values of NDCG gave similar
results). We can only report results for the methods that have the ability to predict scores for cold-start
users: PMF, LR, and CTR do not use any user side information and hence do not have that ability.
In this challenging setting CSTM significantly outperforms the other methods. Further we see that
methods using side information typically outperform the constant baseline. This demonstrates that the
useful information about user preferences can be leveraged from the user libraries. Further, the good
performance of CSTM in this setting shows that the model is able to leverage that information.
Warm-start data regimes
The goal of CSTM is to perform well across different data regimes. In the previous section we examined
the performance of several methods on cold start users; we now focus on users with observed scores. For
each dataset we report the performance of the various methods for each user category. For ICML-12 we
separated users into roughly equal sized bins according to their number of observed scores. Results for
the three datasets are provided in Figure 4.7. First we note that as the number of observed scores is
increased the performance of the different methods also increases. CSTM outperforms all other methods
on lower data-regimes. On users with more observed scores CSTM is competitive with both CTR and
LR.
We note that overall in this task, and even when many observed preferences are available, PMF is
not competitive with most of the methods that have access to the side information. This highlights
Chapter 4. Collaborative Filtering with Textual Side-Information 57
=15 =30 =550.95
1
1.05
1.1
1.15
ConstantLM−ILM−IIPMFLRCTRCSTM
(a) NIPS-10 data set
>0,<=20 >20,<=30 Others0.9
0.95
1
1.05
1.1
1.15
(b) ICML-12 data set
>0,<=8 >8,<=10 Others0.75
0.8
0.85
0.9
0.95
1
(c) Kobo data set
Figure 4.7: Test RMSE of the different methods across the different datasets. For each dataset, we reportresults for the three subsets of users with different number of observed scores (categories). Figures betterseen in color (however the ordering in the legend corresponds to the ordering of the bars in each group).
Chapter 4. Collaborative Filtering with Textual Side-Information 58
1 2 3 41
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
|θ|/(
w0+
w)
1 2 3 40.85
0.9
0.95
1
1.05
Figure 4.8: Averaged norm of parameters under users with varying number of scores (left NIPS-10, rightICML-12).
the value of content side information on both user and item sides. This is further made clear by the
relatively strong performance of both SLM-I and SLM-II.
Overall user libraries do not seem to help as much on the Kobo dataset. There are several explanations
for this. First, in Kobo the distribution over scores is very skewed toward high scores. Therefore a
constant baseline does quite well. Further, bag-of-words representations are particularly well suited for
academic papers where the presence (absence) of specific words are very good indications of a document’s
field and hence its targeted audience. However, in (non-technical) books user preferences also rely on
other aspects such as the document’s prose which is harder to capture in a bag-of-words topic model.
Tradeoff of side information with observed scores
In Section 4.4 we motivated the specific parametrization of CSTM by its ability to trade off the influence
of the user library side information versus that of the user-item scores. Here we show that learning in
our model performs as expected. Figure 4.8 reports the relative norm of the compatibility parameters
θ versus the (shared and individual) user parameters (γ0 + γ) as a function of the number of observed
scores: |θ|/|(γ0 +γ)|. As hypothesized as the number of observed scores increases the relative weight of
the user library side information decreases.
Results on original ICML-12 dataset
For completeness we also report results on the original, unmodified, version of ICML-12 (that is a version
without the artificial 25 percent cold-start users, see Section 4.6.3). Accordingly, users are now binned
differently. In Figure 4.9 and Table 4.3 we provide comparisons of the different methods on the original
version of ICML-12. We highlight that these results further display the good performance of CSTM in
all data regimes. Similarly as in the previous experiments with ICML-12, CSTM is only outperformed
by CTR for reviewers with many training scores.
Chapter 4. Collaborative Filtering with Textual Side-Information 59
<= 16 >16,<=27 >27,<=40 Others0.9
0.95
1
1.05
1.1
1.15
Figure 4.9: RMSE results on the unmodified ICML-12 dataset.
ICML-12
Constant 0.8950±5×10−4
SLM-I 0.9301±2×10−4
SLM-II 0.9308±4×10−4
CSTM 0.9409±3×10−4
Table 4.3: For the unmodified ICML-12 dataset, comparisons between CSTM and competitors for cold-start users using NDCG@5.
Variations of CSTM
We also experimented with variations of CSTM to better understand the roles played by the different
aspects of the model and its training.
CSTM fixed topics (CSTM-FT): This model uses the exact preference regression model used by
CSTM but it uses fixed user topic and document topic representations; that is, it predicts preferences
with rud = (au ⊗ dd)θT + dd(γ0 + γ)T where au and dd are previously learned offline.
CSTM no user side information (CSTM-NUSI): To evaluate the gain of using user side information
we experimented with a version of our model that does not model user side information (i.e., as if a user
did not have any documents). Specifically, in this model ar ≡ 0 for all users.
We provide some results comparing CSTM with its variations in Table 4.4. We notice that it is the
superior synergy of the user side information and the joint training of the model that explain CSTM’s
performance.
NIPS-10 ICML-12 ICML-12(original) Kobo
CSTM-NUSI 0.4941±4×10−4 0.7765±9×10−5 0.8048±9×10−4 0.7997±8×10−5
CSTM-FT 0.4984±2×10−4 0.8036±5×10−5 0.8066±8×10−5 0.8026±8×10−5
CSTM 0.5016±2×10−40.8217±2×10−5
0.8322±2×10−40.8037±2×10−5
Table 4.4: Comparisons between CSTM and two variations. Results report NDCG@5 over all users.
Chapter 4. Collaborative Filtering with Textual Side-Information 60
Poster Recommendations
We explore a different scenario which is meant to simulate what would happen when a model is deployed
in a complete recommender system, for example, to guide users to posters of interest in an academic
conference. Specifically, we evaluate the performance of CTR and CSTM as new users arrive into the
system and gradually provide information about themselves. We postulate that users first provide the
system with their library. Then users gradually express their preferences for certain (user-chosen) items.
We trained CTR and CSTM on all but 50 randomly-chosen ICML-12 users, restricting our attention
to users with at least 15 observed scores. We then simulated these users entering the system and evaluate
their individual impact. In this experiment we want to recommend a few top-ranked items to each user.
In terms of our recommendation framework in Figure 4.1, the recommendation of the personalized-best
items is our ultimate objective F (·). Therefore we evaluate the system’s performance using NDCG (of
held out data). Figure 4.10 presents the performance of CSTM and CTR as a function of the amount
of data available in the system. When a user first enters the system no data is available about him
(indicated by “0” in the figure). The methods revert to using a constant predictor which predicts the
mean of the previously observed scores across all users. Once a user provides a library (Lib.) we see that
CSTM’s performance increases very significantly. CTR cannot leverage that side information. Then once
users provide scores, the performance of both methods increases and the performance of CTR eventually
reaches the performance of CSTM.
Figure 4.10 demonstrates the advantage of having access to user side information, namely, the system
can quickly give good recommendations to new users. Further, in absolute terms the system performs
relatively well without having access to any scores. It is also interesting to note, in this experiment, as
far as NDCG goes, the performance of CSTM only modestly improves as the number observed scores in-
creases. This may be a consequence of our fairly primitive online learning procedure. As far as modelling
goes this experiment is also a demonstration that our model of user libraries is effective at extracting
features (ar for all users) indicative of preferences and that the regression model (Equation 4.10) then
successfully combines the user and item side information.
4.7 Conclusion and Future Opportunities
We have introduced a novel graphical model to leverage user libraries for preference prediction tasks.
We showed experimentally that CSTM overall outperforms competing methods and can leverage the
information of other users and of user libraries to perform particularly well in cold-start regimes. We also
explored a paper recommendation task and demonstrated the positive impact on the recommendation
quality of having access to user libraries.
Overall, we showed that using the content of items outperforms state-of-the-art (content-less) collab-
orative filtering methods in both cold and warm start regimes. Furthermore, we have showed that using
user-item side information is also a win in many cases. Finally, user-item side information is essential to
quickly provide good recommendations to new users.
Future work offers both immediate and longer-term possibilities. In the near term, we could refine
the inference procedure used in training our model. For example by using a fully variational approach
and by leveraging the latest inference procedures for non-conjugate models such as CTM [127].
An important question is what is missing from CSTM before it can used in real life. Of practical
importance is a model’s scalability properties. Currently, with respect to the datasets used in this
Chapter 4. Collaborative Filtering with Textual Side-Information 61
0 Lib. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.55
0.6
0.65
0.7
0.75
0.8
CTRCSTM
Figure 4.10: Comparison of CSTM and CTR’s NDCG@10 performance on new users as a function ofthe amount of data provided by users. The x-axis denotes what user data is available (Lib. stands foruser libraries while integer values denote the number of available user scores). Without any user data(labelled 0 on the x-axis) both methods revert to a constant predictor.
chapter, CSTM could be trained on larger datasets but is still far from being able to learn from datasets
that may be common in industry. To give an indication of training time, the current implementation of
our model, written in Matlab, can be fitted to our datasets within a few hours on a modern computer.
Currently, our inference procedure, at each iteration, must iterate through all users and all items.
One interesting avenue to improve scaling (in terms of number of users and items) is to use stochastic
variational inference [47, 23]. The basic intuition behind the stochastic optimization methods used in
machine learning, for example stochastic gradient descent, is to allow multiple parameter updates per
training pass. This stands in contrast to Algorithm 1 where we update the model parameters after
having examined all observed preferences. Stochastic optimization techniques are very commonly used
to train (collaborative-filtering) models on large datasets (for example, [99]). For our purposes, we could
imagine sampling n users, along with the documents that these users have rated, per iteration and use
the sufficient statistics from these n users to update the global parameters. Therefore, the training time
per iteration would at reduced by least be an order of R/n (where R is the total number of users). It
is still difficult to evaluate the resulting size of datasets that CSTM could then be fitted to since the
number of iterations can only be determined empirically.
Related to the above is the question of using CSTM as the main model inside of TPMS (for example
in place of the LM or of the model depicted in Equation 3.5). CSTM is designed for this exact domain
and, based on the experiments of this chapter it could improve the performance of the system. Before
changing the model used in TPMS we will need to evaluate CSTM against the model of Equation 3.5.
Such comparisons, since CSTM is novel, do not yet exist. Furthermore, there are practical aspects which
Chapter 4. Collaborative Filtering with Textual Side-Information 62
do not favour the inclusion of CSTM. First and foremost, many conferences only rely on user and item
content and never elicit expertise from reviewers. Furthermore, conference organizers are often working
under stringent time constraints which may favour the use of simpler models. For example, the LM is
fast to fit and we have found good settings of its hyper-parameters that work well across all conferences.
On the other hand both CSTM and the model of Equation 3.5 require tuning of the hyper-parameters
which can be time-consuming. One further aspect to consider is how much the empirical gains of CSTM
translate to perceivable differences in reviewer scores and ultimately in reviewer assignments. As an
indication, the next chapter shows that, on our datasets, there is a strong correlation between prediction
performance and matching performance. Thus given our current experiments, when time permits, CSTM
seems like a very good candidate for inclusion into TPMS.
A second aspect of practical importance is that once we move to online recommendation, models
must also be able to adapt to new data, including novel items and users, updates to user libraries, and
new user-item scores. In the poster recommendations experiment we have seen that a simple conditional
inference method works relatively well for novel users. However, one would also like to use the information
from novel users to learn better representations of all users. In other words, we would need a mechanism
which updates model parameters once a sufficient amount of new data is available. Furthermore, we
could refine such a method inter alia to allow the system to adapt to the evolving preferences of users
over time. For example, Agarwal and Chen [1] propose a decaying mechanism to emphasize more recent
scores over older ones. A similar mechanism could be use to weight the different documents in a user’s
library (for example based on date of publication for research papers or purchase date for books).
There is also the question of other potential applications for which CSTM could be useful. In addition
to modelling text, topic models have also been shown to model images [33]. CSTM could then be used
as an image recommendation tool (for example to photographers). In that case, much like for the
books of the Kobo dataset, it remains to be seen whether topic models can capture features of images
which are indicative of preferences. Another application for CSTM is the one of modelling legislators’
interests whom, similarly to academic reviewers, write and express their preferences about proposed
laws. Furthermore, there are novels aspects in this domain such as the effects of party lines which,
for certain bills, may create voting correlations which may have to be taken into account to correctly
determine the legislators’ true preferences.
Finally, the approach behind CSTM, which allows it to smoothly interpolate between side information
and preferences, could be refined. In its current form, if some user documents are not useful to predict
preferences the model must either: a) adjust the global compatibility parameters θ, thereby changing
the predictions of all users; or b) increase user individual parameters γr while keeping the calibration
of predicted preferences to observed ones; or c) change the representation of user libraries therefore
changing the topic model; or d) use a combination of the above. A more pleasing solution would allow
the model to independently adjust the importance of user libraries, and perhaps even of individual
documents within a library. Overall, interpolating from side information based recommendations to a
preference one requires additional attention.
Chapter 5
Learning and Matching in the
Constrained Recommendation
Framework
In the majority of recommender systems the predicted preferences enable the recommendation stage.
The recommendation step, the ultimate step in our constrained-recommendation framework, can assume
various forms. In particular it may include several different constraints and objectives in addition to
the initial preference prediction objective (the loss function associated with the learning method). This
chapter is devoted to studying this last step and the interactions between the learning and recommen-
dation stages. Motivated by the reviewer-paper assignment problem, we also develop and explore an
approach to constrained recommendation consisting of a matching between reviewers and papers.
We begin by detailing the design choices of our proposed recommendation framework. We then look
at an instance of the framework for matching reviewers to papers. We especially focus on the development
of the matching stage and, present several experimental results on matching problems. The experiments
aim to compare different learning models on the matching task, and also explore different matching
formulations motivated by real-life requirements. Finally, we demonstrate synergies between matching
and learning.
5.1 Learning and Recommendations
For readability purposes we reproduce a constrained-recommendation framework in Figure 5.1. We
remind the reader of the three stages of the framework: 1) elicitation of user information; 2) prefer-
ence prediction using the previously elicited preferences and side information; and 3) determination of
recommendations using domain-specific considerations (objective and constraints).
Separating the preference prediction stage and the recommendation stage implies the use of two
separate objective functions. Ideally we would rather use a single objective which encompasses both the
ideals of the learning objective (e.g., accurate preference predictions) and the ideals of the recommen-
dation objective (e.g., learn a model which will give good recommendations). In practice this is usually
impossible for two reasons: 1) The optimization stage can involve solving a complex, for example a
63
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 64
Prediction of missing
preferencesElicitation
Final objectives
and constraints
S31
1 2
Users
Entities
Stated preferences
1 1 20 1 22 3 2
Stated & predicted
preferences
1 1 20 1 22 3 2
Recommendations
Side-Information
from Users and
Entities
F
Figure 5.1: Flow chart depicting a constrained recommender system.
non-continuous, optimization problem with constraints which we cannot express as a learning objective;
2) Learning the parameters of a predictive model is likely the most expensive step of the three stages;
once learned it can be useful if the predictions can be used in different optimization problems. The
reviewer to paper matching system of Chapter 3 is a good example of the latter as conference organizers
will often use TPMS scores multiple times while exploring their preferences over the various matching
constraints.
Further, in machine learning it is sometimes empirically advantageous to learn a model using a
simpler loss function that the one which is required by the task. A good example, which we reviewed in
Section 2.2.2, occurs in the field of learning to rank for recommender systems, a recommendation task
requiring a straightforward sorting procedure as its last stage. Certain state-of-the-art methods often
split the learning into two different steps (akin to the two stages of our framework): in the first step they
learn a standard preference prediction model the output of which is then used to train a model using a
(domain specific) ranking loss [4].
The separation of the prediction and optimization stages does not imply that the two stages cannot
work in synergy when possible. Specifically, defining a loss function that is sensitive to the final objective
may provide performance gains. In fact we explore some of these synergies for the matching problem in
Section 5.4.4.1
5.2 Matching Instantiation
Motivated by the reviewer to paper matching problem we explore different formulations of an assignment
or matching problem to optimally assign papers to reviewers given some constraints. In other words,
as shown in Figure 5.2, the third stage of our constrained recommendation framework is a matching
problem.
Concretely, we frame the assignment problem as an integer program [122], and explore several varia-
tions that reflect different desiderata, and how these interact with various learning methods. We test our
framework on two data sets collected from a large AI conference, measuring predictive accuracy with
respect to both reviewer suitability score and matching performance, and exploring several different
matching objectives and how they can be traded off against one another.
1Synergies between the elicitation and the optimization steps, which we show to be even more important experimentally,will be explored in Chapter 6.
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 65
Prediction of missing
preferences
Matching objective
and constraints
S31
1 2
Users
Entities
Stated preferences
1 1 20 1 22 3 2
Stated & predicted
preferences
1 1 20 1 22 3 2
Assignments
Side-Information
from Users and
Entities
Figure 5.2: Match-constrained recommendation framework.
Although we focus on reviewer matching, our methods are applicable to any constrained matching
domain where: (a) preferences can be used to improve matching quality; (b) it is infeasible or undesirable
for users to express preferences over all items; and (c) capacity or other constraints limit the min/max
number of users-per-item (or vice versa). Examples include facility location, school/college admissions,
certain forms of scheduling and time-tabling, and many others.
We reuse the notation established in previous chapters. As a reminder, we denote the matrix of
observed scores, or reviewer-paper suitabilities, by So. Further we denote the observed scores for a
particular reviewer r and paper p by Sor and So
p , respectively. Su, Sur , S
up are the analogous collections
of unobserved scores.
Given this information, our goal is to find a “good” matching of papers to reviewers in the presence
of incomplete information about reviewer suitabilities, possibly exploiting the side information available.
5.2.1 Matching Objectives
We articulate several different criteria that may influence the definition of a “good” matching and explore
different formulations of the optimization problem that can be used to accommodate these criteria. We
also discuss how these criteria may interact with our learning methods.
Naturally, one would like to assign submitted papers to their most suitable reviewers; of course, this
is almost never possible since some reviewers will be well suited to far more papers than other reviewers.
In general, load balancing is enforced by placing an upper limit or maximum on the number of papers
per reviewer. Similarly, we may impose a minimum to ensure reasonable load equity or load fairness
across reviewers. However, limiting the paper load increases the probability that certain papers will be
assigned to very unsuitable reviewers. This suggests only making assignments involving pairs with score
srp above some minimum score threshold. This ensures that every paper is reviewed by a minimally
suitable reviewer, but may sacrifice load equity (indeed, it may sacrifice feasibility). One may also desire
suitability fairness across reviewers; that is, reviewers should have similar score distributions over their
assigned papers (so on average no reviewer is assigned to significantly more papers for which he is poorly
suited than any other reviewer). Finally, when multiple reviewers are assigned to papers, it may be
desirable to assign complementary reviewers to a paper so as to cover the range of topics spanned by a
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 66
submission. Related is the desire to ensure each paper is reviewed by at least one “well-suited” reviewer.
The intricacies of different conferences prevent us from establishing an exhaustive list of matching
desiderata (see [8, 42, 37] for further discussion). We now explore matching mechanisms that will account
for several of these criteria: we frame the matching procedure as an optimization problem and show how
several properties can be formulated as constraints or modifications of the objective function.
We formulate the basic matching problem as an integer program (IP), where each paper is assigned
to its best-suited reviewers given the constraints [122]:
maximize Jbasic(Y, S) =
∑
r
∑
p
srpyrp (5.1)
subject to yrp ∈ {0, 1}, ∀r, p (5.2)∑
r
yrp = Rtarget , ∀p (5.3)
The binary variable yrp encodes the matching of item p to user r; a match is an instantiation of these
variables. Y is the match matrix Y = {yr}r∈R,p∈P and similarly, S is the score matrix S = {sr}r∈R,p∈P .
Jbasic(Y, S) denotes the value of the objective of the IP with match matrix Y and score matrix S. Rtarget
is the desired number of reviewers per paper. Minimum and maximum reviewer load, Pmin and Pmax
respectively, can be incorporated as constraints [122]:
∑
p
yrp ≥ Pmin,∑
p
yrp ≤ Pmax, ∀r. (5.4)
This IP, including constraints (5.4), is our basic formulation (Basic IP). Its solution, the optimal match,
maximizes total reviewer suitability given the constraints. Although IPs can be computationally difficult,
our constraint matrix is totally unimodular, so the linear program (LP) relaxation (allowing yrp ∈ [0, 1])
does not affect the integrality of the optimal solution; hence the problem can be solved as an LP. This
can be understood as follows. The constraints define a feasible set : Ay ≤ b. Each constraint is linear
thus the feasible set is a polyhedron. Since A the constraint matrix is totally unimodular [122] and b
is an integer vector then the vertices of that polyhedron have integer coordinates. An LP’s objective
function is also linear and therefore its optima corresponds to one, or more, of this polyhedron’s vertices.
Although not mentioned above, it is essential for the matching to prevent assignments of reviewers to
submitted papers for which they have conflicts of interest (COI). The above formulation can easily enforce
known COI by directly constraining the conflicting assignments’ yrp variables to be 0, alternatively, we
can set the relevant scores srp’s to −∞.
To capture additional matching desiderata, we can modify the objective or the constraints of this IP.
Load balancing can be controlled by manipulating Pmin and Pmax: a small range ensures each reviewer
is assigned to roughly the same number of papers at the expense of match quality, while a larger range
does the converse. We can instead enforce load equity by making the tradeoff explicit in the objective
with “soft constraints” on load:
Jbalance(Y, S) =
∑
r
∑
p
srpyrp −∑
r
λf(
(
∑
p
yrp)
− y)
(5.5)
where y is the average number of papers per reviewer (M/N) and f is a penalty function (for example,
f(x) = |x| or f(x) = x2). The parameter λ controls the tradeoff between load equity and match quality.
The Jbalance objective (Equation 5.5) along with the constraints expressed in Equation 5.2 comprise our
Balance IP. Note if f(x) is nonlinear, then Balance IP becomes a nonlinear optimization problem.
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 67
The Jbasic objective (Equation 5.1) maximizes the overall suitability of the assignments, equating
“utility” with suitability. However, the utility of a specific match yrp may not be linear in suitability srp.
For example, utility may be more “binary”: as long as a paper is assigned to a reviewer whose suitability
is above a certain threshold, then the assignment is good, otherwise it is not. This can be realized
by applying some non-linear transformation g to the scores in the matching objective (for example, a
matched pair with score srp ∈ {2, 3} may be greatly preferred to srp ∈ {0, 1}):
Jtransformed(Y, S) =
∑
r
∑
p
g(srp)yrp. (5.6)
In this transformed objective J transformed , if g is a logistic function then score are softly “binarized.”
Note that g(srp) can be evaluated offline and therefore, J tfm can still be used as the objective of an IP.
Finally, we note that some of these matching objectives can also be incorporated into the suitability
prediction model. For example, the nonlinear transformation g can be directly used in the learning
training objective (for example, instead of vanilla RMSE):
CLR-TFM(So) =1
|So|
∑
srp∈So
(srp − g(srp))2. (5.7)
5.3 Related Work on Matching Expert Users to Items
We briefly reviewed some of the connections between our work and expertise retrieval in Section 3.3.1.
Here we broaden our horizon and discuss research that deals explicitly with matching expert users with
suitable items, including research from the expertise retrieval field.
Stern et al. [115] propose to use a hybrid recommender system to assign experts to specific tasks. The
tasks are (hard) combinatorial problems and the set of experts comprises different algorithms that may
be able to solve these tasks. The novelty of this approach is that new unseen tasks appear all the time
and so side information—features of the tasks represented by certain properties of the combinatorial
problems—must be used to relate a new task to other tasks and their experts. A Bayesian model is used
to combine collaborative filtering and content-based recommendations [114].
Matching reviewers to papers in scientific conferences has also received some attention [8, 42, 37].
Recently Conry et al. [29], showed how this domain could benefit from using a collaborative filtering
approach. This approach is conceptually similar to the one used by TPMS in Chapter 3. As a reminder,
imagine that reviewers have assessed their expertise (or preferences) for a subset of papers. CF can be
used to fill in the missing scores (or ratings). Once all scores are known the objective is then to assign
the best reviewers to each paper. Doing so, without further constraints, could create large imbalances
between reviewers with larger expertises, say more senior reviewers, and those with less expertise. One
solution is to post constraints on the maximum number of papers reviewers can be assigned to. Likewise,
papers must be reviewed by a minimum number of reviewers. The final objective function that must be
optimized has constraints across users (reviewers) and items (papers). Ideally, the collaborative filtering
would concentrate on accurately predicting scores corresponding to reviewer-paper pairs that will end
up being matched. This is not easy as the matching objective is typically not continuous. Conry et al.
[29] have studied this problem without integrating both steps.
Rodriguez and Bollen [91] have built co-authorship graphs using the references within submissions in
order to suggest initial reviewers. In Karimzadehgan et al. [60], authors argue that reviewers assigned to
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 68
a submission should cover all the different aspects of the submission. They introduce several methods,
including ones based on modelling papers and reviewer using topics models, to attain good (topic)
coverage. Our CSTM model (Chapter 4) is similar in the sense that it models reviewers and papers
using a topic model. However, in CSTM reviewer preferences are determined according to the global
suitability of the reviewer, and not explicitly according to his coverage with respect to a paper’s topics.
In follow-up work Karimzadehgan and Zhai [59] and Tang et al. [118] explore similar coverage ideas and
show how they can be incorporated as part of a matching optimization problem (akin to the one we
present in Section 5.2.1).
A second body of work focuses exclusively on the matching problem itself. Benferhat and Lang [8],
Goldsmith and Sloan [42], and Garg et al. [37] discuss various optimization criteria, and some of the
practices used by program chairs and existing conference management software. Taylor [122] shows how
these criteria can be formulated as an IP. Tang et al. [117] propose several extensions to the IP. This
work assumes reviewer suitability for each paper is known, and deals exclusively with specific matching
criteria.
5.4 Empirical Results
We start by describing the data sets used in our experiments. The rest of the section is divided into
three parts. The first considers score predictions with the different learning models. The second turns
to matching quality and explores the soft constraints on the number of papers matched per reviewer.
Finally, the third part evaluates a transformation of the matching objective and shows how using a
transformed learning objective can enhance performance on the transformed matching problem.
5.4.1 Data
Experiments are run using the NIPS-10 data, described in previous chapters, and the NIPS-09 dataset,
from the 2009 edition of the NIPS conference.2 As before, side information for each reviewer comprises a
self-selected set of papers representative of his or her areas of expertise; these were summarized as word
count vectors war . Side information about submitted papers consisted of document word counts wd
p for
each p. The total vocabulary used by submissions (across both sets) contained over 21,000 words; here
we used only the top 1000 words for our experiments as ranked using TF-IDF (|wp| = |wr| = 1000).
Reviewer suitability scores ranged from 0 to 3; 0 meaning “paper lies outside my expertise;” 1 means
“can review if necessary;” 2 means “qualified to review;” and 3 means “very qualified to review.” As
discussed above, these scores are intended to reflect reviewer expertise, not desire. We focus on the area
chair (or meta-reviewer) assignment problem, where the matching task is to assign a single area chair
to each paper. We use the term reviewer below to refer to such area chairs.
NIPS-09 comprises 1079 submitted papers and 30 area chairs. Contrary to the procedure followed
by NIPS-10 (and ICML-12) reviewer scores were not elicited, but instead provided by the conference
program chairs for every reviewer-submission pair. The mean suitability score was 0.19 (std. dev. 0.57).
A histogram of the scores for each dataset is shown in Figure 5.3.
2See http://nips.cc
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 69
0 1 2 30
500
1000
1500
2000
2500
3000
Num
ber
of S
core
s
(a) NIPS-10
0 1 2 30
0.5
1
1.5
2
2.5
3x 104
(b) NIPS-09
Figure 5.3: Observed scores for the two datasets. Figure for NIPS-10 is reproduced here (from Section 3.3) inorder to highlight differences in score distribution with NIPS-09.
5.4.2 Suitability Prediction Experimental Methodology
We first describe the methodology used to train and test the score prediction models. We do not report
suitability prediction results since Chapter 4 has already highlighted this stage of the framework.
Suitability methods: We reuse some of the models compared in Chapter 4. Namely, we use a method
which only uses side information, (word-)LM (see LM in Section 3.2.1 for details); a pure collaborative
filtering method, BPMF, a Bayesian extension of PMF already reviewed in Section 2.2.1; and LR, a
reviewer-specific linear regression model (see Section 3.2.2). The goal here is not to compare the merits
of the different approaches on a score prediction task but rather to compare their matching performance.
For learning, we are given a set of training instances, Str ≡ So. We split this set into a training and
validation set. The trained model predicts all unobserved scores Su. Since we do not have true suitability
values for all unobserved scores, we distinguish Su as being the union of test instances Ste (for which
we have scores in the data set), and missing instances Sm. LR is trained using a regularized squared
loss (or, equivalently, by assuming a Gaussian likelihood model and Gaussian priors over parameters).
We denote a model’s estimates of the test instances as Ste.
We use 5 different splits of the data in all experiments. In each split, the data is divided into training,
validation and test sets in 60/20/20 proportions. There is no overlap in the test sets across the 5 splits.
Training LR is naturally slightly faster than training BPMF3, for which we used 330 MCMC samples
including 30 burn-in samples, but both methods can be trained in a few minutes on both of our data
sets.
5.4.3 Match Quality
We now turn our attention to the matching framework. We first elaborate on how we perform the
matching. We then evaluate the performance of the different learning methods on the matching objec-
tive. Finally we introduce soft constraints into the matching objective and analyze the trade-offs they
3We use an implementation of BPMF provided by its authors and currently available athttp://www.cs.toronto.edu/~rsalakhu/BPMF.html
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 70
train/validation test missing
Matching Str Ste Sm = τEvaluation Str Ste Sm = τ
Table 5.1: Overview of the matching/evaluation process.
introduce.
Matching: Experimental Procedures
The matching IPs discussed above assume access to fully known (or predicted) suitability scores. Since
we learn estimates of the unknown scores, we denote a model’s estimates of the test instances as Ste,
and impute a value for all suitability values that are missing, using a constant imputation of τ ∈ R.
Since missing scores are likely to reflect, on average, lower suitability than their observed counterparts,
we use τ = 1 in all experiments (NIPS-10’s mean is 1.1376 and NIPS-09 has no missing scores).
Given the estimate Ste computed by one of our learning methods, we perform a matching with
S = Str ∪ Ste ∪ (Sm = τ). Note that this permits missing values to be matched, which is important in
the regime where few suitability scores are known. Table 5.1 summarizes this procedure. For data set
NIPS-10 we set Pmin and Pmax to 20 and 30, respectively, while the range is 30–40 for data set NIPS-09.4
Baseline: We adopt a baseline method that provides an absolute comparison across methods. The
baseline has access to Str and imputes τ for any element of Ste. To allow meaningful comparison to
other methods, it employs the same imputation for missing scores, Sm = τ .
A note on LM: Although the output of LM can be directly used for matching, it does not exploit
observed suitabilities in its usual formulation. However LM can make use of some of the training data
Str by incorporating submitted papers assessed as “suitable” by some reviewer r into his or her word
vector war . Specifically, we include all papers in wa
r for which r offered a score of 3 (only if this score is
in Str).
For all methods, once an optimal match Y ∗ is found, we evaluate it using all observed and unobserved
scores, with the same constant imputation for the missing scores, where match quality is measured using
Jbasic (see Equation 5.1): ∑
r
∑
p
x∗rp(S
tr ∪ Ste ∪ Sm = 1) (5.8)
Matching Performance using Basic IP
We now report on the quality of the matchings that result from using the predictions of the different
methods. Similarly to the preference prediction experiments in Chapter 4 , we consider dynamic matching
performance as the amount of training data per user increases. Note that the optimal match value is
3053 for NIPS-10 and 2172 for NIPS-09, which occurs when Ste = Ste.
Figure 5.4 shows how matching quality varies as the amount of training data per user increases in
NIPS-10. Since training scores are also observed at matching time (Equation 5.8), all methods benefit
from having a larger training set. Figure 5.4 leads to the following three observations. Firstly, when no
observed data is available (i.e., when using only the archive) LM does very well, with a matching score
of 2247 ± 32, nearly identical to the quality of LR and BPMF with 10 suitabilities per user, and much
4These represent typical ranges for members of the senior program committee members.
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 71
0 10 20 30 40 50 60 70 80 90
2100
2200
2300
2400
2500
2600
2700
2800
2900
Training set size per user
Mat
chin
g O
bjec
tive
LR
BPMF
LM
Baseline
Figure 5.4: Performance on the matching task on the NIPS-10 dataset.
better than the match quality of 1262 obtained using constant scores ((Ste ∪Sm) = τ) . Secondly, when
very few scores are available, LR and LM perform best (and do equally well). As mentioned above, LM
is able to exploit observed suitabilities by adding relevant papers to the user corpus, but this attenuates
the impact of elicited scores: we see LM is outperformed by all other methods when sufficient data is
available. Thirdly, LR outperforms all other methods as data is added. We also see that as the number
of observed scores increases, unsurprisingly, the gain in matching performance (value of information)
from additional scores decreases.
It is also interesting to note that a total matching score of over 2500 implies that, on average, each
reviewer is assigned papers on which her average preference is greater than 2 (out of 3). LR reaches this
level of performance with less than 30 observed scores per user, while other methods need 30% more
data per user to reach the same level of performance.
Further insight into matching quality on NIPS-10 induced by the different learning methods can be
gained by examining the distribution of scores associated with matched papers (Figure 5.5) or under
different sizes of the training set (Figure 5.6). Figure 5.5 displays the number of scores of each value (0–3)
that get assigned with a training set size of 40. Not surprisingly, LR and BPMF assign significantly more
2s and 3s combined than all other methods. LM is very good at picking the top scores which reinforces
the fact that word-level features, from reviewer and submitted papers, contain useful information for
matching reviewers. Similar results were obtained on NIPS-09 and thus LM’s performance is not simply
a consequence of the data collection method used for NIPS-10. In addition, Baseline assigns few zeros,
since all missing and test scores are imputed to be τ = 1.
Figure 5.6 provides another perspective on assignment quality. Here we plot results for the best
performing method, LR, on both NIPS-10 and NIPS-09, for 3 different training set sizes. We first note
that the extreme imbalance in the distribution over scores of NIPS-09 leads LR to assign many zeros
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 72
0 1 2 3
0
100
200
300
400
500
Base
line
Base
line
Base
line
Base
line
LM LM LM LM
BPM
F
BPM
F
BPM
F
BPM
F
LR LR LR LR
Ave
rage
num
ber
of a
ssig
nmen
ts
Figure 5.5: Assignments for NIPS-10 by score value when using 40 training examples per user.
even with 80 training scores per user. Overall, both data sets show that as the number of training scores
increases, more 2s and 3s, and fewer 0s and 1s, are assigned.
Our remaining results deal exclusively with NIPS-10 since experimental results with NIPS-09 were
similar.
Load Balancing Balance IP
The experiments above all constrain the number of papers per reviewer to be within a specific range
(Pmin,Pmax). There is no good indication as to how to set these two limits. Instead we now use the
Balance IP, both for matching and evaluation (see Equation 5.8), setting f to be the absolute value
function. The resulting problem cannot be expressed directly as an LP. However, we can use a standard
procedure which involves adding auxiliary variables. Specifically, the Balance IP, with f the absolute
value function can be solved as the following optimization problem (for clarity we omit the constraints
on yrp):
maximize Jbalance(Y, S) =
∑
r
∑
p
srpyrp −∑
r
λtr
subject to(
(
∑
p
yrp)
− y)
≤ tr, ∀r (5.9)
−(
(
∑
p
yrp)
− y)
≤ tr, ∀r (5.10)
where the auxiliary variables are denoted as tr. We have added two sets of constraints (Eq. 5.9 and
5.10), only one of which will be active at a time for each reviewer r, that ensure that tr is lower bounded
by the absolute value of((∑
p yrp)− y). Since tr can only decrease the value of the objective, tr will
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 73
NIP
S-10
0 1 2 3
0
100
200
300
400
500
600
Avg
. num
ber
of a
ssig
nmen
ts
(a) 10 examples per user
0 1 2 3
(b) 40 examples per user
0 1 2 3
(c) 86 examples per user
NIP
S-09
0 1 2 3
0
100
200
300
400
500
600
Avg
. num
ber
of a
ssig
nmen
ts
(d) 10 examples per user
0 1 2 3
(e) 40 examples per user
0 1 2 3
(f) 80 examples per user
Figure 5.6: Comparison of the distribution of assignments by score value given different number ofobserved scores.
λ 0 0.1 0.25 0.5 0.75 1Jbasic 2625 2615 2600 2573 2569 2569
Variance 4.62 3.28 2.61 0.89 0.37 0.33
Table 5.2: Comparison of the matching objective versus within-reviewer variance of the number ofassigned papers as a function of λ.
then be exactly equal to either the left hand side of Equation 5.9 or the left hand side of Equation 5.10.
Figure 5.7 shows the histogram of assigned papers per reviewer given by the optimal solution to
the IP for different λ ∈ {0, 0.1, 1}. Experimentally when λ = 0 load equity is ignored, and almost all
reviewers either get assigned the minimum (Rmin) or the maximum (Rmax) number of papers (this is
due to certain reviewers having high expertise for more papers than others); within-reviewer variance
(∑
p(yrp− y)2/M) is extremely high. When a “soft constraint” on load equity is introduced, assignments
become more balanced as the λ increases (i.e., the balance constraint becomes “harder”). Table 5.2
reports the matching objective versus the variance, averaged across users, for different values of λ with
a training set size of 40 (other training sizes yielded similar results). Not surprisingly, larger penalties
λ for deviating from the mean reviewer load give rise to greater load balance (lower load variance) and
worse matching performance: Table 5.2 shows that the best matching, in this experiments, had a value
of 2625 and a load variance of 4.62, while the worst matching had a value of 2569 and a load variance
of 0.37. Generally, an appropriate λ will be chosen by the conference organizers, that nicely trades
off performance versus load balance across reviewers (here, perhaps around λ = 0.5). In practice it is
possible that the organizers will have to further examine the assignments (that is beyond looking at only
the matching objective) before selecting an appropriate value for λ.
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 74
20 21 22 23 24 25 26 27 28 29 300
10
20
30
40
50
λ = 020 21 22 23 24 25 26 27 28 29 30
λ = 0.120 21 22 23 24 25 26 27 28 29 30
λ = 1
Figure 5.7: Histograms of number of papers matched per reviewer with different values of λ. Theleftmost plot shows results with only hard constraints on reviewer loads (λ = 0); the others also includea soft constraint minimizing load variance. The corresponding matching values for different λ values arereported in Table 5.2
5.4.4 Transformed Matching and Learning
We now consider a non-linear transformation of the scores, reflecting the view that it is much better
to assign reviewer-paper pairs with suitabilities of 2 and 3, than pairs with 0 and 1; as discussed above
this can be accomplished by allowing “utility” yrp to be non-linear in suitability score srp. We adopt
the following sigmoid function to effect this non-linear transformation: σ(s) = 1/(1+ exp(−(s− 1.5)β));
here 1.5 is the middle of the scores’ range. We set β = 4.5, which gives: σ(0) = 0.001; σ(1) = 0.095;
σ(2) = 0.90; σ(3) = 1.0. We first show how this transformation impacts matching performance without
learning; then we discuss how one can incorporate the transformation into the learning objective itself.
We first test how matching using the transformed objectives affects results without using learning
to infer missing scores (consequently, Su = τ), by examining difference in matching performance when
varying the percentage of observed scores. Figure 5.8(a) shows the difference when matching with
the transformed objective (J transformed ) versus the basic objective (Jbasic). In both cases the resulting
matches are evaluated using J transformed . Although a minor gain is observed when most of the known data
is observed, there is, overall, very little difference in performance when matching with either objective.
Recall that the mean number of scores per paper is less than 4. Hence, when matching using a small
fraction of the data, the matching procedure has very little flexibility to assign high scoring pairs unless
learning is used to predict unobserved scores.
We can modify the learning objective to take into account the nonlinearity introduced in the matching
objective. We do this by transforming all labels using the same sigmoidal transformation as in the
matching objective (Equation 5.7). This allows learning to better predict the transformed scores by
explicitly training on them.
Figure 5.8(b) shows the transformed matching performance of both LR on the non-transformed
data, and LR-TFM, a linear regression model trained using the transformed learning objective. Not
surprisingly, LR-TFM outperforms LR across all training set sizes, since it is trained for the modified
objective J transformed . The difference is especially pronounced with smaller training sets—when enough
data is available, both methods will naturally assign many 2s and 3s. (We also verified that LR-TFM
outperforms BPMF trained on the transformed objective).
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 75
0.4 0.5 0.6 0.7 0.8 0.9 1700
750
800
850
900
950
1000
1050
1100
Fraction of known scores
J tfm
Original Matching
Transformed Matching
(a) Comparing performance of original (Jbasic) and trans-formed (J transformed ) matching objectives, without learn-ing.
10 20 30 40 50 60 70 80 90
800
850
900
950
1000
1050
1100
Training set size per user
J tfm
LR
LR−TFM
(b) Comparing performance of original and transformedLR learning using the transformed matching objective.
Figure 5.8: Performance on the transformed matching objective on NIPS-10.
5.5 Conclusion and Future Opportunities
We have instantiated the recommendation stage of our framework as an assignment procedure between
papers and reviewers. We showed how when only a small subset of reviewer-scores are elicited and
inferring unobserved scores, using one of several learning methods, we are able to determine high-quality
matchings. We explored the trade-off between matching quality and paper load balancing, which helps
one avoid the need to manually set limits on the reviewer load. Finally we showed that using the
realistic assumption that utility is non-linear in suitability score, we discover better matches using the
same nonlinear transformation in the learning objective.
Given how matching benefits from an interaction with learning, a next step would be to develop ways
to strengthen this interaction by making the learning methods sensitive to the final matching objective.
We discuss such interactions for active learning—where the system chooses which reviewer scores to
query—in Chapter 6. Another possible path for future research is to explore optimization models for
different recommendation problems and to more formally define the circumstances in which adapting
the learning loss to the final objective provides an advantage.
Finally, as we outlined in Section 5.3 several researchers have looked at possible matching constraints,
which could be useful for the reviewer-to-paper matching task, using matching formulations similar as
ours. Based on our experience with TPMS we would are also interested in exploring more expressive
matching constraints, which would give more flexibility to conference organizers in expressing their
preferences over assignments, which may not fit as part of our proposed LP framework. For example,
in practice, assigning complementary reviewers (one reviewer of pool A and one reviewer of pool B) to
submission is often sought by conference organizers. For such cases, mapping the matching optimization
to one of performing inference in a undirected graphical model (see Section 6.2.2) and making use of
high-order potentials (see, for example, [119]) could offer interesting opportunities.
In our work we have used the value of the matching objective, which is based on scores elicited prior
to assignment, to evaluate assignment quality. A possibly stronger method of evaluation would be to
Chapter 5. Learning and Matching in the Constrained Recommendation Framework 76
re-evaluate the expertise of reviewers after they have reviewed their assigned papers. We could elicit
reviewer expertise from the reviewers themselves (for example by using the confidence of their review).
Alternatively, senior program committee members could be asked to evaluate the expertise of reviewers
based on their reviews. Another advantage of performing post-hoc evaluation is that it would enable
comparison between our system and other ways of assigning papers to reviewers (either manually or
using other more automated procedures). Exploring different evaluations also relates to our discussion
about evaluation of TPMS in Section 3.5.
Chapter 6
Task-Directed Active Learning
Chapter 5 demonstrated the importance of effective learning of user preferences in matching problems.
Equally important is the question of query selection, which has the potential to further reduce the amount
of preference information users must provide and hence limit the impact of the cold-start problem.
Eliciting preferences (e.g., in the form of item ratings) imposes significant time and cognitive costs
on users. In domains such as paper matching, product recommendation, or online dating, users will have
limited patience for specifying preferences. While learning techniques can be used to limit the amount
of required information in match-constrained recommendation, the intelligent selection of preference
queries will be critical in reducing user burden. It is this problem we address in this chapter. We frame
the problem as one of active learning : our aim is to determine those preference queries with the greatest
potential to improve the quality of the matching. This is a departure from most work in active learning,
and, specifically, approaches tailored to recommender systems (as we discuss below), where queries are
selected to improve the overall quality of ratings prediction. We develop techniques that focus on queries
whose responses will impact—possibly indirectly by changing predictions—the matching quality itself.
We also propose a new probabilistic matching technique that accounts for uncertainty in predicted
preferences when constructing a matching. Finally, we test our methods on several real-life data sets
comprised of preferences for online dating, conference reviewing, and jokes. Our results show that active
learning methods that are tuned to the matching task significantly outperform a standard active learning
method. Furthermore, we show that our probabilistic methods can be successfully leveraged in active
learning.
6.1 Related work
Active learning is a rich field which we surveyed in the context of recommender systems in Section 2.3.
With respect to previous research into active learning directed at other tasks, we are aware of one
relevant study which explored active learning in matching domains. Rigaux [89] considers an iterative
elicitation method for paper matching using neighbourhood CF, but requires an initial partitioning of
reviewers, and elicits scores for the same papers from all reviewers in a partition (with the aim only
of improving score prediction quality). In our work, we need not partition users, and we focus on
optimizing matching quality rather than prediction accuracy. Our approach is thus conceptually similar
to CF methods trained for specific recommendation tasks (Section 2.2.2 surveyed this work).
77
Chapter 6. Task-Directed Active Learning 78
S31
1 2
Users
Entities
Stated preferences
1 1 20 1 22 3 2
Stated & predicted
preferences
1 1 20 1 22 3 2
Assignments
Elicitation
of user
preferences
Figure 6.1: Elicitation in the match-constrained recommendation framework. As shown by the arrowfrom the matching to the elicitation, the elicitation can be sensitive to the matching objective.
We also note that Bayesian optimization has been used for active learning recently [22]; however,
these methods assume a continuous query space and some similarity metric over item space, hence are not
readily adaptable to our match-constrained problems. As such, we will not explore the use of Bayesian
optimization in this thesis.
6.2 Active learning for Match-Constrained Recommendation
Problems
Our framework is depicted, specifically for the purposes of this chapter, in Figure 6.2. Preference
elicitation corresponds to the first stage of our framework. Designing elicitation process that exploits
feedback from the matching stage is the focus of this chapter. The active learning methods that we
develop are in line with some of the previous work on active learning (see Section 2.3) and especially the
work which is guided by prediction uncertainty (see Section 2.3.1).
Once elicited, preferences can be used to predict missing preferences using one of the models described
in the previous chapters. We use Bayesian PMF (described in Section 2.2.1). The reason for choosing
this model will become clear in the next section.
For matching, we reuse the IP presented in the previous chapter, with the Jbasic objective and
constraints on the number of papers per reviewer and on the number of reviewers per paper:
maximize J(Y, S) =∑
r
∑
p
srpyrp (6.1)
subject to yrp ∈ {0, 1}, ∀r, p (6.2)∑
r
yrp ≥ Rmin,∑
r
yrp ≤ Rmax, ∀p (6.3)
∑
p
yrp ≥ Pmin,∑
p
yrp ≤ Pmax, ∀r. (6.4)
We also reuse some of our previously-defined notation: we denote the set of observed suitabilities by
So, and denote the observed scores for a particular user r and item p by Sor and So
p , respectively. Su,
Sur , S
up are the analogous collections of unobserved scores.
Chapter 6. Task-Directed Active Learning 79
6.2.1 Probabilistic Matching
Model uncertainty is at the heart of several active learning techniques (see Section 2.3). For example,
a successful querying strategies such as uncertainty sampling (Section 2.3.1) directly aims at reducing
model uncertainty by querying scores over which the model has the most uncertainty. Our aim is
to extend such strategies to matching problems. In this section we will introduce a novel method
for determining probabilistic matchings which can use score uncertainty as a way to determine the
probability of a user-item pair being matched.
While the IP optimization is straightforward, and provides optimal solutions when all scores are
observed, it has potential drawbacks when used with predicted scores, and specifically, when used in
conjunction with active learning. First, the IP does not consider potentially useful information contained
in the uncertainty of the (predicted) suitabilities. Namely, because of the uncertainty of the (predicted)
suitabilities, we are uncertain of the quality of the IP’s assignment. Evaluating how score uncertainty
affects the IP’s assignment may allow us to reduce the quality uncertainty. Second, the IP does not
express the range of possible matches that might optimize total suitability (given the constraints).
While optimal matching given true scores can be viewed as a deterministic process, score prediction
is inherently uncertain; and we can exploit this if our prediction model outputs a distribution over
unobserved scores Su rather than a point estimate. Different score values supported by the uncertainty
model may lead to different assignments. The number of assignments a particular user-item pair is
assigned in is indicative of matching uncertainty (resulting from the score uncertainty). Given inputs
consisting of observed scores So and possibly additional side information X, we can express uncertainty
over a match Y ′ as:
Pr(Y = Y ′|So, X, θ) =
∫δY ′(Y ∗(Su, So)) Pr(Su|So, X, θ) dSu,
where Pr(Su|So, X, θ) is our score prediction model (assuming model parameters θ), Y ∗(·) (see Equa-
tion 6.1) is the optimal match matrix given a fixed set of scores, and δY ′ is the delta Dirac function with
mode Y ′. Using a similar idea, we can formulate a probabilistic model over individual marginals:
Pr(yrp = 1|So, X, θ) =
∫δ1(y
∗rp(S
u, So)) Pr(Su|So, X, θ) dSu, (6.5)
where y∗rp is the user-item rp entry of Y ∗.
With this in hand, we overcome the limitations of pure IP-based optimization by developing a sam-
pling method for determining “soft” or probabilistic matchings that reflect the range of optimal matchings
given uncertainty in predicted suitabilities. While Equation 6.5 expresses the induced distribution over
marginals, the integral is intractable as it requires solving a large number of matching problems (in this
case solving IPs). Instead we take a sampling approach: we independently sample each score from the
posterior Pr(Su|So, X, θ) to build a complete score matrix, then solve the matching optimization (IP)
using this sampled matrix. Repeating this process T times provides an estimated distribution over op-
timal matchings. We can then average the resulting match matrices, obtaining Y = 1T
∑Tt=1 Y
(t), where
Y (t) is the t’th matching. Each entry Y rp is the (estimated) marginal probability that user-item pair rp
is matched; and the probability of this match depends, as desired, on the distribution Pr(srp|So, X, θ).
Fig. 6.2 illustrates Y , comparing it to the IP solution, on a randomly-generated “toy” problem with
3 reviewers and 6 papers (the match is constrained to exactly 1 reviewer per paper and 2 papers per
Chapter 6. Task-Directed Active Learning 80
Figure 6.2: A “toy” example with a synthetic score matrix S with 3 reviewers and 6 papers. Eachreviewer must be matched to exactly two papers while papers must each be matched to a single reviewer.Matching results using IP (bottom-left), the Y approximation with two different variance matrices andthe loopy belief Propagation matching formulation (Z, bottom-right).
reviewer). Assuming a fixed predicted score matrix S, two versions of Y are shown, one when all
estimated variances are low (Y low), the other when they are higher (Y high).1 Note that the Y matrices
respect the matching constraints by design (for visualization purposes we round matching probabilities).
Y low agrees with the IP, but for Y high, we observe the inherent uncertainty in the optimal matching;
e.g., column one shows all three match probabilities to be reasonably high. In addition, the last column
shows that even though the second and third users have scores that differ by 2 on the sixth paper, the
high variance in their scores gives both users a reasonable probability of being matched to that paper.
6.2.2 Matching as Inference in an Undirected Graphical Model
We have introduced a method for propagating the score uncertainty into matching uncertainty. There
is however another source of uncertainty which we have not yet discussed: the inherent uncertainty over
the range of possible matchings. Given a probabilistic model over possible matchings, the matching IP
returns the single most likely (binary) solution (that is, the optimal solution). There are, independent
of score uncertainty, possibly other good matches which, while being dominated by the optimal, may
provide useful information for active learning.
Tarlow et al. [121] model the matching problem using an undirected graph where (binary) nodes
correspond to assignment variables yrp (in other words, there is a one-to-one mapping between nodes and
paper-reviewer pairs). Nodes have singleton potentials to denote their corresponding score (srp) and high-
order cardinality potentials enforce the reviewer and paper constraints (Equations 6.3 and 6.4). Tarlow
et al. [121] present an efficient approximate-inference algorithm (based on loopy belief propagation [83])
for computing marginal probabilities (Pr(yrp = 1)) for such models. The probability of a marginal
represents the weighted (approximate) number of times that the particular entry is part of a match over
all valid matches. Each match is weighted by the match quality (the objective of the IP, Equation 6.1).
In the rest of this chapter we refer to the matching marginal obtained using this method as the loopy-BP
1Variances are sampled uniformly at random; in a real problem they would be given by the prediction model.
Chapter 6. Task-Directed Active Learning 81
matching and denote the resulting matrix of match marginals as Z (compared to Y for matching variables
from the IP). Figure 6.2 presents the results of applying loopy-BP matching to a “toy” problem. We note
that on this small problem the solution exploiting matching uncertainty Z is similar to Y high, the solution
exploiting score uncertainty. This result is reasonable since different sampled score instantiations may
end up exploring a range of matches that have high weight under loopy-BP matching. In other words,
small perturbations around the mean of the predicted scores may result in IP assignments that are close
the one another and coincide with the assignments that are given high probability by loop-BP.
6.3 Active Querying for Matching
Little work has considered strategies for actively querying the “most informative” preferences from users.
In combination with supervised learning, active querying can further reduce the elicitation burden on
users. Random selection of user-item pairs for assessment will generally be sub-optimal, since query
selection is uninformed by the learned model, the objective function, or any previous data. By contrast,
an active approach, in which queries are tailored to both the current preference model and the current
best matching, will typically give rise to better matchings with fewer queries.2
In this section we describe several distinct strategies for query selection: we review a standard active
learning technique and introduce several novel methods that are sensitive to the matching objective.
Our methods can be broadly categorized based on two properties (which we use to label the different
methods): whether they select queries by evaluating their impact in score space S or in matching space
Y or Z; and whether they select queries with the maximal value M, or maximal entropy E.
S-Entropy (SE)
Uncertainty sampling is a common approach in active learning, which greedily selects queries involving
(unobserved) user-item pairs for which the model is most uncertain [102]. In our context, this corresponds
to selecting the user-item pair with maximum score entropy w.r.t. the score distribution produced by
the learned model. The rationale is clear: uncertainty in score predictions may lead to poor estimates
of match quality. Of course, this approach fails to explicitly account for the matching objective (the
term Y (Su, So) in Equation 6.5), instead focusing (myopically) on entropy reduction in the predictive
model (the term Pr(Su|So, X, θ)). Queries that reduce prediction entropy may have no influence on the
resulting matching. For example, if surp has high entropy, but a much lower mean than some “competing”
sur′p, user r′ may remain matched to p with high probability regardless of the response to query rp.
S-Max (SM)
An alternate, yet still simple, strategy is to select queries involving user-item pairs with highest predicted
score w.r.t. MAP score estimates given our predictions of unobserved scores:
Su ≡ argmaxSu
Pr(Su = smax|So, X, θ),
with smax being the highest possible score value (for example, in a particular data set). This may be
especially advantageous for matching problems where, all else being equal, high scores are more likely
2In our settings one can elicit a rating or suitability score from a user for any item (e.g., paper, date, joke); so the fullset Su
r serves as potential queries for user r.
Chapter 6. Task-Directed Active Learning 82
to be assigned (see Equation 6.1).
SM’s insensitivity to the matching objective means that it shares the obvious shortcoming as SE.
One remedy is to use expected value of information (EVOI) to measure the improvement in matching
quality given the response to a query (taking expectation over predicted responses). This approach,
which we reviewed in Section 2.3.4, has been used effectively in (non-constrained) CF [18]; but EVOI is
notoriously hard to evaluate. In our context, we would (in principle) have to consider each possible query
rp, estimate the impact of each possible response sorp on the learned model (the term Pr(Su|So, X, θ) in
Equation 6.5), and re-solve the estimated matching (the term Y (Su, So) in Equation 6.5). Instead, we
consider several more tractable strategies that embody some of the same intuitions.
Y -Max (YM)
A simple way to select queries in a match-sensitive fashion is to consider the solution returned by the
IP w.r.t. the observed scores, So, and the MAP solution of the unobserved scores, Su. We query the
unknown pair rp that contributes the most to the value of the objective:
arg max(rp)∈Su
yrpsrp,
where yrp ∈ Y (So, Su) is the binary match value for user r and item p, and srp the corresponding
MAP score value. In other words, we query the unobserved pair among those actually matched with
the highest predicted score. We refer to this strategy as Y-Max (YM). It reflects the intuition that
we should either confirm or refute scores for matched pairs, i.e., those pairs that, under the current
model, directly determine the value of the matching objective. However notice that YM is insensitive
to score uncertainty. So, for example, it may query unobserved scores whose predictions have very high
confidence, despite the fact that such queries are highly unlikely to provide valuable information.
Y -Max (YM))
As remedy to YM’s problem, YM exploits our probabilistic matching model to select queries. As with
YM, YM queries the unobserved pair rp that contributes the most to the objective value:
arg max(rp)∈Su
Y rpsrp.
The difference is that we use the probabilistic match, exploiting prediction uncertainty in query selection.
Y -Entropy (YE)
This method exploits the probabilistic match Y as well, but unlike YM, Y E queries unknown pairs
whose entropy in the match distribution is greatest. Specifically, we view each Yrp as a Bernoulli
random variable with (estimated) success probability Y rp. We then query that pair with maximum
match entropy:
arg max(rp)∈Su
[− Y rp log Pr(Y rp)− (1− Y rp) log Pr(1− Y rp)
].
Chapter 6. Task-Directed Active Learning 83
Z-Max (ZM)
In this method we exploit the matching marginals of Z with maximal-value query selection:
arg max(rp)∈Su
zrpsrp,
where zrp ∈ Z(So, Su) is the probabilistic match value for user r and item p according to loopy-BP
matching introduced in Section 6.2.2. We also experimented with a strategy that considered both score
uncertainty and matching uncertainty (ZM) but initial results were not better than those of ZM. We
hypothesize that solving the IP using different sets of sampled scores is an alternate method of exploring
the range of good matches.
One important point to note is that the match-sensitive strategies, YM, YM, Y E, all attempt to
query unobserved pairs that occur (possibly stochastically) in the optimal match. When the IP does
not match on any unobserved pairs, a fall-back strategy is needed. All three strategies resort to random
querying as a fall-back, selecting a random unobserved item score for any specific user as its query. For
YM and Y E, we further consider all queries that corresponds to a user-item pair with less than a 1%
chance of being matched to be “random” queries.
6.4 Experiments
We test the active learning approaches described above on three data sets, each with very different
characteristics. We begin with a brief description of the data sets and matching tasks, then describe our
experimental setup, before proceeding to a discussion of our results.
6.4.1 Data Sets
We first describe our three data sets and define the corresponding matching tasks.
Jokes data set: The Jester data set [41] is a standard CF data set in which over 73,000 users have
each rated a subset of 100 jokes on a scale of -10 to 10. It has a dense subset in which all users rate ten
common jokes. Our experiments use a data set consisting of these ten jokes and 300 randomly selected
users. 3. We convert this to a matching problem by requiring the assignment of a single joke to each
user (for example, to be told at a convention or conference), and requiring that each joke be matched
to between 25 and 35 users (to ensure “jocular diversity” at the convention). Fig. 6.5(a) provides a
histogram of the suitabilities for the Jester sub-data set.
Conference data set: This is the data set derived from the NIPS 2010 conference that we have been
using across the different experimental sections of this thesis. As previously reported the suitabilities
for a subset of papers were elicited in two rounds. In the first round scores were elicited for about 80
papers per reviewer, with queries selected using the YM procedure described above (where the initial
scores were estimated using a word-LM, see Section 3.2.1, using reviewers’ published papers).
Dating data set: The third data set comes from an online dating website.4 It contains over 17 million
ratings from roughly 135,000 users of 168,000 items (other users). We use a denser subset of 32,000
ratings from 250 users (each with at least 59 ratings) over 250 items (other users); see Figure 6.5(c) for
3This data set is derived from the 18,000 subset of Dataset 1 presented in Goldberg et al. [41] Documentation for thedata sets is available online, currently at http://eigentaste.berkeley.edu/dataset/
4See http://www.occamslab.com/petricek/data/
Chapter 6. Task-Directed Active Learning 84
−10 −5 0 50
50
100
150
200
Num
ber
of s
core
s
(a) Jokes Data set
1 2 3 4 5 6 7 8 9 100
1000
2000
3000
4000
5000
6000
(b) Dating data set
Figure 6.3: Histograms of known suitabilities for two of our data sets.
the score histogram. Since items are users with preferences over their matches, dating is generally treated
as a two-sided problem. While two-sided matching can fit within our general framework, the focus of
our current work is on one-sided matching. As such, we only consider user preferences for “items” and
not vice versa. Each user is assigned 25–35 items (and vice versa since “items” are users).
6.4.2 Experimental Procedures
Our experiments simulate the typical interaction of a recommendation or matching engine with its users.
All experiments start with a few observed preferences for each user, for example, preferences provided
by users upon first entering the system, and then go through several rounds of querying. At each round,
a querying strategy selects queries to ask one or more users. Note that in practice we restrict the
strategies to only query (unobserved) scores that are actually available in our datasets. That is, since
we simulate the process of actively querying scores we can only simulate the responses that are available
in the underlying dataset. Once all users have responded, the system re-trains the learning model with
newly and previously observed preferences, then proceeds to select the next batch of queries. This is a
somewhat simplified model that assumes semi-synchronous user communication and thus our strategies
cannot take advantage of users’ most recent elicited preferences until the end of each round. We also
assume for simplicity that the same fixed number of queries per user is asked in each round. The initial
goal is simply to assess the relative performance of each method; we do relax some of these assumptions
in Section 6.4.3.
There are a variety of reasonable interaction modes for eliciting user preferences. For example, in
paper-reviewer matching, posing a single query per round is undesirable, since a reviewer, after assessing
a single paper, must wait for other reviewer responses—and the system to re-train—before being asked
a subsequent query. Reviewers generally prefer to assess their expertise off-line w.r.t. a collection of
papers. Consequently batch interaction is most appropriate: users are asked to assess K items. While
batch frameworks for active learning have received recent attention (e.g., [44]), here we are interested in
comparing different query strategies. Hence we use a very simple greedy batch approach where we elicit
the “top” K preferences from a user, where the “top” queries are ranked by the specific active strategy
under consideration. Appropriate choice of K is application dependent: smaller values of K may lead to
Chapter 6. Task-Directed Active Learning 85
better recommendations with fewer queries, but require more frequent user interaction and user delay.
We test different values of K below.
We use BPMF to generate our predictions and its uncertainty model for unobserved scores. We chose
BPMF because it keeps a full distribution over unobserved scores Pr(Su|So,θ), given model parameters
θ, which is useful for our active learning strategies. A procedure for setting some of the hyper-parameters
of BPMF is outlined in [98]. We use a validation set for the other methods giving (using notation from
the original paper):
• Jokes: D = 1, α = 0.1, β0u = 0.1, β0v = 10
• Conference: D = 15, α = 2, β0u = β0v = 0.1
• Dating: D = 2, α = 2, β0u = β0v = 0.1
Each observed score is assigned a fixed small uncertainty value of 1e−3. The exact value is unimportant
as long as it emphasizes near-certainty compared to the model’s uncertainty over predicted scores (we
verified that this was the case in our experiments). For Y -based methods, which require sampling, we
use 50 samples in all experiments. The ZM method, unlike the IP, is sensitive to the scale of the scores
(the temperature of the system). A large scale will lead to more deterministic matches while a low scale
has the opposite effect. Visual inspection of the results showed that, for each dataset, scaling the score
such that the max score has a value of 10 leads to an acceptable level uncertainty. The elaboration of a
more formal validation procedure is left for future work.
We compare query selection methods w.r.t. their matching performance—i.e., the matching objective
value of Equation 6.1—using the match matrix given by the IP using estimated scores and known scores
So, evaluated on the full set of available scores. We use a random querying strategy, which selects
unobserved items uniformly at random for each user, as a baseline. All figures show the number of
queries per user on the x-axis. The y-axis indicates the difference in the matching objective value
between a specific querying strategy and the baseline. Positive differences indicate better performance
relative to the baseline. The magnitude of this difference can be best understood relative to the number
of users in the data set. For example, a difference of 300 in objective value for the 300 users in the Jokes
data set means that users are matched to jokes that are better by one “score unit” on average (as a
reminder the scores of the jokes range from -10 to 10). Note that as we increase the number of queries,
even random queries will eventually find good matches—in the limit, where all scores are observed,
matching performance of all methods will be identical (hence the bell-shaped curves and asymptotic
convergence in our results).
We don’t focus on running time in our experiments since query determination can often be done
off-line (depending on batch sizes). Having said that, even the most intense querying techniques are fast
and can support online interaction: (a) in all 3 data sets, solving the IP takes a fraction of a second; (b)
BPMF can be trained in a matter of a few minutes at most, but can be run asynchronously with query
selection. For example, we could choose to re-run BPMF as soon as we have received a new (pre-defined)
number of scores. The query selection would then use the most up-to-date learned model available; and
(c) sampling scores is very fast as the posterior distribution is Gaussian. Furthermore, given the above,
our methods should scale to larger datasets although the training time of BPMF may preclude fully
online interaction.
Chapter 6. Task-Directed Active Learning 86
−50 0 50 100 150 200 250 300−100
−50
0
50
100
150
200
250
Diff
eren
ce in
Mat
chin
g O
bjec
tive
Queries−50 0 50 100 150 200 250 300
3400
3600
3800
4000
4200
4400
4600
4800
5000
Abs
olut
e M
atch
ing
Obj
ectiv
e
(a) Jokes data set (10 qbu)
−20 0 20 40 60 80 100 120−50
0
50
100
150
200
Diff
eren
ce in
Mat
chin
g O
bjec
tive
Queries−20 0 20 40 60 80 100 120
1800
2000
2200
2400
2600
2800
3000
Abs
olut
e M
atch
ing
Obj
ectiv
e
(b) Conference (20 qbu)
−20 0 20 40 60 80 100 120 140−200
−100
0
100
200
300
400
500
600
700
800
Diff
eren
ce in
Mat
chin
g O
bjec
tive
Queries−20 0 20 40 60 80 100 120 140
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9x 10
4
Abs
olut
e M
atch
ing
Obj
ectiv
e
(c) Dating (20 qbu)
Figure 6.4: Matching performance for active learning results (queries per batch per user: qbu). Standarderror is also shown. The triangle plot (using the right vertical axis) shows the absolute matching valueof the random strategy.
Chapter 6. Task-Directed Active Learning 87
0 100 200 3000
20
40
60
80
100
Num
ber
of fa
ll−ba
ck q
uerie
s
(a) Jokes data set
0 50 100 1500
200
400
600
800
1000
(b) Conference data set
0 50 100 1500
1000
2000
3000
4000
(c) Dating data set
Figure 6.5: Usage frequency of the fall-back query strategy for YM, YM and Y E.
6.4.3 Results
We first investigate the performance of the different querying strategies on our three data sets using
default batch sizes—these K values were deemed to be natural given the domains (different K values
are discussed below). Figure 6.4(a) shows results for Jokes using batches of 10 queries per user per
round (K = 10). Figures 6.4(b) and 6.4(b) show Conference and Dating results, respectively, both with
a batch size of 20. All users start with 20 observed scores: 15 are used for training and 5 for validation.
We also experimented with a more realistic setting where some users have few observed scores (e.g., new
users)—results are qualitatively very similar but are not discussed here.
The relative performance of each of the active methods exhibits a fairly consistent pattern across all
three domains, which permits us to draw some reasonably strong conclusions.5 First, we see that all
methods except for SE outperform the baseline in all domains. Recall that SE is essentially uncertainty
sampling, a classic (match-insensitive) active learning model often used as a general baseline method
for active learning. It outperforms the random baseline only occasionally, most significantly after the
first round of elicitation in Dating. Second, all of our proposed match-sensitive techniques outperform
SE consistently on all data sets. Third, the match-sensitive approaches that leverage uncertainty over
scores, namely, YM and Y E, typically outperform YM, especially after the initial rounds of elicitation.
This difference in performance behaviour is most pronounced in the Conference domain.
We gain further insight into these results by examining the inner workings of these strategies.6
Figure 6.5 shows the number of random (or fall-back) queries used (on average) by each of YM, YM and
Y E. On all data sets YM resorts to the fall-back strategy significantly earlier than the others, explaining
YM’s fall-off in performance and indicating that the diversity of potential matches identified by our
probabilistic matching technique plays a vital role in match-sensitive active learning.
Sequential Querying
We employed a semi-synchronous querying procedure above, where all users are queried in parallel at
each round. We now consider a different mode of interaction where, at each round, users are queried
sequentially in round robin fashion. This allows the responses of earlier users within a round to influence
the queries asked to later users—potentially reducing the total number of queries at the expense of
5We do not report the performance of SM— it is consistently outperformed by the baseline in all experiments. We haveobserved that SM typically selects all queries from among only a few items, namely, those with high predicted averagescore; hence it acquires no information about the vast majority of items.
6We did not perform further experiments with ZM since it appears as if considering inherent matching uncertainty doesnot lead to a win compared to the Y and Y methods.
Chapter 6. Task-Directed Active Learning 88
0 100 200−100
−50
0
50
100
150
200
250
Diff
eren
ce in
Mat
chin
g O
bjec
tive
Queries
(a) Jokes data set (10 queries per batch per user, sequen-tial)
−20 0 20 40 60 80 100 120−100
−50
0
50
100
150
200
Queries
(b) Conference data set (10 queries per batch, parallel)
−20 0 20 40 60 80 100−100
−50
0
50
100
150
Queries
(c) Conference data set (40 queries per batch, parallel)
−50 0 50 100 150−50
0
50
100
150
200
Queries
(d) Conference data set with larger user and item con-straints
Figure 6.6: Matching performance for active learning results using non-default parameters.
increased synchronization (and delay) among users. Fig. 6.6(a) shows that our methods are robust to
this modification in the querying procedure. Specifically, the SE is quickly outperformed by all methods
that are sensitive to the matching. Furthermore, both YM and Y E, which use matching uncertainty,
outperform YM overall.
Batch Sizes
The choice of the number of queries K per batch affects both the frequency with which the user interacts
with the system as well as the overall match performance. For example, high values of K reduce the
number of user “interactions” needed for a specific level of performance, at the expense of query efficiency
(improvement in matching objective per query). The “optimal” value for K depends on the actual
recommendation application. Figs. 6.6(b) and (c) shows results with different values of K on Conference,
using 10 and 40 queries per round, respectively. The relative performance of the active methods remains
almost identical. As expected, absolute performance w.r.t. query efficiency is better with smaller values
Chapter 6. Task-Directed Active Learning 89
of K. The matching-sensitive strategies clearly outperform the score-based techniques. Results are
similar across all data sets.
Matching Constraints
Our results are also robust to the use of different matching constraints, specifically, bounds on the
numbers of items per user and vice versa (i.e., Rmin, Rmax, Pmin, Pmax). Using the Conference data set,
we increase to two (from one) the number of reviewers assigned to each paper. Fig. 6.6(d) shows that
the behavior of the methods changes little, with both Y -methods still outperforming all other methods.
The other domains (not shown) exhibit similar results.
6.5 Conclusion and Future Opportunities
We investigated the problem of active learning for match-constrained recommender systems. We ex-
plored several different approaches to generating queries that are guided by the matching objective, and
introduced a novel method for probabilistic matching that accounts for uncertainty in predicted scores.
Experiments demonstrate the effectiveness of our methods in determining high-quality matches with
significantly less elicitation of user preferences than that required by uncertainty sampling, a standard
active learning method. Our results highlight the importance of choosing queries in a manner that is
sensitive to the matching objective and uncertainty over predicted scores.
One effect of choosing which user preferences to label, based on some objective, is that the model
will learn from (possibly) biased data. That is, the sampled data may provide the model with a biased
view of the true underlying data distribution. This is a general problem of active learning known as
sampling bias (for a formal description see, for example, [31]). By performing elicitation based on the
matching objective we add a further source of bias. Informally, the system is more likely to query scores
that are expected to be higher since they will more likely be matched. In our datasets we have not found
it to be a problem. However, in general in order to reduce this added bias, one may want to elicit scores
based on both the matching and learning objectives. Another possible practical avenue is to query all
users about a (small) common fixed set of items. This would ensure a certain level of score diversity, in
particular, it would ensure that the system has access to low scores for all users.
On a practical note it has been very challenging to obtain informative uncertainty models using the
preference datasets that we experimented with. We have shown that we are still able to leverage the
model uncertainty to obtain performance gains. However, in general, uninformative uncertainty models
may fundamentally limit the general usefulness of the active learning methods that (even indirectly) rely
on model uncertainty.
There are many promising avenues of future research in match-constrained recommendation. We
could explore different matching objectives, for example two-sided matching with stability constraints
(e.g., as would be appropriate in online dating as well as paper-reviewer matching, where papers require
“sufficient” expertise). We could also explore methods for eliciting side information from users in a way
that is guided by the recommendation objective. Furthermore, higher-level, abstract queries (such as
preferences over item categories or features) may significantly boost “gain per query” performance. In
fact, one could use CSTM (Chapter 4) to model higher-level features such a item genres (or the subject
areas of submissions and reviewers). Initial experiments with such data showed some promises which
could pave the way to using CSTM as the underlying active learning model. Another possibility would
Chapter 6. Task-Directed Active Learning 90
be to allow the active learning to choose between different types of queries (for example queries about the
subject areas of a reviewer or queries about user-item preferences). EVOI could be used as a principled
way of selecting queries of these different types. Modelling this extra level of user preferences will also
be useful for cold-start items. For example, in the reviewer-to-paper matching domain, imagine that
reviewer profiles are kept across conferences. Then user preferences over paper subject areas could be
used to get better score estimates of the submitted papers which in turn will help the active learning.
Without using side information the initial active learning queries would be no better than the ones of a
random baseline.
Finally, in this chapter after new data was obtained we always re-trained the system using all available
data. This approach is unlikely to scale, and online learning techniques, which could learn using only
the newly-available data, would be of interest.
Overall, explicit preference elicitation requires the voluntary participation of users. Conference re-
viewers may oblige since they will reap immediate benefits. However, in other applications, it may be
harder to convince users of the advantages of engaging in this elicitation mechanism. In such cases,
recommender systems may either have to be more subtle about the elicitation adopted, for example, by
including an exploration policy within their recommendation objective, or even by deducing preferences
from user behaviour, for example, by examining the list of page they browsed, the search queries they
issued, or the information they communicated through their participation in online social networking
sites.
Chapter 7
Conclusion
Throughout this thesis we have shown how we can both tailor existing machine learning models and
methods, as well as develop new ones to increase the performance and expand the capabilities of rec-
ommender systems. We now provide a summary of our work and highlight opportunities for future
work.
7.1 Summary
Firstly, we established preference prediction as the core problem of interest in recommender systems.
Current supervised learning methods are well suited to this problem. Specifically, supervised methods
tailored to the specifics of recommender systems, such as collaborative filtering methods, have shown to
be excellent at missing preference prediction in typical recommendation domains when user preference
data is plentiful. However, in cold-start scenarios we must resort to leveraging side information, possibly
including content information, about user and items. We showed how we can model textual user and item
side information, using topic models, in a document-prediction domain and obtain superior performances
compared to state-of-the-art methods that use only user scores of preferences.
Secondly, we introduced a simple framework which decomposes the recommendation problem into
three interacting stages: a) preference elicitation; c) preference prediction; and c) determination of rec-
ommendations. We showed how we can cast match-constrained problems, such as the paper-to-reviewer
matching problem, into this framework. For match-constrained recommendations we experimentally
demonstrated, using two conference datasets, the strong correlation between the methods’ preference-
prediction performance and their matching performance. Further, we exploited the synergy between
learning and matching objectives when using a non-linear mapping between utility and suitability.
Finally, using active learning, we explored the interaction between preference elicitation and matching
in a match-constrained recommender system. The active querying methods we developed focussed on
improving the recommendation objective instead of the learning objective. To reach our goals we also
developed a probabilistic matching procedures to account for the uncertainty in predicted preferences.
Our methods, including those that use probabilistic matching, proved useful in querying user preferences
for the matching problem. Overall, using our conference datasets, a dating dataset and a jokes dataset,
we showed that together with preference prediction methods, active learning methods greatly reduce the
elicitation burden on users and thus help alleviate the cold-start problem.
91
Chapter 7. Conclusion 92
Interestingly, the non-research contribution of this thesis, the Toronto Paper Matching System, is
the component of this thesis which has seemingly had the most immediate impact on the community.
Further, it has both revealed research opportunities, for example by allowing us to collect interesting
data sets which were essential in the development of our work, and it has also shown to be a good test
bed for our research ideas.
7.2 Future Research Directions
As recommender systems become ubiquitous several research opportunities will naturally present them-
selves. We have outlined some of these directions in the preceding chapters. We now outline a few
additional directions which are the closest to the work presented in this thesis.
1. A wide variety of sources of side information may be indicative of user preferences (for example,
the different aspects of user online behaviour, such as their search terms, the sites they visited
and the frequency and length of the visits, their purchased items, and others). Learning simul-
taneously from all such sources has the potential to refine recommendations and, more generally,
personalization models. Therefore, it is worth developing and analyzing models which can learn
from these potentially heterogeneous sources of (side) information. Learning from combinations of
heterogeneous data is a general challenge of machine learning and one of particularly interest to
recommender systems.
Further these additional sources of side information will often contain more expressive forms of
user preferences. For example, deriving preferences from text (e.g., of reviews), where extracting
user preferences cannot be done with a bag-of-words model, is still a challenge in machine learning
(although the field has been progressing, especially if allowed to learn from large collections of
labelled data [110]). One possible avenue for accomplishing this task is—similar to our approach
using CSTM—to learn general higher-level representations of this side information and then use
these representations as features in a user preference model. In general, this direction of research
may require the exploration of interesting combinations of existing machine learning content models
(for example, models of text or images) with preference prediction models.
2. In many recommendation domains it may be unreasonable to require the explicit elicitation of
many preferences from users. Therefore, as we discussed in the last section of Chapter 6, it
will be essential to effectively learn from weaker sources of user preferences such as users’ online
behaviour. As a first step, there has already been some work using implicit user feedback [51]
for recommendations. While extending this work is promising, methods that can aggregate weak
forms of preference data into definitive user preferences will be of importance.
3. Current recommender systems typically work for single item-domains. For example, a system rec-
ommends either books or restaurants but not both. This is partly due to the mechanisms that
are involved in creating user-item preference data sets. However, there are necessarily correlations
between users’ preferences across different domains which can be exploited using a multi-domain
recommender system. Further, a multi-domain recommender system has the potential of learning
much finer-level representations of user preferences, leading to better user personalization. There-
fore, methods which work across multiple domains, and which eventually lead to more general
models of user preferences, will be beneficial even beyond recommender systems.
Chapter 7. Conclusion 93
To be useful across many domains a recommender system will often be required to make recommen-
dations in domains for which it has very little user preference information. For example, an online
system will constantly need to adapt to both novel items and users as well as to users’ interests
as they evolve over time. Again, the use of side information, including content information, will
be a necessity to quickly identify exploitable correlations between the different recommendation
domains.
Ideally, recommender systems will not only recommend single items, but will also be able to rec-
ommend structured list of items. For example, a system could recommend full trip itineraries,
including places to stay, sightseeing activities and tickets to shows, or recommend a reading cur-
riculum composed of a list of scientific or news articles to (gradually) learn about specific topics, or
sets of ingredients to create harmonious recipes and meals. Such problems could be modelled using
our current framework, by first having a preference learning objective followed by a combinatorial
optimization step. In cases where structured label data exists we could then turn our attention to
the work on structure output learning (for example, [123]) or more appropriately, to works that
can deal explicitly with the ultimate combinatorial optimization such as Perturb-and-Map [82] and
others [120].
4. Finally, there are also other particularities of practical recommender systems which have often
been ignored in the academic literature. Examples of such particularities include:
• research that treats missing preferences, in commonly available datasets, as missing at random
• research that eludes the temporal and spacial contexts around recommendations
Such considerations will likely become of crucial importance for real-life recommender systems and
will become more accessible to academics once datasets containing information relevant to these
considerations become available.
We have presented recommender systems as a major beneficiary from advances in machine learning
research, and specifically from supervised and active learning methods. It is likely that recommender
systems, especially at their intersection with human behaviour modelling, will take on even more impor-
tant for machine learning techniques. This will especially be the case as larger user preference data sets,
especially if they contain extra features such as more expressive information indicative of preferences,
become available.
Bibliography
[1] Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models. In Proceedings
of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’09, pages 19–28, New York, NY, USA, 2009. ACM. 40, 41, 42, 53, 62
[2] Deepak Agarwal and Bee-Chung Chen. flda: matrix factorization through latent dirichlet alloca-
tion. In Proceedings of the third ACM International Conference on Web Search and Data Mining,
WSDM ’10, pages 91–100, New York, NY, USA, 2010. ACM. 40, 41, 42, 52
[3] Robert Arens. Learning SVM ranking functions from user feedback using document metadata and
active learning in the biomedical domain. In Johannes Furnkranz and Eyke Hullermeier, editors,
Preference Learning, pages 363–383. Springer-Verlag, 2010. 22
[4] Suhrid Balakrishnan and Sumit Chopra. Collaborative ranking. In Proceedings of the Fifth ACM
International Conference on Web Search and Data Mining, WSDM ’12, pages 143–152, New York,
NY, USA, 2012. ACM. ISBN 978-1-4503-0747-5. 17, 64
[5] Krisztian Balog, Leif Azzopardi, and Maarten de Rijke. Formal models for expert finding in enter-
prise corpora. In Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Jarvelin,
editors, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR-06), pages 43–50, Seattle, Washington, USA, 2006.
ACM. ISBN 1-59593-369-7. 53
[6] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. Expertise retrieval.
Foundations and Trends in Information Retrieval, 6(2):127–256, February 2012. ISSN 1554-0669.
37
[7] Robert M. Bell and Yehuda Koren. Lessons from the Netflix prize challenge. SIGKDD Exploration
Newsletter, 9:75–79, December 2007. ISSN 1931-0145. 8, 13, 20, 48
[8] Salem Benferhat and Jerome Lang. Conference paper assignment. International Journal of Intel-
ligent Systems, 16(10):1183–1192, 2001. 37, 66, 67, 68
[9] T. Bertin-Mahieux. Large-Scale Pattern Discovery in Music. PhD thesis, Columbia University,
February 2013. 42
[10] Michael J. Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a unifying
framework. Math. Program., 47:425–439, 1990. 54
94
BIBLIOGRAPHY 95
[11] Alina Beygelzimer, Daniel Hsu, John Langford, and Zhang Tong. Agnostic active learning without
constraints. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,
Advances in Neural Information Processing Systems 23, pages 199–207. 2010. 18
[12] David M. Blei and John D. Lafferty. Dynamic topic models. In Proceedings of the Twenty-third
International Conference of Machine Learning (ICML), 2006. 43
[13] David M. Blei and John D. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007.
44, 46, 48, 49, 50, 52, 54, 55
[14] David M. Blei and Jon D. McAuliffe. Supervised topic models. In Advances in Neural Information
Processing Systems (NIPS), 2007. 43
[15] David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum. Hierarchi-
cal topic models and the nested chinese restaurant process. In Advances in Neural Information
Processing Systems (NIPS), 2003. 43
[16] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent Dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022, 2003. 31, 32, 43, 44
[17] Michael Bloodgood and K. Vijay-Shanker. A method for stopping active learning based on sta-
bilizing predictions and the need for user-adjustable stopping. In Proceedings of the Thirteenth
Conference on Computational Natural Language Learning (CoNLL-2009), 2009. 24
[18] Craig Boutilier, Richard S. Zemel, and Benjamin Marlin. Active collaborative filtering. In UAI,
pages 98–106, Acapulco, 2003. 22, 23, 82
[19] Darius Braziunas and Craig Boutilier. Assessing regret-based preference elicitation with the UT-
PREF recommendation system. In Proceedings of the Eleventh ACM Conference on Electronic
Commerce (EC-10), pages 219–228, Cambridge, MA, 2010. 8
[20] John S. Breese, David Heckerman, and Carl Myers Kadie. Empirical analysis of predictive algo-
rithms for collaborative filtering. In G. F. Cooper and S. Moral, editors, Proceedings of the 14th
Conference on Uncertainty in Artificial Intelligence, pages 43–52, 1998. 9
[21] Klaus Brinker. Incorporating diversity in active learning with support vector machines. In Fawcett
and Mishra [34], pages 59–66. ISBN 1-57735-189-4. 23
[22] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive
cost functions, with application to active user modeling and hierarchical reinforcement learning.
Technical Report TR-2009-23, Department of Computer Science, University of British Columbia,
November 2009. 78
[23] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan.
Streaming variational bayes, 2013. arXiv:1307.6769. 61
[24] Christopher J.C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth
cost functions. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information
Processing Systems 19, pages 193–200. MIT Press, Cambridge, MA, 2007. 17
BIBLIOGRAPHY 96
[25] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise
approach to listwise approach. Tech Report MSR-TR-2007-40, Microsoft Research, April 2007. 17
[26] Tianqi Chen, Hang Li, Qiang Yang, and Yong Yu. General functional matrix factorization using
gradient boosting. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th
International Conference on Machine Learning (ICML-13), volume 28, pages 436–444. JMLR
Workshop and Conference Proceedings, 2013. 40, 41
[27] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining content-
based and collaborative filters in an online newspaper. In Proceedings of the ACM SIGIR ’99
Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California, 1999. ACM.
40, 41
[28] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of
Artificial Intelligence Research (JAIR), 10:243–270, May 1999. ISSN 1076-9757. 18
[29] Don Conry, Yehuda Koren, and Naren Ramakrishnan. Recommender systems for the conference
paper assignment problem. In Proceedings of the Third ACM Conference on Recommender Systems,
RecSys ’09, pages 357–360, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-435-5. 36, 40,
41, 53, 67
[30] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the
Twenty-Fifth International Conference (ICML 2008), pages 208–215, 2008. 20
[31] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the
Twenty-fifth International Conference on Machine learning (ICML), pages 208–215, 2008. 89
[32] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via
the em algorithm. Journal of The Royal Statistical Society, Series B, 39(1):1–38, 1977. 45
[33] P. Perona E. Bart, M. Welling. Unsupervised organization of image collections: Taxonomies and
beyond. IEEE Transactions of Pattern Analysis and Machine Intelligence, 2011. 62
[34] Tom Fawcett and Nina Mishra, editors. Machine Learning, Proceedings of the Twentieth Interna-
tional Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003. AAAI Press.
ISBN 1-57735-189-4. 95, 102
[35] Brendan J. Frey and Nebojsa Jojic. A comparison of algorithms for inference and learning in prob-
abilistic graphical models. IEEE Trans. Pattern Anal. Mach. Intell., 27(9):1392–1416, September
2005. ISSN 0162-8828. 45
[36] David Gale and Lloyd S. Shapley. College admissions and the stability of marriage. American
Mathematical Monthly, 69(1):9–15, 1962. ISSN 0002-9890. 24, 25
[37] Naveen Garg, Telikepalli Kavitha, Amit Kumar, Kurt Mehlhorn, and Julian Mestre. Assigning
papers to referees. Algorithmica, 58(1):119–136, 2010. 37, 66, 67, 68
[38] Kostadin Georgiev and Preslav Nakov. A non-iid framework for collaborative filtering with re-
stricted boltzmann machines. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of
the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1148–1156.
JMLR Workshop and Conference Proceedings, May 2013. 13
BIBLIOGRAPHY 97
[39] Mehmet Gnen, Suleiman Khan, and Samuel Kaski. Kernelized bayesian matrix factorization. In
Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference
on Machine Learning (ICML-13), volume 28, pages 864–872. JMLR Workshop and Conference
Proceedings, May 2013. 41, 42
[40] David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. Using collaborative filtering
to weave an information tapestry. Communications of the ACM, 35:61–70, December 1992. ISSN
0001-0782. 2, 7, 9
[41] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time
collaborative filtering algorithm. Information Retrieval, 4(2):133–151, July 2001. ISSN 1386-4564.
83
[42] Judy Goldsmith and Robert H. Sloan. The AI conference paper assignment problem. In AAAI-07
Workshop on Preference Handling in AI, pages 53–57, Vancouver, 2005. 37, 66, 67, 68
[43] Yuhong Guo. Active instance sampling via matrix partition. In J. Lafferty, C. K. I. Williams,
J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing
Systems 23, pages 802–810. 2010. 23
[44] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In J.C. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems
20, pages 593–600. MIT Press, Cambridge, MA, 2008. 23, 84
[45] Abhay Harpale and Yiming Yang. Personalized active learning for collaborative filtering. In SIGIR,
pages 91–98, 2008. 19
[46] Guenter Hitsch and Ali Hortacsu. What makes you click? an empirical analysis of online dating.
2005 Meeting Papers 207, Society for Economic Dynamics, 2005. 24
[47] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational
inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013. ISSN 1532-4435. 61
[48] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM. ISBN 1-58113-096-1. 12
[49] Thomas Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22
(1):89–115, 2004. 11, 12
[50] R.A. Howard. Information value theory. Systems Science and Cybernetics, IEEE Transactions on,
2(1):22–26, 1966. ISSN 0536-1567. doi: 10.1109/TSSC.1966.300074. 22
[51] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets.
In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages
263–272, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3502-9. 92
[52] Aanund Hylland and Richard J. Zeckhauser. The efficient allocation of individuals to positions.
Journal of Political Economy, 87(2):293–314, 1979. ISSN 0022-3808. 24, 25
BIBLIOGRAPHY 98
[53] Mohsen Jamali and Martin Ester. Trustwalker: a random walk model for combining trust-based
and item-based recommendation. In Proceedings of the 15th ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, KDD ’09, pages 397–406, New York, NY, USA,
2009. ACM. ISBN 978-1-60558-495-9. 40
[54] Mohsen Jamali and Martin Ester. A matrix factorization technique with trust propagation for
recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender
systems, RecSys ’10, pages 135–142, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0.
40, 41
[55] Tamas Jambor and Jun Wang. Optimizing multiple objectives in collaborative filtering. In ACM
Recommender Systems, 2010. 18
[56] Tamas Jambor and Jun Wang. Goal-driven collaborative filtering: A directional error based
approach. In Proc. of European Conference on Information Retrieval (ECIR), 2010. 15
[57] Kalervo Jarvelin and Jaana Kekalainen. Ir evaluation methods for retrieving highly relevant doc-
uments. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and
Development in Information Retrieval, SIGIR ’00, pages 41–48, New York, NY, USA, 2000. ACM.
16, 56
[58] Rong Jin and Luo Si. A bayesian approach toward active learning for collaborative filtering.
In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-04), pages
278–285. AUAI Press, 1 2004. 19
[59] Maryam Karimzadehgan and ChengXiang Zhai. Integer linear programming for constrained multi-
aspect committee review assignment. Inf. Process. Manage., 48(4):725–740, July 2012. ISSN 0306-
4573. doi: 10.1016/j.ipm.2011.09.004. URL http://dx.doi.org/10.1016/j.ipm.2011.09.004.
68
[60] Maryam Karimzadehgan, ChengXiang Zhai, and Geneva Belford. Multi-aspect expertise matching
for review assignment. In Proceedings of the 17th ACM conference on Information and knowledge
management, CIKM ’08, pages 1113–1122, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-
991-3. 67
[61] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model.
In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and
data mining, KDD ’08, pages 426–434, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-193-4.
40, 41
[62] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recom-
mender systems. IEEE Computer, 42(8):30–37, 2009. 8, 13
[63] Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. Disclda: Discriminative learning for dimen-
sionality reduction and classification. In NIPS, 2008. 43
[64] Helge Langseth and Thomas Dyhre Nielsen. A latent model for collaborative filtering. Int. J.
Approx. Reasoning, 53(4):447–466, June 2012. ISSN 0888-613X. 13
BIBLIOGRAPHY 99
[65] Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted boltzmann
machines. In Proceedings of the Twenty-fifth International Conference on Machine Learning, ICML
’08, pages 536–543, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. 34
[66] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with gaussian processes. In
Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages
601–608, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. 8, 12, 41, 42
[67] David D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional
data. SIGIR Forum, 29:13–19, September 1995. ISSN 0163-5840. 19
[68] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings
of KDD Cup and Workshop, 2007. 10
[69] Nathan N. Liu and Qiang Yang. Eigenrank: a ranking-oriented approach to collaborative filtering.
In Proceedings of the 31st annual international ACM SIGIR conference on Research and develop-
ment in information retrieval, SIGIR ’08, pages 83–90, New York, NY, USA, 2008. ACM. ISBN
978-1-60558-164-4. 17, 18
[70] Hao Ma, Haixuan Yang, Michael R. Lyu, and Irwin King. Sorec: social recommendation using
probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information
and knowledge management, CIKM ’08, pages 931–940, New York, NY, USA, 2008. ACM. ISBN
978-1-59593-991-3. 40, 41
[71] Benjamin Marlin. Modeling user rating profiles for collaborative filtering. In Advances in Neural
Information Processing Systems 16 (NIPS), 2003. 12
[72] Benjamin Marlin. Collaborative filtering: A machine learning perspective. Technical report,
University of Toronto, 2004. 8, 14
[73] Benjamin M. Marlin and Richard S. Zemel. The multiple multiplicative factor model for collabo-
rative filtering. In Carla E. Brodley, editor, ICML, volume 69 of ACM International Conference
Proceeding Series. ACM, 2004. 12, 13
[74] Benjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking with non-random
missing data. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09,
pages 5–12, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-435-5. 14
[75] Benjamin M. Marlin, Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. Collaborative
filtering and the missing at random assumption. In Ronald Parr and Linda C. van der Gaag,
editors, UAI, pages 267–275. AUAI Press, 2007. ISBN 0-9749039-3-0. 14
[76] Paolo Massa and Paolo Avesani. Trust-aware recommender systems. In Proceedings of the 2007
ACM conference on Recommender systems, RecSys ’07, pages 17–24, New York, NY, USA, 2007.
ACM. ISBN 978-1-59593-730–8. 40
[77] B. McFee, T. Bertin-Mahieux, D. Ellis, and G. Lanckriet. The million song dataset challenge. In
Proc. of the 4th International Workshop on Advances in Music Information Research (AdMIRe
’12), April 2012. 42
BIBLIOGRAPHY 100
[78] Lorraine McGinty and Barry Smyth. On the role of diversity in conversational recommender
systems. In Proceedings of the 5th international conference on Case-based reasoning: Research
and Development, ICCBR’03, pages 276–290, Berlin, Heidelberg, 2003. Springer-Verlag. ISBN
3-540-40433-3. 2, 14
[79] David M. Mimno and Andrew McCallum. Expertise modeling for matching papers with reviewers.
In Pavel Berkhin, Rich Caruana, and Xindong Wu, editors, Proceedings of the 13th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD), pages 500–509, San
Jose, California, 2007. ACM. ISBN 978-1-59593-609-7. 29, 31, 48, 53
[80] Radford M. Neal and Geoffrey E. Hinton. Learning in graphical models. chapter A view of the
EM algorithm that justifies incremental, sparse, and other variants, pages 355–368. MIT Press,
Cambridge, MA, USA, 1999. ISBN 0-262-60032-3. 45
[81] Fredrik Olsson and Katrin Tomanek. An intrinsic stopping criterion for committee-based active
learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learn-
ing, CoNLL ’09, pages 138–146, Stroudsburg, PA, USA, 2009. Association for Computational
Linguistics. ISBN 978-1-932432-29-9. 24
[82] G. Papandreou and A. Yuille. Perturb-and-map random fields: Using discrete optimization to
learn and sample from energy models. In Proceedings of the IEEE International Conference on
Computer Vision (ICCV), pages 193–200, Barcelona, Spain, November 2011. 93
[83] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 0-934613-73-7. 80
[84] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. Labeled lda: A
supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009. 43
[85] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collab-
orative prediction. In Luc De Raedt and Stefan Wrobel, editors, ICML, volume 119 of ACM
International Conference Proceeding Series, pages 713–719. ACM, 2005. ISBN 1-59593-180-5. 11
[86] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture
for collaborative filtering of netnews. In 1994 ACM Conference on Computer Supported Collabora-
tive Work Conference, pages 175–186, Chapel Hill, NC, 10/1994 1994. Association of Computing
Machinery, Association of Computing Machinery. 9
[87] Paul Resnick and Hal R. Varian. Recommender systems. Communications of the ACM, 40(3):
56–58, March 1997. ISSN 0001-0782. 2
[88] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors. Recommender Systems
Handbook. Springer, 2011. ISBN 978-0-387-85819-7. 1, 2
[89] Philippe Rigaux. An iterative rating method: application to web-based conference management.
In SAC, pages 1682–1687, 2004. 36, 77
[90] Irina Rish and Gerald Tesauro. Active collaborative prediction with maximum margin matrix
factorization. In ISAIM 2008, 2008. 20
BIBLIOGRAPHY 101
[91] Marko A. Rodriguez and Johan Bollen. An algorithm to determine peer-reviewers. In Proceeding
of the 17th ACM Conference on Information and Knowledge Management (CIKM-08), pages 319–
328, Napa Valley, California, USA, 2008. ACM. ISBN 978-1-59593-991-3. 37, 67
[92] Eytan Ronn. Np-complete stable matching problems. J. Algorithms, 11(2):285–304, May 1990.
ISSN 0196-6774. 25
[93] David A. Ross and Richard S. Zemel. Multiple cause vector quantization. In Advances in Neural
Information Processing Systems 15 (NIPS), pages 1017–1024, 2002. 12
[94] Alvin E. Roth. The evolution of the labor market for medical interns and residents: A case study
in game theory. Journal of Political Economy, 92(6):991–1016, 1984. 24, 25
[95] Alvin E. Roth and Elliott Peranson. The redesign of the matching market for american physicians:
Some engineering aspects of economic design. Working Paper 6963, National Bureau of Economic
Research, February 1999. 25
[96] Neil Rubens and Masashi Sugiyama. Influence-based collaborative active learning. In Proceedings
of the 2007 ACM conference on Recommender systems, RecSys ’07, pages 145–148, New York,
NY, USA, 2007. ACM. ISBN 978-1-59593-730–8. 19
[97] Ruslan Salakhutdinov and Geoffrey Hinton. Replicated softmax: an undirected topic model. In
Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in
Neural Information Processing Systems 22 (NIPS), pages 1607–1614. 2009. 34, 55
[98] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov
chain Monte Carlo. In Proceedings of the International Conference on Machine Learning, vol-
ume 25, 2008. 10, 11, 48, 85
[99] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural
Information Processing Systems (NIPS), volume 20, 2008. 10, 40, 41, 48, 52, 55, 61
[100] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for
collaborative filtering. In Proceedings of the International Conference on Machine Learning, vol-
ume 24, pages 791–798, 2007. 13
[101] Andrew Ian Schein. Active learning for logistic regression. PhD thesis, University of Pennsylvania,
Philadelphia, PA, USA, 2005. AAI3197737. 19
[102] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University
of Wisconsin–Madison, 2009. 19, 20, 21, 23, 24, 81
[103] B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. In Advances in Neural
Information Processing Systems (NIPS), volume 20, pages 1289–1296. MIT Press, 2008. 21
[104] Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP
’08, pages 1070–1079, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. 19
[105] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee, 1992. 20
BIBLIOGRAPHY 102
[106] Hanhuai Shan and Arindam Banerjee. Generalized probabilistic matrix factorizations for collabo-
rative filtering. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM
’10, pages 1025–1030, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-
4256-0. 41, 42, 52
[107] Yue Shi, Martha Larson, and Alan Hanjalic. List-wise learning to rank with matrix factorization
for collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems,
RecSys ’10, pages 269–272, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-906-0. 17
[108] Malcolm Slaney. Web-scale multimedia analysis: Does content matter? IEEE MultiMedia, 18(2):
12–15, April 2011. ISSN 1070-986X. 42
[109] Alex J. Smola, S. V. N. Vishwanathan, and Quoc V. Le. Bundle methods for machine learning.
In Advances in Neural Information Processing Systems 20 (NIPS), 2007. 17
[110] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y.
Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment
treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
Stroudsburg, PA, October 2013. Association for Computational Linguistics. 92
[111] Nathan Srebro. Learning with matrix factorizations. PhD thesis, Massachusetts Institute of Tech-
nology, Cambridge, MA, USA, 2004. AAI0807530. 9
[112] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Fawcett and Mishra
[34], pages 720–727. ISBN 1-57735-189-4. 10
[113] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization.
In Advances in Neural Information Processing Systems (NIPS), 2004. 10, 11, 13
[114] David H. Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online bayesian rec-
ommendations. In Juan Quemada, Gonzalo Leon, Yoelle S. Maarek, and Wolfgang Nejdl, editors,
WWW, pages 111–120. ACM, 2009. ISBN 978-1-60558-487-4. 67
[115] David H. Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando
Tacchella. Collaborative expert portfolio management. In Maria Fox and David Poole, editors,
AAAI. AAAI Press, 2010. 67
[116] Johan A. K. Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural
Processing Letters, 9(3):293–300, 1999. 11
[117] Wenbin Tang, Jie Tang, and Chenhao Tan. Expertise matching via constraint-based optimization.
In IEEE/WIC/ACM International Conference on Web Intelligence (WI-10) and Intelligent Agent
Technology (IAT-10), volume 1, pages 34–41, Toronto, Canada, 2010. IEEE Computer Society.
ISBN 978-0-7695-4191-4. 68
[118] Wenbin Tang, Jie Tang, Tao Lei, Chenhao Tan, Bo Gao, and Tian Li. On optimization of expertise
matching with various constraints. Neurocomputing, 76(1):71–83, January 2012. ISSN 0925-2312.
68
BIBLIOGRAPHY 103
[119] Daniel Tarlow. Efficient Machine Learning with High Order and Combinatorial Structures. PhD
thesis, University of Toronto, February 2013. 75
[120] Daniel Tarlow, Ryan Prescott Adams, and Richard S Zemel. Randomized optimum models for
structured prediction. In Proceedings of the 15th Conference on Artificial Intelligence and Statis-
tics, pages 21–23, 2012. 93
[121] Daniel Tarlow, Kevin Swersky, Richard S Zemel, Ryan P Adams, and Brendan J Frey. Fast exact
inference for recursive cardinality models. In Proceedings of the 28th Conference on Uncertainty
in Artificial Intelligence, 2012. 80
[122] Camillo J. Taylor. On the optimal assignment of conference papers to reviewers. Technical Report
MS-CIS-08-30, University of Pennsylvania, 2008. 36, 64, 66, 68
[123] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector
machine learning for interdependent and structured output spaces. In Proceedings of the Twenty-
First International Conference on Machine Learning, ICML’04, pages 104–, New York, NY, USA,
2004. ACM. ISBN 1-58113-838-5. 93
[124] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc.,
New York, NY, USA, 1995. ISBN 0-387-94559-8. 7
[125] Maksims Volkovs and Rich Zemel. Collaborative ranking with 17 parameters. In P. Bartlett, F.C.N.
Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information
Processing Systems 25, pages 2303–2311. 2012. 17
[126] Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientific articles.
In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and
data mining, KDD ’11, pages 448–456, New York, NY, USA, 2011. ACM. 14, 40, 41, 42, 48, 52,
55
[127] Chong Wang and David M. Blei. Variational inference in nonconjugate models. Journal of Machine
Learning Research, 14(1):1005–1031, April 2013. 44, 60
[128] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - maximum
margin matrix factorization for collaborative ranking. In NIPS, 2007. 14, 16, 17
[129] Markus Weimer, Alexandros Karatzoglou, and Alex Smola. Adaptive collaborative filtering. In
Proceedings of the 2008 ACM conference on Recommender systems, RecSys ’08, pages 275–282,
New York, NY, USA, 2008. ACM. ISBN 978-1-60558-093-7. 11, 17
[130] Jason Weston, Chong Wang, Ron Weiss, and Adam Berenzweig. Latent collaborative retrieval. In
ICML. icml.cc / Omnipress, 2012. 41, 42
[131] Zuobing Xu, Ram Akella, and Yi Zhang 0001. Incorporating diversity and density in active learning
for relevance feedback. In Giambattista Amati, Claudio Carpineto, and Giovanni Romano, editors,
ECIR, volume 4425 of Lecture Notes in Computer Science, pages 246–257. Springer, 2007. ISBN
978-3-540-71494-1. 23
BIBLIOGRAPHY 104
[132] Kai Yu, Anton Schwaighofer, Volker Tresp, Xiaowei Xu, and Hans-Peter Kriegel. Proba-
bilistic memory-based collaborative filtering. IEEE Trans. on Knowl. and Data Eng., 16
(1):56–69, January 2004. ISSN 1041-4347. doi: 10.1109/TKDE.2004.1264822. URL
http://dx.doi.org/10.1109/TKDE.2004.1264822. 23
[133] Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied
to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, April 2004. ISSN 1046-8188. 31
top related