online learning in large-scale contextual recommender...

Online Learning in Large-Scale ContextualRecommender Systems

Linqi Song, Cem Tekin, and Mihaela van der Schaar, Fellow, IEEE

Abstract—In this paper, we propose a novel large-scale, context-aware recommender system that provides accurate

recommendations, scalability to a large number of diverse users and items, differential services, and does not suffer from “cold start”

problems. Our proposed recommendation system relies on a novel algorithm which learns online the item preferences of users based

on their click behavior, and constructs online item-cluster trees. The recommendations are then made by choosing an item-cluster level

and then selecting an item within that cluster as a recommendation for the user. This approach is able to significantly improve the

learning speed when the number of users and items is large, while still providing high recommendation accuracy. Each time a user

arrives at the website, the system makes a recommendation based on the estimations of item payoffs by exploiting past context arrivals

in a neighborhood of the current user’s context. It exploits the similarity of contexts to learn how to make better recommendations even

when the number and diversity of users and items is large. This also addresses the cold start problem by using the information gained

from similar users and items to make recommendations for new users and items. We theoretically prove that the proposed algorithm for

item recommendations converges to the optimal item recommendations in the long-run. We also bound the probability of making a

suboptimal item recommendation for each user arriving to the system while the system is learning. Experimental results show that our

approach outperforms the state-of-the-art algorithms by over 20 percent in terms of click through rates.

Index Terms—Recommender systems, online learning, clustering algorithms, multi-armed bandit

Ç

1 INTRODUCTION

ONLINE websites currently provide a vast number ofproducts or services to their users. Recommender sys-

tems have emerged as an essential method to assist users infinding desirable products or services from large sets ofavailable products or services [1], [2]: movies at Netflix,products at Amazon, news articles at Yahoo!, advertise-ments at Google, reviews in Yelp, etc. Moreover, Yahoo!Developer Network, Amazon’s Amazon Web Services,Google’s Google Developers and numerous other majorcompanies also offer Web services that operate on theirdata, and recommendation ability is often provided todevelopers as an essential service functionality.

The products or services provided by the recommendersystems are referred to generically as items [1], [2]. The per-formance, or payoff, of a recommendation is based on theresponse of the user to the recommendation. For example,in news recommendations, payoffs are measured by users’click behaviors (e.g., 1 for a click and 0 for no click), and theaverage payoff when a webpage is recommended is knownas the Click-Through Rate (CTR) [6]. The goal of recommen-dations is to maximize the payoffs over all users that cometo the website.

Although recommender systems have been deployed inmany application areas during the past decade, the tre-mendous increase in both the number and diversity of

users as well as items poses several key new challengesincluding dealing with heterogeneous user characteristicsand item sets, existence of different user types such as reg-istered and non-registered users, large number of itemsand users, and the cold start problem. We address all thesechallenges by designing online learning algorithms withthree properties: contextualization, differential servicesand scalability.

(a) Contextualization has the potential to better predict oranticipate the preferences of users. First-generation recom-mender systems consider the expected payoff of each rec-ommendation to only be a function of the item [4], [6], [8],[9]. However, such recommender systems model the prefer-ences of users with limited accuracy and have thus a limitedperformance. Several recent works [5], [18], [23] show thatincorporating the context in which a recommendation ismade can greatly improve the accuracy of predicting howusers respond to various recommendations. Moreover, con-textualization allows the recommender to learn together forgroups of users having similar contexts, which significantlyspeeds up the learning process.

The context is defined generally as an aggregate ofvarious categories that describe the situations or circum-stances of the user when a recommendation is made[18], [19], such as time, location, search query, purchasehistory, etc. When contexts are incorporated into the rec-ommender systems, the expected payoff of a recommen-dation can be modeled as a function of the item and thecontext. If the number of contexts, X, is finite and small,we can run X recommenders, one for each context,to tackle this issue. However, when there are large orinfinite number of contexts, this method fails sincethe number of recommenders increases significantly.Hence, designing and building effective contextual

� The authors are with the Department of Electrical Engineering, Universityof California-Los Angeles, Los Angeles, CA 90095.E-mail: [email protected], [email protected], [email protected].

Manuscript received 7 May 2013; revised 20 Sept. 2014; accepted 25 Oct.2014. Date of publication 29 Oct. 2014; date of current version 15 June 2016.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TSC.2014.2365795

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 9, NO. 3, MAY/JUNE 2016 433

1939-1374� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

recommender systems with large or infinite numbers ofcontexts remains a key challenge.

In this paper we propose methods for contextualizationwhich helps the recommender to learn from its past experi-ences that are relevant to the current user characteristics, torecommend an item to the current user.

(b) Differential services are important for online services,including recommender systems. Existing works evaluaterecommender systems based on their average performance[22], [23], [29]. However, it may also be desirable for serviceproviders to build customer loyalty by providing registeredusers with a higher quality of service (in this case better rec-ommendations) than the quality provided to anonymoususers using the system. Hence, designing recommender sys-tems which have the ability to provide differential servicesto various users represents a key challenge.

(c) Scalability is a key challenge faced by a recom-mender system with a large number of items and/or con-texts. If not designed properly, the learning speed andthe implementation complexity of such systems can suffersignificantly [3], [7], [10]. Hence, the recommender systemneeds to easily scale to accommodate a large number ofitems and contexts.

Another important challenge is cold start, whichappears in practice when the system has a very diverse setof items and very limited or no a priori information aboutthe items that the users prefer [3], [7]. In this case, recom-mendation algorithms [2], [6], [9] fail to recommend itemsthat are most liked by users due to the limited informationabout the user’s preference for specific items. This prob-lem is also often referred to as the “coverage” problem [3],[7]. New items are unlikely to be recommended and newusers are unlikely to be given good recommendations dueto the limited interaction histories of these new items/users. Therefore, dealing with cold start remains a keychallenge for large-scale recommender systems. Twoimportant features of our proposed algorithm address theissue of cold start: contextualization and continuousexploration. By contextualization, our algorithm clustersthe users according to their contexts and items accordingto their features such that past information gained fromusers with similar contexts and items with similar featurescan be used to solve the cold start problem when a newuser comes or a new item is introduced. Moreover, ouralgorithm explores continuously, i.e., it does not exploreonly initially based on limited data, and then exploit forthe rest of the users. It explores and exploits in an alternat-ing fashion, and keeps learning as new users come. This

continuous exploration addresses the cold start problemsince it helps the recommender to learn the payoffs ofnew items and preferences of new users that are not simi-lar to the previous ones over time.

In this paper, we propose a novel, large-scale contextualrecommender system which is able to address all of theabove mentioned challenges. Our system is based on anovel contextual Multi-Arm Bandit (MAB) approach. Whena user with a specific context arrives, the recommender uti-lizes information from past arrivals within a neighborhoodof the current context (i.e., “similar” to the current context).Based on the payoff estimations through these past arrivals,the recommendation is made at an item-cluster level,instead of an item level. Our approach controls the size ofthe context neighborhood and the set of clusters to obtainsufficiently accurate and efficient payoff estimations. Tolearn about new items and contexts, our algorithms use adifferential method for exploration: the system exploresmore often when the users have a lower priority, at the riskof providing lower quality recommendations, and exploreless in order to provide consistently high quality recommen-dations when the users have a higher priority. After eachrecommendation, the realized payoff is received by the rec-ommender system. Each item-cluster corresponds to anaction (referred to as the arm in a bandit framework) for ourlearning algorithm, hence learning is performed at the clus-ter level, and this solves the scalability problem.

The goal of learning is to minimize the regret, which is thedifference between the optimal expected total payoff up totime T , which can be achieved if all information (i.e.,expected payoffs of each item/cluster in a certain context) isknown, and the expected total payoff gained up to time Tthrough our learning algorithm, in which the information isincomplete.

The remainder of this paper is organized as follows. InSection 2, we discuss the related works and highlight thedifferences from our work. Section 3 describes the systemmodel of the recommender system. Section 4 introduces theadaptive clustering based contextual MAB recommendationalgorithm. Section 5 provides simulation results. Section 6concludes the paper.

2 RELATED WORKS

A plethora of prior works exists on recommendation algo-rithms. We compare our work with existing works inTable 1. We categorize these algorithms based on the follow-ing characteristics: their consideration of contexts; their

TABLE 1Comparison with Existing Algorithms

Contexts Refinement inthe context space

Solving“scalability”

Solving “coldstart”

Requiringtraining data

Differentialservices

[2] [4] [6] [8] [9] [11] [16] No - No No Yes No[5] [21] Yes Partition No No Yes No[23] [24] Yes Linear payoff assumption No Yes No No[25] Yes Partition No Yes Yes No[27] [28] No - Yes Yes No No[29] Yes Partition Yes Yes Yes NoOur work Yes An adaptive neighborhood Yes Yes No Yes

434 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 9, NO. 3, MAY/JUNE 2016

refinement of the context space (if any); their ability toaddress the issues of “scalability”, “cold start”, and“differential services”; and their ability to operate without aperiod of training. Two main categories are filtering-basedand reinforcement learning methods [3]. Filtering-basedapproaches, such as collaborative filtering [7], [8], content-based filtering [2], [9] and hybrid approaches [10], [11],employ the historical data of users’ feedback to calculate thefuture payoffs of users based on some prediction functions.Collaborative filtering approaches predict users’ preferen-ces based on the preferences of their “peer” users, who havesimilar interests on other items observed through historicalinformation [7], [8]. Content-based filtering methods helpcertain users to find the items that are similar to the pastselected items [2], [9]. Hybrid approaches combine the col-laborative filtering and content-based filtering approachesto gather more historical information in the user-item spacewhen a new user or a new item appears in the system [10],[11]. However, filtering-based approaches are unable tosolely tackle issues of cold start and scalability. Althoughthe hybrid approaches can alleviate the cold start problemto some extent due to the consideration of both similar itemsand similar users when a new item or user appears, it can-not solve the cold start problem when no similar items/users of a new item/user exist in the system.

Reinforcement learning methods, such as MAB [23], [24],[25] and Markov Decision Processes (MDPs) [26], are widelyused to derive algorithms for recommending items. MDP-based learning approaches model the last k selections of auser as the state and the available items as the action set,aiming at maximizing the long-term payoff [26]. However,the state set will grow fast as the number of items increases,thereby resulting in very slow convergence rates. In [32],“state aggregation” methods are used to solve this MDPproblem with large state set, but the incurred regret growslinearly in time. MAB-based approaches [12], [13], [15], [17],such as "-greedy [17] and UCB1 [15], provide convergenceto the optimum, and not only asymptotic convergence, butalso a bound on the rate of convergence for any time step.They do this by balancing exploration and exploitation,where exploration involves learning about new items’ pay-offs for a particular user by recommending new items whileexploitation means recommending the best items based onthe observations made so far.

To further improve the effectiveness of recommenda-tions, contexts are considered in recent recommendersystems [18], [21], [22]. For instance, it is important toconsider the search queries and the purchase histories(which can be considered as the contexts) of users whenmaking news article recommendations. Moreover, inmany applications, such as news recommendations, therecommender system only observes the features of anon-ymous users, without knowing the exact users. In thiscase, the users’ features (cookies) can be also consideredas contexts. The context-aware collaborative-filtering(CACF) and the contextual MAB methods have beenstudied to utilize such information [21], [23], [25], [29].The CACF algorithm extends the collaborative filteringmethod to a contextual setting by considering differentweights for different contexts [21]. In [23], [24], the pay-off of a recommendation is modeled as a linear function

of the context with an unknown weight vector, and theLinUCB algorithm is proposed to solve news article rec-ommendation problems. However, the linearity may nothold in practice for some recommender systems [22],[25], [29]. In a more general setting [25], the contextspace is modeled as a bounded metric space in whichthe payoff function is assumed to satisfy a Lipschitz con-dition, instead of being a linear function of contexts. Inthis context space, partition based algorithms are pro-posed to solve the contextual recommendation problems[25]. These works [18], [22], [25], however, do not takeinto account the large item space, which can cause scal-ability problems in practice.

A different strand of related works studies recommendersystems that have a large number of items to recommend[3], [7], [27], [28]. In [28], the item space is partitioned intosubspaces, and online learning is performed in each sub-space. Alternatively, in [27], the problem is solved by con-sidering a static clustering of items. These works [27], [28]can solve the scalability problem by reducing the number ofchoices of each recommendation. However, these works [3],[7], [27], [28] do not incorporate contexts in the recommen-dations. In contrast, we consider the contexts when makingrecommendations.

Building large-scale contextual recommender systems ismore challenging. Few related works tackle both problemsin an integrated manner. The most relevant theoreticalworks related to our proposed solution are [25], [29]; how-ever, there are significant differences between our work andthese existing works. First, the works in [25], [29] are theo-retical and do not consider the key characteristics andrequirements of recommender systems. For example, thealgorithm proposed in [29] requires the knowledge ofLipschitz constants, which makes the algorithm data-dependent (different datasets may have different Lipschitzconstants based on the characteristics of the data) and hencedifficult to implement in practice; the algorithm in [25]requires knowing the covering constants of context anditem spaces. In contrast, our learning algorithm does notneed the Lipschitz constant or covering constants tooperate; these constants are only used to analyze the perfor-mance of the algorithm (i.e., to show the regret). Second, themodels in [25], [29] consider only two spaces: the contextspace and the item space; while in our work, we considerthe user space, the context space, and the item space. Hence,we can provide differential services for different types ofusers by choosing different trade-offs between exploitationand exploration. Third, in [29], combining the items andcontexts greatly increases the space dimension, resulting ina highly-complex online algorithm. In our work, we greatlyreduce the dimension of the spaces by separately consider-ing the context and the item space. We build the item-cluster tree offline, and only perform online calculations inthe context space, resulting in lower-complexity onlinelearning compared with that in [29].

In existing works [25], [27], [28], [29], the algorithmsoperate by fixing a partition of the context space. In con-trast, our approach selects a dynamic subspace in the con-text space each time to ensure a sufficiently accurateestimation of the expected payoffs with respect to currentcontext, with a precise control of the size of the context

SONG ETAL.: ONLINE LEARNING IN LARGE-SCALE CONTEXTUAL RECOMMENDER SYSTEMS 435

subspace. Moreover, our approach can provide differentialservices (different instantaneous performance) to ensurethe satisfaction of higher type users, which are not consid-ered before in existing works [4], [5], [6], [8], [9], [21], [23],[27], [28], [29]. A detailed comparison with the existingworks is presented in Table 1.

3 RECOMMENDER SYSTEM MODEL

We consider the contextual recommender system shown inFig. 1, which operates in discrete time slots t ¼ 1; 2; . . .. Ateach time slot, the system operates as follows: (1) a userwith a specific context arrives; (2) the recommender systemobserves the context, and recommends to the user an itemfrom the set of items, based on the current context as well asthe historical information about users, contexts, items, andpayoffs of former recommendations; and (3) the payoff ofthis recommendation, whose expectation depends on theitem and context, is received by the recommender systemaccording to the user’s response.

Users: We formally model the set of users as U, whichcan be either a finite set (i.e., U ¼ f1; 2; . . . ; UgÞ or infiniteset (i.e., U ¼ f1; 2; . . .g, e.g., a news recommender systemfrequented by an infinite number of users). When a useru 2 U arrives, the system can observe the user’s type:important or ordinary users, registered or anonymoususers, etc. We divide the users into S types, and denotethe set of type s 2 f1; 2; . . . ; Sg users by Us. Based on theusers’ types, the recommender system can provide differ-ential services for users.

Contexts: We model the context set X as a dC dimensionalcontinuous space, and the context x 2 X is a dC dimensionalvector. For simplicity of exposition, we assume that the con-

text space is X ¼ ½0; 1�dC , but our results will hold for anybounded dC dimensional context space. In the context space,the similarity distance sC is defined as a metricsC : X � X ! ½0;1Þ. A smaller sCðx; x0Þ indicates that thetwo contexts x and x0 exhibit higher similarity. In theEuclidean space X , the metric sC can be the Euclidean normor any other norm. Note that we will add the subscript t touser u and context x when referring to the user and contextassociated with time period t.

Items: We denote the set of items by I ¼ f1; 2; . . . ; Ig.In the item space, the similarity distance is defined as ametric sI : I � I ! R, which is based on the features ofthe items and known to the recommender system, as in[10], [22], [29]. A smaller sIði; i0Þ indicates that the twoitems i and i0 exhibit higher similarity. For example, in a

news recommender system, an item is a news item, andthe similarity between two news items depends on theirfeatures: whether they are in the same category (e.g.,Local News), whether they have the same keywords(e.g., Fire), etc.

Since the number of items in the system is large, we clus-ter them based on the metric sI in the item space in order toreduce the possible options for recommendation. We use anitem-cluster tree to represent the hierarchical relationshipamong item clusters, and an example is shown in Fig. 2. Inan item-cluster tree, each leaf represents an item and eachnode represents an item cluster. The depth of a node isdefined as the length of the path to the root node, and thedepth of the tree is defined as the maximum depth of allnodes in the tree. We denote the depth of the tree by L. Wedefine the collection of nodes at depth 0 � l � L of the treeas layer l and a node1 at layer l is denoted by Cl;k, wherek 2 f1; 2; . . . ; Klg and Kl is the number of nodes at depth l.There is only one node C0;1 at layer 0, i.e., the root node. TheI nodes at layer L are the leaves of the tree, each of which isa single-element cluster that only contains one item. Thehierarchical structure of the cluster tree implies that for achild node Clþ1;k0 of node Cl;k, we have item i 2 Cl;k, for all

i 2 Clþ1;k0 .

The cluster size sðCl;kÞ is defined as the maximum dis-tance of two items in that cluster, namely, sðCl;kÞ ¼maxi;i02Cl;k

fsIði; i0Þg. In an item-cluster-tree structure, a

general tree metric can be defined as a mapping fromthe set of depths of the nodes/clusters in the tree tomaximum size of clusters, i.e., sT : N ! R. Thus, sT ðlÞdenotes the maximum cluster size at depth l, namely,sT ðlÞ � maxi;i02Cl;k;1�k�Kl

fsIði; i0Þg for any cluster at depth

l. For a tree metric, the maximum cluster size at depth lis not smaller than that of a cluster at depth lþ 1, i.e.,sT ðlÞ � sT ðlþ 1Þ. We can see from Fig. 2 that item 6 anditem 7 are in the same cluster C2;3; item 6 and item 9 arein the same cluster C1;2; and the metric between item 6and item 7 is smaller than that between item 6 and item9, namely, sIð6; 7Þ < sIð6; 9Þ. In Section 4, we will providean algorithm to construct such item-cluster trees. Usingthe constructed item-cluster tree, we can control the sizeand number of the clusters when the recommendationsare made.

Fig. 1. Contextual recommender system model.Fig. 2. Construction of item-cluster tree.

1. Note that each node of the tree corresponds to an item cluster, sowe will use the terms “node” and “cluster” and their notationsinterchangeably.


The expected payoff of a recommendation depends onthe context and the item. For a user with context x 2 X ,the payoff of recommending item i is denoted by a ran-dom variable ri;x 2 ½0; 1�, whose distribution is fixed butunknown to the recommender system. The expectationof ri;x is denoted by mi;x. Given the number of items nl;k

in cluster Cl;k, the expected payoff of cluster Cl;k for con-text x is denoted by ml;k;x ¼

Pi2Cl;k

mi;x=nl;k. Only the

payoffs of the recommended items can be observed bythe recommender system and can be used for furtherrecommendations.

Context-aware collaborative-filtering algorithms con-sider that an item has similar expected payoffs in similarcontexts [18], [19], [21], [22], while contextual bandit algo-rithms use a Lipschitz condition to model the similarexpected payoffs of an item in two similar contexts [25],[29]. As in [25], [29], we formalize this in terms of a Lipschitzcondition as follows.

Assumption 1 (Lipschitz condition for contexts). Given twocontexts x; x0 2 X , we have the following assumption:jmi;x � mi;x0 j � LCsCðx; x0Þ for any item i, where LC is the

Lipschitz constant in the context space.

Similarly to [28], [29], we model the similar payoffs forsimilar items in the same context using a Lipschitz condi-tion, described in Assumption 2.

Assumption 2 (Lipschitz condition for items). Given twoitems i; i0 2 I , we have the following assumption:jmi;x � mi0;xj � LIsIði; i0Þ for any context x 2 X , where LI is

the Lipschitz constant in the item space.

Based on Assumption 2, in an item-cluster tree with treemetric sT ð�Þ, we have jmi;x � mi0;xj � LIsT ðlÞ for any x 2 Xand i; i0 2 Cl;k. Note that the Lipschitz constants LC and LI

are not required to be known by our recommendation algo-rithm. Theywill only be used in quantifying its performance.

4 RECOMMENDATION ALGORITHM

In this section, we first propose the item-cluster tree con-struction method to cluster the items, then propose theAdaptive Clustering Recommendation (ACR) algorithm andshow its performance bound in terms of regret.

4.1 Construction of Item-Cluster Tree (CON-TREE)

We first cluster items using an item-cluster tree, such thatthe recommendations are made at a cluster level insteadof an item level which reduces the possible recommenda-tion options (arms) to solve the scalability problem. Toexplicitly characterize the hierarchy of the item-clustertree, we denote the tree by a tuple < L, fKlg0�l�L,fChdðCl;kÞg0�l�L;1�k�Kl

, fPrtðCl;kÞg0�l�L;1�k�Kl>, where L

is the depth of the tree; Kl ¼ fCl;k : 1 � k � Klg is the clus-ter set at layer l, and Kl ¼ jKlj is the cardinality of Kl;ChdðCl;kÞ is the set of child clusters of cluster Cl;k (a subset

of Klþ1 for 0 � l < L, and ; for l ¼ LÞ; PrtðCl;kÞ is the set ofparent cluster of cluster Cl;k (a one-element subset of Kl�1

for 0 < l � L, and ; for l ¼ 0Þ.In this paper, we adopt an exponential tree metric:

sT ðlÞ ¼ bl, where b 2 ð0:5; 1Þ is the constant base. Our goal is

to construct an item-cluster tree that fulfils this tree metric,based on an arbitrary given metric among items.2 The con-structed item-cluster tree will be used in the recommenda-tion algorithm. The Construction of Item-Cluster Treealgorithm is described in Algorithm 1, and an example isshown in Fig. 2.

Algorithm 1. Construction of Item-Cluster Tree

Input: item set I ¼ f1; 2; � � � Ig, base b, depth L of the tree.Initialization: Set L :¼ minfL;Lbg, where Lb is the maximum l

such that bl�1 � mini;i02I ;sI ði;i0Þ6¼0 sIði; i0Þ.1: set node set at layer L as KL :¼ fCL;1; CL;2; . . .CL;KL

g, whereCL;k :¼ fkg; 81 � k � KL, andKL ¼ I.

2: set representative point set at layer L asRPL :¼ fiL;1;iL;2;� � � iL;KL

g, where iL;k :¼ k; 81 � k � KL.3: set the child sets of nodes at layer L as empty sets:

ChdðCL;kÞ :¼ ;; 81 � k � KL.4: for l ¼ ðL-1):0 do5: set k :¼ 0. //counter for the number of clusters at layer l6: whileRPlþ1 6¼ ; do7: set k :¼ kþ 1.8: arbitrary select an element ilþ1;k̂ 2 RPlþ1.9: find the subset SK of Klþ1 such that

SK :¼ fClþ1;k0 : 1 � k0 � Klþ1; ilþ1;k0 2 RPlþ1;

sIðilþ1;k̂; ilþ1;k0 Þ �ð1�bÞbl1þb g:

10: set Cl;k :¼ [Clþ1;k0 2SKClþ1;k0 .11: set ChdðCl;kÞ :¼ SK.12: set PrtðClþ1;k0 Þ :¼ Cl;k; 8Clþ1;k0 2 SK.13: set il;k :¼ ilþ1;k̂.14: remove representative points of selected clusters from

RPlþ1:RPlþ1 :¼ RPlþ1nfilþ1;k0 : Clþ1;k0 2 SKg:

15: end while16: setKl :¼ k.17: end for23: set PrtðC0;1Þ :¼ ;.

The algorithm first sets each leaf at depth L to be a singleitem cluster, and then operates from leaves to root by com-bining clusters that are “similar”. The similarity betweentwo clusters is evaluated through the distance between therepresentative points in the clusters. If the distance betweentwo representative points is below a layer-varying thresh-old, the two clusters are considered similar and have thesame parent cluster.

A key property of the CON-TREE algorithm is that it notonly constructs an item-cluster tree that fulfils the exponen-tial tree metric, but that it also has a bound on the maximumnumber of item clusters at each layer. This is importantbecause we can balance the speed of learning and the accu-racy by adjusting the number of clusters and the clustersize. We formalize this property in the following theorem.

Theorem 1. The item-cluster tree constructed via the CON-TREE algorithm has the following properties: (1) the con-structed tree with a depth at most Lb fulfils the exponentialtree metric, where Lb is the maximum l such that

2. Item-cluster trees that fulfill a certain tree metric are not unique;however, they will provide the same performance bound for the recom-mendations as shown in our theorems.


bl�1 � mini;i02I ;sI ði;i0Þ6¼0 sIði; i0Þ, and (2) the maximum number

of clusters at layer l of this tree is bounded by cIð1þb1�bÞ

dI ð1bÞldI ,

where cI and dI are the covering constant and covering dimen-sion for the item space.3

Proof. See Appendix B, which can be found on the Com-puter Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TSC.2014.2365795. tu

We can see that according to the definition of the expo-nential tree metric, the proposed CON-TREE algorithmensures that the cluster size at layer l is bounded by bl, andfrom Theorem 1, the number of clusters at layer l is

bounded by Oðb�ldI Þ. If l is larger, then the cluster size atlayer l is smaller, but the number of clusters at layer l islarger, and vice versa. This implies that choosing clusterswith larger depth l will make more specific recommenda-tions to the users, but it will require learning more aboutspecific characteristics of items since the number of clustersis larger. We also note that the item-cluster tree can be parti-tioned by a set of clusters Kl at layer l, which contain all theitems and no two clusters contain the same item. Thus anyitem is included in one of the clusters at the same layer.

Note that since the CON-TREE algorithm needs up to

K2l calculations at layer l, the computational complexity

of the CON-TREE algorithm is bounded by OðPLb

l¼0 b�2ldI Þ.

Since it is implemented offline, the CON-TREE algorithmdoes not affect the computational complexity of the onlinerecommendation.

4.2 Adaptive Clustering Recommendation

We have previously proposed a one-layer clustering rec-ommendation algorithm (OLCR) in [30] to solve the con-textual recommendation problem with large number ofitems. However, the OLCR algorithm is based on fixedclustering method. A disadvantage of OLCR is that it isdifficult to decide the optimal size of the clusters. In thiswork, we propose an adaptive clustering based ACRalgorithm that can adaptively select the cluster size inorder to balance between making specific recommenda-tions and reducing the number of clusters. We choose the“optimal item” as the benchmark, which is defined asiðtÞ ¼ argmaxi2Imi;xt

with respect to the context xt arriv-

ing at time t. The regret of learning up to time T of analgorithm p, which selects the pKðtÞ cluster at layer pLðtÞ(i.e., cluster CpLðtÞ;pKðtÞÞ, given the context at time t, is

defined as the difference of expected payoffs from choos-ing item iðtÞ, namely,

RpðT Þ ¼XT

t¼1½miðtÞ;xt � mpLðtÞ;pKðtÞ;xt �: (1)

Since for different types of users, the recommendersystem may provide differential services in order to main-tain the higher type users’ satisfaction. We assume thattype s users have a certain fixed probability ps to arrive ateach time. For higher type users, the recommender sys-tem explores less, and for lower type users, the

recommender system explores. In such a way, the averageper-period regrets of higher type users are expected to besmaller than those of the lower type users. Formally, wedefine the instantaneous regret of type s users asIRs

pðtÞ ¼ E½miðtÞ;xt � mpLðtÞ;pKðtÞ;xt jut 2 Us� for an algorithm

p, where the expectation is taken with respect to the con-text arrival given current user type.

We first present the ACR algorithm in Fig. 3, Fig. 4 andAlgorithm 2. Initially, the item-cluster tree that fulfils theexponential tree metric has been built offline through theCON-TREE algorithm. We denote the epochs by l ¼ 0; 1;

2 . . . , and epoch l contains 2l time slots. In epoch l, the ACRalgorithm chooses the cluster set Kminfl;Lg that contains all

the clusters at layer minfl; Lg, as shown in Fig. 4. Note thatwe will denote by lðtÞ the layer selection corresponding totime slot t. As epoch goes on, the algorithm selects an itemcluster from a cluster set that contains larger number ofclusters, but these clusters have smaller cluster size. Hence,the algorithm can make more specific recommendations tousers over time.

Algorithm 2. Adaptive Clustering RecommendationAlgorithm

Input: Item-cluster tree, periods T .Initialization: initial partition K :¼ K0 ¼ fC0;1g, epochE :¼ minfL; log2Tb cg.1: for l ¼ 0 : E � 1do2: Set partition K :¼ Kl ¼ fCl;k : 1 � k � Klg.3: for t ¼ 2l : 2lþ1 � 1do4: Observe the user’s type s and the context xt. Find a ball

Bðxt; rtÞwith radius rt ¼ tða�1Þ=dC .5: Select the cluster with maximum index

k ¼ argmaxk02KIsl;kðtÞ, with ties broken arbitrarily.

6: Randomly select a child node Clþ1;k̂ of Cl;k.7: Randomly recommend an item in cluster Clþ1;k̂.8: Receive the reward: rt.9: Record ft; xt; Cl;k; Clþ1;k̂; rtg.10: end for11: end for12: Set l :¼ E, and set partition K :¼ KE ¼ fCE;k : 1 � k � KEg13: for t ¼ 2l : Tdo14: Observe the context xt. Find a ball Bðxt; rtÞwith radius

rt ¼ tða�1Þ=dC .15: Select the cluster with maximum index

k ¼ argmaxk02KIsl;kðtÞ, with ties broken arbitrarily.

16: Randomly recommend an item in cluster Cl;k.17: Receive the reward: rt.18: Record ft; xt; Cl;k; rtg.19: end for

Each time, a user with context xt arrives, and the algo-rithm chooses a relevant ball Bðxt; rtÞ in the context space,

Fig. 3. Context arrivals and refinement in the context space.

3. See Appendix A, available in the online supplemental material,for the definitions of covering dimension and covering constant.


http://doi.ieeecomputersociety.org/10.1109/TSC.2014.2365795

http://doi.ieeecomputersociety.org/10.1109/TSC.2014.2365795

whose center is xt and radius is rt ¼ tða�1Þ=dC as shown inFig. 3. It then selects a cluster in KlðtÞ, by calculating and

comparing the index of each cluster at the current layer lðtÞbased on the past history and current context. Subsequently,it randomly selects a child cluster at layer lðtÞ þ 1 (if avail-able) and randomly recommends an item in the selectedchild cluster. Note that in the algorithm, to extract informa-tion of a former epoch, we record two-cluster selectionseach time4: one is the cluster Cl;k at layer l, and the other oneis the cluster Clþ1;k̂ at layer lþ 1, which is a random selected

child cluster of Cl;k.The index represents a combination of exploration and

exploitation, and its calculation is the main part of this algo-rithm. The set of past time periods when the contextsarrived at the relevant ball Bðxt; rtÞis denoted byT ðBðxt; rtÞÞ ¼ ft : t < t; xt 2 Bðxt; rtÞg. The number oftimes that cluster Cl;k has been selected in that ball isdenoted by Nl;k;t ¼

Pt2T ðBðxt;rtÞÞ IfpLðtÞ ¼ l;pKðtÞ ¼ kg.

Therefore, the index for cluster Cl;k can be written as

Isl;kðtÞ ¼ rl;k;t þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiAs ln t=Nl;k;t

p, where As is the exploration-

exploitation trade-off factor for type s users, andrl;k;t ¼ ð

Pt2T ðBðxt;rtÞÞ rtIfpLðtÞ ¼ l;pKðtÞ ¼ kgÞ=Nl;k;t is the

sample-average payoff in the relevant ball. Without loss ofgenerality, we assume that the exploration-exploitationtrade-off factor As decreases as the users’ type increases,namely, A1 � A2 � � � � � AS .

We define the suboptimal cluster set at time t asSðtÞ ¼ fðlðtÞ; kÞ : k 2 KlðtÞ;mlðtÞ;kt ;t

mlðtÞ;k;t > 4LCrtg, where

kt ¼ argmaxkmlðtÞ;k;t. We also define Wsk;t ¼ fNlðtÞ;k;t <

36As ln t

ðmlðtÞ;t�mlðtÞ;k;tþ2LCrtÞ2

g. The following theorems prove a bound

on the instantaneous and long-term regrets of the ACRalgorithm.

Theorem 2. If 1=ð1þ dCÞ < a < 1 and the fully explorationassumption holds: PrfpðtÞ ¼ ðlðtÞ; kÞjut 2 Us;Ws

k;t; ðlðtÞ;kÞ 2 SðtÞg � 1� t�h for any time t, where h > 0 is a parame-ter, then for independent and identically distributed (i.i.d.)user and context arrivals, the instantaneous regret IRs

ACRðtÞof the type s user in period t is bounded by

IRsACRðtÞ �

KlðtÞ2t�2A1 þ 24A1KlðtÞcC ln t

LCtð1�aÞdC

�a þ 4LCt�ð1�aÞ

dC

þ2LItln b= ln 2; s ¼ 1

KlðtÞ2t�2As þ 4LCt

�ð1�aÞdC þ 2LIt

ln b= ln 2

þKlðtÞQs�1

s0¼1 ½1�Ps0

s00¼1 ps00pBðxt;rtÞð1� t�hÞ�ts0�ts0þ1 ; s 6¼ 1;

8>>>>><>>>>>:

;

where ts0 ¼ minft : As0 lnðt�tÞAs lnðtÞ < ð1þ rt�t

3rtÞ2g, for s0 ¼ 1; 2; . . . ;

s� 1, and pBðxt;rtÞ ¼Rx2Bðxt;rtÞ fðxÞdx.

Proof. See Appendix C, available in the online supplementalmaterial. tu

We can see that type 1 users have a higher regret thanother users, because they need to explore more thanothers. For type s users, we can see from Theorem 2 that asthe percentage of lower type users (s0 < sÞ increases, theinstantaneous regret IRs

ACRðtÞ decreases, because these

lower type users will explore more to benefit the type suser. We can also see that as the ratio of exploration-exploitation trade-off factors (As0=As, s

0 < sÞ increases, ts0increases, hence the regret IRs

ACRðtÞ decreases. This

implies that a bigger ratio of As0=As represents a higherdifferential service level.

From Theorem 2, we can observe that the instanta-neous regret IRs

ACRðtÞ sublinearly decreases with t,namely, IRs

ACRðtÞ ¼ Oðt�gÞ, where 0 < g < 1. The instanta-

neous regret measures the performance loss in terms ofthe average payoff due to recommending a suboptimalitem for the user. CTR is a common performance measurein applications such as news and advertisement recom-mendations [6], [23]. The regret bound on the CTR isOðt�gÞ. Another interesting performance metric is theroot-mean-square error (RMSE), which is commonly usedin rating systems to measure the difference of the ratingprediction from the user’s actual rating[19], [20]. Theregret can be converted to a regret bound on the RMSE.To do this, it is sufficient to consider a transformation inwhich the reward is considered as the negative mean-square of the suboptimality in CTR resulting from subop-timal item recommendations. From Theorem 2, it can be

shown that the RMSE is Oðt�g=2Þ.

Theorem 3. For 1=ð1þ dCÞ < a < 1 and i.i.d. user arrivals, theregret RACRðT Þ up to time T of the ACR algorithm is boundedby

Fig. 4. Recommender system based on ACR algorithm.

4. Note that this can be generalized to arbitrary levels ahead, likechildren of children, but the computational complexity and memoryrequirements of each epoch grows.


RACRðT Þ � 2KE

XSs¼1

psX1t¼1

t�2As þ 2LIT1þln b

ln 2

1þ ln b= ln 2

þ 24A1KEcCð1:386þ lnT ÞTð1�aÞð1þdC Þ

dC

LCð2ð1�aÞð1þdC Þ=dC � 1Þ

þ 48A1KEcCTð1�aÞð1þdC Þ

dC ln 2

LCð2ð1�aÞð1þdC Þ=dC � 1Þ2þ 4LCdCT

1�1�adC

dC � 1þ a;

(2)

where E ¼ minfL; log2Tb cg, cC is the covering constant ofthe context space X .

Proof. See Appendix C, available in the online supplementalmaterial. tu

From Theorem 3, we observe that as long as the mini-mum exploration-exploitation trade-off factor satisfies

AS � dI ln 22 lnð1=bÞ, the the regret5 RACRðT Þ is sublinear in T , since

from Theorem 1, KlðtÞ can be bounded through: KlðtÞ �cIð1þb

1�bÞdI ð1bÞ

dI log2t � cIð1þb1�bÞ

dI t�dIln 2ln b. If the system does not con-

sider different user types, then all As ¼ 1, then the regret

can be simplified as OðT ðdCþdIþ1ÞðdCþdIþ2Þ lnT Þ, by setting

a ¼ ðdI þ 2Þ=ðdI þ dC þ 2Þ and b ¼ 2�1=ðdIþdCþ2Þ.

4.3 Theoretical Comparison

In this section, we compare the computational complexity(time complexity), the memory required (space complexity),and the performance bound with the benchmark in [29],shown in Table 2. For the ACR algorithm, the computationalcomplexity is OðtþKlðtÞÞ at each time t. In this case, the

computational complexity up to time T is OðT 2 þKET Þ,which is polynomial in T . The space complexity for the

ACR algorithm is OðPE

l¼0 Kl þ T Þ. The ACR algorithm

achieves the a regret OðT ðdCþdIþ1ÞðdCþdIþ2Þ lnT Þ. Actually,the regret of the ACR algorithm is potentially better than

OðT ðdCþdIþ1ÞðdCþdIþ2Þ lnT Þ due to that KE is often much

smaller than cIð1þb1�bÞ

dI T�dI ln 2= ln b.

We also note that when the dimensions of the contextand item spaces are small, the algorithm in [29] has a lowertime complexity than our algorithm (OðT 1þ2� þ IT Þ <OðT 2 lnT þKET ÞÞ (when � < 0:5Þ .6 But when the dimen-sions of the context and item spaces are big, our algorithms

achieve a lower time complexity (OðT 2 lnT þKET Þ <OðT 1þ2� þ IT ÞÞ (when � > 0:5).

5 EXPERIMENTAL RESULTS

In this section, we evaluate the performances of our algo-rithm through experiments, using the Yahoo! Today Mod-ule dataset.

5.1 Description of the Dataset

The Yahoo! Today Module dataset7 is based on the web-page recommendations of Yahoo! Front Page, which is aprominent Internet news website of Yahoo! This datasetcontains around 40 million instances collected fromYahoo! Front Page during May 1 to 10, 2009. Eachinstance of the dataset includes (1) recommended itemIDs and features, (2) contexts, and (3) click behavior ofusers. For each user the five-dimensional context vectordescribes the user’s features, and is obtained from ahigher dimensional raw feature vector of over 1,000 cate-gorical components by applying dimensionality reductiontechniques [23]. The raw feature vector includes: 1) demo-graphic information: gender and age; 2) geographic fea-tures; and 3) behavioral categories: about 1,000 binarycategories that summarize the user’s consumption his-tory. The high dimensional raw feature vector then is con-verted to a five dimensional context feature vector [23].The number of items in this dataset is 271, all of whichare used in our experiments.

The special data collection method [23] allows evaluatingonline algorithms on this dataset without introducing bias.Essentially, each instance in the dataset contains only theclick behavior of a user for one item that is recommended tothe user at the time of data collection. When collecting thedataset, the item recommendations are done completelyindependent of the context of the user and the previousitem recommendations. As explained in [23], this removesany bias in data collection that might result from the algo-rithm used to make item recommendations when collectingthe data.

Similar to [23], we use this offline dataset to evaluateour online learning algorithms. When simulating ouralgorithms, we randomly pick users from this dataset.Since this data set is very large (around 40 million instan-ces), for a user that has a context x, it is possible to find271 instances, each of which contains a different item rec-ommendation, a context that is within " of x in terms ofthe Euclidian distance (in our experiment, we use" ¼ 0:01), and the user’s click behavior for that item rec-ommendation. In this way, the online algorithm is evalu-ated by using the users and their click behaviors in thisdataset. Note that a very similar simulation method isalso considered in [23] to run the online LinUCB algo-rithm on this offline dataset. In our numerical results weconsider T ¼ 70000 user arrivals.

5.2 Comparison with Existing Context-FreeRecommendation Algorithms

To show the importance of contexts on the payoffs, we com-pare the performances of our proposed ACR algorithm andour previously proposed OLCR algorithm [30] with thosecontext-free algorithms. The performances of the algorithms

TABLE 2Comparison with Existing Solution

Regret Space complexity Time complexity

[29]OðT

dCþdIþ1dCþdIþ2 lnT Þ OðI þ T þ T�Þ OðT 1þ2� þ IT Þ

ACROðT

dCþdIþ1dCþdIþ2 lnT Þ Oð

PEl¼0 Kl þ T Þ OðT 2 þKET Þ

5. At the beginning of the online learning periods, the algorithmneeds to explore more, hence incurring higher regret. To tackle this, wecan use a prior information of payoffs for each item in differentcontexts.

6. In [29], T� is the number of balls to be selected, � 2 ð0; 1Þ and �increases as the dimension of the context and item spaces increase. 7. http://labs.yahoo.com/Academic_Relations


http://labs.yahoo.com/Academic_Relations

are evaluated through CTR up to time t. We compareagainst the following three context-free algorithms.

� Random: The algorithm randomly selects an itemeach time. This can be seen as the benchmark forother algorithms.

� UCB1 [15]: This is a classical MAB algorithm, whichperforms well in the case of learning the best itemwithout exploiting the context information.

� "-greedy [15], [17]: This is another widely-used algo-rithm to perform online recommendations. In timeperiod t, item cluster with the highest payoff isselected with probability 1� "t, and other item clus-ters are randomly selected with probability "t, where"t is in 1=t order, such that the exploration rate isinversely proportional to the number of users.

In simulations, depending on the number of clusters weconsider two scenarios. In the first scenario, we chooseK ¼ 50 item clusters, and in the second scenario, we chooseK ¼ 150 item clusters for all the algorithms except the ACRalgorithm, which can adaptively choose the number of itemclusters over time. For the ACR algorithm, we do not distin-guish users (i.e., As ¼ 1Þ and choose the parameter b ¼ 0:75.

The comparison results are given in Table 3. We can seethat both the OLCR algorithm and the ACR algorithm sig-nificantly outperform the context-free algorithms, asexpected. The simulation results show that the OLCR algo-rithm achieves a 10-31 percent performance gain, in termsof CTR up to time T , over the UCB1 and "-greedy algo-rithms. The CTR of the ACR algorithm is 58-69 percenthigher than those of the UCB1 and "-greedy algorithms. Wealso notice that the UCB1 and "-greedy algorithms learnfaster than the OLCR and ACR algorithms. This is becausethe context-free algorithms estimate the CTR of an item byaveraging the clicks over all users, while the contextualalgorithms estimate the CTR of an item based on specificuser contexts. However, the contextual algorithms canmake more accurate estimations of the expected CTRs ofitems by employing the contextual information when thenumber of user arrivals is sufficiently large. Hence, the con-text-free algorithms converge faster (less than 20 percent ofuser arrivals) than the contextual algorithms and the contex-tual algorithms can achieve a higher CTR than the context-free algorithms in the long run.

5.3 Comparison with Existing ContextualRecommendation Algorithms

In this section, we compare the CTRs of the OLCR and ACRalgorithms with those of the existing contextual recommen-dation algorithms. We compare against the followingalgorithms.

� CACF [21]: The algorithm exploits the information ofhistorical selections of item clusters to predict thecurrent payoff given current context arrival. Notethat this algorithm only makes exploitations, regard-less of explorations. Thus, it has a “cold start” prob-lem, and we use the first 5 percent of data instancesto train the algorithm.

� Hybrid-" [22]: This algorithm combines the contextsand "-greedy algorithms together, by extractinginformation within a small region of the contextspace and by running "-greedy algorithms in thatsmall region.

� LinUCB [23]: This algorithm assumes that the pay-offs of items are linear in contexts. It calculates theindex of each arm by adding an upper bound termto a linear combination of contexts observed in previ-ous periods, and then recommends the arm with themaximum index.

� Contextual-zooming [29]: This algorithm combinesthe context space and the item space together andselects a ball in this space each time based on thepast context arrivals and item selections.

In this section we consider two fixed item clusters foralgorithms that do not have the item clustering property:K ¼ 50 and K ¼ 150 (algorithms except contextual-zooming and ACR). For the ACR algorithm, we do not dis-tinguish users (i.e., As ¼ 1) and choose the parameterb ¼ 0:75. We can categorize the contextual algorithmsinto two classes depending on whether an exploitation-exploration tradeoff is considered. The CACF algorithmdoes not make exploration, and the Hybrid-", LinUCB, con-textual-zooming, OLCR, and ACR algorithms consider boththe exploitation and the exploration.

The simulation results are given in Table 4. We can seethat the ACR algorithm significantly outperform the exist-ing contextual algorithms. The simulation results show thatthe CTRs up to time T obtained by the OLCR algorithm are

TABLE 3Comparison with Context-Free Algorithms

Percentage of user arrivals (K ¼ 50) Percentage of user arrivals (K ¼ 150)

20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

CTR Random 0.026 0.028 0.027 0.026 0.026 0.026 0.028 0.027 0.026 0.026UCB1 0.028 0.032 0.030 0.029 0.029 0.029 0.030 0.029 0.028 0.029"-greedy 0.030 0.029 0.029 0.029 0.029 0.032 0.031 0.031 0.031 0.031OLCR 0.028 0.033 0.037 0.037 0.038 0.025 0.028 0.031 0.034 0.034ACR 0.039 0.044 0.046 0.049 0.049 0.039 0.044 0.046 0.049 0.049

Gain OLCR over Random 8% 18% 37% 42% 46% -4% 0% 15% 31% 31%OLCR over UCB1 0% 3% 23% 28% 31% -14% -7% 7% 21% 17%OLCR over "-greedy -7% 14% 28% 28% 31% -22% -10% 0% 10% 10%ACR over Random 50% 57% 70% 88% 88% 50% 57% 70% 88% 88%ACR over UCB1 39% 38% 53% 69% 69% 34% 47% 59% 75% 69%ACR over "-greedy 30% 52% 59% 69% 69% 22% 42% 48% 58% 58%


31 and 10 percent higher than those obtained by the CACFalgorithm, are 15 percent higher and 3 percent lower thanthose obtained by the Hybrid-" algorithm, are 7 and 18 per-cent lower than those obtained by the LinUCB algorithm,and are the same and 16 percent lower than those obtainedby the contextual-zooming algorithm, in the scenarios ofK ¼ 50 and K ¼ 150, respectively. The CTRs up to time Tobtained by the ACR algorithm are 69 and 69 percent higherthan those obtained by the CACF algorithm, are 48 and48 percent higher than those obtained by the Hybrid-" algo-rithm, are 20 and 26 percent higher than those obtained bythe LinUCB algorithm, and are 29 and 29 percent higherthan those obtained by the contextual-zooming algorithm.In fact, our ACR algorithm aggregates information inan adaptive ball of the context space each time, but theHybrid-" algorithm considers the fixed small region, whichcauses a low efficient learning. The LinUCB assumes the lin-ear payoff function, which can result in the inaccurate esti-mation of the CTR, while our proposed algorithms do notmake such an assumption. The performance of the contex-tual-zooming algorithm is affected by the Lipschitz constantincluded in the algorithm, which may not fit the real datavery well. The CACF algorithm learns faster (converges inless than 20 percent user arrivals) than our proposed ACRand OLCR algorithms, but the accuracy of learning is lowdue to the fact that the CACF algorithm only makes exploi-tation, and our algorithms makes explorations that cause alower speed of learning in the short run but a higher accu-racy of recommendation in the long run. Note that the pro-posed ACR algorithm outperforms the OLCR algorithm,because the ACR algorithm exploits the item-cluster tree toadaptively cluster items, but the OLCR algorithm does notuse this tree structure but has fixed clusters.

5.4 Differential Services

In this section, we show how the recommender system pro-vides differential services according to users’ types. Weassume that the users in the system are divided into twotypes: registered users and anonymous users. For the regis-tered users, we aim to provide a higher quality recommen-dation. In the simulation, we choose the ACR algorithm

with parameter b ¼ 0:75, and compare the average per-period CTR of two user types.

We show the average per-period CTR (normalized withrespect to the anonymous users) of two user types in Fig. 5.Wecan see that the average per-period CTR of registered users isaround 4 to 11 percent higher than that of the anonymoususers when the ratio of exploration-exploitation trade-offfactor A2=A1 (registered user/anonymous user) varies from0.2 to 0.8 in the scenarios of 10 and 30 percent registered users.We can also see that as the ratio of exploration-exploitationtrade-off factor increases, the CTR gain of the registered usersover anonymous users decreases 5-6 percent. In fact, as theratio of exploration-exploitation trade-off factor increases, therecommendation strategy for both types of users become simi-lar to each other.

5.5 Scalability of Algorithms

In practice, the number of items is large, which requires thatthe algorithms have the ability to scale up, i.e., learningspeed and total CTR should not degrade as the number ofitems increases. In order to show this, we increase the num-ber of items in the system by duplicating items to form thelarge item set. Since our proposed algorithms and the con-textual-zooming algorithm make recommendations on clus-ter level, the increase in items in the system only affects thenumber of items in a cluster, and does not affect the per-formances of the algorithms. For the CACF, Hybrid-", andLinUCB algorithms, when they do not cluster the items and

TABLE 4Comparison with Contextual Algorithms

Percentage of user arrivals (K ¼ 50) Percentage of user arrivals (K ¼ 150)

20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

CTR CACF 0.028 0.029 0.027 0.029 0.029 0.028 0.029 0.028 0.029 0.029Hybrid-" 0.034 0.035 0.034 0.033 0.033 0.033 0.034 0.032 0.033 0.033LinUCB 0.035 0.039 0.040 0.041 0.041 0.032 0.037 0.039 0.039 0.039Contextual-zooming 0.037 0.039 0.038 0.038 0.038 0.037 0.039 0.038 0.038 0.038OLCR 0.028 0.033 0.037 0.037 0.038 0.025 0.028 0.031 0.032 0.032ACR 0.039 0.044 0.046 0.049 0.049 0.039 0.044 0.046 0.049 0.049

Gain OLCR over CACF 0% 14% 37% 28% 31% -11% -3% 11% 10% 10%OLCR over Hybrid-" -18% -6% 9% 12% 15% -24% -18% -3% -3% -3%OLCR over LinUCB -20% -15% -8% -10% -7% -22% -24% -21% -18% -18%OLCR over contextual-zooming -24% -15% -3% -3% 0% -32% -28% -18% -16% -16%ACR over CACF 39% 52% 70% 69% 69% 39% 52% 64% 69% 69%ACR over Hybrid-" 15% 26% 35% 48% 48% 18% 29% 44% 48% 48%ACR over LinUCB 11% 13% 15% 20% 20% 22% 19% 18% 26% 26%ACR over contextual-zooming 5% 13% 21% 29% 29% 5% 13% 21% 29% 29%

Fig. 5. Normalized per-period CTR comparison between registeredusers and anonymous users, with percentage p of registered users.


learn the CTR separately for each item, there is some perfor-mance losses when the number of items grows.

The simulation results are shown in Fig. 6. We use a scalefactor s to denote the scalability of the system (i.e., numberof items is sK, where K is the number of items withoutbeing scaled up). We calculate the learning gain of the eval-uated algorithm over the random algorithm (i.e.,ðCTRp;sK � CTRrandomÞ=CTRrandom, where CTRp;sK denotesthe CTR of the algorithm p with a scale factor sÞ, and nor-malize it with respect to that of the ACR algorithm as a per-formance measure to evaluate the scalability of thealgorithm. We can see that when the number of itemsbecomes 100 times of current number, the performance lossin terms of the normalized learning gain against the randomalgorithm is 63, 61, and 67 percent for the CACF, Hybrid-"and LinUCB algorithms, respectively. The long-term CTRsof the contextual-zooming algorithm and our proposedOLCR and ACR algorithms do not change, as expected,implying that our algorithms are scalable.

5.6 Robustness of Algorithms

In this section, we evaluate the robustness of algorithmswhen input parameters vary. Note that, to construct theitem-cluster tree in our proposed approach, the theoreticaloptimal value of b (b 2 ð0:5; 1ÞÞ given in Section 4.3 (i.e.,

b ¼ 2�1=ðdIþdCþ2Þ ¼ 0:94) may be conservatively large, and sooptimizing this parameter may result in higher total pay-offs. In particular, we compare the CTR of the ACR algo-rithm when the base b changes. In the existing works, theLinUCB algorithm [23] needs an input parameter a > 0 (a isused to control size of the confidence interval) to run thealgorithm, and the contextual-zooming algorithm [29] needs

the Lipschitz constant to run the algorithm. So we also showthe effect of parameters on the performances of the LinUCBand contextual-zooming algorithms.

Simulation results are shown in Figs. 7, 8, and 9. In Fig. 7,we can see that for the ACR algorithm, the CTR differenceis less than 5 percent when the base b changes from 0.50 to0.95. This is because the base b controls the trade-off of thenumber of clusters (learning speed) and the cluster size(accuracy). When b is large, learning is fast but less accurate;when b is small, learning is slow but more accurate. Hence,an optimal b exists to balance the speed of learning and theaccuracy. For the LinUCB algorithm and the contextual-zooming algorithm, since the input parameters have a rangefrom 0 to infinity, we test several possible values to evaluatethe robustness of the algorithm. We can see from Fig. 8 thatthe LinUCB algorithm has more than 17 percent perfor-mance loss, in terms of CTR, compared to its maximum cor-responding to the optimal value of a. We can also see fromFig. 9 that the contextual-zooming algorithm has more than18 percent performance loss, in terms of CTR, when the Lip-schitz constant is not estimated correctly. Moreover, sincethe range of input parameter a and Lipschitz constant isfrom 0 to infinity, it is difficult to find the optimal value ofparameters without training data. Hence, we can see thatthe LinUCB algorithm and the contextual-zooming algo-rithm are less robust to the input parameters than the ACRalgorithm.

5.7 User Arrival Dependent Performance Variation

In this section, we evaluate the robustness of our ACR algo-rithm to different types of user arrivals. We randomly gen-erate 10 sequences of user arrivals, and evaluate thevariation in the CTR of the ACR algorithm over thesesequences. To evaluate the variations of CTR in different

Fig. 6. Comparison of scalable performances of algorithms.

Fig. 7. The impact of b on the ACR algorithm.

Fig. 8. The impact of a on the LinUCB algorithm.

Fig. 9. The impact of the Lipschitz constant on the contextual-zoomingalgorithm.


cases, we first calculate the average CTR up to time T of the10 cases and normalize it to 1. We then compare the CTRvariation of each case with the average CTR. The experi-mental results are shown in Table 5. For a specific sequenceof user arrivals, a negative variation means the CTR is lowerthan the average CTR, while a positive variation means theCTR is higher than the average CTR. We can see fromTable 5 that the variation of CTR up to time T obtained byour proposed algorithm is less than 5.5 percent.

6 CONCLUSIONS AND FUTURE WORKS

In this paper, we propose a contextual MAB based cluster-ing approach to design and deploy recommender systemsfor a large number of users and items, while taking into con-sideration the context in which the recommendation ismade. Our proposed ACR algorithm makes use of an adap-tive item clustering method to improve the learning speed.More importantly, our algorithm can address the cold startproblem and provide differential services to users of differ-ent types. Theoretical results show that the algorithm canachieve a sublinear regret, while our simulations show thatour algorithm outperforms the state-of-the-art algorithmsby 20 percent in terms of average CTRs.

Understanding the utility and impact of the proposedapproach in the current practice of Web services and recom-mendations requires further study. Nevertheless, our onlinelearning algorithm for recommendation systems is generaland can potentially be used in numerous other applicationdomains. However, each of these domains has its ownunique set of contextual information, user characteristics,recommendation payoffs and associated challenges. Forexample, users’ demographics, environmental variables(e.g., time, location), devices used (e.g. cellphones, laptopsetc.) can be potential contexts, while the recommendationpayoff can be the rating(s) given by users to the recommen-dation they receive. Hence, this will require adaptation ofthe proposed algorithm to the specific domain in which therecommendation system needs to operate.

While general and applicable to many domains, ourapproach has also several limitations. We mention here afew. First, we do not consider specific challenges associatedwith implementing the proposed system in a distributedmanner, as several Web services and applications mayrequire. However, this is essential and future work willneed to consider such distributed solutions for implement-ing recommender systems. Second, recommender systemsare increasingly co-evolving together with social networks:users often recommend items they liked or purchased toothers in their social network. Hence, understanding howsuch recommendation systems and social networks co-evolve and interact with each other forms another interest-ing, yet challenging future work.

REFERENCES

[1] P. Resnick and H. R. Varian, “Recommender systems,” Commun.ACM, vol. 40, no. 3, pp. 56–58, 1997.

[2] M. Balabanovi and Y. Shoham, “Fab: Content-based, collaborativerecommendation,” Commun. ACM, vol. 40, no. 3, pp. 66–72, 1997.

[3] G. Adomavicius and A. Tuzhilin, “Toward the next generation ofrecommender systems: A survey of the state-of-the-art and possi-ble extensions,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 6,pp. 734–749, Jun. 2005.

[4] X. Chen, Z. Zheng, X. Liu, Z. Huang, and H. Sun, “PersonalizedQoS-aware web service recommendation and visualization,” IEEETrans. Service Comput., vol. 6, no. 1, pp. 35–47, First Quarter 2013.

[5] Z. Zheng, H. Ma, M. R. Lyu, and I. King, “QoS-aware Web servicerecommendation by collaborative filtering,” IEEE Trans. Trans.Services Comput., vol. 4, no. 2, pp. 140–152, Apr.–Jun. 2011.

[6] J. Liu, P. Dolan, and E. R. Pedersen, “Personalized news recom-mendation based on click behavior,” in Proc. 15th Int. Conf. Intell.User Interfaces, 2010, pp. 31–40.

[7] X. Su and T. M. Khoshgoftaar, “A survey of collaborative filteringtechniques,” Adv. Artif. Intell., vol. 2009, p. 421425, 2009.

[8] H. Kim, J. K. Kim, and Y. U. Ryu, “Personalized recommendationover a customer network for ubiquitous shopping,” IEEE Trans.Services Comput., vol. 2, no. 2, pp. 140–151, Apr.–Jun. 2009.

[9] M. J. Pazzani and D. Billsus, “Content-based recommendationsystems,” in The Adaptive Web, vol. 4321. Berlin, Germany:Springer, 2007, pp. 325–341.

[10] R. Burke, “Hybrid recommender systems: Survey andexperiments,” User Model. User-Adapted Interaction, vol. 12, no. 4,pp. 331–370, 2002.

[11] K. Yoshii, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Anefficient hybrid music recommender system using an incremen-tally trainable probabilistic generative model,” IEEE Trans. Audio,Speech, Lang. Process., vol. 16, no. 2, pp. 435–447, Feb. 2008.

[12] H. Liu, K. Liu, and Q. Zhao, “Logarithmic weak regret of non-Bayesian restless multi-armed bandit,” in Proc. IEEE Int. Conf.Acoust., Speech Signal Process., 2011, pp. 1968–1971.

[13] W. Dai, Y. Gai, B. Krishnamachari, and Q. Zhao, “The non-Bayesian restless multi-armed bandit: A case of near-logarithmicregret,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2011,pp. 2940–2943.

[14] K. Wang and L. Chen, “On optimality of myopic policy for restlessmulti-armed bandit problem: An axiomatic approach,” IEEETrans. Signal Process., vol. 60, no. 1, pp. 300–309, Jan. 2012.

[15] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis ofthe multi-armed bandit problem,” Mach. Learn., vol. 47, pp. 235–256, 2002.

[16] Y. Deshpande and A. Montanari, “Linear bandits in high dimen-sion and recommendation systems,” in Proc. 50th Annu. AllertonConf. Commun., Control, Comput., 2012, pp. 1750–1754.

[17] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.Cambridge, U.K.: Cambridge Univ. Press, 2006.

[18] G. Adomavicius and A. Tuzhilin, “Context-aware recommendersystems,” in Recommender Systems Handbook. New Yrok, NY, USA:Springer, 2011, pp. 217–253.

[19] G. Adomavicius, R. Sankaranarayanan, S. Sen, and A. Tuzhilin,“Incorporating contextual information in recommender systemsusing a multidimensional approach,” ACM Trans. Inf. Syst.,vol. 23, no. 1, pp. 103–145, 2005.

[20] A. Said, S. Berkovsky, E. W. De Luca, and J. Hermanns,“Challenge on context-aware movie recommendation:CAMRa2011,” in Proc. 5th ACM Conf. Recommender Syst., 2011,pp. 385–386.

[21] A. Chen, “Context-aware collaborative filtering system: Predictingthe user’s preference in the ubiquitous computing environment,”in Location-and Context-Awareness. Berlin, Germany: Springer,2005, pp. 244–253.

TABLE 5The Variation of CTRs in Different User Arrival Processes

Cases Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 Case 9 Case 10

Percentage of CTR variationcompared to average CTR

4.4% -3.6% -4.0% 5.1% -5.5% -0.5% 5.1% -5.2% 4.8% -0.5%


[22] D. Bouneffouf, A. Bouzeghoub, and A. L. Gancarski, “Hybrid-"-greedy for mobile context-aware recommender system,” in Proc.16th Adv. Knowl. Discovery Data Mining, 2012, pp. 468–479.

[23] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,”in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 661–670.

[24] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, “Contextual banditswith linear payoff functions,” in Proc. Int. Conf. Artif. Intell. Statist.,2011, pp. 208–214.

[25] T. Lu, D. Pal, and M. Pal, “Contextual multi-armed bandits,” inProc. 13th Int. Conf. Artif. Intell. Statist., 2010, pp. 485–492.

[26] G. Shani, D. Heckerman, and R. I. Brafman, “An MDP-based rec-ommender system,” in Proc. 18th Conf. Uncertainty Artif. Intell.,2005, vol. 6, pp. 1265–1295.

[27] S. Pandey, D. Chakrabarti, and D. Agarwal, “Multi-armed banditproblems with dependent arms,” in Proc. 24th Int. Conf. Mach.Learn., 2007, pp. 721–728.

[28] R. Kleinberg, A. Slivkins, and E. Upfal, “Multi-armed bandits inmetric spaces,” in Proc. 40th Annu. ACM Symp. Theory Comput.,2008, pp. 681–690.

[29] A. Slivkins, “Contextual bandits with similarity information” inProc. 24th Annu. Conf. Learn. Theory, 2011, pp. 679–701.

[30] L. Song, C. Tekin, and M. van der Schaar, “Clustering basedonline learning in recommender systems: A bandit approach,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 4528–4532.

[31] N. Ailon and M. Charikar, “Fitting tree metrics: Hierarchical clus-tering and phylogeny,” in Proc. 46th Annu. IEEE Symp. Found.Comput. Sci., 2005, pp. 73–82.

[32] Z. Ren and B. H. Krogh, “State aggregation in Markov decisionprocesses,” in Proc. 41st IEEE Conf. Decision Control, 2002,pp. 3819–3824.

[33] E. Chlebus, “An approximate formula for a partial sum of thedivergent p-series,” Appl. Math. Lett., vol. 22, no. 5, pp. 732–737,2009.

Linqi Song received the BE and ME degrees inelectrical engineering from Tsinghua University,Beijing, China, in 2006 and 2009, respectively. Heis currently working toward the PhD degree in theElectrical Engineering Department at the Univer-sity of California, Los Angeles. His research inter-ests include machine learning, optimization, andcommunication networks.

Cem Tekin received the BSc degree in electricaland electronics engineering from the Middle EastTechnical University, Ankara, Turkey, in 2008, theMSE degree in electrical engineering: systems,the MS degree in mathematics, the PhD degreein electrical engineering: systems from the Uni-versity of Michigan, Ann Arbor, in 2010, 2011,and 2013, respectively. He is a postdoctoralscholar at the University of California, LosAngeles. His research interests include machinelearning, multi-armed bandit problems, data min-

ing, cognitive radio, and networks. He received the University ofMichigan Electrical Engineering Departmental Fellowship in 2008, andthe Fred W. Ellersick Award for the Best Paper in MILCOM 2009.

Mihaela van der Schaar (F’10) is a Chancellor’sprofessor of electrical engineering at the Univer-sity of California, Los Angeles. She is a distin-guished lecturer of the Communications Societyfor 2011-2012 and the editor in chief of IEEETransactions on Multimedia. She received anNSF CAREER Award (2004), the Best PaperAward from the IEEE Transactions on Circuitsand Systems for Video Technology (2005), theOkawa Foundation Award (2006), the IBM Fac-ulty Award (2005, 2007, 2008), the Most Cited

Paper Award from EURASIP: Image Communications Journal (2006),the Gamenets Conference Best Paper Award (2011) and the 2011 IEEECircuits and Systems Society Darlington Award Best Paper Award. Sheholds 33 granted US patents. She is a fellow of the IEEE. For more infor-mation about her research visit: http://medianetlab.ee.ucla.edu/.


http://medianetlab.ee.ucla.edu/

online learning in large-scale contextual recommender...

Documents