sigir 2016 cofiba - collaborative filtering bandits, the 39th acm sigir
TRANSCRIPT
Collaborative Filtering Bandits
Shuai Li
University of Insubria
The 39th SIGIR
Jul 20, 2016
Joint with:
Alexandros Karatzoglou and Claudio Gentile
1
Overview
• Contextual Bandits have been used in Recommender Systems
• Traditional Bandits do not take collaborative filtering effects into account
• COFIBA (“Coffee Bar”): main idea is to generalize Clustering of Banditswith co-clustering for collaborative effects
2
Motivation
• Classical CF methods, dynamic environments (Video-Music-Ads)
• Clustering Bandit is successful for large scale recommendation [ICML’14]
• Latent clustering is more efficient, cheaper and scalable [ICML’16]
• Netflix: 2/3 of the movies watched are recommended
• Google News: recommendations generate 38% more click-throughs
• According to Google 2011 annual report, “Advertising recommendationrevenues made up 97% of our revenues in 2009 and 96% of our revenuesin 2010 and 2011”
• Google gains billions in market value as YouTube drives ad growth (16.3%at 699.62 billion USD ⇐ 65 billion USD increase by YOUTUBE ADSCLICKS) – Jul 17, 2015
3
Continuous Cold-Start Problem in dynamic Recommenda-tion settings
•
User
Pick which one?
X-Box
Google GlassPSPiPad
all are ever-changing
one data record
Same Category
Different Category
4
Challenge and Design Principle
• How to adapt to highly dynamic environment?
• Has a technical sound theoretical guarantee?
• Scale at big data scenarios?
• Simple to deploy to industrial systems?
• Can be applied to many other application domains?
5
Multi-Armed Bandit
• Multi-Armed Bandits and Regret
• A Statistical Problem for Slot Machine players:- One Slot machine pays more- How can we detect it by playing?
• The tradeoff faced by user between “exploitation” of the machine withhighest expected payoff and “exploration” to get more information aboutthe expected payoffs of the other machines
• In recommendation systems − > Different items have different utilitiesHow can we find the one with the highest?
6
Traditional Bandits VS. Clustering Bandits
• Graph with n users
• Unknown users’ profiles ui, i = 1 . . . n
u2
u1
u6
u5
u8
u7u4
u3
Available connections need not reflect similar interests among users =⇒need to infer the connections in the highly dynamic environment
7
Forming the Implicit Graph for Users
• Set of n users
• Unknown users’ profiles ui, i = 1 . . . n
u2
u1
u6
u5
u8
u7u4
u3
Drawing edges based on observed behavior of users (clustering algorithms)
• Content universe changing rapidly over time
• Many users: scaling properties are major concerns
8
Forming the Graph and Clustering Bandit Model
(group recommendation to subcommunities):
• m << n clusters
• Each cluster has a single profile uj
• Need to learn both vectors ofuser profile and cluster profile
• User profiles used to compute similarityfor graph, cluster’s for recommendations
• zj is aggregation of proxies wi
u2
u1
w1
u3
w2
w3
w6
u3
w7
u2
z2
w4
w8
u3
u1
u3
u3u1
u3u2
w5
z1
9
Pruning the edges of the graph
• Start off from full n-node graph(or sparsified version thereof)and single estimated cluster
• Node proxies wi to delete edges:If ||wi −wj|| > θ(i,j)=⇒ delete (i, j)
• Estimated clusters are currentconnected components
u2
u1
w1
u3
w2
w3
w6
u3
w7
u2
z2
w4
w8
u3
u1
u3
u3u1
u3u2
w5
z1
• When serving user i in estimated cluster j,update both node proxy wi and cluster proxy zj
• Recompute clusters upon changing the graphEdges are deleted
10
Co-Clustering for Collaborative Filtering
11
Co-Clustering for Collaborative Filtering
12
Co-Clustering for Collaborative Filtering
13
Cofiba with Co-clustering of User-Item Graph
Challenges
• Statistical: tight theoretical convergence guarantee
• Computational: running time cost and memory cost
• Performance: online prediction over user-item graphs
Tricks
1. Start off from random (Erdos-Renyi) graph
2. Clustering by connected components: Current clusters is union of under-lying clusters
14
Tricks
1. Start off from random (Erdos-Renyi) graph
Known fact:
• Random (Erdos-Renyi) graphslead to one initial cluster witha pre-specified probability
• Initial n-clique graph G
15
Tricks
2. During learning process clusters are unions of underlying preference clusters
u2
u1
w1
u3
w2
w3
w6
u3
w7
u2
z2
w4
w8
u3
u1
u3
u3u1
u3u2
w5
z1
• Within-cluster edges (w.r.t. underlying clustering) are never deleted
• Between-cluster edges (w.r.t. underlying clustering) will eventually bedeleted, assuming gap on cluster profile vectors, and enough observedpayoff values
16
Algorithmic Idea
• Group users based on items, andgroup items based on theclustering induced over users
• Explore the priori: I = {x1, . . . ,x|I|}
• Multiple clusterings over the set ofusers U and a single clusteringover the set of items I
• Nit,t(xt,k) w.r.t. the items in Cit arestored into the clusters at the userside pointed to by those items
• Update the clusterings at user sideand unique clustering at item side
User graphs
U6
24
6
5
3
4
1
31
5
2
64
21
5
3
2
1
5
6
7
8
4
I1,t+1^
I2,t+1^
I
U
Item graph
UUser graphs
64
21
5
3
82
2,t
1,1
3,t
I
Item graphU
U 6
2 4 53
1
3
I3,t+1^
I4,t+1^
(c) Time t+1
(a) Initialization
6
431
5
2
U 2
1
3
5
6
7
8
4
I^
I^
I^
I
1 3
74
56
U6
24
5
311,tI^
(b) Time t
2 4 51
U63
17
Advancements
• Explore the collaborative effect that arise due to the ever-changing in-teraction of both customers and products, e.g., IR system: Google
• Design a truly online collaborative filtering solution augmented by anexploration-exploitation strategy
• Dynamically grouping users based on the items under consideration and,at the same time, grouping items based on the similarity of the clusteringsinduced over the users
• Principle recipe to alleviate the cold-start problem in terms of both theoryand practice
18
Theoretical Guarantee
• Let it be generated uniformly at random from U , j-th induced partitionP (ηj) over U is made up of mj clusters of cardinality vj,1, vj,2, . . . , vj,mj
• According to a given but unknown distribution over I, the sequence ofitems in Cit is generated i.i.d., at ∈ [−1,1], E[at] = u>it xt
• Let parameter α and α2 be suitable functions of log(1/δ). If ct ≤ c ∀tthen, as T grows large, with probability at least 1 − δ the cumulativeregret satisfies
T∑t=1
rt = O
((Ej[S] + 1 +
√(2c− 1)VARj(S)
) √d T
n
),
where S = S(j) =∑mj
k=1√vj,k, and Ej[·] and VARj(·) denote, respectively,
the expectation and the variance w.r.t. the distribution of ηj over I
19
Data Sets
• Yahoo: ICML 2012 Exploration & Exploitation Challenge, news articlerecommendation algorithms on “Today Module”, 3M records
• Telefonica: This data contains clicks on ads displayed to user on one ofthe websites that Telefonica operates on, 15M records
• Avazu: The data was provided for the challenge to predict the click-through rate of impressions on mobile devices, 40M records
20
Experimental Evidence
• In general, the longer the lifecycle of one item the fewer the items, thehigher the chance that users with similar preferences will consume it, andhence the bigger the collaborative effects contained in the data
• It is therefore reasonable to expect that our algorithm will be more ef-fective in datasets where the collaborative effects are indeed strong
21
Experimental Result: Benchmark News Data
1 2 3 4 5 6 7
x 104
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Rounds
CT
R
Yahoo Dataset
LINUCB−ONE
DYNUCB
LINUCB−IND
CLUB
LINUCB−V
COFIBA
22
Experimental Result: Benchmark News Data
• The users here are span a wide range of demographic characteristics;this dataset is derived from the consumption of news that are ofteninteresting for large portions of these users and, as such, do not createstrong polarization into subcommunities
• This implies that more often than not, there are quite a few specifichot news that all users might express interest in, and it is natural toexpect that these pieces of news are intended to reach a wide audienceof consumers
• In this non-trivial case, COFIBA can still achieve a significant increasedprediction accuracy compared, thereby suggesting that simultaneous clus-tering at both the user and the item (the news) sides might be an evenmore effective strategy to earn clicks in news recommendation systems
23
Experimental Result: Benchmark Advertising Data
1 2 3 4 5 6 7 8 9
x 105
0
0.05
0.1
0.15
0.2
0.25
Rounds
CT
R
Avazu Dataset
LINUCB−ONE
DYNUCB
LINUCB−IND
CLUB
LINUCB−V
COFIBA
24
Experimental Result: Benchmark Advertising Data
• The Avazu data is furnished from its professional digital advertising so-lution platform, where the customers click the ad impressions via theiOS/Android mobile apps or through websites, serving either the pub-lisher or the advertiser which leads to a daily high volume internet traffic
• In this dataset, COFIBA seems to work extremely well during the cold-start, and comparatively best in all later stages
25
Experimental Result: Production Advertising Data
0.5 1 1.5 2 2.5 3
x 105
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Rounds
CT
R
Telefonica Dataset
LINUCB−ONE
DYNUCB
LINUCB−IND
CLUB
LINUCB−V
COFIBA
26
Experimental Result: Production Advertising Data
• Most of the users in the Telefonica data are from a diverse sample ofpeople in Spain, and it is easy to imagine that this dataset spans a largenumber of communities across its population
• Thus we can assume that collaborative effects will be much more evident,and indeed COFIBA is able to leverage these effects efficiently
27
Experimental Result: Typical Distribution of Users
0 2 4 6 8 10 12 14 16 180
0.1
0.2
0.3
0.4
0 2 4 6 8 10 12 14 16 180
0.2
0.4
0.6
0.8
0 2 4 6 8 10 12 14 16 180
0.2
0.4
0.6
0.8
0 2 4 6 8 10 12 14 16 180
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10 12 14 16 180
0.1
0.2
0.3
0.4
0.5
28
Experimental Result: Typical Distribution of Users
• A typical distribution of cluster sizes over users for the Yahoo! dataset.Each bar plot corresponds to a cluster at the item side. Each bar repre-sents the fraction of users contained in the corresponding cluster
• The emerging pattern is always the same: we have few clusters over theitems with very unbalanced sizes and, corresponding to each item cluster,we have few clusters over the users, again with very unbalanced sizes
• This recurring pattern also confirms our theoretical finding, and a prop-erty of data that the COFIBA algorithm can provably take advantage
29
Experimental Result: Summary
• Despite the differences in all the datasets, the experimental evidence wecollected on them is quite consistent, in that in all the cases COFIBAsignificantly outperforms all other competing methods we tested
• This is especially noticeable during the cold-start period, but the samerelative behavior essentially shows up during the whole time window ofour experiments
• On the other hand, COFIBA is far more effective in exploiting the col-laborative effects embedded in the data, and still amenable to be run onlarge datasets
30
Conclusion and Future Work
• Introduced Collaborative Bandits that make best use of data seen so far
• Provided the sharp theoretical guarantee
• Outperformed the state-of-the-art significantly
• Some directions:
– Elaborate the analysis to asynchronous networks [ICML’16]
– Exhibit more generic Context-Aware clustering method
– Extend our techniques in the quantification domain [SIGKDD’16]
31
Questions ?
• Papers etc − > https://sites.google.com/site/shuailidotsli
• Thanks :-)
32