sigir 2016 cofiba - collaborative filtering bandits, the 39th acm sigir

Collaborative Filtering Bandits

Shuai Li

University of Insubria

The 39th SIGIR

Jul 20, 2016

Joint with:

Alexandros Karatzoglou and Claudio Gentile

1

Overview

• Contextual Bandits have been used in Recommender Systems

• Traditional Bandits do not take collaborative filtering effects into account

• COFIBA (“Coffee Bar”): main idea is to generalize Clustering of Banditswith co-clustering for collaborative effects

2

Motivation

• Classical CF methods, dynamic environments (Video-Music-Ads)

• Clustering Bandit is successful for large scale recommendation [ICML’14]

• Latent clustering is more efficient, cheaper and scalable [ICML’16]

• Netflix: 2/3 of the movies watched are recommended

• Google News: recommendations generate 38% more click-throughs

• According to Google 2011 annual report, “Advertising recommendationrevenues made up 97% of our revenues in 2009 and 96% of our revenuesin 2010 and 2011”

• Google gains billions in market value as YouTube drives ad growth (16.3%at 699.62 billion USD ⇐ 65 billion USD increase by YOUTUBE ADSCLICKS) – Jul 17, 2015

3

Continuous Cold-Start Problem in dynamic Recommenda-tion settings

•

User

Pick which one?

X-Box

Google GlassPSPiPad

all are ever-changing

one data record

Same Category

Different Category

4

Challenge and Design Principle

• How to adapt to highly dynamic environment?

• Has a technical sound theoretical guarantee?

• Scale at big data scenarios?

• Simple to deploy to industrial systems?

• Can be applied to many other application domains?

5

Multi-Armed Bandit

• Multi-Armed Bandits and Regret

• A Statistical Problem for Slot Machine players:- One Slot machine pays more- How can we detect it by playing?

• The tradeoff faced by user between “exploitation” of the machine withhighest expected payoff and “exploration” to get more information aboutthe expected payoffs of the other machines

• In recommendation systems − > Different items have different utilitiesHow can we find the one with the highest?

6

Traditional Bandits VS. Clustering Bandits

• Graph with n users

• Unknown users’ profiles ui, i = 1 . . . n

u2

u1

u6

u5

u8

u7u4

u3

Available connections need not reflect similar interests among users =⇒need to infer the connections in the highly dynamic environment

7

Forming the Implicit Graph for Users

• Set of n users

• Unknown users’ profiles ui, i = 1 . . . n

u2

u1

u6

u5

u8

u7u4

u3

Drawing edges based on observed behavior of users (clustering algorithms)

• Content universe changing rapidly over time

• Many users: scaling properties are major concerns

8

Forming the Graph and Clustering Bandit Model

(group recommendation to subcommunities):

• m << n clusters

• Each cluster has a single profile uj

• Need to learn both vectors ofuser profile and cluster profile

• User profiles used to compute similarityfor graph, cluster’s for recommendations

• zj is aggregation of proxies wi

u2

u1

w1

u3

w2

w3

w6

u3

w7

u2

z2

w4

w8

u3

u1

u3

u3u1

u3u2

w5

z1

9

Pruning the edges of the graph

• Start off from full n-node graph(or sparsified version thereof)and single estimated cluster

• Node proxies wi to delete edges:If ||wi −wj|| > θ(i,j)=⇒ delete (i, j)

• Estimated clusters are currentconnected components

u2

u1

w1

u3

w2

w3

w6

u3

w7

u2

z2

w4

w8

u3

u1

u3

u3u1

u3u2

w5

z1

• When serving user i in estimated cluster j,update both node proxy wi and cluster proxy zj

• Recompute clusters upon changing the graphEdges are deleted

10

Co-Clustering for Collaborative Filtering

11


12


13

Cofiba with Co-clustering of User-Item Graph

Challenges

• Statistical: tight theoretical convergence guarantee

• Computational: running time cost and memory cost

• Performance: online prediction over user-item graphs

Tricks

1. Start off from random (Erdos-Renyi) graph

2. Clustering by connected components: Current clusters is union of under-lying clusters

14

Tricks

1. Start off from random (Erdos-Renyi) graph

Known fact:

• Random (Erdos-Renyi) graphslead to one initial cluster witha pre-specified probability

• Initial n-clique graph G

15

Tricks

2. During learning process clusters are unions of underlying preference clusters

u2

u1

w1

u3

w2

w3

w6

u3

w7

u2

z2

w4

w8

u3

u1

u3

u3u1

u3u2

w5

z1

• Within-cluster edges (w.r.t. underlying clustering) are never deleted

• Between-cluster edges (w.r.t. underlying clustering) will eventually bedeleted, assuming gap on cluster profile vectors, and enough observedpayoff values

16

Algorithmic Idea

• Group users based on items, andgroup items based on theclustering induced over users

• Explore the priori: I = {x1, . . . ,x|I|}

• Multiple clusterings over the set ofusers U and a single clusteringover the set of items I

• Nit,t(xt,k) w.r.t. the items in Cit arestored into the clusters at the userside pointed to by those items

• Update the clusterings at user sideand unique clustering at item side

User graphs

U6

24

6

5

3

4

1

31

5

2

64

21

5

3

2

1

5

6

7

8

4

I1,t+1^

I2,t+1^

I

U

Item graph

UUser graphs

64

21

5

3

82

2,t

1,1

3,t

I

Item graphU

U 6

2 4 53

1

3

I3,t+1^

I4,t+1^

(c) Time t+1

(a) Initialization

6

431

5

2

U 2

1

3

5

6

7

8

4

I^

I^

I^

I

1 3

74

56

U6

24

5

311,tI^

(b) Time t

2 4 51

U63

17

Advancements

• Explore the collaborative effect that arise due to the ever-changing in-teraction of both customers and products, e.g., IR system: Google

• Design a truly online collaborative filtering solution augmented by anexploration-exploitation strategy

• Dynamically grouping users based on the items under consideration and,at the same time, grouping items based on the similarity of the clusteringsinduced over the users

• Principle recipe to alleviate the cold-start problem in terms of both theoryand practice

18

Theoretical Guarantee

• Let it be generated uniformly at random from U , j-th induced partitionP (ηj) over U is made up of mj clusters of cardinality vj,1, vj,2, . . . , vj,mj

• According to a given but unknown distribution over I, the sequence ofitems in Cit is generated i.i.d., at ∈ [−1,1], E[at] = u>it xt

• Let parameter α and α2 be suitable functions of log(1/δ). If ct ≤ c ∀tthen, as T grows large, with probability at least 1 − δ the cumulativeregret satisfies

T∑t=1

rt = O

((Ej[S] + 1 +

√(2c− 1)VARj(S)

) √d T

n

),

where S = S(j) =∑mj

k=1√vj,k, and Ej[·] and VARj(·) denote, respectively,

the expectation and the variance w.r.t. the distribution of ηj over I

19

Data Sets

• Yahoo: ICML 2012 Exploration & Exploitation Challenge, news articlerecommendation algorithms on “Today Module”, 3M records

• Telefonica: This data contains clicks on ads displayed to user on one ofthe websites that Telefonica operates on, 15M records

• Avazu: The data was provided for the challenge to predict the click-through rate of impressions on mobile devices, 40M records

20

Experimental Evidence

• In general, the longer the lifecycle of one item the fewer the items, thehigher the chance that users with similar preferences will consume it, andhence the bigger the collaborative effects contained in the data

• It is therefore reasonable to expect that our algorithm will be more ef-fective in datasets where the collaborative effects are indeed strong

21

Experimental Result: Benchmark News Data

1 2 3 4 5 6 7

x 104

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Rounds

CT

R

Yahoo Dataset

LINUCB−ONE

DYNUCB

LINUCB−IND

CLUB

LINUCB−V

COFIBA

22

Experimental Result: Benchmark News Data

• The users here are span a wide range of demographic characteristics;this dataset is derived from the consumption of news that are ofteninteresting for large portions of these users and, as such, do not createstrong polarization into subcommunities

• This implies that more often than not, there are quite a few specifichot news that all users might express interest in, and it is natural toexpect that these pieces of news are intended to reach a wide audienceof consumers

• In this non-trivial case, COFIBA can still achieve a significant increasedprediction accuracy compared, thereby suggesting that simultaneous clus-tering at both the user and the item (the news) sides might be an evenmore effective strategy to earn clicks in news recommendation systems

23

Experimental Result: Benchmark Advertising Data

1 2 3 4 5 6 7 8 9

x 105

0

0.05

0.1

0.15

0.2

0.25

Rounds

CT

R

Avazu Dataset

LINUCB−ONE

DYNUCB

LINUCB−IND

CLUB

LINUCB−V

COFIBA

24

Experimental Result: Benchmark Advertising Data

• The Avazu data is furnished from its professional digital advertising so-lution platform, where the customers click the ad impressions via theiOS/Android mobile apps or through websites, serving either the pub-lisher or the advertiser which leads to a daily high volume internet traffic

• In this dataset, COFIBA seems to work extremely well during the cold-start, and comparatively best in all later stages

25

Experimental Result: Production Advertising Data

0.5 1 1.5 2 2.5 3

x 105

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Rounds

CT

R

Telefonica Dataset

LINUCB−ONE

DYNUCB

LINUCB−IND

CLUB

LINUCB−V

COFIBA

26

Experimental Result: Production Advertising Data

• Most of the users in the Telefonica data are from a diverse sample ofpeople in Spain, and it is easy to imagine that this dataset spans a largenumber of communities across its population

• Thus we can assume that collaborative effects will be much more evident,and indeed COFIBA is able to leverage these effects efficiently

27

Experimental Result: Typical Distribution of Users

0 2 4 6 8 10 12 14 16 180

0.1

0.2

0.3

0.4

0 2 4 6 8 10 12 14 16 180

0.2

0.4

0.6

0.8

0 2 4 6 8 10 12 14 16 180

0.2

0.4

0.6

0.8

0 2 4 6 8 10 12 14 16 180

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14 16 180

0.1

0.2

0.3

0.4

0.5

28

Experimental Result: Typical Distribution of Users

• A typical distribution of cluster sizes over users for the Yahoo! dataset.Each bar plot corresponds to a cluster at the item side. Each bar repre-sents the fraction of users contained in the corresponding cluster

• The emerging pattern is always the same: we have few clusters over theitems with very unbalanced sizes and, corresponding to each item cluster,we have few clusters over the users, again with very unbalanced sizes

• This recurring pattern also confirms our theoretical finding, and a prop-erty of data that the COFIBA algorithm can provably take advantage

29

Experimental Result: Summary

• Despite the differences in all the datasets, the experimental evidence wecollected on them is quite consistent, in that in all the cases COFIBAsignificantly outperforms all other competing methods we tested

• This is especially noticeable during the cold-start period, but the samerelative behavior essentially shows up during the whole time window ofour experiments

• On the other hand, COFIBA is far more effective in exploiting the col-laborative effects embedded in the data, and still amenable to be run onlarge datasets

30

Conclusion and Future Work

• Introduced Collaborative Bandits that make best use of data seen so far

• Provided the sharp theoretical guarantee

• Outperformed the state-of-the-art significantly

• Some directions:

– Elaborate the analysis to asynchronous networks [ICML’16]

– Exhibit more generic Context-Aware clustering method

– Extend our techniques in the quantification domain [SIGKDD’16]

31

Questions ?

• Papers etc − > https://sites.google.com/site/shuailidotsli

• Thanks :-)

32

sigir 2016 cofiba - collaborative filtering bandits, the 39th acm sigir

Science