streaming recommender systemspapers.…streaming, recommender system, online learning, continu-ous...

Streaming Recommender Systems

Shiyu Chang1, Yang Zhang1, Jiliang Tang2,Dawei Yin3, Yi Chang4, Mark A. Hasegawa-Johnson1, Thomas S. Huang1

1 Beckman Institute, University of Illinois at Urbana-Champaign, IL 61801.2 Computer Science and Engineering, Michigan State University. East Lansing, MI 48824.

3 Data Science Lab, JD.com. Beijing, 100101, China.4 Search Technology Lab, Huawei Research America. Santa Clara, CA 95050

{chang87, yzhan143, jhasegaw, t-huang1}@illinois.edu,[email protected], {yindawei, yichang}@acm.org

ABSTRACTThe increasing popularity of real-world recommender sys-tems produces data continuously and rapidly, and it becomesmore realistic to study recommender systems under stream-ing scenarios. Data streams present distinct properties suchas temporally ordered, continuous and high-velocity, whichposes tremendous challenges to traditional recommender sys-tems. In this paper, we investigate the problem of recom-mendation with stream inputs. In particular, we provide aprincipled framework termed sRec, which provides explicitcontinuous-time random process models of the creation ofusers and topics, and of the evolution of their interests. Avariational Bayesian approach called recursive meanfield ap-proximation is proposed, which permits computationally ef-ficient instantaneous on-line inference. Experimental resultson several real-world datasets demonstrate the advantagesof our sRec over other state-of-the-arts.

KeywordsStreaming, recommender system, online learning, continu-ous time, data stream.

1. INTRODUCTIONRecommender systems help to overcome information over-

load by providing personalized suggestions from a plethoraof choices based on the historical data. The pervasive useof real-world recommender systems such as eBay and Net-flix generate massive data at an unprecedented rate. Forexample, more than 10 million transactions are made perday in eBay1 and Netflix gained more than three millionsubscribers from mid-March 2013 to April 20132. Such datais temporally ordered, continuous, high-velocity and timevarying, which determines the streaming nature of data in

1http://www.webretailer.com/articles/ebay-statistics.asp

2https://en.wikipedia.org/wiki/Netflix

c�2017 International World Wide Web Conference Committee (IW3C2),

published under Creative Commons CC BY 4.0 License.

WWW 2017, April 3–7, 2017, Perth, Australia.

ACM 978-1-4503-4913-0/17/04.

http://dx.doi.org/10.1145/3038912.3052627

.

!Streaming!Recommender!

Systems!input stream

Recommendations

Figure 1: An Illustration of Streaming Recommender Sys-tems. Online update and prediction are made based on threetypes of events: user feedback activities, new users, and newitems.

recommender systems. Hence it is more realistic to studyrecommender systems using a streaming data framework.

In recent years, there is a large literature exploiting tem-poral information [9, 14, 28, 29] and it is evident that ex-plicitly modeling temporal dynamics greatly improves therecommendation performance [5, 14, 27]. However, the vastmajority of those systems does not consider the input data asstreams. Recommendation under streaming settings needsto tackle the following challenges simultaneously:

Real-time updating: One inherent characteristic of datastreams is their high velocity; hence the recommender sys-tem needs to update and response instantaneously in orderto catch users’ instant intention and demands.

Unknown size: New users or freshly posted items arrivecontinuously in data streams. For example, there were morethan 21 million new products o↵ered on the main AmazonUSA websites from December 2013 to August 20143. Hencethe number of users and the size of recommendation listsare unknown in advance. Nevertheless, many existing algo-rithms[13] assume the availability of such information.

Concept shift: Data stream evolution leads to conceptshifts, e.g., a new product launch reduces the popularityof previous versions. Likewise, user preferences drift overtime. The recommender system should have the ability to

3http://export-x.com/2014/08/14/many-products-amazon-sell-2/

381

capture such signals and timely adapt its recommendationsaccordingly.

This paper defines a streaming recommender system withreal-time updates for a shifting concept pool of unknownsize. An illustrative example of the proposed streaming rec-ommender systems is demonstrated in figure 1. The inputstreams are modeled as three types of events: user feed-back activities, new users, and new items. The system con-tinuously updates its model to capture dynamics and pur-sue real-time recommendations. In particular, we tackle allthree challenges simultaneously by a novel streaming recom-mender system framework (sRec). The major contributionsare three-folds.‚ We provide a principled way to handle data as streams for

e↵ective recommendations.

‚ We propose a novel recommender system sRec , whichcaptures temporal dynamics under steaming settings andmakes real-time recommendations.

‚ We conduct extensive experiments on various real-worlddatasets to understand the mechanism of the proposedframework.

2. MODEL FORMULATIONWe model the streaming setting of recommender systems

by assuming a set of continuously time-varying partiallyknown user-item relationships as tRt

iju, where each Rtij rep-

resents the rating from user i to item j at a given time t,which is modeled as a random variable. We assume that timeis continuous, t P R`. In addition, mt P Z` and nt P Z` de-note the number of users and items at time t. The numbersof users and items are dynamically changing over time.

Moreover, these time-dependent ratings are partially ob-served. We denote rtij as the deterministic observed valueof the random variable Rt

ij ; R as a set that contains theseobserved rating values rtij ; and R0:t as the subset of the ob-served ratings no later than t. Here, we want to emphasizethat the user-item relationship at di↵erent times are con-sidered di↵erent: rtij P R does not imply rt

1ij P R, @t1 ° t

since we assume that the user-item relationships are inher-ently changing. One advantage is that we no longer assumethat an item can be rated or adopted only once by a user.In contrast, users have the flexibility to edit or even deleteratings previously recorded.

2.1 Modeling User-Item RelationshipsRatings, discrete and ordered, are modeled by an ordered

probit model [10], thus rating Rtij is a discretization of some

hidden variable Xtij that depicts the “likeness” of user i to

item j. Formally, we have Rtij “ k, if Xt

ij P p⇡k,⇡k`1s,where k P t1, 2, ...,Ku, and K is the total number of dis-cretized levels. t⇡kuK`1

k“1 is a set of partition thresholds,which divide the real line R into segments where consec-utive discrete scores are assigned. The boundaries are givenby ⇡1 “ ´8, ⇡K`1 “ 8. The intuition behind such a modelis that it can easily handle both explicit and implicit feed-back. For instance, Netflix uses scaled ratings ranging from1 to 5, while Last.fm recommends music to target user basedon their listening behaviors. In the first case, the rating typeis explicit, and we can directly set K “ 5. On the contrary,the probit model handles implicit feedback by treating eachevent (e.g. listen, download, click, etc.) at time t as a binaryprediction. Then, a single threshold is su�cient.

Ui

Ui+1

Vj

Vjt−τ

Vjt

Xijt−τ

Rijt−τ

Uit−τ

Uit

t

t −τ

the birth of item j

the birth of user i

Figure 2: Graphical illustration of the model. Each tubestructure denotes the continuous Markov process of eachtime-varying user/item topic, with di↵erent intersections de-noting the user/item topics at di↵erent times. The time axisgoes up in parallel to these tubes, with the top intersectionscorresponding to the current time. Di↵erent users/items areborn at di↵erent times, so the tubes vary in length and po-sition of their bottom intersections. A rating that user iassigns to item j at time t1 (in the figure t1 “ t ´ ⌧ as an

example), Rt1ij , depends on the hidden likeness Xt1

ij , which inturn depends on the hidden topics of user i and item j attime t1.

The hidden user-item “likeness” at t, Xtij , is modeled as

Xtij “ pU t

i qT V tj ` Et

ij , (1)

where U ti is a d-dimensional hidden topic vector for user i

at t. Likewise, V tj is a d-dimensional hidden topic vector for

item j at t. Etij is a Gaussian perturbation with mean 0 and

variance �2E .

2.2 Temporal DynamicsTemporal dynamics of U t

i and V tj incorporate both hid-

den topic evolution and new user/item introduction. Specif-ically, the hidden topic vectors of existing users follow Brow-nian motion

U ti |U t´⌧

i „ N´U t´⌧i , �2

U ⌧I¯

, ⌧ ° 0. (2)

For new users, let tU piq be the birth time of user i. Wemodel the initial distribution (at the birth time), namely

the distribution of U tU piqi , as the average of the posterior ex-

pectations of all existing user factors at tU piq with Gaussianperturbations. We essentially assumes that the preferenceprior of a new user follows the general interests of others:

U tU piqi | Rt

ij : t † tU piq(

„ Nˆ

meank:tU pkq†tU piq

E´U tU piqk | Rt

ij : t † tU piq(¯, �2

U0I

˙.

(3)

where meank:tU pkq†tU piq

denotes averaging over all existing users.

Furthermore, equation (3) models the birth of all new users

382

but for the very first users at t “ 0, for whom we assume astandard Gaussian prior as:

U0i „ N p0, Iq, @i, tU piq “ 0. (4)

Temporal dynamics as well as the initial distributions forthe item factors V t

j are similar to those shown in equations

(2)-(4) by replacing Up¨qi to V

p¨qj , tU piq to tV pjq, and �2

U and

�2U0 to �2

V and �2V 0, respectively.

2.3 Model Summary and NotationTo sum up, the proposed framework consists of the fol-

lowing hidden variables:

Z “ U t

i , Vtj , X

tij

( Y Rt

ij : rtij R R(;

observed variables: Rt

ij : rtij P R(; and parameters to be es-

timated⇥ “

�2E , �2

U , �2V

(. (5)

For better understanding, a graphical model is shown infigure 2. Moreover, before introducing the inference andtraining scheme for the model, we list the notations that wewill use throughout the rest of the paper.

Time related notations and terminologies:‚ An event: a new user/item introduced, or a rating as-

signed;‚ t and t1: : the current time and any arbitrary time, re-

spectively;‚ tU piq, tV pjq: the birth time of user i and item j, respec-

tively;‚ ⌧ptq and ⌧ 1ptq: the time between current and the most

recent event time preceding and proceeding, respectively,but not including, the reference time t;

‚ ⌧U pi, tq and ⌧V pj, tq: the time between current and themost recent event time preceding, but not including, the ref-erence time t that involves user i and item j, respectively;

‚ ⌧ 1U pi, tq and ⌧ 1

V pj, tq: the time between current and theup-coming event time from, but not including, time t foruser i and item j;

‚ T : the set of all event times;‚ TU piq, TV pjq: the set of all event times that are related

to user i and item j respectively.

Random variables and distributions:‚ rtij : the observed value of Rt

ij ;‚ R: the set of values of all observed ratings rtij ;‚ R0:t: the set of values of all observed ratings no later

than t;‚ Rt: the set of values of all observed ratings at time t;‚ X ,X 0:t,X t: the set of random variables Xt

ij whose cor-responding ratings have observed values in R, R0:t, Rt re-spectively;

‚ X̄ t, the set of random variables Xtij at time t whose

corresponding ratings are not observed;‚ pA|B : the probability density function of random vari-

able A conditional on B, function argument omitted.

3. STREAMING RECOMMENDER SYSTEMThe proposed framework provides two components to ad-

dress the following two problems as shown in figure 3:

‚ Online prediction: Based on observations UP TO thecurrent time, how can we predict an unseen rating?

!Online!!

Recommenda-on!!OR!!

!Offline!

Parameter!Learning!!

Streaming!Inputs! Historical!Data!

Manual!Define!

!!

�2E

�2U

�2V

Real!Tim

e!Upd

ate!

Sec-on!3.1:!Online!Inference!

Sec-on!3.2:!Offline!Parameter!Es-ma-on!

Figure 3: Overall framework illustration. The sRec frame-work consists of two components. The first component is on-line update and prediction, which provides a principled wayto handle data streams. The second component provides adata driven method to estimate meaningful parameters fromhistorical data.

‚ O✏ine parameter estimation (learning): Given asubset of data, how can we estimate the parameters ⇥defined in equation (5)?

For the first problem, we will apply posterior variationalinference as shown in section 3.1, which leads to a very ef-ficient streaming scheme: all the predictions are based onsome posterior moments, which get updated only when arelevant event happens. Furthermore, the update only de-pends on the incoming event and the posterior moments atthe previous event, but not on other historical events. Forthe second problem, we will apply variational EM algorithm,which will be derived in section 3.2.

3.1 Online InferenceGiven past observations, for any unseen ratings i.e. Rt

ij :

rtij R R0:t, the inferred rating R̂tij is given by posterior ex-

pectation as:

R̂tij “ k, if ErXt

ij |R0:ts P p⇡k, ⇡k`1s. (6)

The predicted rating depends on the online posterior expec-tation of Xt

ij , which in turn depends on posterior distribu-tions of other hidden variables.

3.1.1 Recursive Meanfield ApproximationAccording to equation (6), online inference needs to eval-

uate the posterior distribution pXt,Ut,V t|R0:t ,@t, which canbe decomposed into pX t,Ut,V t|R0:t ¨pX̄ t|Ut,V t by the Markovproperty. While the second term is readily defined by equa-tion (1), it is challenging to calculate the first term dueto the highly nonlinear model assumptions. Therefore, wepropose a new method, called recursive meanfield approx-imation, to obtain an approximate posterior distributionqX t1

,Ut1,V t1 |R0:t1 , which is defined alternatively and recur-

sively by the following three equations.

First, @t1 P T , qX t1,Ut1

,V t1 |R0:t1 is considered to be indepen-

dent among U t1, V t1

and Xt1, and approximates the distri-

bution induced by the Markov property

qX t1,Ut1

,V t1 |R0:t1 “ qX t1 |R0:t1 qUt1 |R0:t1 q

V t1 |R0:t1

“ argminq“q

Ut1 qV t1 qXt1

KL´qX t1

,Ut1,V t1 |R0:t1´⌧pt1q pRt1 |X t1 }q

¯.

(7)

383

Second, @t1 R T , since there is no event at t1, R0:t1 “R0:t1´⌧pt1q, thus:

qX t1,Ut1

,V t1 |R0:t1 “ qX t1,Ut1

,V t1 |R0:t1´⌧pt1q . (8)

Finally, both (7) and (8) depend on qX t1,Ut1

,V t1 |R0:t1´⌧pt1q ,which is defined by the Markov property:

qX t1,Ut1

,V t1 |R0:t1´⌧pt1q

“ª

qUt1´⌧pt1q|R0:t1´⌧pt1q q

V t1´⌧pt1q|R0:t1´⌧pt1q pUt1 |Ut1´⌧pt1q

¨ pV t1 |V t1´⌧pt1q pX t1 |Ut1

,V t1 pRt1 |X t1 dU t1´⌧pt1qdV t1´⌧pt1q.

(9)

The posterior distributions at non-event times are ex-pressed by those at the most recent event times, hence thesystem only needs to keep track of the posterior distribu-tions at the most recent event time, and updates them onlywhen an event occurs, which makes the system e�cient.

The following useful facts are stated here without proof.

Theorem 3.1. For all t1 P T , under recursive meanfield

approximate distribution qUt1,V t1

,X t1 |R0:t1and

qUt1,V t1

,X t1 |R0:t1´⌧pt1q

‚ U t1and V t1

follow multivariate normal distribution.

‚ All individual user topics U t1i and item topics V t1

j are

jointly independent.

The following corollary can be derived from theorem 3.1.

Corollary 3.1.1. @t1 P T , for user i and item j that are

not relevant to the current ratings, i.e. rt1

ij R Rt1.

EqpU t1i |R0:t1 q “ EqpU t1

i |R0:t1´⌧pt1qq “ EqpU t1i |R0:t1´⌧U pi,t1qq,

CovqpU t1i |R0:t1 q “ CovqpU t1

i |R0:t1´⌧pt1qq ` �2U ⌧pt1qI

“ CovqpU t1i |R0:t1´⌧U pi,t1qq ` �2

U ⌧U pt1qI.(10)

and similarly for Vj .

Theorem 3.1 and corollary 3.1.1 essentially state that 1)to keep track of the posterior distributions, we only need tokeep track of their first and second moments; 2) we can up-date only those U t

i and V tj that are associated with events

at time t and the rest are una↵ected; and 3) the updatedoes not depend on other users or items. The solution tothe variational problem defined in (7) reveals the updatingequations for U t

i ,Vtj and Xt

ij , @t P T .

Update Equations for U ti and V t

j at Event Times:For any U t

i associated with an event at time t, the posteriormoments are given as

CovqÙ ti |R0:t

˘ “”�´2E

ÿ

j:rtijPRt

EqpV tj pV t

j qT |R0:tq ` CovqpU ti |R0:t´⌧ptqq´1

ı´1,

EqÙ ti |R0:t

˘ “Covq

Ù ti |R0:t

˘ ´�´2E

ÿ

j:rtijPRt

Eq`V tj |R0:t

˘Eq

`Xt

ij |R0:t˘ `

CovqpU ti |R0:t´⌧ptqq´1EqpU t

i |R0:t´⌧ptqq¯,

(11)where EqpU t

i |R0:t´⌧ptqq and CovqpU ti |R0:t´⌧ptqq are given by

equations (10), (3), or (4), which correspond to existingusers, newly born users, and very first users, respectively.

Algorithm 1: Online Posterior Moments Update atEvent Timeinput : Current time t (must be an event instance); the

event type(s); currently assigned rating(s) Rt; lastupdated posterior moments of tU t

i u, tV tj u.

output: Updated posterior moments of tU ti u, tV t

j u and

tXtiju that are associated with the events at

current time.for i : user i borns at time t do

Initialize EqpU ti |R0:t´⌧ptqq, CovqpU t

i |R0:t´⌧ptqqaccording to equations (3) or (4);

end

for j : item j borns at t do

Initialize EqpV tj |R0:t´⌧ptqq, CovqpV t

j |R0:t´⌧ptqq by

symmetry to equations (3) or (4);end

while convergence not yet reached do

for i : user i rates at time t do

Update EqpU ti |R0:tq, CovqpU t

i |R0:tq according toequation (11);

end

for j : item j is rated at time t do

Update EqpV tj |R0:tq, CovqpV t

j |R0:tq by symmetry to

equation (11);end

for pi, jq: user i rates item j at time t do

Update EqpXtij |R0:tq according to equation (14).

end

end

The update for V tj is identical, except that U and V are

interchanged, and subscripts i and j are interchanged.

Update Equations for Xtij at Event Times:

On the other hand, qXtij |R0:t is a truncated Gaussian distri-

bution with the Gaussian mean

µtij “ Eq

Ù ti |R0:t

˘T Eq`V tj |R0:t

˘, (12)

and variance �2E . The truncation interval is between

p⇡rtij,⇡rtij`1s. Define

etij “⇡rtij

´ µtij

�E, and f t

ij “⇡rtij`1 ´ µt

ij

�E. (13)

The expectation of this truncated Gaussian can be expressedas

EqpXtij |R0:tq “ µt

ij ` �E

�petijq ´ �pf tijq

�pf tijq ´ �petijq , (14)

where �p¨q and �p¨q are the pdf and cdf of the standard nor-mal distribution respectively.

Update Equation for Xtij at Non-event Times:

With all the online posterior expectations derived, we arenow able to obtain an approximation of E

`Xt

ij |R0:t˘, which

is essential for rating prediction as in (6):

EpXtij |R0:tq « EqpXt

ij |R0:tq “ EqppU ti qT V t

j |R0:tq“EqpU t´⌧U pi,tq

i |R0:t´⌧U pi,tqqTEqpV t´⌧V pj,tqj |R0:t´⌧V pj,tqq.

(15)The last equality is given by corollary 3.1.1.

3.1.2 Algorithm table and Order ComplexityTo sum up, the algorithm of updating the posterior mo-

ments at event times is given in algorithm 1. The posterior

384

moments of a user/item topic are updated only when there isan event associated with it. So the total number of updatesis equal to the total number of events. In the update foreach new user/item, there is a summation over all existinguser/item topics of dimension d, which can be accelerated bysetting a global sum of all the existing user/item topics. Thisglobal sum is updated only once for every event. In the up-date for each new rating, there is an inversion of a size-d ma-trix, which needs no more than O

`d3

˘operations. Then the

total computation complexity over the entire time r0, T s isO

`I��R0:T

�� d3 ` `mt ` nt `

��R0:T��˘ d

˘where I is the num-

ber iterations to reach convergence for each time.

3.2 Offline Parameter EstimationThe online streaming prediction relies on the set of pa-

rameters ⇥, which can be either manually predefined bydomain expertise, or estimated from historical data. De-fine T as the end time of the historical data. This sectionapplies expectation maximization (EM) to learn the modelwith incomplete observations by first defining the completedata as pZ 10:T ,R0:T q where Z 10:T is a subset of Z that areassociated with events, or more concretely

Z 10:T “ U ti : t P TU piq( Y

V tj : t P TV pjq( Y

Xtij : rtij P R

(.

3.2.1 M-stepThe auxiliary function, or Q function, is given by

Qp⇥|⇥̂pkqq “ EZ10:T |R0:T

`log pZ10:T ,R0:T ;⇥

˘.

Update �2E :

According the first order condition, we obtain the optimalestimation formula of �2

E as

p�̂2Eqpk`1q “1

|R0:T |ÿ

i,j,t1:rt1ijPR0:T

Ep

„´Xt1

ij ´ U t1Ti V t1

j

¯2|R0:T ; ⇥̂pkq

⇢, (16)

where |R0:T | denotes the total number of ratings in the his-torical data from time 0 to T .

Update �2U and �2

V :Similar to the �2

E case, the first order condition yields

p�̂2U qpk`1q “

1

C

ÿ

i,t1PTU piqztU piq

Ep

”}U t1

i ´ U t1´⌧U pi,t1qi }2|R0:T ; ⇥̂pkq

ı

t1 ´ ⌧U pi, t1q .(17)

C is a constant defined as the total number of user ratingevents. Furthermore, the estimation formula for �2

V is al-most identical to �2

U , which is omitted.

3.2.2 E-stepThe exact value of all o✏ine posterior moments listed in

equations (16) and (17) cannot be obtained due to the in-tractable inference of the posterior distributions

pUt1

,Ut1´⌧pt1q|R0:T , pV t1

,V t1´⌧pt1q|R0:T , pUt1

,V t1,X t1 |R0:T .

However, we can utilize the result from the online inferenceto recursively approximate the distributions by first definingthe approximated o✏ine posterior distributions as

qUt1

,Ut1´⌧pt1q|R0:T , qV t1

,V t1´⌧pt1q|R0:T , qUt1

,V t1,X t1 |R0:T ,

which are applied to the estimation of �2U , �

2V and �2

E re-spectively.

First, qUt1,Ut1´⌧pt1q|R0:T is defined recursively by Markov

property

qUt1

,Ut1´⌧pt1q|R0:T “ qUt1 |R0:T q

Ut1´⌧pt1q|Ut1,R0:t1´⌧pt1q

“qUt1 |R0:T

qUt1´⌧pt1q|R0:t1´⌧pt1q p

Ut1 |Ut1´⌧pt1q≥qUt1´⌧pt1q|R0:t1´⌧pt1q p

Ut1 |Ut1´⌧pt1q dU t1 ,

where the first term is given backward-recursively by marginal-izing qUt1`⌧ 1pt1q,Ut1 |R0:T ; and qUt1´⌧pt1q|R0:t1´⌧pt1q in the frac-tion is given by online inference. qV t1

,V t1´⌧pt1q|R0:T is definedsimilarly and is omitted.

Second, qUt1,V t1

,X t1 |R0:T , similar to (7), assumes U t1, V t1

,X t1

are independent

qX t1,Ut1

,V t1 |R0:T “ qX t1 |R0:T qUt1 |R0:T q

V t1 |R0:T ,

where qUt1 |R0:T and qV t1 |R0:T are already defined, and qX t1 |R0:T

is set to minimize the KL divergence to the true distribution.

qX t1 |R0:T “ argminq

KL ppUt1

,V t1,X t1 |R0:T }q ¨ q

Ut1 |R0:T qV t1 |R0:T q.

Similar to the online case, these approximated distributionshave good properties, which will be stated without proof.

Corollary 3.1.2. Under the approximated o✏ine distri-

bution qUt1,Ut1´⌧pt1q|R0:T and qV t1

V t1´⌧pt1q|R0:T , all individual

user topics U t1i and item topics V t1

j are independent.

Using the corollary 3.1.2, all the o✏ine posterior expecta-tions listed in equations (16) and (17) can be computed.

4. EXPERIMENTSTo assess the e↵ectiveness of the proposed framework, we

collect three real-world datasets and their detailed descrip-tions are as follows:

MovieLens Latest Small : It consists of 100,000 ratings to8,570 movies by 706 users. In addition, all ratings are asso-ciated with timestamps ranging from the year 1996 to 2015.The ratings are given on a half star scale from 0.5 to 5. Inabbreviate, we use ML-latest for the following subsections.

MovieLens-10M : The dataset contains 10 million movieratings from 1995 to 2009. The rating scale and temporalinformation are similar to those of the ML-latest dataset.In total, there are 10,000 movies and 71,000 users. We useML-10M to represent the dataset.

Netflix : The Netflix dataset contains 100 million movieratings from 1999 to 2006 that are distributed across 17,7700movies and 480,189 users. The rating scale is from 1 to 5.Furthermore, the temporal granularity is a day.

4.1 Baseline MethodsWe compare our proposed framework sRec with several

representative recommender systems including:

PMF [19]: Probabilistic Matrix Factorization is a classicalrecommendation algorithm that is widely used.

MDCR [2]: Multi-Dimensional Collaborative Recommen-dation is a tensor-based matrix factorization algorithm, which

385

can potentially incorporate both the temporal and spatialinformation.

Time-SVD`` [14]: Time-variant version of the Netflixwinning algorithm SVD`` [13]. In addition, the model canbe evolved e�ciently via an online updating scheme.

GPFM [20]: Gaussian Process Factorization Machines isa variant of the factorization machines [21], which is non-linear, probabilistic and time-aware.

In summary, PMF is a static model, and all the other threebaselines leverage the temporal factor in di↵erent ways. Weutilize the existing C`` implementations from GraphChi[15] for PMF and Time-SVD``, while the code for GPFMis also available online4.

4.2 Experimental SettingsWe divide all data into halves. We call the first half the

“base training set”, which is used for parameter estimation;and the remaining half is the “candidate set”. The basetraining set can be considered as the historical data poolwhile the candidate set mimics the streaming inputs. Boththe proposed method and other baseline algorithms utilizethe based training set to determine the best hyper parame-ters. For instance, the latent dimensionality for each algo-rithms are determined independently using this base train-ing set. For online prediction, the actual testing set is gen-erated from the candidate set by randomly selecting a ref-erence time first. Such a reference time can be consideredas the “current time”. Then, our task is to predict user-itemratings for the following week starting from the referencetime. All the ratings prior to the reference time are used forthe online inference. The sequential order of data streamsensures a prospective inference procedure: no future rat-ings can be used to predict the past behaviors. It is worthmentioning that, except our sRec model, other baselinescannot explicitly handle new user/item introduction duringthe testing phase. Therefore, for a fair comparison, all thetesting ratings are from the existing users and items whichhave appeared in the training set.

We evaluate the performance via the root mean squareerror (RMSE), which is a widely used metric in recommen-dation. Furthermore, in order to ensure reliability, all ex-perimental results were averaged over 10 di↵erent runs. Inaddition, to keep the temporal variability o↵ the testing set,there is no temporal overlapping between any testing sets.

4.3 Experimental ResultsThe best performance of each method on all three datasets

is reported in table 1. We observe that the proposed algo-rithm achieves the best performance across all three datasets.It is evident that explicitly modeling user/item dynamicssignificantly boosts the recommendation performance underthe streaming setting. The second best algorithm is Time-SVD``, which is better than the other three baselines con-sistently. This is because, Time-SVD`` models temporalinformation in a more suitable way under the streaming set-tings. However, the inflexibility to the temporal granularitymakes it insu�cient to capture any volatile changes. Onthe other hand, the tensor-based method MDCR yields theworst performance. One potential reason is that the algo-rithm considers temporal information in a retrospective way,which is obviously inadequate to capture the prospective

4https://github.com/trungngv/gpfm

Table 1: Performance comparison in terms of RMSE. Thebest performance is highlighted in bold.

ML-Latest ML-10M NetflixPMF 0.7372 0.8814 0.8610MDCR 0.7604 0.9349 0.9326GPFM 0.7710 0.9114 NA

Time-SVD`` 0.7141 0.8579 0.8446sRec 0.6780 0.7957 0.8093

0 5 10 15 20 25 30 35 40 45 501

1.5

2

2.5Parameter sigma E squre vs. EM iteration

0 5 10 15 20 25 30 35 40 45 50

#10-5

02468

Parameter sigma U square vs. EM iteration

0 5 10 15 20 25 30 35 40 45 50

#10-4

6

8

10

Parameter sigma V square vs. EM iteration

0 5 10 15 20 25 30 35 40 45 500.5

0.6

0.7

0.8The training RMSE error vs. EM iteration

Figure 4: The convergence analysis for learning parameterson the ML-Latest dataset.

process of data generation. Furthermore, the PMF methoddoes not utilize any time information, however, its perfor-mance is still better than MDCR. This result implies that in-appropriate temporal modeling even hurts the performance.Last, GPFM su↵ers from the problem of scalability, thus itprovides no result for the Netflix dataset.

4.4 Parameter UnderstandingIn this subsection, we investigate the physical meanings

of our learned parameters. The major parameters for theproposed method are �2

E , �2U and �2

V . According to ourexperimental settings, these parameters are learned from theso-called historical data. Note that, in this part, we will onlyshow the results on the ML-Latest dataset. The results forother datasets are similar.

We first perform empirical convergence analysis for theproposed sRec method. Figure 4 demonstrates the conver-gence paths of all three parameters as well as the trainingerror against EM iterations. As we can see, all the parame-ters converge, and the RMSE on the training set is saturatedaround the 10th iteration. The converged values of �2

E , �2U

and �2V are 1.32, 1.17 ˆ 10´5 and 7.33 ˆ 10´4 respectively

while the training RMSE is around 0.64. Next, we analyzethe interpretations of the values of these parameters.

The first parameter can be understood as the impact ofother unknown factors on ratings. The larger the value of�2E , the more susceptible the rating is to variations in un-

known factors, and hence the less predictable by sRec.More interestingly, �2

U and �2V are intuitively considered

as the evolution speed of user/item taste based on our model

386

days0 500 1000 1500 2000 2500 3000 3500 4000 4500

corre

latio

n fa

ctor

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1The time decaying factor of users and items

user self-correlations

item self-correlations

half-level

Figure 5: Correlation Decay with time. The half-time foruser and item is 3.12 years and 13.06 years respectively.

assumptions in a global sense (across the whole populationof the users/items). We relate the learned values of thesetwo parameters to our intuitions in a quantitative way viathe following posterior correlation function:

CorrpU t1i , U t1`�t

i |R1:t1 q2 “ TrpCovpU t1i |R1:t1 q

TrpCovpU t1i |R1:t1 q ` d�t�2

U

. (18)

This quantity is a function of �2U , which measures the user

auto-correlation from a reference time t1 to some future timet1 ` �t. The range of such a measurement is from 0 to 1.The item correlation function is identical to equation (18)by replacing the corresponding U with V . We vary �t andplot the mean of these correlations for all users and itemsin figure 5. The reference time t1 is equal to the last timeinstance in the training data, and �2

U and �2V are obtained

from the o✏ine learning.Furthermore, one critical value for the correlation measure

is 0.5, which is usually referred as the “half-time”. In figure5, the half-time for both users and items are the intersectionsof the red dashed line to their own curves. The“half-time”for users is around 3.12 years; while the one for items is13.06 years, which can be understood as: a rating providedby a user can only be trusted to reflect half his/her taste for3.12 years. Although we indeed observe the change of itemtopics over time, the change is significantly slower than theuser ones. This finding can be used to explain why manyexisting recommender systems only consider the dynamicsof user topics while assuming fixed item topics.

4.5 Temporal DriftingWe demonstrate the proposed sRec can capture the tem-

poral drifting phenomenon from another perspective. Werecorded all the posterior expectations of user/item topicsas a function of time. Figure 6 shows the evolution of latentfactors with the ML-Latest dataset, which x-axis denotestime and y-axis is the latent dimension. The intensity of theuser latent representation reflects the user likeness to a spe-cific hidden topic. The darker the color in figure 6a, the lessemphases users lay on the certain topic. On the other hand,the item latent vectors indicate the assignment of items tothese latent topics. The observation is consistent with ourexperimental results in section 4.4: the user tastes changemore remarkably compared to item topics.

Furthermore, several topics evolve much more drasticallycompared to others. Notable ones include the 1st, 6th, 12th

The overall user-interests drifting

days1000 2000 3000 4000 5000 6000

late

nt d

imen

sion

2

4

6

8

10

12

14

16

18

20

(a) User

The overall item-topic drifting

days1000 2000 3000 4000 5000 6000

late

nt d

imen

sion

2

4

6

8

10

12

14

16

18

20

(b) Item

Figure 6: The averaged latent topics for both users (6a) anditems (6b) at di↵erent time.

days1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500

The

hidd

en li

kene

ss-1

0

1

2

3

4

5 stars

4.5 stars

4 stars

3.5 stars

2.5 stars

3 stars

2 stars

The user likeness changes along the temporal direction

The Silence of the Lambs (Horror)

The Lion King (Animation)

Ace Ventura: When Nature Calls (Comedy)

Figure 7: An example of user interest evolving over timethrough predicted ratings for some representative movies.

and 16th user topics as well as the 5th item topic. To un-derstand the reason, we provide an example using the 6thuser topic, which has a volatile shift in interests around theday 1,500 (year 2001). Table 2 shows the top movies thathave strongest responses at the 6th latent topics. Most ofthe movies are the popular commercial type with the ma-jority of their ratings provided after day 1500. Moreover,the average rating for the most of these movies is substan-tially higher than that of the whole dataset, which is 3.49.Another indicatory event is the movie “The Lord of Rings”that was also released in that year. This might potentiallycorrelate to user taste changes and a↵ect the latent repre-sentation.

We next present a qualitative result to demonstrate theadvantage of our sRec model in tracking user instant prefer-ence by plotting the predicted “likeness” (posterior expecta-tion of Xij) for one of the most active users from the data toa set of movies as a function of time. In figure 7, the pref-erence of an individual user continuously evolves on thosethree selected movies. An initial strong interest in “The Si-lence of the Lamb”decreases slightly with time. In contrast,the comedy movie “Ace Ventura” becomes more and morepreferable. The last movie “The Lion King” is considered asa favorable movie by the user across the entire time.

5. RELATED WORKIn this section, we focus on two types of related work: the

first one is about online updating for recommender systems;and the other one is about time-aware recommendations.

387

Table 2: The statistics of the top movies that contain strong responses at their 6th latent dimension.

Topic 6 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6

Movie name Start War VThere’s Something

Alien The Lord of Rings I Star Wars IV Star War VIAbout Mary

GenreAction, Sci-Fi Comedy, Horror, Adventure, Action, Sci-Fi Action, Sci-Fi

Adventure Romance Sci-Fi Fantasy Adventure AdventureRelease time 1980 1998 1979 2001 1977 1983

Percentage of ratings75.43% 88.39% 78.87% 100% 69.15% 70.46%

after day 1500Average ratings

4.17 3.64 4.02 4.04 4.14 3.99after day 1500

5.1 Online Recommender SystemMost previous works focus on traditional batch regression

problem [20, 24, 25]. However, the recommender system bynature is an incremental process. That is, users’ feedbackare performed in a sequence, and their interests are also dy-namically evolving. To capture the system evolution, once auser feedback is performed, the recommender system shouldbe able to get updated. Authors in [1] proposed a fast onlinebilinear factor model to learn item-specific factors throughonline regression, where each item can perform independentupdates. Hence, this procedure is fast, scalable and easilyparallelizable. [22] introduced regularized kernel matrix fac-torization method where kernels provide a flexible way tomodel nonlinear interactions and an online-update scheme.Das et al. [7] addressed the large data and dynamic con-tent problem in recommender systems, and proposed on-line models to generate personalized recommendations forGoogle News users. Diaz et al. [8] presented Stream Rank-ing Matrix Factorization, which uses a pairwise approach tomatrix factorization in order to optimize the personalizedranking of topics and follows a selective sampling strategyto perform incremental model updates based on active learn-ing principles. [6] extended the online ranking technique andproposed a temporal recommender system: when postingtweets, users can get recommendations of topics (hashtags)according to their real-time interests. Furthermore, userscan also generate fast feedback according to the obtainedrecommendations. However, the above methods focus ononline update and learning but neglect the temporal order-ing information of the input data.

Recently, beyond the online learning methodology, theconcept of real-time recommender systems is introduced,which emphasizes on the scalability and real-time pruning inrecommender systems. Authors in [12] present a practicalscalable item-based collaborative filtering (CF) algorithm,with the characteristics such as robustness to the implicitfeedback problem, incremental update, and real-time prun-ing. Zhao et. al. [30] utilize a selective sampling scheme un-der the PMF framework [19] to achieve interactive updates.StreamRec [4] is invented based on a scalable CF recom-mendation model, which emphasized on the databases as-pect of a stream processing system. A neighborhood-basedmethod for streaming recommendations is discussed in [23].The proposed framework is substantially di↵erent from theabove real-time recommender systems: aforementioned sys-tems are memory based methods and designed only for cer-tain applications, while our method provides a principledway to handle data as streams.

5.2 Time-aware RecommendationsAn important factor of time-aware recommendations is

how to explicitly model users’ interest change over time. In

collaborative filtering, modeling temporal information hasalready shown its success (e.g. Ding and Li [9], Koren [14],Yin et al. [28] and Xiang et al. [26]). Zhang et al.[29] in-vestigated the recurrence dynamics of social tagging. Xionget al. [27] used tensor factorization approach to model tem-poral factor, where the latent factors are under the Marko-vian assumption. Bhargava et al. [2] further extended thetensor factorization to handle not only temporal but alsogeographical information. Liang et al. [16] proposed to usethe implicit information network, and the temporal infor-mation of micro-blogs to find semantically and temporallyrelevant topics in order to profile user interest drifts. Liuet al. [17] extended collaborative filtering to a sequentialmatrix factorization model to enable time-aware parame-terization with temporal smoothness regularization. Both[11] and [18] model the temporal dynamics using KalmanFiltering. Recently, Chang et al. [5] unified the task oflink prediction and one-bit recommendation as a stream-ing positive-unlabeled learning problem. However, we di↵erfrom these methods proposed by providing more principledinference procedures. In addition, our data-driven techniqueautomatic estimate all the parameters that have physicalmeanings (i.e. the half-time). Furthermore, as the tempo-ral factors catching more attention, Burke et al. [3] studiedan evolutional method, which can provide meaningful in-sights. Although many studies have been conducted in thetemporal aspect of the data, most of them are retrospectiveand overlook the causality of the data generation process.

6. CONCLUSIONData arrive at real-world recommender systems rapidly

and continuously. The velocity and volume at which datais coming suggest the use of streaming settings for recom-mendations. We delineate three challenges for recommen-dations under streaming settings, which are real-time up-dating, unknown sizes and concept shift. We leverage thesedi�culties by proposing a streaming recommender systemsRec in this paper. sRec manages stream inputs via acontinuous-time random process. In addition, the proposedframework is not only able to track the dynamic changes ofuser/item topics, but also provides real-time recommenda-tions in a prospective way. Further, we conduct experimentson three real-world datasets and sRec significantly outper-forms other state-of-the-arts. It provides us an encouragingfeedback to model data as streams.

7. ACKNOWLEDGMENTThis research was supported in part to Shiyu Chang andThomas Huang by a research grant from Jump Labs. YangZhang and Mark Hasegawa-Johnson was in part supportedby the AHRQ grant number 1-483711-392030-191100.

388

8. REFERENCES[1] D. Agarwal, B.-C. Chen, and P. Elango. Fast online

learning through o✏ine initialization for time-sensitiverecommendation. In SIG KDD, 2010.

[2] P. Bhargava, T. Phan, J. Zhou, and J. Lee. Who, what,when, and where: Multi-dimensional collaborativerecommendations using tensor factorization on sparseuser-generated data. In WWW, 2015.

[3] R. Burke. Evaluating the dynamic properties ofrecommendation algorithms. In ACM RecSys, 2010.

[4] B. Chandramouli, J. J. Levandoski, A. Eldawy, andM. Mokbel. Streamrec: A real-time recommender system.In SIGMOD, 2011.

[5] S. Chang, Y. Zhang, J. Tang, D. Yin, Y. Chang, M. A.Hasegawa-Johnson, and T. S. Huang. Positive-unlabeledlearning in streaming networks. In ACM SIGKDD, 2016.

[6] C. Chen, H. Yin, J. Yao, and B. Cui. Terec: A temporalrecommender system over tweet stream. Proc. VLDBEndow., 6(12), Aug. 2013.

[7] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Googlenews personalization: Scalable online collaborative filtering.In WWW, 2007.

[8] E. Diaz-Aviles, L. Drumond, L. Schmidt-Thieme, andW. Nejdl. Real-time top-n recommendation in socialstreams. In RecSys, 2012.

[9] Y. Ding and X. Li. Time weight collaborative filtering. InCIKM, 2005.

[10] W. H. Greene. Econometric analysis. Granite HillPublishers, 2008.

[11] S. Gultekin and J. Paisley. A collaborative kalman filter fortime-evolving dyadic processes. In ICDM, 2014.

[12] Y. Huang, B. Cui, W. Zhang, J. Jiang, and Y. Xu.Tencentrec: Real-time stream recommendation in practice.In SIGMOD, 2015.

[13] Y. Koren. Factorization meets the neighborhood: amultifaceted collaborative filtering model. In ACMSIGKDD, 2008.

[14] Y. Koren. Collaborative filtering with temporal dynamics.Communications of the ACM, 53(4), 2010.

[15] A. Kyrola, G. E. Blelloch, and C. Guestrin. Graphchi:Large-scale graph computation on just a pc. In OSDI,volume 12, pages 31–46, 2012.

[16] H. Liang, Y. Xu, D. Tjondronegoro, and P. Christen.Time-aware topic recommendation based on micro-blogs. InCIKM, 2012.

[17] N. N. Liu, L. He, and M. Zhao. Social temporalcollaborative ranking for context aware movierecommendation. ACM Trans. Intell. Syst. Technol., 4(1),Feb. 2013.

[18] Z. Lu, D. Agarwal, and I. S. Dhillon. A spatio-temporalapproach to collaborative filtering. In Recsys, 2009.

[19] A. Mnih and R. Salakhutdinov. Probabilistic matrixfactorization. In NIPS, pages 1257–1264, 2007.

[20] T. V. Nguyen, A. Karatzoglou, and L. Baltrunas. Gaussianprocess factorization machines for context-awarerecommendations. In ACM SIGIR. ACM, 2014.

[21] S. Rendle. Factorization machines. In ICDM, 2010.[22] S. Rendle and L. Schmidt-Thieme. Online-updating

regularized kernel matrix factorization models forlarge-scale recommender systems. In Recsys, 2008.

[23] K. Subbian, C. Aggarwal, and K. Hegde. Recommendationsfor streaming data. In CIKM, 2016.

[24] S. Wang, J. Tang, and H. Liu. Clare:a joint approach tolabel classification and tag recommendation. In AAAI,2017.

[25] S. Wang, J. Tang, Y. Wang, and H. Liu. Exploring implicithierarchical structures for recommender systems. In IJCAI,2015.

[26] L. Xiang, Q. Yuan, S. Zhao, L. Chen, X. Zhang, Q. Yang,and J. Sun. Temporal recommendation on graphs via long-and short-term preference fusion. In ACM SIGKDD, 2010.

[27] L. Xiong, X. Chen, T. Huang, J. G. Schneider, and J. G.Carbonell. Temporal collaborative filtering with bayesianprobabilistic tensor factorization. In SDM, 2010.

[28] D. Yin, L. Hong, Z. Xue, and B. D. D. 0001. Temporaldynamics of user interests in tagging systems. In AAAI,2011.

[29] D. Zhang, R. Mao, and W. Li. The recurrence dynamics ofsocial tagging. In WWW, 2009.

[30] X. Zhao, W. Zhang, and J. Wang. Interactive collaborativefiltering. In CIKM, 2013.

389

streaming recommender systemspapers.…streaming, recommender system, online learning, continu-ous...

Documents