general functional matrix factorization using gradient ...proceedings.mlr.press/v28/chen13e.pdf ·...

General Functional Matrix Factorization Using Gradient Boosting

Tianqi Chen [email protected]

Shanghai Jiao Tong University, China

Hang Li [email protected]

Huawei Noah’s Ark Lab, Hong Kong

Qiang Yang [email protected]

Huawei Noah’s Ark Lab, Hong Kong

Yong Yu [email protected]

Shanghai Jiao Tong University, China

Abstract

Matrix factorization is among the most suc-cessful techniques for collaborative filtering.One challenge of collaborative filtering is howto utilize available auxiliary information toimprove prediction accuracy. In this paper,we study the problem of utilizing auxiliaryinformation as features of factorization andpropose formalizing the problem as generalfunctional matrix factorization, whose mod-el includes conventional matrix factorizationmodels as its special cases. Moreover, wepropose a gradient boosting based algorith-m to efficiently solve the optimization prob-lem. Finally, we give two specific algorithmsfor efficient feature function construction fortwo specific tasks. Our method can construc-t more suitable feature functions by search-ing in an infinite functional space based ontraining data and thus can yield better pre-diction accuracy. The experimental resultsdemonstrate that the proposed method out-performs the baseline methods on three real-world datasets.

1. Introduction

Matrix factorization has been proved to be one ofthe most successful approaches to collaborative fil-tering. It assumes that the users’ preference matrix

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

Y ∈ Rn×m can be factorized into a product of t-wo low rank matrices U ∈ Rd×n and V ∈ Rd×m asY(i, j) = UT

i Vj , and conducts collaborative filteringby exploiting U and V. In real world scenarios, manytypes of auxiliary information are available besides thematrix Y, and the use of them may enhance the predic-tion accuracy. For example, information such as timeand location can help better predict users’ preferencesbased on the contexts, while information such as user-s’ ages and gender can help make better prediction onthe basis of user profiles. From the viewpoint of learn-ing, the use of auxiliary information can overcome thesparsity of the matrix data.

The use of auxiliary information in matrix factoriza-tion has been studied, various models (Agarwal &Chen, 2009; Rendle et al., 2011; Weston et al., 2012)have been proposed. In these models, auxiliary in-formation is encoded as features and factorization ofthe matrix amounts to linearly mapping the featurevectors in the feature space into a latent space. Thekey question here is how to encode useful features toenhance the performance of the model. Ideally onewould like to automatically construct “good” featurefunctions in the learning process.

To the best of our knowledge, little work has been doneon automatic feature function construction in matrixfactorization. In this paper, we try to tackle the chal-lenge by employing general functional matrix factor-ization and gradient boosting. In our method, featurefunctions are automatically constructed via searchingin a functional space during the process of learning.The learning task is effectively and efficiently carriedout with the gradient boosting algorithm. The con-tribution of the paper is as follows. (1) We propose


U (i,x)3

U (i,x)2

U (i,x)1 f (i,x)1,1

f (i,x)2,1

f (i,x)3,1

f (i,x)1,2

f (i,x)2,2

f (i,x)3,2

f (i,x)1,s

f (i,x)2,s

f (i,x)3,s

f (i,x)1,s+1

LatentFactor

Dimensions

Feature Functions LearnedTime dependent feature

Demographic feature

Training loss Function complexity

Model formalization

Section 2 Section 3Training Procedure Specific Feature Functions

f (i,x)2,s+1

Constructing the newfeature function from

t2t

User’s interest

t1

Topic: IT news

Age>15?

Major=CS?

Y N

Y N

1 0.3

-1

Split Time

Figure 1. Outline of Our Methods

a general formalization of functional matrix factoriza-tion. (2) We introduce an efficient algorithm to solvethe optimization problem using gradient boosting. (3)We give two specific feature construction algorithmsusing time and demographic information. (4) We em-pirically demonstrate the effectiveness of the proposedmethod on three real-world datasets.

The rest of the paper is organized as follows. Section2 gives the general description of our model. Section3 describes two specific algorithms for feature func-tion construction. Experimental result is provided inSection 4. Related work is introduced in Section 5.Finally, the paper is concluded in Section 6.

2. General Functional MatrixFactorization

2.1. Model

Matrix factorization with auxiliary information can berepresented by the following equation, where the aux-iliary information is denoted as x.

Y(i, j,x) = UT (i,x)V(j,x) (1)

Note that for simplicity we always assume that thereis an extra constant column in both U and V to modelthe bias effect. Further generalization can be consid-ered, which involves the sum of multiple factorizations.To simplify the discussion, we restrict our model to asingle factorization. The latent factor model in Equa-tion (1) assumes that prediction Y is a function ofuser i, item j and auxiliary information x in the cur-rent context. The matrix U is constructed by linearcombination of features. The kth dimension of U iscalculated by the following equation1

Uk(i,x) =∑s

αk,sfk,s(i,x), fk,s ∈ F (2)

1The indicator function can be used to encode user/itemindex in the model.

The matrix V can be formalized similarly, while inmany applications it seems to be sufficient to defineone of them in a functional form (either U or V).F is the set of candidate feature functions (space offunctions) which is closed under scaling and can beinfinite. The novelty of our model lies in that it in-troduces functions as model parameters, which allowsus to construct “good” feature functions fk,s ∈ F inlearning of the latent factor model. To simplify thenotations, we will also use Yij for Y(i, j,x), Uki forUk(i,x), and Vkj for Vk(j,x).

One very important factor in our method is the defi-nition of the functional space F . As shown in Fig. 1,we will give two examples of feature functions. Thefirst one is a user-specific k-piece step function thatcaptures users’ interest change over time; the secondone is a user-independent decision tree to model userswith their properties. The details will be introducedin Section 3.

Relationship with Existing Models

General functional matrix factorization contains ex-isting models as its special cases in which certain con-straints are imposed on the model. The model degen-erates to the conventional factorization models usingauxiliary information, when the feature functions fk,sare predefined and fixed. The model becomes that offunctional matrix factorization in (Zhou et al., 2011),when the feature functions in all the latent dimensionsare assumed to be the same decision tree.

2.2. Training

The objective function of our method is defined in thefollowing equation

L =∑i,j

l(Yij , Yij) +∑k,s

Ω(fk,s) (3)

where l is a differentiable convex loss function thatmeasures the difference between the prediction Y and


Algorithm 1 GFMF Training

repeatupdate U :for k = 1 to d do

collect statistics for L(f) in Equation (5)construct function f∗ to optimize L(f) + Ω(f)Uk(i,x)← Uk(i,x) + εf∗(i,x)

end forupdate V :for k = 1 to d do

update each Vk in the same way for Uend for

until converge

the target Y, and the summation is over all the ob-served entries. The second term Ω measures the com-plexity of the model (i.e., the feature functions) toavoid over-fitting. With the objective function, a mod-el employing simple and useful feature functions will beselected as the best model. Because the model includesfunctions as parameters, we cannot directly use tra-ditional methods such as stochastic gradient descentto find the solution. We propose exploiting gradien-t boosting to tackle the challenge. More specifically,we construct the latent dimensions additively: in eachstep, we pick a latent dimension k and try to add anew feature function f to the dimension. During thestep, search over the functional space F is performedto find the feature function f that optimizes the ob-jective function.

For a general convex loss function l, the search of f canbe hard. We approximate it by using the quadratic lossfunction to simplify the search objective. We define

gij =∂

∂Yij

l(Yij , Yij), hij =∂2

∂2Yij

l(Yij , Yij) (4)

Suppose that we add a new feature f(i,x) to Uk(i,x)while fixing the rest of parameters. The first part ofthe objective function can be approximated by Taylorexpansion as:

L(f) =∑i,j

l(Yij , Yij + f(i,x)Vkj)

=∑i,j

l(Yij , Yij) +∑i,j

(gijVkj)f(i,x)

+1

2

∑i,j

(hijV2kj)f

2(i,x) + o(f2(i,x))

=L(f) + o(f2(i,x))

(5)

We use L(f)+Ω(f) to find the best feature function f .

The general procedure is shown in Algorithm 1, whereε is a shrinkage parameter to avoid over-fitting.

Convergence of the Training

The choice of hij does not necessarily require that thesecond order gradient is utilized to approximate theloss function. The algorithm converges as long as thefollowing inequality holds

l(Yij , Yij+∆y) ≤ l(Yij , Yij)+gij∆y+1

2hij∆y

2+o(∆y2)

(6)

Proof. Denoting the upper bound as L(f), we haveL(εf∗) + Ω(εf∗) ≤ L(εf∗) + Ω(εf∗) ≤ L(0) + Ω(0) =L(0). Here we assume 0 ∈ F and Ω(0) = 0. Thus theobjective function decreases at each step. 2

Time Complexity

The time complexity for calculating the sufficient s-tatistics in the general algorithm is O(kN), if we bufferthe predictions of observed pairs in the training data,where N represents the number of observed pairs inthe matrix. Furthermore, the time complexity of thetraining algorithm depends on the specific setting offeature function learning. We will provide the timecomplexity of the specific learning algorithms in Sec-tion 3, which is of log-linear order.

Integration with Existing Methods

The training algorithm is an iterative optimizationprocedure like coordinate descent. This allows us toeasily integrate our approach with existing methodsfor learning a latent factor model: to use our updatealgorithm for the feature function construction and touse stochastic gradient descent or coordinate descentto optimize the other parts of the model.

2.3. Segmentation Function Learning

An important part of the training algorithm is to learnthe feature function f in each step. In this section, weconsider a general method for learning a class of fea-ture functions. It first segments the data into severalparts using auxiliary information, and then learns anindependent parameter in each segment

f(i,x) = βp(i,x) (7)

The function p(i,x) defines the cluster index of an in-stance, and a weight β is assigned to each cluster. As-

2This proof holds up to o(∆y2). To make the loss strict-ly decrease, we can set hij to be a uniform upper bound ofthe second order gradient to remove this error term.


sume that there are |C| clusters and the instance setof each cluster is defined as Ic = (i, j,x)|p(i,x) = c.The complexity of feature function is defined as

Ω(f) = γ|C|+ 1

2λ

|C|∑c=1

β2c (8)

The first term represents the number of segments; thesecond term represents the norm of weights in eachcluster. For this problem, the optimal β can be foundanalytically for a given segmentation as follows

β∗c = −∑i,j,x∈Ic gijVkj∑

i,j,x∈Ic hijV2kj + λ

(9)

the value will be mainly determined by λ when thesecond order term is small. Putting β∗ back to L(f),we obtain the objective function as

L(f)+Ω(f) = L(0)−1

2

|C|∑c=1

(∑i,j,x∈Ic gijVkj

)2∑i,j,x∈Ic hijV

2kj + λ

+γ|C|

(10)We can use Equation (10) to guide the search for asegmentation function. The basic motivation of thesegmentation function is to make the latent dimensionsof data in the same segments as similar as possible.One of the most widely used segmentation functionsis decision tree (Friedman et al., 2001). However, forspecific types of segmentation functions, we can usespecific algorithms rather than decision tree. We willuse the general methodology in this section to guidethe feature function constructions in Section 3.

3. Specific Feature Functions

3.1. Time dependent Feature Function

Time is very important auxiliary information in col-laborative filtering. One important reason is that theusers’ preferences change over time. To model users’local preferences, following the methodology in (Ko-ren, 2009), we can define an indicator function for eachlocalized time bin as feature function; the model canbe written as

Y(i, j, t) = UTi,binid(t)Vj (11)

The idea is to learn a localized latent factor in eachtime bin to capture the time dependent preference ofthe user. The time bin size is fixed and needs to bepredefined. In real world scenarios, different users mayhave different patterns of preference evolution. Someusers may change their interest frequently, while oth-ers may like to stick to their interest. The patterns

U (i,t)3

U (i,t)2

U (i,t)1

LatentFactor

Dimensions

Feature Functions Learned

Split Time

User i’s Preference on Topic 3against time t

t

Feature function learned in the second iteration, user i

Figure 2. Illustration of Time-dependent Feature Function

of changes for different topics may also differ; a us-er might have constant favor on classical music for along time while changing her preference on pop musicvery often. These patterns may be hard to capturewith a fixed size of time bins. We can utilize our mod-el to learn more flexible feature functions over time.Specifically, we define the user latent factor as

Uk(i, t) =∑s

fk,s,i(t) (12)

where each fk,s,i(t) is a user-specific |C|-piece stepfunction in t defined by segmentations on time. Fig.2 gives an illustration of the model. Note that timebin indicator function can also be viewed as a spe-cial type of our time feature function. The differenceis that our method decides the number of segments,changing points, and function values based on train-ing data, which offers more flexibility. Using the resultin Section 2.3, the objective for discovering time seg-mentation can be written as follows

argmint1,···t|C|−1

− 1

2

|C|∑c=1

(∑i,j,t∈Ic gijVkj

)2∑i,j,t∈Ic hijV

2kj + λ

+ γ|C|

where Ic = (i, j, t)|tc−1 ≤ t < tct0 = −∞, t|C| = +∞

(13)

The optimal solution of the objective can be foundusing a dynamic programming algorithm in quadratictime complexity. We use a faster greedy algorithm;it starts from all instances as independent segments,then greedily combines segments together when an im-provement of objective function can be made, until alocal minimum is reached.


Algorithm 2 Tree Split Finding with Missing Value

Require: I: instance set, np: number of propertiesgain← 0Gall ←

∑i,j∈I gijVkj ,Hall ←

∑i,j∈I hijV

2kj

for p = 1 to np doIp ← (i, j, x) ∈ I|xp 6= missingenumerate default goto rightGleft ← 0, Hleft ← 0for (i, j, x) in sorted( Ip, ascent order) doGleft ← Gleft + gijVkj

Hleft ← Hleft + hijV2kj

Gright ← Gall −Gleft, Hright ← Hall −Hleft

gain← max(gain,G2

left

Hleft+λ+

G2right

Hright+λ− G2

all

Hall+λ)

end forenumerate default goto leftGright ← 0, Hright ← 0for (i, j, x) in sorted( Ip, descent order) doGright ← Gright + gijVkj

Hright ← Hright + hijV2kj

Gleft ← Gall −Gright, Hleft ← Hall −Hright

gain← max(gain,G2

left

Hleft+λ+

G2right

Hright+λ− G2

all

Hall+λ)

end forend foroutput split and default direction with max gain

Time Complexity

Let mi to be the number of observed entries in the i-th row. The time complexity of the greedy algorithmfor user i is O(mi logmi), where the log-scale factorcomes from the cost for maintenance of the priorityqueue. The overall training complexity for one itera-tion over the dataset is O(kN logm). As for the spacecomplexity, addition of two step functions can be com-bined into a single step function, which allows us tostore Uk(i, t) efficiently and do not need to buffer Ufor each instance.

3.2. Demographic Feature Function

Data sparsity is a major challenge for collaborative fil-tering. The problem becomes especially severe for han-dling new users, who may have only a few or even norecords in the system. One important type of auxiliaryinformation that can be used to deal with this prob-lem is user demographic data such as age, gender andoccupation. The information can be relatively easilyobtained and has strong indication to user preferences.For example, a young boy is more likely to favor ani-mation, and a student of computer science may havean interest in IT news. To learn useful feature func-tions from the user demographic data for latent factormodel construction, we use x to represent the demo-

U (x)3

U (x)2

U (x)1

LatentFactor

Dimensions

Feature Functions Learned

Topic: IT news source on microblog Feature function learned in the second iteration

Age>15?

Major=CS?

Y N

Y N

1.5 0.5

-1

Figure 3. Illustration of Demographic Feature Function

graphic information, and define the latent factor to be

Uk(i,x) =∑s

fk,s(x) (14)

where each fk,s(x) is a user-independent feature func-tion. We define as the feature function set F the setof regression trees over user properties x. Fig. 3 givesan intuitive explanation of the model. We place in theleaf nodes example values that are used to illustratethe structure of the trees, whereas in the real featuretrees they are calculated using Equation (9). The ad-vantage of using trees as features is that it is possible toselect useful attributes, create high-order combination-s of attributes, and automatically segment continuousattributes to construct meaningful features.

Next, we describe the tree learning algorithm. Thetree construction is also guided using Equation (10) inSection 2.3. In each round, we greedily enumerate theattributes and choose to make a split on the attributethat can give the maximum loss reduction. The pro-cess is repeated until some stopping condition is met(e.g maximum depth), and then a pruning process iscarried out to merge the nodes with less loss reduction.One problem that we may face here is the existenceof missing values. That is, not all the attributes areavailable for each user. To solve this problem, we de-cide a default direction (either left child or right child)for each node and adopt the default direction for clas-sification when a missing value is encountered. Thesplitting construction algorithm also enumerates de-fault directions to find the best split; the algorithm isshown in Algorithm 23.

3All features are assumed to be numerical here, andcategorical attributes are encoded in an indicator vector


t

t1 t

Observed user’s interest on topic k against time t

Wrong split point, is high

User’s interest

User’s interest

t1 t

User’s interest

Good balance of and

t1 t

User’s interest

t2 t3 t4 t5

Too many splits, is high

(a) Time dependent feature construction

Topic: IT news

Age>15?

Y N

-10.4

Topic: IT news

Age>15?

isStudent?

Tree Growth

High order feature: Age>15 isStudent

Y N

Y N

0.6 0.3

-1

(b) Tree construction

Figure 4. Illustration of Feature Construction

Time Complexity

The time complexity for one iteration over the datasetis O(kDNp log(n) + kN). D is the maximum depthof the tree and Np is the number of observed entriesin the user attribute matrix. We note that the com-putation only depends on the size of observed entries.We also note that the complexity for tree construction(first part) does not depend on the number of observedentries N . This is because we can make statistics foreach user before tree construction and thus have theconstruction step only depend on the number of users.Similarly, we only need to buffer U for each user.

3.3. Advantage of Our Method

Next, we will discuss why our method can enhancethe prediction accuracy. As previously explained, userpreference patterns are dynamic and hard to capture.Fig. 4 illustrates our method of feature construction.In the time dependent case, different users may havedifferent interest changing patterns. Our method caneffectively detect the dynamic changes in interest oversome topics along the time axis. The regularization inour method will also lead to selection of a simple mod-el when no change occurs in user preference. In theuser demographic case, our method can automatical-ly create high order features and build the tree basedcomplex feature functions. The goal of optimization

Table 1. Statistics of Datasets

Dataset #user #item #observed

ml-1M 6,040 3,952 1 MYahoo! Music 1 M 624 K 252 MTencent Miroblog 2.3 M 6,095 73 M

is to find the feature functions which are simple andhave good fit to the data; automatically constructinghigh order features and automatically detecting changepoints make this possible and give our model the po-tential of achieving better results.

4. Experiments

4.1. Dataset and Setup

We use three real-world datasets in our experiments.The statistics of the datasets are summarized in Table1. The first dataset is the Movielens dataset ml-1M,which contains about 1 million ratings of 3,952 moviesby 6,040 users with timestamps. We use five-fold crossvalidation to evaluate the performance of our methodand the baseline methods. The second dataset is theYahoo! Music Track1 dataset4, which contains 253million ratings by 1 million users over 624 thousandtracks from the Yahoo! Music website. The data issplit into training, validation, and test sets by time.We use the official training and validation datasets totrain the model and the test set to make evaluation.For the two datasets, we use RMSE (root mean squareerror) as evaluation measure. The first two datasetsare utilized to evaluate the performance of our methodwith time dependent feature function construction.

The third dataset we use is the Tencent Microblog5. Itis one of the largest Twitter-like social media servicesin China. The dataset contains the recommendationrecords of 2.3 million users over a time period of abouttwo months. The item set is defined as celebritiesand information sources on Tencent Microblog. Thedataset is split into training and testing data by time.Furthermore, the test set is split into public and pri-vate set for independent evaluations. We use the train-ing set to train the model and use the public test setto conduct evaluation. We also separate 1/5 data fromthe training data by time for parameter selection. Thedataset is extremely sparse, with only on average twopositive records per user. Furthermore, about 70% ofusers in the test set never occur in the training set.User demographic information (age and gender) and

4http://kddcup.yahoo.com/datasets.php5http://kddcup2012.org/c/kddcup2012-track1/data


Table 2. Results on Movielen in RMSE

Method RMSE (d = 32) RMSE (d = 64)

MF 0.8519± 0.0008 0.8497± 0.0010TimeMF 0.8474± 0.0009 0.8450± 0.0011GFMF 0.8436± 0.0013 0.8417± 0.0011SVD++ 0.8473± 0.0006 0.8456± 0.0005TimeSVD++ 0.8423± 0.0007 0.8410± 0.0006GFMF++ 0.8393± 0.0010 0.8377 ± 0.0008

social network information is available in the dataset.For this dataset, we use MAP@k (mean average preci-sion) as evaluation measure. The dataset is exploitedto evaluate the performance our method with demo-graphic feature construction.

In all the experiments, the parameter λ of our model isheuristically set to 2. We use the validation set (in ml-1M via cross validation) to tune the meta parametersfor the baseline models and our model.

4.2. Results on Movielen

We compare six methods on the ml-1M dataset. Thecompared methods are as follows:

• MF is the basic matrix factorization model thatdoes not utilize auxiliary information.

• TimeMF uses predefined time bins to modelusers’ preference changes over time. The size oftime bins is tuned through cross validation.

• GFMF is our model that learns time dependentfeature functions.

• SVD++ is the matrix factorization model withimplicit feedback information (Koren, 2008).

• TimeSVD++ uses the same time feature func-tion as TimeMF, and it also utilizes implicit feed-back information.

• GFMF++ utilizes the GFMF part to train atime dependent model and also uses implicit feed-back information.

We train the baseline models and our model, when thelatent dimensionality d is 8, 16, 32 and 64, and findthat the results are consistent. We show the resultsin two parameter settings n Table 2. Both TimeMFand SVD++ give 0.004 reduction over MF in termsof RMSE. GFMF is able to give a further 0.003 re-duction of RMSE. Finally, the GFMF++ model thatcombines implicit feedback information into our model

Table 3. Results on Yahoo! Music in RMSE

Method RMSE (d = 64) RMSE (d = 128)

SVD++ 23.06 23.02TimeSVD++ 22.88 22.84GFMF++ 22.80 22.77

achieves the best result. We can see that our modelbrings more than 75% improvement compared withTimeMF, which uses predefined time bins. This isdue to the fact that our method adapts the time binsize and time split points for each user and can captureusers’ preference evolution more accurately.

4.3. Results on Yahoo! Music

We also investigate the performance of our method onthe Yahoo! Music dataset. We compare three methodson this dataset, with the same model definitions as inthe previous section. For a time stamp that does notappear in the training set, we assign a constant featureto it, which is equal to the feature of the correspondinglast time bin in the training set. We add user and itemtime bias terms to all the models to offset the biaseffect and focus on modeling of latent dimensions. Wecompare the methods with different latent dimensions(ranging in 8, 16, 32, 64, 128), and obtain consistentresults. We report the results for d = 64 and d = 128.

The experimental results are shown in Table 3. Thecomparison shows GFMF++ gives 40% more RMSEreduction than TimeSVD++ over SVD++. This in-dicates that our method can learn good feature func-tions in a large real-world dataset.

4.4. Results on Tencent Microblog

We compare six methods on the Tencent Microblogdataset. We use logistic loss to train all the models.Due to the extreme sparsity of the dataset, the per-formance of matrix factorization is similar to the biasmodel and thus is not reported. Because this is a so-cial network dataset, it is interesting to see the impactof social information as well, and thus we also includesocial-aware models in the experiment.

• item bias is the basic model that only includes abias term to measure the popularity of each item.

• DemoMF is a latent factor model that utilizespredefined feature functions on age and gender.We segment ages into intervals in advance and usean indicator function to represent the age featureof each interval.


Table 4. Results on Tencent Microblog in MAP

Method MAP@1 MAP@3 MAP@5

item bias 22.58% 34.65% 38.26%DemoMF 24.07% 36.38% 40.00%GFMF 24.48% 36.82% 40.43%SocialMF 24.67% 37.16% 40.80%SocialDemoMF 25.41% 37.98% 41.60%SocialGFMF 25.70% 38.28% 41.90%

• GFMF is our model that learns demographic fea-ture functions during the training process.

• SocialMF follows the approach of social-awarematrix factorization (Jamali & Ester, 2010) thatutilizes social network information to constructuser latent factors.

• SocialDemoMF and SocialGFMF are inte-grated models of two demographic-aware modelswith SocialMF respectively.

We train the model, when the latent dimensionality dis 8, 16, 32, and 64. The results for all the models areoptimal when d = 32, and thus we only report thoseresults here, as shown in Table 4. From the results,we can first find that the bias effect is strong in thisdataset. However, improving the performance is stil-l possible using social network and user demographicinformation. The performances of GFMF are consis-tently better than DemoMF. DemoMF improves up-on the bias model in terms of MAP@1 by about 1.5%,while GFMF is able to further improve the perfor-mance by about 0.4%. This means GFMF can learngood feature functions to utilize the user demographicinformation. The results of SocialMF and GFMFare similar, which means user demographic informa-tion is as important as social information in this task.Finally, SocialGFMF utilizing both the social anduser demographic information achieves the best per-formance among all the methods.

5. Related Work

This work is concerned with matrix factoriza-tion (Salakhutdinov & Mnih, 2008; Koren et al., 2009;Lawrence & Urtasun, 2009), which is the among mostpopular approaches to collaborative filtering. Thereare many variants of factorization models to incor-porate various auxiliary information, including timeinformation (Koren, 2009), attribute information (A-garwal & Chen, 2009; Stern et al., 2009), context in-formation (Rendle et al., 2011), and content informa-

tion (Weston et al., 2012). These models make useof predefined feature functions to encode the auxil-iary information. In (Zhou et al., 2011) a latent factormodel is proposed for a different task of guiding us-er interview, which can be viewed as a special case ofour model. Their method cannot be applied to solvethe problems in our experiments because of its specificmodel formulation (e.g., it cannot handle time depen-dent features). Our method is unique in that it simul-taneously learns the feature functions and the latentfactor model for CF which is, to our best knowledge,is the first work in the literature.

From the perspective of infinite feature space, one re-lated general framework is (Abernethy et al., 2009),which implicitly defines features using kernel function-s. Our method allows explicit definition and automaticconstruction of features, and is scalable to deal witha large amount of data. Our method employs gra-dient boosting (Friedman, 2001), which is equivalentto performing coordinate descent in the functional s-pace. Gradient boosting has been successfully used inclassification (Friedman et al., 2000) and learning torank (Burges, 2010). To our knowledge, this is the firsttime it is used in matrix factorization.

6. Conclusion

In this paper, we have proposed a novel method forgeneral functional matrix factorization using gradien-t boosting. The proposed method can automaticallysearch in a functional space to construct useful fea-ture functions based on data for accurate prediction.We also give two specific algorithms for feature func-tion construction on the basis of time and demographicinformation. Experimental results show that the pro-posed method outperforms the baseline methods onthree real-world datasets. As future work, we plan toexplore other feature function settings. We are alsointerested in implementation of the method in a dis-tributed environment to scale up to larger problems.Finally, it is also interesting to apply our method toother problems than collaborative filtering.

Acknowledgement

We thank the anonymous reviewers for their veryhelpful reviews. Yong Yu is supported by grantsfrom NSFC-RGC joint research project 60931160445.Qiang Yang is supported in part by Hong Kong CERGprojects 621010, 621307, 621211, and NSFC-RGCproject N HKUST624/09. We also appreciate the dis-cussions with Wei Fan, Zhengdong Lu, Weinan Zhang,and Fangwei Hu.


References

Abernethy, J., Bach, F., Evgeniou, T., and Vert, J.P.A new approach to collaborative filtering: Operatorestimation with spectral regularization. Journal ofMachine Learning Research, 10:803–826, 2009.

Agarwal, D. and Chen, B.C. Regression-based la-tent factor models. In Proceedings of the 15thACM SIGKDD international conference on Knowl-edge discovery and data mining, KDD ’09, pp. 19–28, New York, NY, USA, 2009. ACM.

Burges, C. From ranknet to lambdarank to lamb-damart: An overview. Learning, 11:23–581, 2010.

Friedman, J., Hastie, T., and Tibshirani, R. Additive l-ogistic regression: a statistical view of boosting. Theannals of statistics, 28(2):337–407, 2000.

Friedman, J., Hastie, T., and Tibshirani, R. The el-ements of statistical learning, volume 1. SpringerSeries in Statistics, 2001.

Friedman, J.H. Greedy function approximation: a gra-dient boosting machine. Annals of Statistics, pp.1189–1232, 2001.

Jamali, M. and Ester, M. A matrix factorization tech-nique with trust propagation for recommendation insocial networks. In Proceedings of the fourth ACMconference on Recommender systems, RecSys ’10, p-p. 135–142, New York, NY, USA, 2010. ACM.

Koren, Y. Factorization meets the neighborhood: amultifaceted collaborative filtering model. In Pro-ceeding of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data min-ing, KDD ’08, pp. 426–434, New York, NY, USA,2008. ACM.

Koren, Y. Collaborative filtering with temporal dy-namics. In Proceedings of the 15th ACM SIGKD-D international conference on Knowledge discoveryand data mining, pp. 447–456, 2009.

Koren, Y., Bell, R., and Volinsky, C. Matrix factor-ization techniques for recommender systems. Com-puter, 42, August 2009.

Lawrence, N.D. and Urtasun, R. Non-linear matrixfactorization with gaussian processes. In Proceed-ings of the 26th International Conference on Ma-chine Learning, pp. 601–608, Montreal, June 2009.Omnipress.

Rendle, S., Gantner, Z., Freudenthaler, C., andSchmidt-Thieme, L. Fast context-aware recommen-dations with factorization machines. In Proceedings

of the 34th ACM SIGIR Conference on Reasearchand Development in Information Retrieval. ACM,2011.

Salakhutdinov, R. and Mnih, A. Probabilistic matrixfactorization. In Advances in Neural InformationProcessing Systems 20, pp. 1257–1264. MIT Press,Cambridge, MA, 2008.

Stern, D.H., Herbrich, R., and Graepel, T. Match-box: large scale online bayesian recommendations.In Proceedings of the 18th international conferenceon World wide web, WWW ’09, pp. 111–120, NewYork, NY, USA, 2009. ACM.

Weston, J., Wang, C., Weiss, R., and Berenzweig, A.Latent collaborative retrieval. In International Con-ference on Machine Learning (ICML), Edinburgh,Scotland, June 2012.

Zhou, K., Yang, S.H., and Zha, H. Functional ma-trix factorizations for cold-start recommendation. InProceedings of the 34th international ACM SIGIRconference on Research and development in Infor-mation Retrieval, SIGIR ’11, pp. 315–324, New Y-ork, NY, USA, 2011. ACM.

general functional matrix factorization using gradient ...proceedings.mlr.press/v28/chen13e.pdf ·...

Documents