evaluating prediction accuracy for collaborative filtering ...927356/fulltext01.pdfthis report...

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2016

Evaluating Prediction Accuracy for Collaborative Filtering Algorithms in Recommender Systems

SAFIR NAJAFI

ZIAD SALAM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Evaluating Prediction Accuracy for CollaborativeFiltering Algorithms in Recommender Systems

SAFIR NAJAFIZIAD SALAM

Degree Project in Computer Science, DD143XSupervisor: Atsuto MakiExaminer: Örjan Ekeberg

CSC, KTH, 2016-05-11

AbstractRecommender systems are a relatively new technology thatis commonly used by e-commerce websites and streamingservices among others, to predict user opinion about prod-ucts. This report studies two specific recommender algo-rithms, namely FunkSVD, a matrix factorization algorithmand Item-based collaborative filtering, which utilizes itemsimilarity. This study aims to compare the prediction ac-curacy of the algorithms when ran on a small and a largedataset. By performing cross-validation on the algorithms,this paper seeks to obtain data that supposedly may clarifyambiguities regarding the accuracy of the algorithms. Thetests yielded results which indicated that the FunkSVD al-gorithm may be more accurate than the Item-based collab-orative filtering algorithm, but further research is requiredto come to a concrete conclusion.

ReferatBedömning av Förutsägelsers Precision för

Kollaborativa Filtreringsalgoritmer iRekommendationssystem

Rekommendationssystem är en relativt ny teknologi som of-ta förekommer på bland annat e-handelsidor och streaming-tjänster för att förutse användares åsikter om produkter.Denna rapport studerar två specifika rekommendationsal-goritmer, nämligen FunkSVD, en algoritm som bygger påmatrisfaktorisering, samt Item-based Collaborative Filte-ring, som använder sig av likheten mellan olika element.Denna studie har som mål att undersöka hur träffsäkraalgoritmernas förutsägelser blir mellan en mindre och enstörre datakollektion. Genom att använda korsvalidering påalgoritmerna för de olika datakollektionerna försöker rap-porten ta fram data som kan klargöra tvetydigheter kringalgoritmernas prestanda. Testerna visade på resultat somindikerar att FunkSVD-algoritmen verkar prestera bättreän Item-based Collaborative Filtering-algoritmen, men merforskning behöver bedrivas för att dra en konkret slutsats.

Contents

1 Introduction 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 The Content-based Approach . . . . . . . . . . . . . . . . . . . . . . 42.3 Collaborative Filtering Approach . . . . . . . . . . . . . . . . . . . . 42.4 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Memory-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 62.6 Model-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 62.7 Problem Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.7.1 Implementation Problems . . . . . . . . . . . . . . . . . . . . 72.7.2 Performance Problems . . . . . . . . . . . . . . . . . . . . . . 72.7.3 User Behavior Problems . . . . . . . . . . . . . . . . . . . . . 72.7.4 Dataset Problems . . . . . . . . . . . . . . . . . . . . . . . . 8

2.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.8.1 Mean Absolute Error (MAE) . . . . . . . . . . . . . . . . . . 82.8.2 Root Mean Square Error (RMSE) . . . . . . . . . . . . . . . 92.8.3 Offline Experiments . . . . . . . . . . . . . . . . . . . . . . . 92.8.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Algorithms 103.1 Item-based Collaborative Filtering . . . . . . . . . . . . . . . . . . . 10

3.1.1 Cosine-based Similarity . . . . . . . . . . . . . . . . . . . . . 103.1.2 Correlation-based Similarity . . . . . . . . . . . . . . . . . . . 103.1.3 Computing Prediction . . . . . . . . . . . . . . . . . . . . . . 11

3.2 FunkSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Experiment 134.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Item-based Collaborative Filtering Adjustment . . . . . . . . 144.2.2 FunkSVD Adjustment . . . . . . . . . . . . . . . . . . . . . . 14

4.3 The Scalability Experiment . . . . . . . . . . . . . . . . . . . . . . . 14

5 Results 155.1 Item-based Collaborative Filtering Optimization . . . . . . . . . . . 15

5.1.1 Cosine-based Similarity . . . . . . . . . . . . . . . . . . . . . 155.1.2 Adjusted Cosine Similarity . . . . . . . . . . . . . . . . . . . 155.1.3 Correlation-based Similarity . . . . . . . . . . . . . . . . . . . 165.1.4 Optimized Parameters . . . . . . . . . . . . . . . . . . . . . . 17

5.2 FunkSVD Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 175.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Discussion 19

7 Conclusion 21

References 22

Appendices 23

A Evaluation Data 24

B FunkSVD Parameter Testing 26

Chapter 1IntroductionRecommender systems are an evolving technology often based on a learning systemproviding predictions to users of a web service or application, based on user behavior.The research on recommender systems is a consequence of the business possibilities,allowing for increased revenue, consumer-satisfaction, as well as user-friendliness[13]. Different approaches have been presented and studied in order to improve theperformance and accuracy of recommender systems [13, 16, 22].

Improvement is still a matter of importance for developers of recommender sys-tems. Lacking performance, accuracy and other problems are surfacing as servicesare frequently expanding in terms of supply range and customer base [22]. A recom-mender system might have trouble predicting user-preferred items for other reasonsas well, often depending on the implementation strategy [15]. Many different ap-proaches have been presented concerning several issues, but the prevention of oneproblem often increases the impact of others [22].

One important aspect of recommender systems is the requirement for preciserecommendations. For example, one group of recommender systems may not requireaccurate results to present relevant data of some sort, while others might be in needof an algorithm which can produce results in between a narrow interval. Certainalgorithms may be better suited for heavy constraints on precision and validity whileothers do better in contexts where low-cost computations are a more importantfeature.

Recommender systems operate on sets of data in order to provide recommenda-tions. Commonly, datasets start out small and grow over time as more data relatedto products are obtained or new products added. An algorithm working efficientlyand accurately for a particular dataset may perform differently on other datasets[3]. In order to determine the optimal algorithm for a specific dataset, tests can beconducted with different metrics to yield an approximation of an algorithm’s per-formance. This in turn would provide guidelines for optimal algorithms dependingon dataset properties and sizes.

1

CHAPTER 1. INTRODUCTION

1.1 Related WorkDifferent approaches and techniques have been proposed for predicting user behav-ior. A really interesting paper concerning the Item-based collaborative filtering algo-rithm has been published by Sarwar et al. [20]. The paper describes the item-basedapproach thoroughly and discusses optimal parameters and compares Item-basedcollaborative filtering to User-based collaborative filtering using the Movielens 100kdataset. Another interesting algorithm has been proposed by Simon Funk [10],namely FunkSVD, a matrix factorization algorithm. An experiment will be con-ducted in order to measure item-based collaborative filtering and FunkSVD thor-oughly by comparing the algorithms’ performances with each other on two differentdatasets. The objective is to measure which algorithm is more scalable.

1.2 Problem StatementThis report will address the issue of offline computation accuracy and how certainalgorithm predictions scale based on the size of a dataset. The aim of this reportis to acquire an understanding of when a given algorithm is the most effective fora growing dataset. Naturally, the result will not yield a definite answer as to whento use which algorithm, since there are external factors such as the structure ofdatasets and what kind of predictions are preferred. Hopefully this report will pro-vide some guidelines to which algorithm is the most suitable for a dataset withspecific properties and size.

Question: Which of the algorithms Item-based collaborative filtering and FunkSVDis optimal for predicting ratings for the Movielens 100k dataset? Does the resultdiffer for the Movielens 1M dataset?

2

Chapter 2BackgroundThis section will outline the concept of recommender systems, describe several differ-ent approaches and the advantages and disadvantages of various implementations.The scope of this report lies in collaborative filtering and matrix factorization, butthis section will brief the reader about related subjects necessary to make use of theresults and analysis later in this report.

2.1 Recommender SystemsA recommender system’s main goal is to estimate ratings for items and provide apersonalized list of the recommended items to a given user [1, 27]. There are manydifferent approaches to this task, such as the comparison of ratings, item or userattributes and demographic relations.[1, 15].

Every basic recommender system contain some type of background information,consisting of a set of users, a set of items and the relation between those which inmany cases is the rating a user gives to an item [1, 15]. The rating of an item isoften represented as an integer, for example a scale from one to five. [15].

These relations can for instance be described with bipartite graphs [15] andmatrices [14]. A common tool of recommender algorithms are user profiles [17] anditem attributes [15]. User profiles may contain certain characteristics, for examplegender, age and occupation [28]. Attributes of an item depends on their domain.For instance, movies might have a genre, an actor list, a specific producer or otherrelevant attributes that link items together [17].Recommender systems filtering can be categorized into three main approaches.

1. Content-based filtering: Recommends items by matching attributes withother items that a given user have rated [14, 15]. Further explained in section2.2.

2. Collaborative filtering: Recommends items by comparing a given user witha set of users that have rated other items similarly [15]. Collaborative filteringsystems have two main approaches for filtering, namely memory-based andmodel-based [19]. Further explained in section 2.3.

3. Hybrid filtering: Recommends items by combining different type of ap-proaches together [1]. Further explained in section 2.4.

3

CHAPTER 2. BACKGROUND

2.2 The Content-based ApproachIn content-based recommender systems, a recommendation is based on the relationbetween the attributes of items that a given user has previously rated and itemswhich the user have not yet rated [14]. The content-based approach utilizes theconcept of monotonic personal interest, meaning a person’s interest will most likelynot change in the near future [19].

The process of how a basic content-based approach recommends an item involveshaving a set of items and a set of users [19]. Assuming every user have rated asubset of the items, a user profile can be constructed for each user containing theirpreferences [14, 19]. User preferences are computed by analyzing the similaritiesbetween the items that the user has previously rated [14, 19]. Recommending anitem when a user profile is available is done by matching the profile attributes withitems not yet rated by the user [14]. However lacking information in user profiles anditems leads to less accurate results. The case of insufficient data to build a reliablemodel is called the start up problem [6]. Thus, the content-based approach is heavilyreliant on the available information. On the contrary, collaborative filtering requiresuser-item rating relations.

2.3 Collaborative Filtering ApproachCollaborative filtering system recommends items to users by comparing ratings[19, 23, 27]. Opinions of real-life related users play a role in personal decision making[19]. A collaborative system is based on the same principle, but instead of consid-ering geographically adjacent user opinions, the system searches globally for similarusers [23]. However, basing recommendation on all users may not result in accuratepredictions. In collaborative filtering, the issue is addressed by matching users thatshare rating similarities [19]. As opposed to content-based filtering, collaborativefiltering provides recommendations by examining user-item rating relations [25].

4


Table 2.1. User-item rating relation. Rating is represented in a scale 1-5 where 5 isbest and 1 is worst.

Rating Matrix The Avengers The Revenant The Martian DeadpoolFred 2 4 5 1Sara ? 5 ? 2John 5 2 2 4Jessica ? 1 ? 5

Table 2.1 is an example of how a collaborative filtering system may recommendan item using rating information from different users [19]. The user-item ratingrelation can be viewed as a matrix [23], as shown in the table. Each row in table2.1 represents a user and each column a movie, the ratings are in scalar form andranges from 1 to 5.

Given a user to provide a recommendation for, the system identifies other usersbased on the rating information by examining patterns in their rating behaviors [19].The potential items to be recommended to the user will therefore be dependent onprevious user ratings. For instance analyzing Sara in table 2.1 will most likely resultin the system predicting a high rating for The Martian and a low rating for TheAvengers to Sara. This is because Sara’s rating pattern is more similar to Fred’sthan it is to John’s and Jessica’s. However, a larger quantity of users would be takeninto consideration in a real system, possibly yielding a more accurate prediction [25].

One disadvantage present to collaborative algorithms is the requirement thatsystems have multiple users rating several items in order to produce accurate results[23].

2.4 Hybrid ApproachesIn the attempt to provide more accurate recommendations, several hybrid ap-proaches has been proposed [9]. A hybrid algorithm combines several approachesin order to provide more accurate results. The structure of such an implementationdiffer depending on the hybrid approach strategy. More specifically, a hybrid ap-proach incorporates evaluation features from the content-based approach and thecollaborative filtering approach [1]. The structural idea behind hybrid approachesmay vary, descriptions of common ideas are outlined below:

• A hybrid system can predict recommendations from a dataset using both acontent-based approach and a collaborative approach separately [1]. Whenthe predictions are presented, a specific metric may be used to determine thequality of the recommendations in order to choose result set. Alternatively,the results can be combined in order to produce the highest quality resultsfrom both the approaches.

5


• A hybrid system may predict recommendations from a dataset using a collab-orative approach incorporating some of the characteristics of a content basedapproach [1].

• A hybrid system may predict recommendations from a dataset using a content-based approach incorporating some of the characteristics of a collaborativeapproach [1].

• A unified approach, combining features from both the content-based and col-laborative approaches, resulting in an algorithm with a unique implementation[1].

2.5 Memory-based AlgorithmsMemory-based algorithms use statistical techniques to find sets of users, often calledneighborhoods, that are users similar to the target user [20]. The entire dataset ora subset is used and every user is in a neighborhood of users that are similar to eachother. A neighborhood-based collaborative filtering algorithm performs predictionby first calculating the similarity or weight. The similarity is computed by observingthe distance correlation between users and items. A prediction to a user is producedby taking the weighted sum of all the ratings for that item on another item, forexample [25]. Algorithms based on memory have been widely used in commercialpurposes such as on the Amazon.com website [25].

However there are several limitations. An algorithm tries to find a set of usersby evaluating similarities between items rated by users. Therefore the dataset mustnot be sparse or consist of a low number of items [25]. These type of limitationscauses unreliable predictions. To get better and more reliable recommendations,model-based collaborative filtering can be utilized.

2.6 Model-based AlgorithmsModel-based algorithms differ from memory-based by utilizing machine learning,data mining algorithms or other algorithms to find patterns and predict ratingsby learning. Some approaches for model-based algorithms are clustering models,Bayesian models and dependency models [25].

6


2.7 Problem DomainsToday, recommender systems face a variety of problems, including implementation[22], user behavior [26] and performance issues[22]. In addition, the properties ofdatasets may also introduce difficulties [22].

2.7.1 Implementation ProblemsRecommender systems are facing certain challenges, algorithms often have theirindependent disadvantages depending on their individual implementation. Whilesome approaches are time-efficient and produce accurate predictions, they may havecorner-case problems which produce exceptionally bad predictions [1]. Anotherexample could be recommender systems requiring to recompute the offline trainingmodel often in order to provide accurate predictions, which may be a strainingprocedure, taking a lot of time and computational power. Implementation problemscan generally be solved by incorporating other methods into an existing algorithm,creating new algorithms based on various principles, or gathering more data in orderfor predictions to become more precise [1]. Many of these solutions often affectanother problem domain. For example the cold-start problem [4], where recentlyintroduced items in a system are seldom recommended. The reason is lacking dataassociated with recently added items. There might be several solutions to such aproblem. However, trying to solve the problem may result in unintended side-effects.

2.7.2 Performance ProblemsGrowing databases containing item and user information may present performanceissues. Too vast systems contain a lot of data, which in turn demands a lot of com-putation to be made in order to present predictions to users. Services consisting ofmillions of items and users benefit from efficiently scaling recommender algorithms,otherwise the computation of recommendations may be too slow [22].

2.7.3 User Behavior ProblemsHow a user behaves when using a service is crucial to making accurate predictions[26]. The problem of user behavioral issues are possibly one of the hardest to addresssince a large part of the problem is due to human behavior in general. Users whodo not contribute to the database by rating items are one of the problems [12],another one may be users that tend to exclusively rate items which they have astrong opinion about [2]. In the latter case, computation of user rating might proveinaccurate, because the value input of the computation does not truly reflect thegeneral user opinion, but merely items which some users had a particular interestin submitting a rating for [2].

7


2.7.4 Dataset ProblemsDatasets used to evaluate algorithms have a major impact on the results. Whenevaluating an algorithm, the best practice is to do it on datasets with similar domainfeatures [12]. Dataset properties can have substantial effect on algorithms. Theseproperties are commonly [12]:

• Density of the dataset. How many cells of the item-user matrix that is popu-lated with data.

• Ratio between users and items.

Density is important since users only rate a small subset of all the items [2]. Datasetsusing implicit rating are more likely to be less sparse because no effort is requiredfrom the users [12]. Evaluating an algorithm on a sparse dataset can cause lessaccuracy, because it lacks rating data to find common items between users [2].The relation between users and items may also affect algorithm performance. Al-gorithms that find similarities between users or items may achieve an increase inperformance[11]. For example evaluating an item-based algorithm in a dataset withmore users than items may lead to better results [8].

2.8 EvaluationRecommender algorithm performance can be measured by testing against differentdatasets [20]. In this paper, cross-validation is used in order to test algorithmprediction performance which is further explained in subsection 2.8.4. Algorithmprediction performance is measured with different metrics. Which metric to usedepends on the purpose of the algorithm and the goal of the measurement [12, 25].The metrics relevant to this paper will be outlined below.

2.8.1 Mean Absolute Error (MAE)Mean absolute error is a metric used to compute the average of all the the absolutevalue differences between the true and the predicted rating [5]. The lower the MAEthe better the accuracy. In general MAE can range from 0 to infinity, where infinityis the maximum error depending on the rating scale of the measured application.Computing the Mean Absolute Error is accomplished with the following formula[12]:

MAE = 1n

∑ni=1 |di| − |d̂i|

di is the actual ratingd̂i is the predicted ratingn is the amount of ratings

8


2.8.2 Root Mean Square Error (RMSE)Root mean square error computes the mean value of all the differences squared betweenthe true and the predicted ratings and then proceeds to calculate the square root out ofthe result [5]. As a consequence, large errors may dramatically affect the RMSE rating,rendering the RMSE metric most valuable when significantly large errors are unwanted.The root mean square error between the true ratings and predicted ratings is given by [12]:

RMSE =√

1n

∑ni=1(di − d̂i)2

di is the actual ratingd̂i is the predicted ratingn is the amount of ratings

2.8.3 Offline ExperimentsIn prediction analysis, offline experiments may be conducted on a dataset in order to decideoptimal algorithms to predict certain information. The offline experiment assumes theuser behavior from the time the data was collected still is accurate. Otherwise, such anexperiment is more or less meaningless for a live application. In our case, the analysisrests solely on determining the general behavior of certain algorithms. Therefore we donot have to take into account the dataset’s validity. An offline experiment may includepredicting similarities between items, calculating neighborhood clusters, among other things[12, 21, 24].

2.8.4 Cross-validationCross-validation, sometimes referred to as rotation estimation, is a validation methodologyfor analyzing statistical data. One goal of cross-validation is to pair an algorithm anda dataset in order to build a model and determine the generalizability of the algorithm.The analysis of an algorithm can be compared to the analysis of another algorithm, thuscross-validation can be used to measure algorithm performance [18].

Cross-validation splits a dataset into z equally large partitions, one of the partitionsbeing the test partition while the z - 1 rest partitions are the training partitions. The algo-rithms then trains a model with the training partitions and when the training is complete,the model is tested with the test partition, producing test data. This procedure continuesuntil every partition has been the test partition [18].

9

Chapter 3Algorithms

3.1 Item-based Collaborative FilteringThe Item-based collaborative filtering approach compares users and their ratings to a cer-tain set of items. The algorithm calculates the nearest neighbors to a given user. Everyneighbor’s previously rated items are compared to the item in focus with an item simi-larity computation. When the most similar items have been determined, the prediction iscomputed by taking the weighted average of every neighbor’s ratings of those items. Thecalculation of item similarities can be done with several methods including cosine-basedsimilarity, correlation-based similarity and adjusted cosine similarity [20, 25].

3.1.1 Cosine-based SimilarityCosine-based similarity is a method for computing similarities between items in Item-basedbased collaborative filtering. In cosine-based similarity an item is represented as a vectorin the user-space. The angle between two of these vectors are computed and the cosine ofthe angle is the similarity of the items. For example two items m and n are computed asfollows [20]:

similarities(m, n) = cosine(m̄, n̄) = m̄∗n̄‖m̄‖2∗‖n̄‖2

Where the "*" represents the dot product of the two vectors.

The cosine-based similarity computation has one drawback. User rating scale is not takeninto consideration during the computation. The calculations are performed overinformation from several columns, and each column represents a different user [20, 25].The issue may be solved with adjusted cosine similarity which subtracts the correspondinguser average rating from each co-rated (a user that has rated both item m and n) pair[20, 25]. The similarity for the two items m and n, using adjusted cosine similarity, arecomputed as follows [20]:

similarity(m, n) =∑

u∈U(Ru,m−R̄u)(Ru,n−R̄u)√

(Ru,m−R̄u)2√

(Ru,n−R̄u)2

Ru,m is the rating of user u on item m.R̄u is the average rating of item m.

3.1.2 Correlation-based SimilarityCorrelation-based similarity between two items m and n are calculated by computing thePearson correlation [20]. The Pearson correlation for the item m and n rated by a set ofusers U is computed as follows [20, 25]:

similarity(m, n) =∑

u∈U(Ru,m−R̄m)(Ru,n−R̄n)√

(Ru,m−R̄m)2√

(Ru,n−R̄n)2

Where R̄u is the average rating of user u.

10

CHAPTER 3. ALGORITHMS

3.1.3 Computing PredictionComputing predictions can be achieved by several techniques. A simple method is com-puting the weighted sum of an item m for user u by taking the sum of all the rated itemsby user u that are similar to m. The purpose of this method is to apprehend how a userrates items that are similar to the item that the method is trying to predict rating for. Theweighted sum for predicting an item m for user u is computed as follows [20]:

Pu,m =∑

N(sm,k∗Ru,k)∑N

(|sm,k|)

Where Pu,m is the prediction of item m for user u.Ru,k is the rating from user u on item k.N is a set of similar items.k is an element of N.sm,k is the similarity value between two items.

3.2 FunkSVDFunkSVD is an algorithm utilizing a matrix factorization method, namely Singular valuedecomposition. Singular value decomposition is used to reduce one matrix to two matriceswith lower dimensions1. The two resulting matrices represent generalized items and usersrespectively. Predictions are made by computing the dot product of one or more item vectorsand a user vector, yielding values which indicate the most suitable recommendations [10].An example of how a Singular value decomposition reduces a matrix into two other matricesis displayed in figure 3.1. The amount of features are decided by an input variable to thefactorization [10]. In the figure, u1 to u4 denotes the users, i1 to i3 denotes the items andf1 to f3 denotes the features. Features are usually called latent factors where Singular valuedecomposition is concerned.

Figure 3.1. Singular value decomposition in FunkSVD on a dataset

1In reality, singular value decomposition results in three matrices, but in the FunkSVD algo-rithm the third matrix is incorporated into the other matrices [10]

11

CHAPTER 3. ALGORITHMS

FunkSVD trains the model in order to achieve as accurate feature values as possible foritems and users. When training is performed, the algorithm has a parameter which oftenis called iteration count. The iteration count decides how many times the training for acell in the matrices is supposed to run. The iteration count may be a fixed amount oftimes to loop, or some other kind of condition which states when the looping is supposedto terminate [10].

A useful advantage to Singular value decomposition is that the lower ranked matricesoften yield better estimates of the data than the original matrix [21]. The purpose ofthe matrix factorization is to generalize the data by extracting common user preferences,categorizing them to get an overview of the users. Due to the generalization, vast amountsof data can be represented as more intuitive information [10, 21].

12

Chapter 4ExperimentWe used the LensKit Recommender Toolkit 2.2.1, which is a recommender system evalu-ation framework [7], for our research. More detailed information about the purpose andusage of LensKit can be found at http://lenskit.org/. Our reasons for choosing LensKitRecommender Toolkit is the following:

• Relevant and interesting algorithms are already implemented and refers to valid sci-entific sources.

• Algorithm parameters are easily modifiable in order to alter algorithm performance.

• Intuitive and thorough documentation.

In order to represent our result data in graphs and tables we used MATLAB 8.4.0 and theR environment software because of the available types of graph plotting they contained.

We have conducted testing on the algorithms Item-based collaborative filtering andFunkSVD. Item-based collaborative filtering may be implemented in different ways, andthe implementation we chose to use is the algorithm built into Lenskit and proposed bySarwar et al. [20]. The reason being it appears to be a widely recognized implementation.FunkSVD is an algorithm implemented by Simon Funk, who placed third in the NetflixPrize. [10] Several algorithms have been implemented using matrix factorization, FunkSVDis one of them, but it has been studied and acknowledged.

4.1 Datasets

Testing has been conducted on two of the Movielens datasets provided by GroupLens1. TheMovielens datasets contain user ratings for movies. The Movielens datasets are differentlysized. Each dataset is independent from the other, the small dataset is not a subset ofthe bigger one [11]. The datasets used in this thesis are the 100k dataset released in 1988and the 1M dataset released in 2003 [11]. Both datasets contain users that have rated atleast 20 movies [11]. Parameter testing were conducted on the Movielens 100k dataset. Itconsists of 100.000 ratings from 943 users on 1682 movies with a density of 6.37% displayedin table 4.1. The parameters that gave best result on the Movielens 100k dataset werethen used to evaluate the algorithms on the 1M dataset. The 1M dataset consists of onemillion ratings from 6040 users on 3706 movies with a density of 4.47% displayed in table4.1. Both datasets are represented with a rating scale from 1 to 5 using a precision of onein the LensKit Tool Recommender.

The reason we chose these two specific datasets from Movielens is because they arewidely used in multiple reports. They also use conventional ratings which is preferred whenpredicting ratings. Both datasets use the same type of rating scale and precision. Sincethe RMSE and MAE values are depended on the rating scale, the results will be morecomparable. GroupLens provides multiple dataset sizes for research purposes but due tohardware performance issues this report is limited to evaluation on the 100k and the 1mdataset.

1http://grouplens.org

13

CHAPTER 4. EXPERIMENT

Table 4.1. Movielens dataset properties [11]

Name Users Movies Rating-Scale Ratings DensityML 100k 943 1682 1-5 100.000 6.30%ML 1m 6040 3706 1-5 1.000.209 4.47%

4.2 ParametersAlgorithms in LensKit has certain default values when unmodified. In order to optimize theprediction accuracy, we labored with the parameters and methods. In order to do this, wegathered information from papers [20] and methodically investigated the prediction perfor-mance of them by running iterative tests. The improvements achieved by the adjustmentare displayed in the results section, but the process of optimizing the algorithms are outlinedbelow.

4.2.1 Item-based Collaborative Filtering AdjustmentFrom the paper published by Sarwar et al. [20] we discovered that neighborhood size canbe an altering factor in the prediction accuracy of the Item-based collaborative filteringalgorithm. Therefore, we wanted to investigate which neighborhood size would optimizethe prediction results on our dataset and the LensKit implementation. We used the 100kdataset and cross-validated the algorithm with 10, 20, 30, .., 80 as neighborhood size input.The test results are displayed in the 5.1 results section.

The similarity computation of Item-based collaborative filtering may also affect pre-diction results, which was outlined in the Item-based collaborative filtering section in thebackground chapter of this paper. Therefore we conducted tests with the cosine-based simi-larity computation, adjusted cosine similarity computation and correlation-based similaritycomputation [20, 25]. The test results are displayed in the 5.1 results section.

4.2.2 FunkSVD AdjustmentThe iteration count in FunkSVD is an important parameter which drastically may affectprediction accuracy. We performed some testing on the iteration count 60, 80, 100, ..,300. In parallel to the iteration count testing, we also conducted tests on the optimalamount of latent factors (feature vectors) in order to determine when the two parameters incombination performed the best. For every iteration count magnitude, we tested the latentfactor count 10, 15, 20, .., 75. The results are displayed in the 5.2 results section.

4.3 The Scalability ExperimentWhen the algorithms have been optimized, the scalability may be measured. Both item-based collaborative filtering and FunkSVD will be cross-validated over the two datasets.The results will be analyzed, interpreted and compared with mean absolute error and rootmean square error. Furthermore the change factor in the results from the 100k datasetto the 1M dataset will be computed in order to determine the scalability. The results aredisplayed in the 5.2 results section.

14

Chapter 5ResultsThis chapter will outline the optimization and scalability experiment results by presentingthe average values of the cross-validations.

5.1 Item-based Collaborative Filtering OptimizationEvaluation values for each of the three item similarity methods were tested with differentneighborhood sizes. Each method was evaluated with both MAE and RMSE.

5.1.1 Cosine-based SimilarityResults from the running cosine-based similarity method with iterative neighborhood sizesare presented in figure 5.1. Both evaluation metrics indicated similar growth in ratingaccuracy when the neighborhood size is incremented. The lowest values for RMSE andMAE were obtained with different neighborhood sizes. RMSE indicates that neighborhoodsize 20 yields the best accuracy while MAE indicates that neighborhood size 10 yields thebest accuracy.

Figure 5.1. Cosine-based similarity. Iterating through different neighborhood sizeto find best possible accuracy.

5.1.2 Adjusted Cosine SimilarityResults from running adjusted cosine with different neighborhood sizes are presented infigure 5.2. The lowest value for adjusted cosine was achieved when neighborhood sizewere twenty with both RMSE and MAE. Both graphs grows similarly when reaching theirminimum point.

15

CHAPTER 5. RESULTS

Figure 5.2. Adjusted-cosine item similarity. Iterating through different neighbor-hood size to find best possible accuracy.

5.1.3 Correlation-based SimilarityResults from running Pearson correlation with different neighborhood sizes are presentedin figure 5.3. Both graphs resembles similar accuracy throughout the neighborhood sizes.The minimum value is computed for both RMSE and MAE when neighborhood size is 20.

Figure 5.3. Pearson correlation similarity. Iterating through different neighborhoodsize to find best possible accuracy.

16

CHAPTER 5. RESULTS

Table 5.1. The lowest values for each of the item similarity computations

Cosine Adjusted Cosine Pearson CorrelationNeighborhood size RMSE MAE RMSE MAE RMSE MAE

10 - 0.752503 - - - -20 1.076190 - 0.920979 0.717344 1.076190 0.855794

5.1.4 Optimized ParametersThe lowest RMSE and MAE values obtained from each of the different parameter combi-nations are presented in table 5.1. The table shows that the MAE and RMSE values are ata minimum when adjusted cosine similarity with neighborhood size twenty.

5.2 FunkSVD OptimizationThe lowest score for each iteration count and the corresponding latent factor is shown infigure 5.4. Latent factor 50 and iteration count 200 yielded the lowest RMSE and MAEscores. A summary of the data is presented in appendix B.

Figure 5.4. The lowest score for each iteration and corresponding latent factor

17

CHAPTER 5. RESULTS

5.3 Evaluation ResultsThe results from section 5.1 revealed which parameter combination that yields the MAEand RMSE values. Both algorithms were evaluated on the 100k dataset as well as the 1Mdataset. The distribution for each of the RMSE and MAE values after conducting cross-foldevaluation are presented as boxplots in appendix A. The average evaluation values for eachmetric are presented in table 5.2. The lowest values were obtained while cross-validatingFunkSVD over both datasets. The change factor from the 100k dataset to the 1M datasetare presented in table 5.3.

Table 5.2. Evaluations on optimized algorithms with 100k and 1m dataset.

100k 1mAlgorithm RMSE MAE RMSE MAEFunkSVD 0.905520 0.709531 0.863767 0.672917Item-based 0.920979 0.717344 0.885014 0.689355

Table 5.3. Improvement from 100k to 1m.

ImprovementAlgorithm RMSE MAEFunkSVD 4.83% 5.44%Item-based 4.06% 4.06%

18

Chapter 6DiscussionThe results indicate that the FunkSVD algorithm scales slightly better than the Item-based collaborative filtering algorithm when comparing the performance with regard to thedatasets. However, the actual reason for the difference is unclear, it could mean FunkSVDis superior to Item-based collaborative filtering when datasets grow larger. On the contrary,the difference in performance could be related to the properties of the datasets. The varyingdensities of the datasets could be a major factor, the relation between the amount of usersand items another, for example.

Assuming the algorithms work equally well no matter the dataset properties, we willnow outline what we can deduce from the test results. The assumption would indicate thatthe FunkSVD algorithm always predicts the ratings more accurately for larger datasetsthan the Item-based collaborative filtering algorithm. Further evaluation on larger datasetsis required in order verify the assumption. We may deduce that the improvement in per-formance for FunkSVD is higher both with respect to root mean square error and meanabsolute error when comparing the test results.

One matter to consider is what impact the prediction accuracy has on the actual recom-mendation. The predicted ratings of the FunkSVD algorithm are indeed closer to the realratings than the Item-based collaborative filtering algorithm, but it does not necessarily pro-pose the outcome of a recommendation to a user will be more accurate. The tests does notgive any clue as to which ratings have been most severely miscalculated, which makes theresults hard to interpret. For example in the 100k dataset cross-validation, the FunkSVDalgorithm has a RMSE of 0.90 and the Item-based collaborative filtering algorithm a RMSEof 0.92. The difference is significant, but it is hard to tell which of the predicted ratingswere evaluated differently between the algorithms. Suppose a recommendation is neededfor a user and both algorithms yield a set of suggestions. The suggestions could turn out tobe equally relevant, because it is unknown which ratings are causing the most errors. If thelow ratings distinguishes the algorithm errors, it is likely the two algorithms will performequally well in providing the user with a relevant recommendation. On the other hand,if the high ratings distinguishes the algorithm errors, the algorithm with the lowest errorwill likely provide more relevant recommendations. Thus, the results should be consideredas rough estimations, rather than definitive answers to what is truly optimal for an actualrecommender system.

Central to understanding the results is to take into consideration what role the densityof a dataset has. The results show that the FunkSVD algorithm scales better than theItem-based collaborative filtering algorithm when moving from a smaller dataset with higherdensity to a larger dataset with lower density. In order to determine whether the algorithmsscale equally during circumstances where the dataset properties does not change, furthertesting is required.

Similarly the relation between users and items may well affect the result. As earlier men-tioned, the Item-based approach tends to perform differently depending on the user-itemratio in a dataset. In effect, the scaling of the Item-based collaborative filtering algorithmmay be faulty, because the smaller dataset contains more items than users, and the largerdataset contains more users than items. One might suspect the Item-based collaborativefiltering algorithm would have done even worse if datasets with similar user-item ratio wouldhave been used for the testing.

19

CHAPTER 6. DISCUSSION

Another important matter to discuss is optimization. Our study does not suggest whichof the algorithms is actually the most effective on the larger dataset when optimized for itspecifically. However, for the purpose of this report, we do not believe such an experimentto be relevant. The goal is to provide a better perspective of how the algorithms in generalperform when a dataset grows. To find the optimized algorithm for any given dataset, it isprobably most accurate to test algorithms directly against the dataset. Hence, we do notbother to optimize the algorithms for the 1M dataset, since we strictly wanted to comparethe same specifications on two datasets with different sizes. However, further testing can beconducted to analyze how the algorithms should be altered in order to perform optimallyon datasets with varying properties.

20

Chapter 7ConclusionThe prediction accuracy of a recommender system is heavily dependent on various factors.The properties of a dataset plays a major role as well as the parameters of the algorithms.What can be seen is that when both algorithms are optimized for the 100k dataset, theFunkSVD algorithm performs better on the larger, less dense dataset than the Item-basedcollaborative filtering algorithm. To elicit more rigorous results, testing algorithms ondatasets with more similar properties to each other need to be performed. The fact thatthe Item-based collaborative filtering algorithm was given a convenient opportunity to scalewell due to the dataset properties and still got outperformed by FunkSVD is a strong indi-cator that the FunkSVD algorithm is more versatile. We can conclude that the FunkSVDalgorithm is seemingly more accurate than the Item-based collaborative filtering algorithmwhen it comes to scaling from small to large datasets.

AcknowledgementsWe would sincerely like to thank Atsuto Maki for supervising this project, greatly improvingthe style and structure of the report as well as assisting with ideas and thoughts.

21

References[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender sys-

tems: a survey of the state-of-the-art and possible extensions. IEEE Transactions onKnowledge and Data Engineering, 17(6):734–749, June 2005.

[2] G. Adomavicius and J. Zhang. Impact of data characteristics on recommender systemsperformance. ACM Trans. Manage. Inf. Syst., 3(1):3:1–3:17, Apr. 2012.

[3] I. C. G. Alejandro BellogÃn Kouki, Pablo Castells Azpilicueta. Recommender systemperformance evaluation and prediction: An information retrieval perspective. 2012.

[4] R. Burke. Hybrid recommender systems: Survey and experiments. User Modeling andUser-Adapted Interaction, 12(4):331–370, Nov. 2002.

[5] T. Chai and R. R. Draxler. Root mean square error (rmse) or mean absolute error(mae)? - arguments against avoiding rmse in the literature. 2014.

[6] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACMTrans. Inf. Syst., 22(1):143–177, Jan. 2004.

[7] M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl. Rethinking the recom-mender research ecosystem: Reproducibility, openness, and lenskit. In Proceedings ofthe Fifth ACM Conference on Recommender Systems, RecSys ’11, pages 133–140, NewYork, NY, USA, 2011. ACM.

[8] M. D. Ekstrand, J. T. Riedl, and J. A. Konstan. Collaborative filtering recommendersystems. Foundations and Trends in Human-Computer Interaction, 4(2):81–173, 2011.

[9] A. Felfernig, M. Jeran, G. Ninaus, F. Reinfrank, S. Reiterer, and M. Stettinger. Rec-ommendation Systems in Software Engineering, chapter Basic Approaches in Recom-mendation Systems, pages 15–37. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.

[10] S. Funk. Netflix update: Try this at home, 2006.

[11] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACMTrans. Interact. Intell. Syst., 5(4):19:1–19:19, Dec. 2015.

[12] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborativefiltering recommender systems. ACM Trans. Inf. Syst., 22(1):5–53, Jan. 2004.

[13] T. Iwata, K. Saito, and T. Yamada. Modeling user behavior in recommender systemsbased on maximum entropy. In Proceedings of the 16th International Conference onWorld Wide Web, WWW ’07, pages 1281–1282, New York, NY, USA, 2007. ACM.

[14] P. Lops, M. Gemmis, and G. Semeraro. Recommender Systems Handbook, chapterContent-based Recommender Systems: State of the Art and Trends, pages 73–105.Springer US, Boston, MA, 2011.

[15] L. Lü, M. Medo, C. H. Yeung, Y.-C. Zhang, Z.-K. Zhang, and T. Zhou. Recommendersystems. Physics Reports, 519(1):1 – 49, 2012. Recommender Systems.

[16] M. C. Pham, Y. Cao, R. Klamma, and M. Jarke. A clustering approach for collaborativefiltering recommendation using social network analysis. Journal of Universal ComputerScience, 17(4):583–604, feb 2011.

22

[17] A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge UniversityPress, New York, NY, USA, 2011.

[18] P. Refaeilzadeh, L. Tang, and H. Liu. Encyclopedia of Database Systems, chapterCross-Validation, pages 532–538. Springer US, Boston, MA, 2009.

[19] M. Robillard, R. Walker, and T. Zimmermann. Recommendation systems for softwareengineering. IEEE Software, 27(4):80–86, July 2010.

[20] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filteringrecommendation algorithms. In Proceedings of the 10th International Conference onWorld Wide Web, WWW ’01, pages 285–295, New York, NY, USA, 2001. ACM.

[21] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Incremental singular value decom-position algorithms for highly scalable recommender systems. In Fifth InternationalConference on Computer and Information Science, pages 27–28, 2002.

[22] B. M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering, 2002.

[23] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen. The Adaptive Web: Methodsand Strategies of Web Personalization, chapter Collaborative Filtering RecommenderSystems, pages 291–324. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.

[24] G. Shani and A. Gunawardana. Evaluating recommender systems. Technical ReportMSR-TR-2009-159, November 2009.

[25] X. Su and M. T. Khoshgoftaar. A survey of collaborative filtering techniques, advancesin artificial intelligence. 2009.

[26] A. Tuzhilin and G. Adomavicius. Integrating user behavior and collaborative meth-ods in recommender systems. In CHI 99 Workshop. Interacting with RecommenderSystems, 1999.

[27] X. Yang, Y. Guo, Y. Liu, and H. Steck. A survey of collaborative filtering based socialrecommender systems. Computer Communications, 41:1 – 10, 2014.

[28] T. Yuan, J. Cheng, X. Zhang, S. Qiu, and H. Lu. Recommendation by mining multipleuser behaviors with group sparsity. 2014.

23

Appendix AEvaluation DataResults from evaluating the optimized algorithms on the 100k and 1m dataset. The plotsshow how the values are distributed before calculating the average performance.

Figure A.1. The RMSE values from each cross-fold evaluation on the 100k dataset

Figure A.2. The RMSE values from each cross-fold evaluation on the 1m dataset

24

APPENDIX A. EVALUATION DATA

Figure A.3. The MAE values from each cross-fold evaluation on the 100k dataset

Figure A.4. The MAE values from each cross-fold evaluation on the 1m dataset

25

Appendix BFunkSVD Parameter TestingAverage results from running evaluations on different parameter combinations containing.

Parameters RMSE.ByRating MAE.ByRatingFunkSVDIC60FC10 0.940418 0.739879FunkSVDIC60FC15 0.939640 0.739107FunkSVDIC60FC20 0.939169 0.738715FunkSVDIC60FC25 0.939057 0.738610FunkSVDIC60FC30 0.939308 0.738812FunkSVDIC60FC35 0.939863 0.739277FunkSVDIC60FC40 0.940555 0.739912FunkSVDIC60FC45 0.940608 0.740332FunkSVDIC60FC50 0.940182 0.740086FunkSVDIC60FC55 0.940521 0.740303FunkSVDIC60FC60 0.941430 0.741129FunkSVDIC60FC65 0.942264 0.741907FunkSVDIC60FC70 0.942765 0.742535FunkSVDIC60FC75 0.942976 0.742924

FunkSVDIC80FC10 0.938801 0.738877FunkSVDIC80FC15 0.938164 0.738282FunkSVDIC80FC20 0.938050 0.738172FunkSVDIC80FC25 0.938342 0.738410FunkSVDIC80FC30 0.939041 0.738935FunkSVDIC80FC35 0.939648 0.739597FunkSVDIC80FC40 0.938539 0.738971FunkSVDIC80FC45 0.939070 0.739525FunkSVDIC80FC50 0.939801 0.740264FunkSVDIC80FC55 0.940005 0.740602FunkSVDIC80FC60 0.939964 0.740867FunkSVDIC80FC65 0.939803 0.740993FunkSVDIC80FC70 0.939720 0.741139FunkSVDIC80FC75 0.939677 0.741213

FunkSVDIC100FC10 0.936488 0.737058FunkSVDIC100FC15 0.936398 0.736961FunkSVDIC100FC20 0.936525 0.737098FunkSVDIC100FC25 0.936752 0.737341FunkSVDIC100FC30 0.937540 0.738027FunkSVDIC100FC35 0.937773 0.738257FunkSVDIC100FC40 0.938139 0.738856FunkSVDIC100FC45 0.938307 0.739135FunkSVDIC100FC50 0.937082 0.738408FunkSVDIC100FC55 0.935003 0.737037FunkSVDIC100FC60 0.934816 0.737150FunkSVDIC100FC65 0.934623 0.737092FunkSVDIC100FC70 0.934553 0.737047

26

APPENDIX B. FUNKSVD PARAMETER TESTING

FunkSVDIC100FC75 0.934463 0.736962


FunkSVDIC140FC10 0.926911 0.728213FunkSVDIC140FC15 0.927124 0.728125FunkSVDIC140FC20 0.927541 0.728154FunkSVDIC140FC25 0.926747 0.727809FunkSVDIC140FC30 0.926916 0.728447FunkSVDIC140FC35 0.925386 0.728086FunkSVDIC140FC40 0.923706 0.726779FunkSVDIC140FC45 0.921922 0.725516FunkSVDIC140FC50 0.920540 0.724418FunkSVDIC140FC55 0.919502 0.723533FunkSVDIC140FC60 0.918742 0.722906FunkSVDIC140FC65 0.917889 0.722174FunkSVDIC140FC70 0.917211 0.721567FunkSVDIC140FC75 0.916635 0.721049FunkSVDIC140FC80 0.916168 0.720547FunkSVDIC140FC85 0.915526 0.719885FunkSVDIC140FC90 0.914794 0.719141FunkSVDIC140FC95 0.914316 0.718579FunkSVDIC140FC100 0.914277 0.718390FunkSVDIC140FC105 0.914643 0.718613

FunkSVDIC160FC10 0.921572 0.723172FunkSVDIC160FC15 0.921101 0.722168FunkSVDIC160FC20 0.920842 0.722063FunkSVDIC160FC25 0.920375 0.722120FunkSVDIC160FC30 0.919651 0.722210FunkSVDIC160FC35 0.918539 0.721725FunkSVDIC160FC40 0.915368 0.719686FunkSVDIC160FC45 0.912963 0.717654FunkSVDIC160FC50 0.911924 0.716676FunkSVDIC160FC55 0.911050 0.716013FunkSVDIC160FC60 0.910429 0.715560

27


FunkSVDIC160FC65 0.909965 0.715071FunkSVDIC160FC70 0.909973 0.714998FunkSVDIC160FC75 0.909941 0.714780FunkSVDIC160FC80 0.909944 0.714642FunkSVDIC160FC85 0.909951 0.714457




28


FunkSVDIC220FC75 0.915190 0.715378




FunkSVDIC300FC10 0.927199 0.722739FunkSVDIC300FC15 0.933433 0.727787FunkSVDIC300FC20 0.936392 0.729819

29


FunkSVDIC300FC25 0.940578 0.732279FunkSVDIC300FC30 0.942833 0.733320FunkSVDIC300FC35 0.941756 0.732432FunkSVDIC300FC40 0.941286 0.732240FunkSVDIC300FC45 0.940538 0.732974FunkSVDIC300FC50 0.943976 0.736105FunkSVDIC300FC55 0.948792 0.737838FunkSVDIC300FC60 0.946298 0.737685FunkSVDIC300FC65 0.948191 0.738852FunkSVDIC300FC70 0.950728 0.740835FunkSVDIC300FC75 0.952842 0.740651

30

www.kth.se

evaluating prediction accuracy for collaborative filtering ...927356/fulltext01.pdfthis report...

Documents