comparative recommender system evaluation: benchmarking recommendation frameworks
DESCRIPTION
Video available here http://www.youtube.com/watch?v=1jHxGCl8RXc Recommender systems research is often based on comparisons of predictive accuracy: the better the evaluation scores, the better the recommender. However, it is difficult to compare results from different recommender systems due to the many options in design and implementation of an evaluation strategy. Additionally, algorithmic implementations can diverge from the standard formulation due to manual tuning and modifications that work better in some situations. In this work we compare common recommendation algorithms as implemented in three popular recommendation frameworks. To provide a fair comparison, we have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics. We also include results using the internal evaluation mechanisms of these frameworks. Our analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks. Our results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results.TRANSCRIPT
![Page 1: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/1.jpg)
Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks
Alan Said@alansaid
TU Delft
Alejandro Bellogín@abellogin
Universidad Autónoma de Madrid
ACM RecSys 2014Foster City, CA, USA
![Page 2: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/2.jpg)
2
A RecSys paper outline– We have a new model – it’s
great– We used %DATASET% 100k
to evaluate it– It’s 10% better than our
baseline– It’s 12% better than
[Authors, 2010]
![Page 3: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/3.jpg)
3
Benchmarking
![Page 4: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/4.jpg)
4
LibRecmrec
Python-recsys
![Page 5: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/5.jpg)
5
What are the differences?Some things just work differently• Data splitting• Algorithm design (implementation)• Algorithm optimization• Parameter values• Evaluation• Relevance/ranking• Software architecture• etc
Different design choices!!
How do these choices affect evaluation results?
![Page 6: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/6.jpg)
6
Evaluate evaluation• Comparison of frameworks• Comparison of implementation• Comparison of results• Objective benchmarking
![Page 7: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/7.jpg)
7
Algorithmic ImplementationFramework Class Similarity Item-based
LensKit ItemItemScorer CosineVectorSimilarity, PearsonCorrelation
Mahout GenericItemBasedRecommender UncenteredCosineSimilarity, PearsonCorrelationSimilarityMyMediaLite ItemKNN Cosine, Pearson User-based Parameters
LensKit UserUserItemScorerCosineVectorSimilarity, PearsonCorrelation
SimpleNeighborhoodFinder, NeighborhoodSize
Mahout GenericUserBasedRecommenderUncenteredCosineSimilarity, PearsonCorrelationSimilarity
NearestNUserNeighborhood, neighborhoodsize
MyMediaLite UserKNN Cosine, Pearson neighborhoodsize Matrix Factorization
LensKit FunkSVDItemScorer IterationsCountStoppingCondition, factors, iterations
Mahout SVDRecommender FunkSVDFactorizer, factors, iterationsMyMediaLite SVDPlusPlus factors, iterations
![Page 8: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/8.jpg)
8
There’s more than algorithms thoughThere’s the data, evaluation, and more
Data splits• 80-20 Cross-validation• Random Cross-validation• User-based cross validation• Per-user splits• Per-item splits• Etc.
Evaluation• Metrics• Relevance• Strategies
![Page 9: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/9.jpg)
9
Real world examples
Movielens 1M[Cremonesi et al, 2010]
Movielens 1M[Yin et al, 2012]
![Page 10: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/10.jpg)
10
Evaluation
Dataset
Training / Test Framework Evaluation
Results
Algorithm
![Page 11: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/11.jpg)
11
Internal Evaluation
Dataset
Training / Test Framework Evaluation
Results
Algorithm
![Page 12: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/12.jpg)
12
Internal Evaluation ResultsAlgorithm Framework nDCG
IB Cosine Mahout 0,00041478
IB Cosine Lenskit 0,94219205
IB Pearson Mahout 0,00516923
IB Pearson Lenskit 0,92454613
SVD50 Mahout 0,10542729
SVD50 Lenskit 0,94346409
UB Cosine Mahout 0,16929545
UB Cosine Lenskit 0,94841356
UB Pearson Mahout 0,16929545
UB Pearson Lenskit 0,94841356
Algorithm Framework RMSEIB Cosine Lenskit 1,01390931
IB Cosine MyMediaLite 0,92476162
IB Pearson Lenskit 1,05018614
IB Pearson MyMediaLite 0,92933246
SVD50 Lenskit 1,01209290
SVD50 MyMediaLite 0,93074012
UB Cosine Lenskit 1,02545490
UB Cosine MyMediaLite 0,93419026
![Page 13: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/13.jpg)
13
We need a fair and common evaluation protocol!
![Page 14: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/14.jpg)
14
Reproducible evaluation - BenchmarkingControl all parts of the process
- Data Splitting strategy- Recommendation (black box)- Candidate items generation (what
items to test)- Evaluation
Select strategy• By time• Cross validation• Random• Ratio
Select framework• Apache Mahout• LensKit• MyMediaLite
Select algorithm• Tune settings
Recommend
Define strategy• What is the
ground truth• What users to
evaluate• What items to
evaluate
Select error metrics• RMSE, MAE
Select ranking metrics• nDCG,
Precision/Recall, MAP
RecommendSplitCandidate
itemsEvaluate
http://rival.recommenders.net
![Page 15: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/15.jpg)
15
Controlled Evaluation
Dataset
Training / Test Framework Evaluation
Results
Algorithm
![Page 16: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/16.jpg)
16
AN OBJECTIVE BENCHMARK
Lenskit vs. Mahout vs. MyMediaLiteMovielens 100k (additional datasets in the paper)
![Page 17: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/17.jpg)
17
The FrameworksAM: Apache MahoutLK: LenskitMML: MyMediaLite
The Candidate ItemsRPN: Relevant + N [Koren, KDD 2008]TI: TrainItemsUT: UserTest
Split Pointgl: Globalpu: Per-user
Split Strategycv: 5-fold cross-validationrt: 80-20 random ratio
Algorithms nDCG@10
![Page 18: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/18.jpg)
18
User Coverage
![Page 19: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/19.jpg)
19
Catalog Coverage
![Page 20: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/20.jpg)
20
Time
![Page 21: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/21.jpg)
21
Good accuracy?
![Page 22: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/22.jpg)
22
Yes, at the cost of coverage
![Page 23: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/23.jpg)
23
What’s the best result?
![Page 24: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/24.jpg)
24
Difficult to say … depends on what you’re evaluating!!
![Page 25: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/25.jpg)
25
In conclusion• Design choices matter!
– Some more than others• Evaluation needs to be documented• Cross-framework comparison is not easy
– You need to have control!
![Page 26: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/26.jpg)
26
What have we learnt?
![Page 27: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/27.jpg)
27
How did we do this?RiVal – an evaluation toolkit for RecSys• http://rival.recommenders.net• http://github.com/recommenders/rival• RiVal demo later today• On Maven central!
• RiVal was also used for this year’s RecSys Challenge
– www.recsyschallenge.com
![Page 28: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/28.jpg)
28
QUESTIONS?Thanks!
Special thanks: • Zeno Gantner• Michael Ekstrand
![Page 29: Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022051609/547050b0af7959282b8b46d4/html5/thumbnails/29.jpg)
29
• https://www.flickr.com/photos/13698839@N00/3001363490/in/photostream/• http://rick--hunter.deviantart.com/art/Unfair-scale-1-149667590