self-adjusting hybrid recommenders based on social network...

1
Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparison Alejandro Bellogín, Pablo Castells, Iván Cantador Information Retrieval Group Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain {alejandro.bellogin, pablo.castells, ivan.cantador}@uam.es 5th ACM Conference on Recommender Systems (RecSys2011) Chicago, USA, 23-27 October 2011 Motivation Approach Empirical comparison Evaluation of Recommender Systems is still an area of active research Evaluation methodologies: Error-based (accuracy) Precision-oriented (ranking quality) Realization that quality of the ranking is more important than accuracy in predicting rating values Problem: difficult to compare results from different works Precision-oriented metrics depend on Amount of relevant items Amount of non-relevant items Different assumptions about the non-relevant set leads to biases in the measurements 5-fold 80% training 20% test Dataset: MovieLens 100K Recommenders UB50: user-based recommender with 50 neighbors IB: item-based recommender using adjusted cosine SVD: matrix factorization technique using 50 factors Metrics P@50: precision at 50 Recall@50: recall at 50 nDCG@50: normalized discounted cumulative gain at 50 RMSE: root mean square error Discussion Conclusions (Bellogín et al 2011) A. Bellogín, J. Wang, P. Castells. Text Retrieval Methos Applied to Ranking Items in Collaborative Filtering. In ECIR 2011. (Cremonesi et al 2010) P. Cremonesi, Y. Koren, R. Turrin. Performance of Recommender Algorithms on Top-N Recommendation Tasks. In RecSys 2010. (Jambor & Wang 2010a) T. Jambor and J. Wang. Goal-driven Collaborative Filtering A Directional Error Based Approach. In ECIR 2010. (Jambor & Wang 2010b) T. Jambor and J. Wang. Optimizing Multiple Objectives in Collaborative Filtering. In Recsys 2010. (Koren 2008) Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. In KDD 2008. References A general methodology for evaluating ranked item lists For each target user u, we select a set L u of target items for ranking: For each user and item in the set, we request a rating prediction r (u,i) We sort the items by decreasing order of predicted rating value Different authors have built the set L u differently Different methodologies used in the state-of-the-art (Notation: Tr and Te denote training and test sets) TestRatings (TR): L u = Te u . It needs a relevance threshold TestItems (TeI): L u = v Te v Tr u TrainingItems (TrI): L u = vu Tr v AllItems (AI): L u = Tr v One-Plus-Random (OPR): L ui = {i} NR u , for i in HR u Te u , | NR u | = 1000 Comparative results with precision metrics are not the same as with error metrics (IB better than UB for RMSE, not for precision) TestRatings methodology only evaluates recommendations over known relevance unrealistic situation. TestRatings’ ranked list consists of top rated items, which may or may not be related with the recommended items the user would get in a real application Absolute performance values obtained by each methodology are very different TestItems obtains higher performance values than TrainingItems since non-relevant items for every user are omitted TrainingItems and AllItems are, as expected, completely equivalent The five methodologies are consistent for the two datasets, even though the test size for each user is different in each situation Solid triangle represents the target user. Boxed ratings denote test set. Methodology Reference(s) TestRatings (Jambor & Wang 2010a) (Jambor & Wang 2010b) TestItems (Bellogín et al 2011) OnePlusRandom (Cremonesi et al 2010) (Koren 2008) Future Work Four out of five methodologies are consistent with each other The other methodology (TestRatings) has proved to overestimate performance values. No direct equivalence found between results with error-based and precision-based metrics Performance range of results depends on the methodology Online experiment with real users’ feedback Evaluate other metrics From IR: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) From RS: Normalized Distance-based Performance Measure (NDPM), ROC curve Alternative training / test generation E.g., temporal split Check the source code for the different methodologies: http://ir.ii.uam.es/evaluation/rs IR G IR Group @ UAM Rel @50 1 @50 50 u uU P U Rel 1 Recall@50 Rel @50 u uU u U 50 1 1 1 2 1 nDCG@50 IDCG @50 log 1 u rel i uU i u u U i , 1 RMSE , , | | ui Te rui rui Te 2 splits 10 ratings per user in test IB UB RMSE 0 0.05 0.30 0.35 0.40 TR 3 TR 4 TeI TrI AI OPR P@50 SVD50 IB UB50 0 0.20 0.90 0.10 0.30 1.00 TR 3 TR 4 TeI TrI AI OPR Recall@50 0 0.90 0.10 0.05 0.80 1.00 TR 3 TR 4 TeI TrI AI OPR nDCG@50 0 0.01 0.02 0.03 0.11 0.16 TR 3 TR 4 TeI TrI AI OPR P@50 SVD50 IB UB50 0.85 0.90 0.95 1.00 1.05 1.10 SVD IB UB RMSE TR 3 TR 4 TeI TrI AI OPR nDCG@50 0 0.20 1.00 0.10 0.30 TR 3 TR 4 TeI TrI AI OPR Recall@50

Upload: others

Post on 17-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self-adjusting Hybrid Recommenders Based on Social Network ...ir.ii.uam.es/~alejandro/2011/recsys-poster-poster.pdf · Title: Self-adjusting Hybrid Recommenders Based on Social Network

Precision-oriented Evaluation of Recommender Systems:

An Algorithmic Comparison Alejandro Bellogín, Pablo Castells, Iván Cantador

Information Retrieval Group

Escuela Politécnica Superior, Universidad Autónoma de Madrid, Spain

{alejandro.bellogin, pablo.castells, ivan.cantador}@uam.es

5th ACM Conference on Recommender Systems (RecSys2011)

Chicago, USA, 23-27 October 2011

Motivation Approach

Empirical comparison

Evaluation of Recommender Systems is still an area of active research

Evaluation methodologies:

• Error-based (accuracy)

• Precision-oriented (ranking quality)

Realization that quality of the ranking is more important than accuracy

in predicting rating values

Problem: difficult to compare results from different works

Precision-oriented metrics depend on

• Amount of relevant items

• Amount of non-relevant items

Different assumptions about the non-relevant set leads to biases in the

measurements

5-fold

80% training

20% test

Dataset: MovieLens 100K

Recommenders

• UB50: user-based recommender with 50 neighbors

• IB: item-based recommender using adjusted cosine

• SVD: matrix factorization technique using 50 factors

Metrics

• P@50: precision at 50

• Recall@50: recall at 50

• nDCG@50: normalized discounted cumulative

gain at 50

• RMSE: root mean square error

Discussion

Conclusions

(Bellogín et al 2011) A. Bellogín, J. Wang, P.

Castells. Text Retrieval Methos Applied to

Ranking Items in Collaborative Filtering. In

ECIR 2011.

(Cremonesi et al 2010) P. Cremonesi, Y. Koren, R.

Turrin. Performance of Recommender

Algorithms on Top-N Recommendation Tasks.

In RecSys 2010.

(Jambor & Wang 2010a) T. Jambor and J. Wang.

Goal-driven Collaborative Filtering – A

Directional Error Based Approach. In ECIR

2010.

(Jambor & Wang 2010b) T. Jambor and J. Wang.

Optimizing Multiple Objectives in Collaborative

Filtering. In Recsys 2010.

(Koren 2008) Y. Koren. Factorization Meets the

Neighborhood: a Multifaceted Collaborative

Filtering Model. In KDD 2008.

References

A general methodology for evaluating ranked item lists

For each target user u, we select a set Lu of target items for ranking:

• For each user and item in the set, we request a rating prediction r (u,i)

• We sort the items by decreasing order of predicted rating value

Different authors have built the set Lu differently

Different methodologies used in the state-of-the-art

(Notation: Tr and Te denote training and test sets)

• TestRatings (TR): Lu = Teu. It needs a relevance threshold

• TestItems (TeI): Lu = ∪v Tev ∖ Tru

• TrainingItems (TrI): Lu = ∪v≠u Trv

• AllItems (AI): Lu = ∖ Trv

• One-Plus-Random (OPR): Lui = {i} ∪ NRu , for i in HRu ⊆Teu, | NRu | = 1000

• Comparative results with precision metrics are not the same as with error metrics (IB better

than UB for RMSE, not for precision)

• TestRatings methodology only evaluates recommendations over known relevance

unrealistic situation.

• TestRatings’ ranked list consists of top rated items, which may or may not be related with the

recommended items the user would get in a real application

• Absolute performance values obtained by each methodology are very different

• TestItems obtains higher performance values than TrainingItems since non-relevant items for

every user are omitted

• TrainingItems and AllItems are, as expected, completely equivalent

• The five methodologies are consistent for the two datasets, even though the test size for each

user is different in each situation

Solid triangle represents the target user. Boxed ratings denote test set.

Methodology Reference(s)

TestRatings (Jambor & Wang 2010a)

(Jambor & Wang 2010b)

TestItems (Bellogín et al 2011)

OnePlusRandom (Cremonesi et al 2010)

(Koren 2008)

Future Work

Four out of five methodologies are consistent with

each other

The other methodology (TestRatings) has proved to

overestimate performance values.

No direct equivalence found between results with

error-based and precision-based metrics

Performance range of results depends on the

methodology

Online experiment with real users’ feedback

Evaluate other metrics

• From IR: Mean Average Precision (MAP), Mean

Reciprocal Rank (MRR)

• From RS: Normalized Distance-based

Performance Measure (NDPM), ROC curve

Alternative training / test generation

E.g., temporal split

Check the source code for the different methodologies: http://ir.ii.uam.es/evaluation/rs

IRGIR Group @ UAM

Rel @501

@5050

u

u U

PU

Rel1

Recall@50Rel @50

u

u U uU

50

1

1 1 2 1nDCG@50

IDCG @50 log 1

urel i

u U iu uU i

,

1RMSE , ,

| | u i Te

r u i r u iTe

2 splits

10 ratings per

user in test

0.85

0.90

0.95

1.00

1.05

1.10

SVD IB UB

RMSE

0

0.05

0.30

0.35

0.40

TR 3 TR 4 TeI TrI AI OPR

P@50 SVD50IBUB50

0

0.20

0.90

0.10

0.30

1.00

TR 3 TR 4 TeI TrI AI OPR

Recall@50

0

0.90

0.10

0.05

0.80

1.00

TR 3 TR 4 TeI TrI AI OPR

nDCG@50

0

0.01

0.02

0.03

0.11

0.16

TR 3 TR 4 TeI TrI AI OPR

P@50 SVD50

IB

UB50

0.85

0.90

0.95

1.00

1.05

1.10

SVD IB UB

RMSE

0

0.90

0.10

0.05

0.80

1.00

TR 3 TR 4 TeI TrI AI OPR

nDCG@50

0

0.20

1.00

0.10

0.30

TR 3 TR 4 TeI TrI AI OPR

Recall@50