offline evaluation of recommender systems: all pain and no gain?

60
Offline Evaluation of Recommender Systems All pain and no gain? Mark Levy Mendeley

Upload: mark-levy

Post on 08-Sep-2014

9.803 views

Category:

Technology


5 download

DESCRIPTION

Keynote for the workshop on Reproducibility and Replication in Recommender Systems at ACM RecSys, Hong Kong, 12 October 2013.

TRANSCRIPT

Page 1: Offline evaluation of recommender systems: all pain and no gain?

Offline Evaluation of Recommender Systems

All pain and no gain?

Mark LevyMendeley

Page 2: Offline evaluation of recommender systems: all pain and no gain?

About me

Page 3: Offline evaluation of recommender systems: all pain and no gain?

About me

Page 4: Offline evaluation of recommender systems: all pain and no gain?

Some things I built

Page 5: Offline evaluation of recommender systems: all pain and no gain?

Something I'm building

Page 6: Offline evaluation of recommender systems: all pain and no gain?

What is a good recommendation?

Page 7: Offline evaluation of recommender systems: all pain and no gain?

What is a good recommendation?

One that increases the usefulnessof your product in the long run1

1. WARNING: hard to measure directly

Page 8: Offline evaluation of recommender systems: all pain and no gain?

What is a good recommendation?

● One that increased your bottom line:

– User bought item after it was recommended

– User clicked ad after it was shown

– User didn't skip track when it was played

– User added document to library...

– User connected with contact...

Page 9: Offline evaluation of recommender systems: all pain and no gain?

Why was it good?

Page 10: Offline evaluation of recommender systems: all pain and no gain?

Why was it good?

● Maybe it was

– Relevant

– Novel

– Familiar

– Serendipitous

– Well explained

● Note: some of these are mutually incompatible

Page 11: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 12: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

(you know one when you seen one)

Page 13: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 14: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 15: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

Page 16: Offline evaluation of recommender systems: all pain and no gain?

What is a bad recommendation?

● Maybe it was

– Not relevant

– Too obscure

– Too familiar

– I already have it

– I already know that I don't like it

– Badly explained

Page 17: Offline evaluation of recommender systems: all pain and no gain?

What's the cost of getting it wrong?

● Depends on your product and your users

– Lost revenue

– Less engaged user

– Angry user

– Amused user

– Confused user

– User defects to a rival product

Page 18: Offline evaluation of recommender systems: all pain and no gain?

Hypotheses

Good offline metricsexpress product goals

Most (really) bad recommendationscan be caught by business logic

Page 19: Offline evaluation of recommender systems: all pain and no gain?

Issues

● Real business goals concern long-term user behaviour e.g. Netflix

“we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to

watch a title and enjoys it enough to come back to the service”

● Usually have to settle for short-term surrogate

● Only some user behaviour is visible

● Same constraints when collecting training data

Page 20: Offline evaluation of recommender systems: all pain and no gain?

Least bad solution?

● “Back to the future” aka historical log analysis

● Decide which logged event(s) indicate success

● Be honest about “success”

● Usually care most about precision @ small k

● Recall will discriminate once this plateaus

● Expect to have to do online testing too

Page 21: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Building a test framework + data is hard

● Be sure to get best value from your work

● Don't use straw man baselines

● Be realistic – leave the ivory tower

● Make test setups and baselines reproducible

Page 22: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Old skool k-NN systems are better than you think

– Input numbers from mining logs

– Temporal “modelling” (e.g. fake users)

– Data pruning (scalability, popularity bias, quality)

– Preprocessing (tf-idf, log/sqrt, )…– Hand crafted similarity metric

– Hand crafted aggregation formula

– Postprocessing (popularity matching)

– Diversification

– Attention profile

Page 23: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Measure preference honestly

● Predicted items may not be “correct” just because they were consumed once

● Try to capture value

– Earlier recommendation may be better

– Don't need a recommender to suggest items by same artist/author

● Don't neglect side data

– At least use it for evaluation / sanity checking

Page 24: Offline evaluation of recommender systems: all pain and no gain?

Making metrics meaningful

● Public data isn't enough for reproducibility or fair comparison

● Need to document preprocessing

● Better:

Release your preparation/evaluation code too

Page 25: Offline evaluation of recommender systems: all pain and no gain?

What's the cost of poor evaluation?

Page 26: Offline evaluation of recommender systems: all pain and no gain?

What's the cost of poor evaluation?

Poor offline evaluation can lead toyears of misdirected research

Page 27: Offline evaluation of recommender systems: all pain and no gain?

Ex 1: Reduce playlist skips

● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash”

● Use audio similarity measure to compute transition distance, then travelling salesman

● Metric: sum of transition distances (lower is better)

● 6 months work to develop solution

Page 28: Offline evaluation of recommender systems: all pain and no gain?

Ex 1: Reduce playlist skips

● Result: users skipped more often

● Why?

Page 29: Offline evaluation of recommender systems: all pain and no gain?

Ex 1: Reduce playlist skips

● Result: users skipped more often

● When a user skipped a track they didn't like they were played something else just like it

● Better metric: average position of skipped tracks (based on logs, lower down is better)

Page 30: Offline evaluation of recommender systems: all pain and no gain?

Ex 2: Recommend movies

● Use a corpus of star ratings to improve movie recommendations

● Learn to predict ratings for un-rated movies

● Metric: average RMSE of predictions for a hidden test set (lower is better)

● 2+ years work to develop new algorithms

Page 31: Offline evaluation of recommender systems: all pain and no gain?

Ex 2: Recommend movies

● Result: “best” solutions were never deployed

● Why?

Page 32: Offline evaluation of recommender systems: all pain and no gain?

Ex 2: Recommend movies

● Result: “best” solutions were never deployed

● User behaviour correlates with rank not RMSE

● Side datasets an order of magnitude more valuable than algorithm improvements

● Explicit ratings are the exception not the rule

● RMSE still haunts research labs

Page 33: Offline evaluation of recommender systems: all pain and no gain?

Can contests help?

● Good:

– Great for consistent evaluation

● Not so good:

– Privacy concerns mean obfuscated data

– No guarantee that metrics are meaningful

– No guarantee that train/test framework is valid

– Small datasets can become overexposed

Page 34: Offline evaluation of recommender systems: all pain and no gain?

Ex 3: Yahoo! Music KDD Cup

● Largest music rating dataset ever released

● Realistic “loved songs” classification task

● Data fully obfuscated due to recent lawsuits

Page 35: Offline evaluation of recommender systems: all pain and no gain?

Ex 3: Yahoo! Music KDD Cup

● Result: researchers hated it

● Why?

Page 36: Offline evaluation of recommender systems: all pain and no gain?

Ex 3: Yahoo! Music KDD Cup

● Result: researchers hated it

● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings

Page 37: Offline evaluation of recommender systems: all pain and no gain?

Ex 4: Million Song Challenge

● Large music dataset with rich metadata

● Anonymized listening histories

● Simple item recommendation task

● Reasonable MAP@500 metric

● Aimed to solve shortcomings of KDD Cup

● Only obfuscation was removal of timestamps

Page 38: Offline evaluation of recommender systems: all pain and no gain?

Ex 4: Million Song Challenge

● Result: winning entry didn't use side data

● Why?

Page 39: Offline evaluation of recommender systems: all pain and no gain?

Ex 4: Million Song Challenge

● Result: winning entry didn't use side data

● No timestamps so test tracks chosen at random

● So “people who listen to A also listen to B”

● Traditional item similarity solves this well

● More honesty about “success” might have shown that contest data was flawed

Page 40: Offline evaluation of recommender systems: all pain and no gain?

Ex 5: Yelp RecSys Challenge

● Small business review dataset with side data

● Realistic mix of input data types

● Rating prediction task

● Informal procedure to create train/test sets

Page 41: Offline evaluation of recommender systems: all pain and no gain?

Ex 5: Yelp RecSys Challenge

● Result: baseline algorithms high up leaderboard

● Why?

Page 42: Offline evaluation of recommender systems: all pain and no gain?

Ex 5: Yelp RecSys Challenge

● Result: baseline algorithms high up leaderboard

● Train/test split was corrupt

● Competition organisers moved fast to fix this

● But left only one week before deadline

Page 43: Offline evaluation of recommender systems: all pain and no gain?

Ex 6: MIREX Audio Chord Estimation

● Small dataset of audio tracks

● Task to label with predicted chord symbols

● Human labelled data hard to come by

● Contest hosted by premier forum in field

● Evaluate frame-level prediction accuracy

● Historical glass ceiling around 80%

Page 44: Offline evaluation of recommender systems: all pain and no gain?

Ex 6: MIREX Audio Chord Estimation

● Result: 2011 winner ftw

● Why?

Page 45: Offline evaluation of recommender systems: all pain and no gain?

Ex 6: MIREX Audio Chord Estimation

● Result: 2011 winner ftw

● Spoof entry relying on known test set

● Protest against inadequate test data

● Other research showed weak generalisation of winning algorithms from same contest

● Next year results dropped significantly

Page 46: Offline evaluation of recommender systems: all pain and no gain?

So why evaluate offline at all?

● Building test framework ensures clear goals

● Avoid wishful thinking if your data is too thin

● Be efficient with precious online testing

– Cut down huge parameter space

– Don't alienate users

● Need to publish

● Pursuing science as well as profit

Page 47: Offline evaluation of recommender systems: all pain and no gain?

Online evaluation is tricky too

● No off the shelf solution for services

● Many statistical gotchas

● Same mismatch between short-term and long-term success criteria

● Results open to interpretation by management

● Can make incremental improvements look good when radical innovation is needed

Page 48: Offline evaluation of recommender systems: all pain and no gain?

Ex 7: Article Recommendations

● Recommender for related research articles

● Massive download logs available

● Framework developed based on co-downloads

● Aim to improve on existing search solution

● Management “keen for it work”

● Several weeks of live A/B testing available

● No offline evaluation

Page 49: Offline evaluation of recommender systems: all pain and no gain?

Ex 7: Article Recommendations

● Result: worse than similar title search

● Why?

Page 50: Offline evaluation of recommender systems: all pain and no gain?

Ex 7: Article Recommendations

● Result: worse than similar title search

● Inadequate business rules e.g. often suggesting other articles from same publication

● Users identified only by organisational IP range so value of “big data” very limited

● Establishing an offline evaluation protocol would have shown these in advance

Page 51: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Rules of the game:

– Model fit metrics (e.g. validation loss) don't count

– Need a transparent “audit trail” of data to support genuine reproducibility

– Just using public datasets doesn't ensure this

Page 52: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Wish list for reproducible evaluation:

– Integrate with recommender implementations

– Handle data formats and preprocessing

– Handle splitting, cross-validation, side datasets

– Save everything to file

– Work from file inputs so not tied to one framework

– Generate meaningful metrics

– Well documented and easy to use

Page 53: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Current offerings:

● GraphChi/GraphLab

● Mahout

● LensKit

● MyMediaLite

Page 54: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Current offerings:

● GraphChi/GraphLab

– Model validation loss, doesn't count

● Mahout

– Only rating prediction accuracy, doesn't count

● LensKit

– Too hard to understand, won't use

Page 55: Offline evaluation of recommender systems: all pain and no gain?

Isn't there software for that?

Current offerings:

● MyMediaLite

– Reports meaningful metrics

– Handles cross-validation

– Data splitting not transparent

– No support for pre-processing

– No built in support for standalone evaluation

– API is capable but current utils don't meet wishlist

Page 56: Offline evaluation of recommender systems: all pain and no gain?

Eating your own dog food

● Built a small framework around new algorithm

● https://github.com/mendeley/mrec

– Reports meaningful metrics

– Handles cross-validation

– Supports simple pre-processing

– Writes everything to file for reproducibility

– Provides API and utility scripts

– Runs standalone evalutions

– Readable Python code

Page 57: Offline evaluation of recommender systems: all pain and no gain?

Eating your own dog food

● Some lessons learned

– Usable frameworks are hard to write

– Tradeoff between clarity and scalability

– Should generate explicit validation sets

● Please contribute!

● Or use as inspiration to improve existing tools

Page 58: Offline evaluation of recommender systems: all pain and no gain?

Where next?

● Shift evaluation online:

– Contests based around online evaluation

– Realistic but not reproducible

– Could some run continuously?

● Recommender Systems as a commodity:

– Software and services reaching maturity now

– Business users can tune/evaluate themselves

– Is there a way to report results?

Page 59: Offline evaluation of recommender systems: all pain and no gain?

Where next?

● Support alternative query paradigms:

– More like this, less like that

– Metrics for dynamic/online recommenders

● Support recommendation with side data:

– LibFM, GenSGD, WARP research @google, …– Open datasets?

Page 60: Offline evaluation of recommender systems: all pain and no gain?

Thanks for listening

[email protected]

@gamboviol

https://github.com/gamboviol

https://github.com/mendeley/mrec