offline evaluation of recommender systems: all pain and no gain?

Offline Evaluation of Recommender Systems

All pain and no gain?

Mark LevyMendeley

About me

Some things I built

Something I'm building

What is a good recommendation?


One that increases the usefulnessof your product in the long run1

1. WARNING: hard to measure directly


● One that increased your bottom line:

– User bought item after it was recommended

– User clicked ad after it was shown

– User didn't skip track when it was played

– User added document to library...

– User connected with contact...

Why was it good?

Why was it good?

● Maybe it was

– Relevant

– Novel

– Familiar

– Serendipitous

– Well explained

● Note: some of these are mutually incompatible

What is a bad recommendation?


(you know one when you seen one)


● Maybe it was

– Not relevant

– Too obscure

– Too familiar

– I already have it

– I already know that I don't like it

– Badly explained

What's the cost of getting it wrong?

● Depends on your product and your users

– Lost revenue

– Less engaged user

– Angry user

– Amused user

– Confused user

– User defects to a rival product

Hypotheses

Good offline metricsexpress product goals

Most (really) bad recommendationscan be caught by business logic

Issues

● Real business goals concern long-term user behaviour e.g. Netflix

“we have reformulated the recommendation problem to the question of optimizing the probability a member chooses to

watch a title and enjoys it enough to come back to the service”

● Usually have to settle for short-term surrogate

● Only some user behaviour is visible

● Same constraints when collecting training data

Least bad solution?

● “Back to the future” aka historical log analysis

● Decide which logged event(s) indicate success

● Be honest about “success”

● Usually care most about precision @ small k

● Recall will discriminate once this plateaus

● Expect to have to do online testing too

Making metrics meaningful

● Building a test framework + data is hard

● Be sure to get best value from your work

● Don't use straw man baselines

● Be realistic – leave the ivory tower

● Make test setups and baselines reproducible


● Old skool k-NN systems are better than you think

– Input numbers from mining logs

– Temporal “modelling” (e.g. fake users)

– Data pruning (scalability, popularity bias, quality)

– Preprocessing (tf-idf, log/sqrt, )…– Hand crafted similarity metric

– Hand crafted aggregation formula

– Postprocessing (popularity matching)

– Diversification

– Attention profile


● Measure preference honestly

● Predicted items may not be “correct” just because they were consumed once

● Try to capture value

– Earlier recommendation may be better

– Don't need a recommender to suggest items by same artist/author

● Don't neglect side data

– At least use it for evaluation / sanity checking


● Public data isn't enough for reproducibility or fair comparison

● Need to document preprocessing

● Better:

Release your preparation/evaluation code too

What's the cost of poor evaluation?

What's the cost of poor evaluation?

Poor offline evaluation can lead toyears of misdirected research

Ex 1: Reduce playlist skips

● Reorder a playlist of tracks to reduce skips by avoiding “genre whiplash”

● Use audio similarity measure to compute transition distance, then travelling salesman

● Metric: sum of transition distances (lower is better)

● 6 months work to develop solution


● Result: users skipped more often

● Why?


● Result: users skipped more often

● When a user skipped a track they didn't like they were played something else just like it

● Better metric: average position of skipped tracks (based on logs, lower down is better)

Ex 2: Recommend movies

● Use a corpus of star ratings to improve movie recommendations

● Learn to predict ratings for un-rated movies

● Metric: average RMSE of predictions for a hidden test set (lower is better)

● 2+ years work to develop new algorithms


● Result: “best” solutions were never deployed

● Why?


● Result: “best” solutions were never deployed

● User behaviour correlates with rank not RMSE

● Side datasets an order of magnitude more valuable than algorithm improvements

● Explicit ratings are the exception not the rule

● RMSE still haunts research labs

Can contests help?

● Good:

– Great for consistent evaluation

● Not so good:

– Privacy concerns mean obfuscated data

– No guarantee that metrics are meaningful

– No guarantee that train/test framework is valid

– Small datasets can become overexposed

Ex 3: Yahoo! Music KDD Cup

● Largest music rating dataset ever released

● Realistic “loved songs” classification task

● Data fully obfuscated due to recent lawsuits


● Result: researchers hated it

● Why?


● Result: researchers hated it

● Research frontier focussed on audio content and metadata, not joinable to obfuscated ratings

Ex 4: Million Song Challenge

● Large music dataset with rich metadata

● Anonymized listening histories

● Simple item recommendation task

● Reasonable MAP@500 metric

● Aimed to solve shortcomings of KDD Cup

● Only obfuscation was removal of timestamps

mailto:MAP@500


● Result: winning entry didn't use side data

● Why?


● Result: winning entry didn't use side data

● No timestamps so test tracks chosen at random

● So “people who listen to A also listen to B”

● Traditional item similarity solves this well

● More honesty about “success” might have shown that contest data was flawed

Ex 5: Yelp RecSys Challenge

● Small business review dataset with side data

● Realistic mix of input data types

● Rating prediction task

● Informal procedure to create train/test sets


● Result: baseline algorithms high up leaderboard

● Why?


● Result: baseline algorithms high up leaderboard

● Train/test split was corrupt

● Competition organisers moved fast to fix this

● But left only one week before deadline

Ex 6: MIREX Audio Chord Estimation

● Small dataset of audio tracks

● Task to label with predicted chord symbols

● Human labelled data hard to come by

● Contest hosted by premier forum in field

● Evaluate frame-level prediction accuracy

● Historical glass ceiling around 80%


● Result: 2011 winner ftw

● Why?


● Result: 2011 winner ftw

● Spoof entry relying on known test set

● Protest against inadequate test data

● Other research showed weak generalisation of winning algorithms from same contest

● Next year results dropped significantly

So why evaluate offline at all?

● Building test framework ensures clear goals

● Avoid wishful thinking if your data is too thin

● Be efficient with precious online testing

– Cut down huge parameter space

– Don't alienate users

● Need to publish

● Pursuing science as well as profit

Online evaluation is tricky too

● No off the shelf solution for services

● Many statistical gotchas

● Same mismatch between short-term and long-term success criteria

● Results open to interpretation by management

● Can make incremental improvements look good when radical innovation is needed

Ex 7: Article Recommendations

● Recommender for related research articles

● Massive download logs available

● Framework developed based on co-downloads

● Aim to improve on existing search solution

● Management “keen for it work”

● Several weeks of live A/B testing available

● No offline evaluation


● Result: worse than similar title search

● Why?


● Result: worse than similar title search

● Inadequate business rules e.g. often suggesting other articles from same publication

● Users identified only by organisational IP range so value of “big data” very limited

● Establishing an offline evaluation protocol would have shown these in advance

Isn't there software for that?

Rules of the game:

– Model fit metrics (e.g. validation loss) don't count

– Need a transparent “audit trail” of data to support genuine reproducibility

– Just using public datasets doesn't ensure this


Wish list for reproducible evaluation:

– Integrate with recommender implementations

– Handle data formats and preprocessing

– Handle splitting, cross-validation, side datasets

– Save everything to file

– Work from file inputs so not tied to one framework

– Generate meaningful metrics

– Well documented and easy to use


Current offerings:

● GraphChi/GraphLab

● Mahout

● LensKit

● MyMediaLite


Current offerings:

● GraphChi/GraphLab

– Model validation loss, doesn't count

● Mahout

– Only rating prediction accuracy, doesn't count

● LensKit

– Too hard to understand, won't use


Current offerings:

● MyMediaLite

– Reports meaningful metrics

– Handles cross-validation

– Data splitting not transparent

– No support for pre-processing

– No built in support for standalone evaluation

– API is capable but current utils don't meet wishlist

Eating your own dog food

● Built a small framework around new algorithm

● https://github.com/mendeley/mrec

– Reports meaningful metrics

– Handles cross-validation

– Supports simple pre-processing

– Writes everything to file for reproducibility

– Provides API and utility scripts

– Runs standalone evalutions

– Readable Python code

Eating your own dog food

● Some lessons learned

– Usable frameworks are hard to write

– Tradeoff between clarity and scalability

– Should generate explicit validation sets

● Please contribute!

● Or use as inspiration to improve existing tools

Where next?

● Shift evaluation online:

– Contests based around online evaluation

– Realistic but not reproducible

– Could some run continuously?

● Recommender Systems as a commodity:

– Software and services reaching maturity now

– Business users can tune/evaluate themselves

– Is there a way to report results?

Where next?

● Support alternative query paradigms:

– More like this, less like that

– Metrics for dynamic/online recommenders

● Support recommendation with side data:

– LibFM, GenSGD, WARP research @google, …– Open datasets?

Thanks for listening

[email protected]

@gamboviol

https://github.com/gamboviol

https://github.com/mendeley/mrec

offline evaluation of recommender systems: all pain and no gain?

Technology

article recommendations

recommend

baseline algorithms

winning entry

offline evaluation

poor evaluation

side data

users skipped