the movielens datasets: history and context

The MovieLens Datasets:

History and Context

Max Harper (presenter)

Joe Konstan

http://tiis.acm.org/iui16/

MovieLens: 5 star movie ratings

userId,movieId,rating,timestamp

1,2,3.5,1112486027

1,29,3.5,1112484676

1,32,3.5,1112484819

1,47,3.5,1112484727

1,50,3.5,1112484580

1,112,3.5,1094785740

1,151,4.0,1094785734

1,223,4.0,1112485573

1,253,4.0,1112484940

138493,69644,3.0,1260209457

138493,70286,5.0,1258126944

138493,71619,2.5,1255811136

web site: dataset:

ratings data is interesting, intuitive,

and pervasive

dataset impact

» 140,000 downloads in 2014

» a search for “movielens” yields

• 6,020 results in Google Books

• 8,920 results in Google Scholar

dataset uses

» research

» technical: programming books + blogs

» educational (including a MOOC)

» industrial R&D, demos

overview

» MovieLens datasets overview

» dataset stability, system change

<user, movie, rating, timestamp>

<Max, Toy Story, 4.0, 2010-12-01 12:00:00>

MovieLens benchmark datasets

Name Dates Users Movies Ratings Density

ML 100K ‘97 – ‘98 943 1,682 100,000 6.30%

ML 1M ‘00 – ‘03 6,040 3,706 1,000,209 4.47%

ML 10M ‘95 – ‘09 69,878 10,681 10,000,054 1.34%

ML 20M ‘95 – ‘15 138,493 27,278 20,000,263 0.54%

designed for replicability

MovieLens latest datasets

Name Dates Users Movies Ratings Density

ML Latest ‘95 – ‘16 247,753 34,208 22,884,377 0.003%

ML Latest

Small‘96 – ‘16 668 10,329 105,339 0.015%

designed for recency

overview

» MovieLens datasets overview

» dataset stability, system change

tension: datasets vs. system

» ideal (pure) vs. actual (it’s complex)

» systems want to change

• stay current, constant improvements

• A/B tests, beta testing, and other experiments

» context changes

• devices, competing sites, changing user base

some key changes

» core flow of browse/search

» rating widget

» recommender

» new user experience

» …

history of experiments

» both online field experiments and online

lab experiments

» created temporary and permanent

changes, changed pattern of use

in the paper

» the story of MovieLens (1997 origins)

• lessons learned from running a “real” system

in a research lab

• lots of fun descriptive stats/charts

» best practices for dataset researchers

• limitations

• alternatives

people who made this possible

» John Riedl

» Istvan Albert, Al Borchers, Dan Cosley, Brent J. Dahlen, Rich Davies, Michael Ekstrand, Dan Frankowski, Nathaniel Good, Jon Herlocker, Daniel Kluver, Shyong (Tony) Lam, Michael Ludwig, Sean McNee, Chad Salvatore, Shilad Sen, and Loren Terveen

» MovieLens users

in ACM Transactions on Interactive Intelligent Systems, Dec. 2015

» feedback? contact us: grouplens-info@cs.umn.edu

presented by Max Harper, Research Scientist, University of Minnesota, harper@cs.umn.edu

written with Joe Konstan, Distinguished McKnight University Professor, University of Minnesota, konstan@cs.umn.edu

This material is based on work supported by the National Science Foundation under grants DGE-9554517, IIS-9613960, IIS-9734442, IIS-9978717, EIA-9986042, IIS-0102229, IIS-0324851, IIS-0534420, IIS-0808692, IIS-0964695, IIS-0968483, IIS-1017697, IIS-1210863. This project was also supported by the University of Minnesota’s Undergraduate Research Opportunities Program and by grants and/or gifts from Net Perceptions, Inc., CFK Productions, and Google.

The MovieLens Datasets:

History and Context

version 0 (1997) version 4 (2014)

one solution

» document change, include with datasets

key dataset limitations (1/2)

» system UI and recommender changes

» bias towards “successful” users

» possible bias towards users with tolerance

for “research quality” design

» timestamps do not reflect time of

consumption

key dataset limitations (2/2)

» recommender systems research

community attitudes

• implicit behaviors > ratings?

• dataset-only research increasingly

discouraged

MovieLens system evolution

key changes and experiments

alternative datasets

Name Domain Rating Scale Ratings Density

Crossing books 0 - 10 1.1m 0.003%

EachMovie movies 0 - 14 2.7m 2.872%

Jester

(dataset1) jokes -10 - 10 4.1m 57.463%

Amazon many 1 - 5 82.8m < 0.001%

Netflix Prize movies 1 - 5 100.5m 1.178%

Yahoo Music

(C15) music (various) 0 - 100 262.8m 0.042%

EachMovie

lessons from running MovieLens

» lessons from startups apply (it’s hard, fail

» continual work, not one-time effort

» encourage code quality through good

social coding conventions

» invest in tools that allow users to help

dataset uses

» recommender systems research

» recommender systems MOOC• http://coursera.org/learn/recommender-systems

» code examples (popular press, blogs)

» higher education

» commercial – internal testing

the movielens datasets: history and context

Science

vrije universiteit amsterdam & university of …vrije...

datasets and mapbuilding.pdf

us-amlr datasets

brc 2011 session #4 – “omics” data. session #4 -...

bimstruct strukturierte daten für die digitale ... · 1....

context dependent compression using adaptive...

towards new sti indicators and datasets in the sadc context

proposing)amul./touch)interface)for)intrusion) … · 2020....

0 - datasets

large acgh datasets

a residual encoder-decoder network for semantic ... ·...

tools and datasets

4. the extreme 2015/16 el niÑo, in the context of...

hawaii lidar datasets

leveraging cdr datasets for context-rich performance ... ·...

mfc datasets: large-scale benchmark datasets for media

visualization of high dimensional datasets · 1 1...

integrating heterogeneous coin datasets in the context of...

dg sante datasets

datasets slidesrachel kotarski