the movielens datasets: history and context

36
The MovieLens Datasets: History and Context Max Harper (presenter) Joe Konstan

Upload: max-harper

Post on 28-Jan-2018

610 views

Category:

Science


2 download

TRANSCRIPT

Page 1: The MovieLens Datasets: History and Context

The MovieLens Datasets:

History and Context

Max Harper (presenter)

Joe Konstan

Page 2: The MovieLens Datasets: History and Context

2

http://tiis.acm.org/iui16/

Page 3: The MovieLens Datasets: History and Context

MovieLens: 5 star movie ratings

userId,movieId,rating,timestamp

1,2,3.5,1112486027

1,29,3.5,1112484676

1,32,3.5,1112484819

1,47,3.5,1112484727

1,50,3.5,1112484580

1,112,3.5,1094785740

1,151,4.0,1094785734

1,223,4.0,1112485573

1,253,4.0,1112484940

...

138493,69644,3.0,1260209457

138493,70286,5.0,1258126944

138493,71619,2.5,1255811136

3

web site: dataset:

Page 4: The MovieLens Datasets: History and Context

ratings data is interesting, intuitive,

and pervasive

4

Page 5: The MovieLens Datasets: History and Context

dataset impact

» 140,000 downloads in 2014

» a search for “movielens” yields

• 6,020 results in Google Books

• 8,920 results in Google Scholar

5

Page 6: The MovieLens Datasets: History and Context

dataset uses

» research

» technical: programming books + blogs

» educational (including a MOOC)

» industrial R&D, demos

6

Page 7: The MovieLens Datasets: History and Context

overview

» MovieLens datasets overview

» dataset stability, system change

7

Page 8: The MovieLens Datasets: History and Context

8

<user, movie, rating, timestamp>

Page 9: The MovieLens Datasets: History and Context

9

<user, movie, rating, timestamp>

<Max, Toy Story, 4.0, 2010-12-01 12:00:00>

Page 10: The MovieLens Datasets: History and Context

MovieLens benchmark datasets

10

Name Dates Users Movies Ratings Density

ML 100K ‘97 – ‘98 943 1,682 100,000 6.30%

ML 1M ‘00 – ‘03 6,040 3,706 1,000,209 4.47%

ML 10M ‘95 – ‘09 69,878 10,681 10,000,054 1.34%

ML 20M ‘95 – ‘15 138,493 27,278 20,000,263 0.54%

designed for replicability

Page 11: The MovieLens Datasets: History and Context

MovieLens latest datasets

11

Name Dates Users Movies Ratings Density

ML Latest ‘95 – ‘16 247,753 34,208 22,884,377 0.003%

ML Latest

Small‘96 – ‘16 668 10,329 105,339 0.015%

designed for recency

Page 12: The MovieLens Datasets: History and Context

overview

» MovieLens datasets overview

» dataset stability, system change

12

Page 13: The MovieLens Datasets: History and Context

tension: datasets vs. system

» ideal (pure) vs. actual (it’s complex)

» systems want to change

• stay current, constant improvements

• A/B tests, beta testing, and other experiments

» context changes

• devices, competing sites, changing user base

13

Page 14: The MovieLens Datasets: History and Context

14

Page 15: The MovieLens Datasets: History and Context

15

Page 16: The MovieLens Datasets: History and Context

16

Page 17: The MovieLens Datasets: History and Context

17

Page 18: The MovieLens Datasets: History and Context

18

Page 19: The MovieLens Datasets: History and Context

some key changes

» core flow of browse/search

» rating widget

» recommender

» new user experience

» …

19

Page 20: The MovieLens Datasets: History and Context

history of experiments

» both online field experiments and online

lab experiments

» created temporary and permanent

changes, changed pattern of use

20

Page 21: The MovieLens Datasets: History and Context

21

Page 22: The MovieLens Datasets: History and Context

in the paper

» the story of MovieLens (1997 origins)

• lessons learned from running a “real” system

in a research lab

• lots of fun descriptive stats/charts

» best practices for dataset researchers

• limitations

• alternatives

22

Page 23: The MovieLens Datasets: History and Context

people who made this possible

» John Riedl

» Istvan Albert, Al Borchers, Dan Cosley, Brent J. Dahlen, Rich Davies, Michael Ekstrand, Dan Frankowski, Nathaniel Good, Jon Herlocker, Daniel Kluver, Shyong (Tony) Lam, Michael Ludwig, Sean McNee, Chad Salvatore, Shilad Sen, and Loren Terveen

» MovieLens users

23

Page 24: The MovieLens Datasets: History and Context

in ACM Transactions on Interactive Intelligent Systems, Dec. 2015

» feedback? contact us: [email protected]

presented by Max Harper, Research Scientist, University of Minnesota, [email protected]

written with Joe Konstan, Distinguished McKnight University Professor, University of Minnesota, [email protected]

This material is based on work supported by the National Science Foundation under grants DGE-9554517, IIS-9613960, IIS-9734442, IIS-9978717, EIA-9986042, IIS-0102229, IIS-0324851, IIS-0534420, IIS-0808692, IIS-0964695, IIS-0968483, IIS-1017697, IIS-1210863. This project was also supported by the University of Minnesota’s Undergraduate Research Opportunities Program and by grants and/or gifts from Net Perceptions, Inc., CFK Productions, and Google.

24

The MovieLens Datasets:

History and Context

Page 25: The MovieLens Datasets: History and Context

25

Page 26: The MovieLens Datasets: History and Context

26

version 0 (1997) version 4 (2014)

Page 27: The MovieLens Datasets: History and Context

one solution

» document change, include with datasets

27

Page 28: The MovieLens Datasets: History and Context

key dataset limitations (1/2)

» system UI and recommender changes

» bias towards “successful” users

» possible bias towards users with tolerance

for “research quality” design

» timestamps do not reflect time of

consumption

28

Page 29: The MovieLens Datasets: History and Context

key dataset limitations (2/2)

» recommender systems research

community attitudes

• implicit behaviors > ratings?

• dataset-only research increasingly

discouraged

29

Page 30: The MovieLens Datasets: History and Context

30

Page 31: The MovieLens Datasets: History and Context

MovieLens system evolution

key changes and experiments

31

Page 32: The MovieLens Datasets: History and Context

alternative datasets

32

Name Domain Rating Scale Ratings Density

Book-

Crossing books 0 - 10 1.1m 0.003%

EachMovie movies 0 - 14 2.7m 2.872%

Jester

(dataset1) jokes -10 - 10 4.1m 57.463%

Amazon many 1 - 5 82.8m < 0.001%

Netflix Prize movies 1 - 5 100.5m 1.178%

Yahoo Music

(C15) music (various) 0 - 100 262.8m 0.042%

Page 33: The MovieLens Datasets: History and Context

33

EachMovie

Page 34: The MovieLens Datasets: History and Context

lessons from running MovieLens

» lessons from startups apply (it’s hard, fail

fast)

» continual work, not one-time effort

» encourage code quality through good

social coding conventions

» invest in tools that allow users to help

34

Page 35: The MovieLens Datasets: History and Context

dataset uses

» recommender systems research

» recommender systems MOOC• http://coursera.org/learn/recommender-systems

» code examples (popular press, blogs)

» higher education

» commercial – internal testing

35

Page 36: The MovieLens Datasets: History and Context

36