matrix factorization bamshad mobasher depaul university bamshad mobasher depaul university

Download Matrix Factorization Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

If you can't read please download the document

Upload: charleen-matthews

Post on 26-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Matrix Factorization Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University
  • Slide 2
  • The $1 Million Question 2
  • Slide 3
  • Ratings Data 134 355 455 3 3 222 5 211 3 3 1 17,700 movies 480,000 users
  • Slide 4
  • Training Data 100 million ratings (matrix is 99% sparse) Rating = [user, movie-id, time-stamp, rating value] Generated by users between Oct 1998 and Dec 2005 Users randomly chosen among set with at least 20 ratings Small perturbations to help with anonymity 4
  • Slide 5
  • Ratings Data 134 355 455 3 3 2?? ? 21? 3 ? 1 Test Data Set (most recent ratings) 480,000 users 17,700 movies
  • Slide 6
  • Scoring Minimize root mean square error Does not necessarily correlate well with user satisfaction But is a widely-used well-understood quantitative measure RMSE Baseline Scores on Test Data 1.054 - just predict the mean user rating for each movie 0.953 - Netflixs own system (Cinematch) as of 2006 0.941 - nearest-neighbor method using correlation 0.857 - required 10% reduction to win $1 million 6 Mean square error = 1/|R| (u,i) e R ( r ui - r ui ) 2 ^
  • Slide 7
  • Matrix Factorization of Ratings Data Based on the idea of Latent Factor Analysis Identify latent (unobserved) factors that explain observations in the data In this case, observations are user ratings of movies The factors may represent combinations of features or characteristics of movies and users that result in the ratings 7 R Q P m users n movies m users n movies f f ~ ~ x r ui q T i p u ~ ~
  • Slide 8
  • Matrix Factorization of Ratings Data 8 Figure from Koren, Bell, Volinksy, IEEE Computer, 2009
  • Slide 9
  • Matrix Factorization of Ratings Data 9 Credit: Alex Lin, Intelligent Mining
  • Slide 10
  • Predictions as Filling Missing Data Credit: Alex Lin, Intelligent Mining
  • Slide 11
  • Learning Factor Matrices Need to learn the feature vectors from training data User feature vector: (a, b, c) Item feature vector (x, y, z) Approach: Minimize the errors on known ratings Credit: Alex Lin, Intelligent Mining
  • Slide 12
  • Learning Factor Matrices min q,p u,i) R ( r ui - q t i p u ) 2 r ui q t i p u ~ ~ min q,p u,i) R ( r ui - q t i p u ) 2 + ( |q i | 2 + |p u | 2 ) Add regularization 12
  • Slide 13
  • Stochastic Gradient Descent (SGD) ui = r ui - q t i p u q i q i + ui p u - q i ) p u p u + ui q i - p u ) min q,p u,i) R ( r ui - q t i p u ) 2 + ( |q i | 2 + |p u | 2 ) regularization goodness of fit Online (stochastic) gradient update equations: 13
  • Slide 14
  • Components of a Rating Predictor user-movie interactionmovie biasuser bias User-movie interaction Characterizes the matching between users and movies Attracts most research in the field Benefits from algorithmic and mathematical innovations Baseline predictor Separates users and movies Often overlooked Benefits from insights into users behavior Among the main practical contributions of the competition 14 Credit: Yehuda Koren, Google, Inc.
  • Slide 15
  • Modeling Systematic Biases r ui + b u + b i + user-movie interactions ~ ~ overall mean rating mean rating for user u mean rating for movie i Example: Mean rating = 3.7 You are a critical reviewer: your ratings are 1 lower than the mean -> b u = -1 Star Wars gets a mean rating of 0.5 higher than average movie: b i = + 0.5 Predicted rating for you on Star Wars = 3.7 - 1 + 0.5 = 3.2 q t i p u 15 Credit: Padhraic Smyth, University of California, Irvine
  • Slide 16
  • Objective Function min q,p u,i) R ( r ui - ( + b u + b i + q t i p u ) ) 2 + ( |q i | 2 + |p u | 2 + |b u | 2 + |b i | 2 ) } regularization goodness of fit Typically selected via grid-search on a validation set 16 Credit: Padhraic Smyth, University of California, Irvine
  • Slide 17
  • 5% 8% 17 Figure from Koren, Bell, Volinksy, IEEE Computer, 2009
  • Slide 18
  • 18
  • Slide 19
  • Explanation for increase? 19
  • Slide 20
  • Adding Time Effects r ui + b u + b i + user-movie interactions ~ ~ ~ ~ r ui + b u (t) + b i (t) + user-movie interactions Add time dependence to biases Time-dependence parametrized by linear trends, binning, and other methods For details see Y. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009 20 Credit: Padhraic Smyth, University of California, Irvine
  • Slide 21
  • Adding Time Effects r ui + b u (t) + b i (t) + q t i p u (t) ~ ~ Add time dependence to user factor weights Models the fact that users interests over genres (the qs) may change over time 21
  • Slide 22
  • Figure from Koren, Bell, Volinksy, IEEE Computer, 2009 5% 8% 22
  • Slide 23
  • The Kitchen Sink Approach. Many options for modeling Variants of the ideas we have seen so far Different numbers of factors Different ways to model time Different ways to handle implicit information . Other models (not described here) Nearest-neighbor models Restricted Boltzmann machines Model averaging was useful. Linear model combining Neural network combining Gradient boosted decision tree combining Note: combining weights learned on validation set (stacking) 23 Credit: Padhraic Smyth, University of California, Irvine
  • Slide 24
  • 24
  • Slide 25
  • Other Aspects of Model Building Automated parameter tuning Using a validation set, and grid search, various parameters such as learning rates, regularization parameters, etc., can be optimized Memory requirements Memory: can fit within roughly 1 Gbyte of RAM Training time Order of days: but achievable on commodity hardware rather than a supercomputer Some parallelization used 25 Credit: Padhraic Smyth, University of California, Irvine
  • Slide 26
  • Progress Prize 2008 Sept 2 nd Only 3 teams qualify for 1% improvement over previous year Oct 2 nd Leading team has 9.4% overall improvement Progress prize ($50,000) awarded to BellKor team of 3 AT&T researchers (same as before) plus 2 Austrian graduate students, Andreas Toscher and Martin Jahrer Key winning strategy: clever blending of predictions from models used by both teams Speculation that 10% would be attained by mid-2009 26
  • Slide 27
  • The Leading Team for the Final Prize BellKorPragmaticChaos BellKor: Yehuda Koren (now Yahoo!), Bob Bell, Chris Volinsky, AT&T BigChaos: Michael Jahrer, Andreas Toscher, 2 grad students from Austria Pragmatic Theory Martin Chabert, Martin Piotte, 2 engineers from Montreal (Quebec) 27
  • Slide 28
  • 28
  • Slide 29
  • June 26 th 2009: after 1000 days & nights 29
  • Slide 30
  • Million Dollars Awarded Sept 21 st 2009 30