next directions in mahout's recommenders

Next Directions in Mahout’s RecommendersSebastian Schelter, Apache Software FoundationBay Area Mahout Meetup

Next

Directions

inM

ahout’sRecom

menders

2/38

About me

PhD student at the Database Systems and InformationManagement Group of Technische Universitat Berlin

Member of the Apache Software Foundation, committer onMahout and Giraph

currently interning at IBM Research Almaden

Next

Directions

inM

ahout’sRecom

menders

3/38

Next Directions?

Mahout in Action is the prime source ofinformation for using Mahout in practice.

As it is more than two years old(and only covers Mahout 0.5), it ismissing a lot of recent developments.

This talk describes what has been added to the recommendersof Mahout since then and gives suggestions on directions forfuture versions of Mahout.

Collaborative Filtering 101

Next

Directions

inM

ahout’sRecom

menders

5/38

Collaborative Filtering

Problem: Given a user’s interactions with items, guess whichother items would be highly preferred

Collaborative Filtering: infer recommendations from patternsfound in the historical user-item interactions

data can be explicit feedback (ratings) or implicit feedback(clicks, pageviews), represented in the interaction matrix A

item1 · · · item3 · · ·

user1 3 · · · 4 · · ·user2 − · · · 4 · · ·user3 5 · · · 1 · · ·· · · · · · · · · · · · · · ·

Next

Directions

inM

ahout’sRecom

menders

6/38

Neighborhood Methods

User-based:I for each user, compute a ”jury” of users with similar tasteI pick the recommendations from the ”jury’s” items

Item-based:I for each item, compute a set of items with similar

interaction patternI pick the recommendations from those similar items

Next

Directions

inM

ahout’sRecom

menders

7/38

Neighborhood Methods

item-based variant most popular:I simple and intuitively understandableI additionally gives non-personalized, per-item

recommendations (people who like X might also like Y)I recommendations for new users without model retrainingI comprehensible explanations (we recommend Y because

you liked X)

Next

Directions

inM

ahout’sRecom

menders

8/38

Latent factor models

Idea: interactions are deeply influenced by a set of factorsthat are very specific to the domain (e.g. amount of actionor complexity of characters in movies)

these factors are in general not obvious and need to beinferred from the interaction data

both users and items can be described in terms of these factors

Next

Directions

inM

ahout’sRecom

menders

9/38

Matrix factorization

Computing a latent factor model: approximately factor Ainto the product of two rank k feature matrices U and M suchthat A ≈ UM.

U models the latent features of the users, M models the latentfeatures of the items

dot product u>i mj in the latent feature space predicts strength

of interaction between user i and item j

≈ ×

Au × i

Uu × k

Mk × i

Single machine recommenders

Next

Directions

inM

ahout’sRecom

menders

11/38

Taste

I based on Sean Owen’s Taste framework (started in 2005)I mature and stable codebaseI Recommender implementations encapsulate recommender

algorithmsI DataModel implementations handle interaction data in

memory, files, databases, key-value stores

but focus was mostly on neighborhood methodsI lack of implementations for latent factor modelsI little support for scientific usecases (e.g. recommender

contests)

Next

Directions

inM

ahout’sRecom

menders

12/38

Collaboration

MyMedialite, scientific library of recom-mender system algorithmshttp://www.mymedialite.net/

Mahout now features a couple of popular latent factor models,mostly ported by Zeno Gantner.

http://www.mymedialite.net/

Next

Directions

inM

ahout’sRecom

menders

13/38

Lots of different Factorizers for our SVDRecommender

RatingSGDFactorizer, biased matrix factorizationKoren et al.: Matrix Factorization Techniques for Recommender Systems, IEEE Computer ’09

SVDPlusPlusFactorizer, SVD++Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD ’08

ALSWRFactorizer, matrix factorization using AlternatingLeast SquaresZhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08

Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08

ParallelSGDFactorizer, parallel version of biased matrixfactorization (contributed by Peng Cheng)Takacs et. al.: Scalable Collaborative Filtering Approaches for Large Recommender Systems, JMLR ’09

Niu et al.: Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, NIPS ’11

Next

Directions

inM

ahout’sRecom

menders

14/38

Next directions

I better tooling for cross-validation and hold-out tests (e.g.time-based splits of interactions)

I memory-efficient DataModel implementations tailored tospecific usecases (e.g. matrix factorization with SGD)

I better support for computing recommendations for”anonymous” users

I online recommenders

Next

Directions

inM

ahout’sRecom

menders

15/38

Usage

I researchers at TU Berlin and CWI Amsterdamregularly use Mahout for their recommender researchpublished at international conferences

I ”Bayrischer Rundfunk”, one of Germany’s largest publicTV broadcasters, uses Mahout to help users discover TVcontent in its online media library

I Berlin-based company plista runs a live contest for thebest news recommender algorithm and providesMahout-based ”skeleton code” to participants

I The Dutch Institute of Sound and Vision runs awebplatform that uses Mahout for recommending contentfrom its archive of Dutch audio-visual heritage collectionsof the 20th century

Parallel processing

Next

Directions

inM

ahout’sRecom

menders

17/38

Distribution

difficult environment:I data is partitioned and stored in a distributed filesystemI algorithms must be expressed in MapReduce

our distributed implementations focus on two popular methods

I item-based collaborative filteringI matrix factorization with Alternating Least Squares

Scalable neighborhood methods

Next

Directions

inM

ahout’sRecom

menders

19/38

Cooccurrences

start with a simplified view:imagine interaction matrix A wasbinary

→ we look at cooccurrences only

item similarity computation becomes matrix multiplication

S = A>A

scale-out of the item-based approach reduces to finding anefficient way to compute this item similarity matrix

Next

Directions

inM

ahout’sRecom

menders

20/38

Parallelizing S = A>A

standard approach of computing item cooccurrences requiresrandom access to both users and items

foreach item f doforeach user i who interacted with f do

foreach item j that i also interacted with doSfj = Sfj + 1

→ not efficiently parallelizable on partitioned data

row outer product formulation of matrix multiplication isefficiently parallelizable on a row-partitioned A

S = A>A =∑i∈A

aia>i

mappers compute the outer products of rows of A, emit theresults row-wise, reducers sum these up to form S

Next

Directions

inM

ahout’sRecom

menders

21/38

Parallel similarity computation

much more details in the implementation

I support for various similarity measuresI various optimizations (e.g. for symmetric similarity

measures)I downsampling of skewed interaction data

in-depth description available in:

Sebastian Schelter, Christoph Boden, Volker Markl:Scalable Similarity-Based Neighborhood Methods withMapReduceACM RecSys 2012

Next

Directions

inM

ahout’sRecom

menders

22/38

Implementation in Mahout

o.a.m.math.hadoop.similarity.cooccurrence.RowSimilarityJobcomputes the top-k pairwise similarities for each row of amatrix using some similarity measure

o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJobcomputes the top-k similar items per item usingRowSimilarityJob

o.a.m.cf.taste.hadoop.item.RecommenderJobcomputes recommendations and similar items usingRowSimilarityJob

Next

Directions

inM

ahout’sRecom

menders

23/38

Scalable Neighborhood Methods: Experiments

Setup

I 6 machines running Java 7 and Hadoop 1.0.4I two 4-core Opteron CPUs, 32 GB memory and four 1 TB

disk drives per machine

Results

Yahoo Songs dataset (700M datapoints, 1.8M users, 136Kitems), similarity computation takes less than 100 minutes

Scalable matrix factorization

Next

Directions

inM

ahout’sRecom

menders

25/38

Alternating Least Squares

ALS rotates between fixing U and M. When U is fixed, thesystem recomputes M by solving a least-squares problem peritem, and vice versa.

easy to parallelize, as all users (and vice versa, items) can berecomputed independently

additionally, ALS can be applied to usecases with implicit data(pageviews, clicks)

≈ ×

Au × i

Uu × k

Mk × i

Next

Directions

inM

ahout’sRecom

menders

26/38

Scalable Matrix Factorization: ImplementationRecompute user feature matrix U using a broadcast-join:

1. Run a map-only job using multithreaded mappers2. load item-feature matrix M into memory from HDFS to

share it among the individual mappers3. mappers read the interaction histories of the users4. multithreaded: solve a least squares problem per user to

recompute its feature vector

user histories A user features U

item features M

MapHash-Join + Re-computation

local fw

dlo

cal fwd

local fw

d



broadcast

mac

hin

e 1

mac

hin

e 2

mac

hin

e 3

Next

Directions

inM

ahout’sRecom

menders

27/38

Implementation in Mahout

o.a.m.cf.taste.hadoop.als.ParallelALSFactorizationJobdifferent solvers for explicit and implicit dataZhou et al.: Large-Scale Parallel Collaborative Filtering for the Netflix Prize, AAIM ’08

Hu et al.: Collaborative Filtering for Implicit Feedback Datasets, ICDM ’08

o.a.m.cf.taste.hadoop.als.RecommenderJob computesrecommendations from a factorization

in-depth description available in:

Sebastian Schelter, Christoph Boden, Martin Schenck,Alexander Alexandrov, Volker Markl:Distributed Matrix Factorization with MapReduce using aseries of Broadcast-Joinsto appear at ACM RecSys 2013

Next

Directions

inM

ahout’sRecom

menders

28/38

Scalable Matrix Factorization: Experiments

Cluster: 26 machines, two 4-core Opteron CPUs, 32 GBmemory and four 1 TB disk drives each

Hadoop Configuration: reuse JVMs, used JBlas as solver,run multithreaded mappers

Datasets: Netflix (0.5M users, 100M datapoints), YahooSongs (1.8M users, 700M datapoints), Bigflix (25M users, 5Bdatapoints)

0

50

100

150

number of features r

avg

. du

rati

on

per

job

(se

con

ds)

(U) 10

(M) 10

(U) 20

(M) 20

(U) 50

(M) 50

(U) 100

(M) 100

Yahoo SongsNetflix

5 10 15 20 250

100

200

300

400

500

600

number of machines

avg

. du

rati

on

per

job

(se

con

ds)

Bigflix (M)Bigflix (U)

Next

Directions

inM

ahout’sRecom

menders

29/38

Next directions

I better tooling for cross-validation and hold-out tests (e.g.to find parameters for ALS)

I integration of more efficient solver libraries like JBlasI should be easier to modify and adjust the MapReduce

code

Next

Directions

inM

ahout’sRecom

menders

30/38

A selection of users

I Mendeley, a data platform for researchers (2.5M users,50M research articles): Mendeley Suggest for discoveringrelevant research publications

I Researchgate, the world’s largest social network forresearchers (3M users)

I a German online retailer with several million customersacross Europe

I German online market places for real estate andpre-owned cars with millions of users

Deployment -

Next

Directions

inM

ahout’sRecom

menders

32/38

”Small data, low load”

I use GenericItembasedRecommender orGenericUserbasedRecommender, feed it with interactiondata stored in a file, database or key-value store

I have it load the interaction data in memory and computerecommendations on request

I collect new interactions into your files or database andperiodically refresh the recommender

In order to improve performance, try to:I have your recommender look at fewer interactions by

using SamplingCandidateItemsStrategyI cache computed similarities with a CachingItemSimilarity

Next

Directions

inM

ahout’sRecom

menders

33/38

”Medium data, high load”

Assumption: interaction data still fits into main memory

I use a recommender that is able to leverage aprecomputed model, e.g. GenericItembasedRecommenderor SVDRecommender

I load the interaction data and the model in memory andcompute recommendations on request

I collect new interactions into your files or database andperiodically recompute the model and refresh therecommender

use BatchItemSimilarities or ParallelSGDFactorizer forprecomputing the model using multiple threads on a singlemachine

Next

Directions

inM

ahout’sRecom

menders

34/38

”Lots of data, high load”

Assumption: interaction data does not fit into main memory

I use a recommender that is able to leverage aprecomputed model, e.g. GenericItembasedRecommenderor SVDRecommender

I keep the interaction data in a (potentially partitioned)database or in a key-value store

I load the model into memory, the recommender will onlyuse one (cacheable) query per recommendation request toretrieve the user’s interaction history

I collect new interactions into your files or database andperiodically recompute the model offline

use ItemSimilarityJob or ParallelALSFactorizationJob toprecompute the model with Hadoop

Next

Directions

inM

ahout’sRecom

menders

35/38

”Precompute everything”

I use RecommenderJob to precompute recommendationsfor all users with Hadoop

I directly serve those recommendations

successfully employed by Mendeley for their research paperrecommender ”Suggest”

allowed them to run their recommender infrastructure serving2 million users for less than $100 per month in AWS

Next

Directions

inM

ahout’sRecom

menders

36/38

Next directions

”Search engine based recommender infrastructure”(work in progress driven by Pat Ferrel)

I use RowSimilarityJob to find anomalously co-occuringitems using Hadoop

I index those item pairs with a distributed search enginesuch as Apache Solr

I query based on a user’s interaction history and the searchengine will answer with recommendations

I gives us an easy-to-use, scalable serving layer for free(Apache Solr)

I allows complex recommendation queries containing filters,geo-location, etc.

Next

Directions

inM

ahout’sRecom

menders

37/38

The shape of things to come

MapReduce is not well suited for certain ML usecases, e.g.when the algorithms to apply are iterative and the dataset fitsinto the aggregate main memory of the cluster

Mahout always stated that it is not tied to Hadoop, howeverthere were no production-quality alternatives in the past

With the advent of YARN and the maturing of alternativesystems, this situation is changing and we should embrace thischange

Personally, I would love to see an experimental port of ourdistributed recommenders to another Apache-supportedsystem such Spark or Giraph

Thanks for listening!Follow me on twitter at http://twitter.com/sscdotopen

Join Mahout’s mailinglists at http://s.apache.org/mahout-lists

picture on slide 3 by Tim Abott, http://www.flickr.com/photos/theabbott/picture on slide 21 by Crimson Diabolics, http://crimsondiabolics.deviantart.com/

http://twitter.com/sscdotopen

http://s.apache.org/mahout-lists

next directions in mahout's recommenders

Technology

core opteron

high load

implicit feedback

recommenderjob

matrix factorization

interaction

matrix multiplication

squares problem