large-scale parallel collaborative filtering and clustering using mapreduce for recommender engines

24
+ Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines Varad Meru Software Development Engineer, Orzota, Inc. © Varad Meru, 2013

Upload: varad-meru

Post on 26-Jan-2015

120 views

Category:

Technology


3 download

DESCRIPTION

A Presentation I gave to Senior Year students on Recommender Systems. Specifically on how they work, and how to build one using existing tools.

TRANSCRIPT

Page 1: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

Varad MeruSoftware Development Engineer,Orzota, Inc.

© Varad Meru, 2013

Page 2: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Outline

Introduction

Introduction to Recommendation Engines

Algorithms for Recommendation Engines

Challenges in Recommendation Engines

What is Hadoop MapReduce?

What is Netflix prize?

Block diagram

System requirement

Conclusion

© Varad Meru, 2013

Page 3: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+

Recommender SystemsIntroduction and Project Scope

© Varad Meru, 2013

Page 4: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Introduction

Scope of our project is to build a Recommender Engine using Clustering.

Recommender Engine are used in E-Commerce and other settings to recommend items to the end users.

Widely used in companies such as Amazon, Netflix, Flipkart, Google News, and many others.

Collaborative Algorithms, Clustering and Matrix Decomposition is used for finding Recommendations.

© Varad Meru, 2013

Page 5: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Recommender System Example

© Varad Meru, 2013

Page 6: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+ Some other Recommender SystemsHere are some snapshots of widely used recommendation engines used in Amazon.

© Varad Meru, 2013

Page 7: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Collaborative Filtering in Action

Assuming is Every one of the names have seen any of the above movie

Let 1 denote seen

Let 0 denote not seen

© Varad Meru, 2013

Page 8: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Collaborative Filtering in Action

© Varad Meru, 2013

Page 9: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Collaborative Filtering in Action

© Varad Meru, 2013

Page 10: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+MinHash Clustering in Action

We will be implementing a variation of algorithm for our Project

It’s a technique to findout how similar two sets are.

The scheme was invented by Andrei Broder (1997)1

The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of hmin(S) for these k functions.

Google is known to have used this method to cluster news articles for recommending users the news of their tastes2

1Broder, Andrei Z. (1997), "On the resemblance and containment of documents”.2Mayur Datar et. al. (2007), "Google News Personalization: Scalable Online Collaborative Filtering”.© Varad Meru, 2013

Page 11: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+MinHash Clustering Flow

Get a Random Permutation of Product

Catalog, R

Start

Define a hash function h such that

h(Ui)=min. ranked product in R

Ui : All the Interaction performed by the User.An Interaction can be a Click, Purchase, Like, etc.

Pass each user through the Hash function to get

the Cluster Number

After the Clusters have been formed, Use

Covisitation to find out Recommendations

Stop

Cache the Recommendations in

Memory

Memory

© Varad Meru, 2013

Page 12: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Some Recommender Systems Available

Apache Mahout1

Easyrec2

University of Minnisota’s SUGGEST3

Other, for research, implementations such as UniRecSys and Taste

1 http://mahout.apache.org2 http://easyrec.org/ 3 http://www-users.cs.umn.edu/~karypis/suggest/

© Varad Meru, 2013

Page 13: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+

MapReduce ParadigmMapReduce and Hadoop

© Varad Meru, 2013

Page 14: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+MapReduce Programming Paradigm

A core idea behind MapReduce is mapping your data set into a collection of Key-Value pairs, and then reducing over all pairs with the same key.

Hadoop MapReduce is an Open Source implementation of MapReduce framework on the lines of Google’s MapReduce software framework.

Used for writing applications rapidly process vast amounts of data in parallel on large clusters of compute nodes.

A Hadoop MapReduce job mainly consists of two user-defined functions: map and reduce.

© Varad Meru, 2013

Page 15: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+map() function

A list of data elements are passed, one at a time, to map() functions which transform each data element to an individual output data element.

A map() produces one or more intermediate <key, values> pair(s) from the input list.

k1 V1 k2 V2 k5 V5k4 V4k3 V3

MAP MAP MAPMAP

k6 V6 ……

k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 ……

Input list

Intermediate output list

© Varad Meru, 2013

Page 16: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+reduce() function

After map phase finish, those intermediate values with same output key are reduced into one or more final values

k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 ……

Reduce Reduce Reduce

F1 R1 F2 R2 F3 R3 ……

Intermediate map output

Final Result

© Varad Meru, 2013

Page 17: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Parallelism

map() functions run in parallel, creating different intermediate values from different input data elements

reduce() functions also run in parallel, working with assigned output key

All values are processed independently

Reduce phase can’t start until map phase is completely finished.

Its in a way, data parallel implementation and thus works with humongous amount of data.

© Varad Meru, 2013

Page 18: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Hadoop

Started by Doug Cutting, and then carried ahead by enterprises such as Yahoo! and Facebook

It’s a collection of three frameworks – Commons, MapReduce and DFS.

Free and Open Source with Apache Software License

Current Largest Cluster size of 4000 nodes. ( at Yahoo! )

Whole Ecosystem build around it to process large amounts of data. (~in GBs, TBs, PBs)

© Varad Meru, 2013

Page 19: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Evaluation of Recommendation EngineNetflix and Comparison with other frameworks

© Varad Meru, 2013

Page 20: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Netflix Dataset

This dataset was release by Netflix October 2, 2006 for SIGKDD challenge to build worlds best recommender for Netflix.

Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies.

Each training rating is a quadruplet of the form <user, movie, date of grade, grade>

Used heavily in Research for Recommender Engine1.

Used in our project to compare the Implementation of our Algorithm with other implementations e.g. Mahout

1Google Scholar : About 3,190 results for the search term “netflix prize”© Varad Meru, 2013

Page 21: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+High-level Architecture

MapReduce implementation of Clustering algorithms such as K-Means and MinHash Clustering.

Comparative Analysis with already present frameworks such as Apache Mahout (Refer Reference no. 1, 2, and 3)

© Varad Meru, 2013

Page 22: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+Requisites

2 Linux Machines (Required, preferred OS - Ubuntu)

Pentium 4 + Machines (Recommended – Core 2 Duo 2.53 GHz+)

RAM 1 GB per machine (Recommended – 4 GB per machine)

Apache Hadoop (from http://hadoop.apache.org )

Apache Mahout (from http://mahout.apache.org)

Java IDE ( Eclipse, Preferred)

Java SDK1.6+

© Varad Meru, 2013

Page 23: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+References

1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012.

2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)

3. http://mahout.apache.org/ - Apache Mahout Project Page

4. http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout

5. [VIDEO] “Collaborative filtering at scale” by Sean Owen

6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.

© Varad Meru, 2013

Page 24: Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines

+

Thank You

© Varad Meru, 2013