collaborative filtering - rajashree. apache mahout in 2008 as a subproject of apache’s lucene...

41
Collaborative Filtering - Rajashree

Upload: jasmin-foster

Post on 25-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Collaborative Filtering

- Rajashree

Page 2: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Apache Mahout

• In 2008 as a subproject of Apache’s Lucene project

• Mahout absorbed the Taste open source collaborative filtering project

Page 3: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Apache Mahout

• A Mahout is an elephant trainer/driver/keeper, hence…

+Machine Learning

=

Page 4: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

What is Apache Mahout?

• Machine learning• For large data• Based on Hadoop• But can work on a non Hadoop cluster• Scalable • Licensed by Apache

Mahout is a Java written open source scalable machine learning library from Apache

Page 5: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Mahout inApache Software Foundation

Lucene: information retrieval software library

Page 6: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Mahout inApache Software Foundation

Hadoop: framework for distributed storage and programming based on MapReduce

Page 7: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Mahout inApache Software Foundation

Taste: collaborative filtering framework

Page 8: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Why Mahout?

• Mahout provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms.

• Mahout is designed for performance, scalability and flexibility.

Page 9: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Why do we need a scalable framework?

Page 10: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

General Architecture

Three-tiers architecture (Application, Algorithms and Shared Libraries)

Page 11: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

General Architecture

Data Storage and Shared Libraries

Page 12: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

General Architecture

Business Logic

Page 13: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

General Architecture

External Applications invoking Mahout APIs

Page 14: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Use cases

•Currently Mahout supports mainly four use cases: –Recommendation - takes users' behavior and from that tries to find items users might like. –Clustering - takes e.g. text documents and groups them into groups of topically related documents. –Classification - learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. –Frequent itemset mining - takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Page 15: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

In this presentation we will focus on

Recommendation

Page 16: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Recommendation

Mahout implements a Collaborative Filtering framework –Popularized by Amazon and others –Uses hystorical data (ratings, clicks, and purchases) to provide recommendations •User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users. •Item-based: Calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline. •Slope-One: A very fast and simple item-based recommendation approach applicable when users have given ratings (and not just boolean preferences).

Page 17: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Collaborative Filtering

• Recommend people and products– User-User• User likes X, you might too

– Item-Item• People who bought X also bought Y

Page 18: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Example

Amazon.com

Page 19: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Recommendation - Architecture

Inceptive Idea:

A Java/J2EE application invokes a Mahout Recommender whose DataModel is based on a set of User Preferences that are built on the ground of a physical DataStore

Page 20: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Recommendation - Architecture

Physical Data

Data model

Recommender

External Application

Page 21: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Recommendation in Mahout

•Input: raw data (user preferences) •Output: preferences estimation

•Step 1 –Mapping raw data into a DataModel Mahout-compliant •Step 2 –Tuning recommender components

•Similarity measure, neighborhood, etc.

Page 22: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Recommendation Components

Mahout implements interfaces to these key abstractions: –DataModel •Methods for mapping raw data to a Mahout-compliant form –UserSimilarity •Methods to calculate the degree of correlation between two users –ItemSimilarity •Methods to calculate the degree of correlation between two items –UserNeighborhood •Methods to define the concept of ‘neighborhood’ –Recommender •Methods to implement the recommendation step itself

Page 23: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Components: DataModel

A DataModel is the interface to draw information about user preferences.

Which sources is it possible to draw? –Database •MySQLJDBCDataModel –External Files •FileDataModel –Generic (preferences directly feed through Java code) •GenericDataModel

Page 24: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Components: DataModel

Basic object: Preference –Preference is a triple (user,item,score)

Two implementations –GenericUserPreferenceArray •It stores numerical preference, as well. –BooleanUserPreferenceArray •It skips numerical preference values.

Page 25: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Components: UserSimilarity

•UserSimilarity defines a notion of similarity between two Users.

–(respectively) ItemSimilarity defines a notion of similarity between two Items.

•Which definition of similarity are available? –Pearson Correlation –Spearman Correlation –Euclidean Distance –Tanimoto Coefficient –LogLikelihood Similarity

Page 26: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Example: TanimotoDistance

Page 27: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Components: UserNeighborhood

•UserSimilarity defines a notion of similarity between two Users.

–(respectively) ItemSimilarity defines a notion of similarity between two Items.

•Which definition of neighborhood are available? –Nearest N users

•The first N users with the highest similarity are labeled as ‘neighbors’ –Tresholds

•Users whose similarity is above a threshold are labeled as ‘neighbors’

Page 28: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Components: Recommender

•Given a DataModel, a definition of similarity between users (items) and a definition of neighborhood, a recommender produces as output an estimation of relevance for each unseen item

•Which recommendation algorithms are implemented? –User-based CF –Item-based CF –SVD-based CF

Page 29: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Download Mahout

• Download –The latest Mahout release is 0.9 –Available at: http://apache.fastbull.org/mahout/0.9/mahout-distribution-0.9.zip -Extract all the libraries and include them in a new NetBeans (Eclipse) project

•Requirement: Java 1.6.x or greater. •Hadoop is not mandatory!

Page 30: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Exercise 1: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

Page 31: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Exercise 1: First Recommender package dnslab.recommender;

import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

FileData model

Page 32: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Exercise 1: First Recommender package dnslab.recommender;

import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

Neighbors

Page 33: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Exercise 1: First Recommender package dnslab.recommender;

import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity);

List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

} Top-1 Recommendation for User 1

Page 34: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Exercise 1: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;

class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);

for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }

}

}

Page 35: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Output

Page 36: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Exercise 2: MovieLens Recommender

• Download the MovieLens dataset (100k) –http://files.grouplens.org/datasets/movielens/ml-100k.zip

• similarity calculations with a bigger dataset •Next: now we can run the recommendation framework against a state-of-the-art dataset

Page 37: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

MovieLens Dataset

u.data • The full u data set, 100000 ratings by 943 users

on 1682 items. • Each user has rated at least 20 movies. • Users and items are numbered consecutively

from 1• The data is randomly ordered. This is a tab

separated list of • user id | item id | rating | timestamp

Page 38: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Data Processing

• Its format is not Mahout compliant

cat u.data | cut –f1,2,3 | tr “\\t” “,” >>u.data1

Page 39: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

package dnslab.recommender;import java.io.File;import java.io.IOException;import java.util.List;import org.apache.mahout.cf.taste.common.TasteException;import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;import org.apache.mahout.cf.taste.model.DataModel;import org.apache.mahout.cf.taste.recommender.RecommendedItem;import org.apache.mahout.cf.taste.similarity.ItemSimilarity;

public class ItemRecommend {public static void main(String[] args) {try {

DataModel dm = new FileDataModel(new File("data/u.data1"));TanimotoCoefficientSimilarity sim = new TanimotoCoefficientSimilarity(dm);GenericItemBasedRecommender recommender = new GenericItemBasedRecommender(dm, sim);int x=1;for(LongPrimitiveIterator items = dm.getItemIDs(); items.hasNext();) {long itemId = items.nextLong();List<RecommendedItem>recommendations = recommender.mostSimilarItems(itemId, 5);for(RecommendedItem recommendation : recommendations) {System.out.println(itemId + "," + recommendation.getItemID() + "," + recommendation.getValue());}x++;if(x>10) System.exit(1);}} catch (IOException e) {System.out.println("There was an error.");e.printStackTrace();} catch (TasteException e) {System.out.println("There was a Taste Exception");e.printStackTrace();}

}

}

Page 40: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Output

Similar items

Preferences similarity

items

Page 41: Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative

Thank you