collaborative filtering - rajashree. apache mahout in 2008 as a subproject of apache’s lucene...
TRANSCRIPT
Collaborative Filtering
- Rajashree
Apache Mahout
• In 2008 as a subproject of Apache’s Lucene project
• Mahout absorbed the Taste open source collaborative filtering project
Apache Mahout
• A Mahout is an elephant trainer/driver/keeper, hence…
+Machine Learning
=
What is Apache Mahout?
• Machine learning• For large data• Based on Hadoop• But can work on a non Hadoop cluster• Scalable • Licensed by Apache
Mahout is a Java written open source scalable machine learning library from Apache
Mahout inApache Software Foundation
Lucene: information retrieval software library
Mahout inApache Software Foundation
Hadoop: framework for distributed storage and programming based on MapReduce
Mahout inApache Software Foundation
Taste: collaborative filtering framework
Why Mahout?
• Mahout provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms.
• Mahout is designed for performance, scalability and flexibility.
Why do we need a scalable framework?
General Architecture
Three-tiers architecture (Application, Algorithms and Shared Libraries)
General Architecture
Data Storage and Shared Libraries
General Architecture
Business Logic
General Architecture
External Applications invoking Mahout APIs
Use cases
•Currently Mahout supports mainly four use cases: –Recommendation - takes users' behavior and from that tries to find items users might like. –Clustering - takes e.g. text documents and groups them into groups of topically related documents. –Classification - learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. –Frequent itemset mining - takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
In this presentation we will focus on
Recommendation
Recommendation
Mahout implements a Collaborative Filtering framework –Popularized by Amazon and others –Uses hystorical data (ratings, clicks, and purchases) to provide recommendations •User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users. •Item-based: Calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline. •Slope-One: A very fast and simple item-based recommendation approach applicable when users have given ratings (and not just boolean preferences).
Collaborative Filtering
• Recommend people and products– User-User• User likes X, you might too
– Item-Item• People who bought X also bought Y
Example
Amazon.com
Recommendation - Architecture
Inceptive Idea:
A Java/J2EE application invokes a Mahout Recommender whose DataModel is based on a set of User Preferences that are built on the ground of a physical DataStore
Recommendation - Architecture
Physical Data
Data model
Recommender
External Application
Recommendation in Mahout
•Input: raw data (user preferences) •Output: preferences estimation
•Step 1 –Mapping raw data into a DataModel Mahout-compliant •Step 2 –Tuning recommender components
•Similarity measure, neighborhood, etc.
Recommendation Components
Mahout implements interfaces to these key abstractions: –DataModel •Methods for mapping raw data to a Mahout-compliant form –UserSimilarity •Methods to calculate the degree of correlation between two users –ItemSimilarity •Methods to calculate the degree of correlation between two items –UserNeighborhood •Methods to define the concept of ‘neighborhood’ –Recommender •Methods to implement the recommendation step itself
Components: DataModel
A DataModel is the interface to draw information about user preferences.
Which sources is it possible to draw? –Database •MySQLJDBCDataModel –External Files •FileDataModel –Generic (preferences directly feed through Java code) •GenericDataModel
Components: DataModel
Basic object: Preference –Preference is a triple (user,item,score)
Two implementations –GenericUserPreferenceArray •It stores numerical preference, as well. –BooleanUserPreferenceArray •It skips numerical preference values.
Components: UserSimilarity
•UserSimilarity defines a notion of similarity between two Users.
–(respectively) ItemSimilarity defines a notion of similarity between two Items.
•Which definition of similarity are available? –Pearson Correlation –Spearman Correlation –Euclidean Distance –Tanimoto Coefficient –LogLikelihood Similarity
Example: TanimotoDistance
Components: UserNeighborhood
•UserSimilarity defines a notion of similarity between two Users.
–(respectively) ItemSimilarity defines a notion of similarity between two Items.
•Which definition of neighborhood are available? –Nearest N users
•The first N users with the highest similarity are labeled as ‘neighbors’ –Tresholds
•Users whose similarity is above a threshold are labeled as ‘neighbors’
Components: Recommender
•Given a DataModel, a definition of similarity between users (items) and a definition of neighborhood, a recommender produces as output an estimation of relevance for each unseen item
•Which recommendation algorithms are implemented? –User-based CF –Item-based CF –SVD-based CF
Download Mahout
• Download –The latest Mahout release is 0.9 –Available at: http://apache.fastbull.org/mahout/0.9/mahout-distribution-0.9.zip -Extract all the libraries and include them in a new NetBeans (Eclipse) project
•Requirement: Java 1.6.x or greater. •Hadoop is not mandatory!
Exercise 1: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;
class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }
}
}
Exercise 1: First Recommender package dnslab.recommender;
import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;
class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }
}
}
FileData model
Exercise 1: First Recommender package dnslab.recommender;
import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;
class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }
}
}
Neighbors
Exercise 1: First Recommender package dnslab.recommender;
import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;
class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }
}
} Top-1 Recommendation for User 1
Exercise 1: First Recommender import org.apache.mahout.cf.taste.impl.model.file.*; import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import org.apache.mahout.cf.taste.model.*; import org.apache.mahout.cf.taste.neighborhood.*; import org.apache.mahout.cf.taste.recommender.*; import org.apache.mahout.cf.taste.similarity.*;
class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel(new File("data/mydata.dat ")); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity); List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }
}
}
Output
Exercise 2: MovieLens Recommender
• Download the MovieLens dataset (100k) –http://files.grouplens.org/datasets/movielens/ml-100k.zip
• similarity calculations with a bigger dataset •Next: now we can run the recommendation framework against a state-of-the-art dataset
MovieLens Dataset
u.data • The full u data set, 100000 ratings by 943 users
on 1682 items. • Each user has rated at least 20 movies. • Users and items are numbered consecutively
from 1• The data is randomly ordered. This is a tab
separated list of • user id | item id | rating | timestamp
Data Processing
• Its format is not Mahout compliant
cat u.data | cut –f1,2,3 | tr “\\t” “,” >>u.data1
package dnslab.recommender;import java.io.File;import java.io.IOException;import java.util.List;import org.apache.mahout.cf.taste.common.TasteException;import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;import org.apache.mahout.cf.taste.model.DataModel;import org.apache.mahout.cf.taste.recommender.RecommendedItem;import org.apache.mahout.cf.taste.similarity.ItemSimilarity;
public class ItemRecommend {public static void main(String[] args) {try {
DataModel dm = new FileDataModel(new File("data/u.data1"));TanimotoCoefficientSimilarity sim = new TanimotoCoefficientSimilarity(dm);GenericItemBasedRecommender recommender = new GenericItemBasedRecommender(dm, sim);int x=1;for(LongPrimitiveIterator items = dm.getItemIDs(); items.hasNext();) {long itemId = items.nextLong();List<RecommendedItem>recommendations = recommender.mostSimilarItems(itemId, 5);for(RecommendedItem recommendation : recommendations) {System.out.println(itemId + "," + recommendation.getItemID() + "," + recommendation.getValue());}x++;if(x>10) System.exit(1);}} catch (IOException e) {System.out.println("There was an error.");e.printStackTrace();} catch (TasteException e) {System.out.println("There was a Taste Exception");e.printStackTrace();}
}
}
Output
Similar items
Preferences similarity
items
Thank you