sdec2011 essentials of mahout
DESCRIPTION
TRANSCRIPT
![Page 1: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/1.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Essentials of MahoutMastering Hadoop Map-reduce for Data Analysis
Shashank Tiwariblog: shanky.org | twitter: @[email protected]
![Page 2: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/2.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
What is Apache Mahout?
• A scalable machine learning infrastructure
• Built on top of Hadoop MapReduce
• Currently supports:
• Clustering, classification, and collaborative filtering, etc...
![Page 3: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/3.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
A Little History
• Founded by folks active in the Lucene community
• Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf
![Page 4: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/4.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Project Goal
• Create a community driven scalable and robust machine learning infrastructure
• Leverage Hadoop for parallel processing and scalability
• Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
![Page 5: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/5.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Supported Algorithms
• Collaborative Filtering
• User and Item based recommenders
• K-Means, Fuzzy K-Means clustering
• Mean Shift clustering
• Dirichlet process clustering
![Page 6: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/6.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
More Supported Algorithms
• Latent Dirichlet Allocation
• Singular value decomposition
• Parallel Frequent Pattern mining
• Complementary Naive Bayes classifier
• Random forest decision tree based classifier
• ...and growing
![Page 7: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/7.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Focus Areas
• Collaborative Filtering
• Clustering
• Classification
![Page 8: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/8.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Build and Install
• Required Software:
• Java 1.6.x
• Maven 2.0.11+
• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
• Compile & install core & examples: mvn install
• Alternatively, individually mvn compile, mvn package, and mvn install
![Page 9: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/9.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Recommendation Examples
• mvn -q exec:java -Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/workspace/hadoop_workspace/grouplens/ratings.dat"
• https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples
![Page 10: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/10.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Common Use Cases
• Shopping: Amazon, Netflix
• Who to follow/friend: Twitter/Facebook
• Web resource classification, spam filtering, financial markets pattern recognition, classification
![Page 11: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/11.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Collaborative Filtering Basis
• User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges.
• Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable.
• Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike.
• Model-based: provide recommendation on the basis of developing a model of users and their ratings.
![Page 12: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/12.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Clustering Basis
• Clustering algorithms also use the notion of similarity to group similar items into a cluster.
• Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques.
• Example: Euclidean distance,
![Page 13: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/13.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Mahout Taste Framework
• Taste Collaborative Filtering:
• Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.
• Has been applied to a number of different data sets successfully.
![Page 14: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/14.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Mahout Taste Framework
• Taste Collaborative Filtering:
• Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.
• Has been applied to a number of different data sets successfully.
• Mahout supports building recommendation engines primarily basis the Taste library.
• The library supports both user-based and item-based recommendations.
• Can be used with Java or over RESTful web-service endpoints.
![Page 15: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/15.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Taste Framework : Primary Classes
• DataModel: Model for Users, Items, and Preferences
• UserSimilarity: Interface defining the similarity between two users
• ItemSimilarity: Interface defining the similarity between two items
• Recommender: Interface for providing recommendations
• UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
![Page 16: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/16.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Taste Framework : Online vs Offline
• Can do online recommendations for a few thousand data sets.
• Leverages Hadoop for offline recommendation calculations on large data sets.
![Page 17: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/17.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Understanding the Group Lens Implementation
• Provide an insight into a sample Mahout Taste Framework Implementation.
• Uses the publicly available data set
• Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation
• Easy to follow example
![Page 18: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/18.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Group Lens Implementation Source
• GroupLensDataModel.java
• GroupLensRecommender.java
• GroupLensRecommenderBuilder.java
• GroupLensRecommenderEvaluatorRunner.java
![Page 19: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/19.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Group Lens Runner -- evaluator
• Instantiates an evaluator:
• RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
• a “mean average error” algorithm
• Parses input parameters:
• File ratingsFile = TasteOptionParser.getRatings(args);
![Page 20: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/20.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Group Lens Runner -- data model
• Parses a colon delimiter pattern file:
• DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
![Page 21: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/21.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.Group Lens Runner -- evaluate with
recommendation builder
• evaluates using GroupLensRecommender
• double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
![Page 22: SDEC2011 Essentials of Mahout](https://reader033.vdocuments.net/reader033/viewer/2022051514/54b777ce4a795918738b467d/html5/thumbnails/22.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Questions?
• blog: shanky.org | twitter: @tshanky