machine learning and apache mahout : an introduction
DESCRIPTION
An Introductory presentation on Machine Learning and Apache Mahout. I presented it at the BigData Meetup - Pune Chapter's first meetup (http://www.meetup.com/Big-Data-Meetup-Pune-Chapter/).TRANSCRIPT
+
Varad MeruSoftware Development EngineerOrzota, Inc.about.me/vrdmr
Machine Learning
and Apache Mahout
© Varad Meru, 2013
+Who Am I
Orzota, Inc. Making BigData Easy Designing a Cloud-based platform for ETL, Analytics
Past Work Experience Persistent Systems Ltd.
Recommendation Engines and User Behavior Analytics.
Area of Interest Machine Learning Distributed Systems Recommendation Engines
2
+Outline
Introduction
Machine Learning Introduction and History Types of Learning Algorithms Applications What’s New
Apache Mahout History Architecture Applications and Examples
Conclusion© Varad Meru, 2013
3
+
Machine LearningRise of the Machine-Era
4
+Introduction
Term coined by Arthur Samuel "Field of study that gives computers the ability to learn
without being explicitly programmed“.
Branch of Artificial Intelligence and Statistics
Focuses on prediction based on known properties
Used as a sub-process in Data Mining. Data Mining focuses on discovering new, unknown
properties.
“Machine Learning is Programming Computers to optimize a Performance Criterion using
Example Data or Past Experience”
5
+Learning Algorithms
Supervised Learning Labelled input data. Creating classifiers to predict unseen inputs.
Unsupervised Learning Unlabelled input data. Creating a function to predict the relation and output
Semi-Supervised Learning Combines Supervised and Unsupervised Learning
methodology
Reinforcement Learning Reward-Punishment based agent.
6
+Supervised Learning
Learn from the Data
Data is already labelled Expert, Crowd-sourced or case-based labelling of data.
Applications Handwriting Recognition Spam Detection Information Retrieval
Personalisation based on ranks Speech Recognition
Introduction
7
+Supervised Learning
Decision Trees
k-Nearest Neighbours
Naive Bayes
Logistic Regression
Perceptron and Multi-level Perceptrons
Neural Networks
SVM and Kernel estimation
Algorithms
8
+Supervised LearningExample: Naive Bayes Classifier
President Obama’s Speech’s Word Map
9
+Supervised LearningExample: Naive Bayes Classifier
A Spam Document’s Word Map
10
+Supervised LearningExample: Naive Bayes Classifier
Running a test on the Classifier
Classifier
“Order a trial Adobe chicken daily EAB-List new summer
savings, welcome!”
11
SpamBin
+Unsupervised Learning
Finding hidden structure in data
Unlabelled Data
SMEs needed post-processing to verify, validate and use the output
Used in exploratory analysis rather than predictive analytics
Applications Pattern Recognition Groupings based on a distance measure
Group of People, Objects, ...
Introduction
12
+Unsupervised Learning
Clustering k-Means, MinHash, Hierarchical Clustering
Hidden Markov Models
Feature Extraction methods
Self-organizing Maps (Neural Nets)
Algorithms
13
+Unsupervised LearningExample K-Means
14
Source: http://apandre.wordpress.com/visible-data/cluster-analysis/
+Learning ProblemCat and Dog Problem
Humans can easily classify which is a cat and which is a dog.
But how can a computer do that?
Some attempts used Clustering Mechanisms to solve it – Co-occurence Clustering, Deep Learning
15
+
Apache MahoutScalable Machine Learning Library
© Varad Meru, 2013
16
+History and Etymology
Inspired from MapReduce for Machine Learning on Multicore” Ng et. al.
Written in Java. Apache License.
Founders Mahout – Isabel Drost, Grant Ingersoll,
Karl Witten. Taste – Sean Owen
Mahout – Keeper/Driver of Elephants.
Current Release – 0.8 (stable)
© Varad Meru, 2013
17
+Need
BigData Ever-growing data. Yesterday’s methods to
process tomorrow’s data Cheap Storage
Scalable from Ground Up Should be build on top of
any existing Distributed Systems framework
Should contain distributed version of ML algorithms
Size Classification Tools
LinesSample Data
Analysis and Visualisation
Whiteboard,Bash, ...
KBs – low MBsPrototype Data
Analysis and Visualisation
Matlab, Octave, R, Processing, Bash, ...
MBs – low GBs
Online Data
StorageMySQL (DBs), ...
Analysis
NumPy, SciPy, Pandas, Weka..
VisualisationFlare, AmCharts, Raphael
GBs – TBs – PBs
Big Data
StorageHDFS, Hbase, Cassandra,...
AnalysisHive, Giraph, Hama, Mahout
18
+Mahout Modules
Evolutionary Algorithms
Classification
Clustering Recommenders
Regression FPM Dimension Reduction
UtiliesLucene/Vectorizer
MathVectors/ Matrics/SVD
Collections(Primitives)
Hadoop
Applications
19
+Recommender Systems
© Varad Meru, 2013
20
+Recommender Systems
Types of Recommender Systems Content Based Recommendations Collaborative Filtering Recommendations
User-User Recommendations Item-Item Recommendations
Dimensionality Reduction (SVD) Recommendations
Applications Products you would like to buy People you might want to connect with Potential Life-Partners Recommending Songs you might like ...
21
Introduction
+Recommender Systems
22
Collaborative Filtering in Action
Assuming people have seen at least one movie. Cold Start?
1: seen
0: not seen
© Varad Meru, 2013
+Collaborative Filtering in Action
Tanimoto Coefficient
NA – Number of Customers who bought A
NB – Number of Customers who bought B
NC – Number of Customers who bought A and B
© Varad Meru, 2013
CBA
C
NNN
NbaT
),(
23
+Collaborative Filtering in Action
Cosine Coefficient
NA – Number of Customers who bought A
NB – Number of Customers who bought B
NC – Number of Customers who bought A and B
© Varad Meru, 2013
BA
C
NN
NbaC
),(
24
+Apache Mahout
Two Modes Stand-alone non distributed (“Taste”) Scalable Distributed Algorithmic version
for Collaborative Filtering
Top-level Packages Data Model User Similarity Item Similarity User Neighbourhood Recommender
25
Recommender System Architecture
+Naive Bayes Classifier
26
Classifier
“Order a trial Adobe chicken daily EAB-List new summer
savings, welcome!”
+Naive Bayes Classifier
Naive Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.
Training: Read the Features Calculate per-Document
Statistics Normalize across Categories Calculate normalizing factor
of each label
Testing Classification (fifth job, explicitly invoked)
© Varad Meru, 2013
27
+K-Means Clustering
28
Iterations
+K-Means Clustering
29
MapReduce Version
+ Summary• Machine Learning
• Learning Algorithms• Varied Applications
• Mahout• Scaling to Giga/Tera/Peta Scale• Free and Open Source
30
+More Info.
1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012.
2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)
3. http://mahout.apache.org/ - Apache Mahout Project Page
4. http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout
5. [VIDEO] “Collaborative filtering at scale” by Sean Owen
6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.
© Varad Meru, 2013
31
+
Questions?
© Varad Meru, 2013
32
+ Thank YouGo BigData!!!
33
© Varad Meru, 2014