mahout classification presentation
DESCRIPTION
These slides were presented in class on April 7th, 2014.TRANSCRIPT
Classification on MahoutNaoki NakataniSan Jose State University
CS185C Spring 2014
Agenda
● Classification Overview● Mahout Overview
○ Classification on Mahout● Case Study with Demo
○ Problem Description○ Working Environment○ Data Preparation○ ML Model Generation
Classification?● Classifying examples into given set of categories● Supervised learning
○ Prepare data○ Build classifier (train & test)○ Apply classifier to new data
http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
Mahout?● Scalable machine learning
library = Can handle Big Data
● Runs on HDFS● Classification, Clustering,
Collaborative Filtering , etc
http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
Classification on Mahout?Classifying examples into given set of categories
Scalable machine learning library that can handle big data
Classifying big data into given set of categories
Case Study & Demo
Given question with title and body, can we automatically generate tags for it?
Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.
Does anyone have a link?
Documentation
latex3
expl3
DatasetFile :● TrainSmall.tsv
Fields :● id, title, body, tags
Characteristics :● Each question contains
only one tag
\0
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
\0
\0
“----” , ”-----------” , “------------------------” , “--- --- --- ---”“----” , ”-----------” , “------------------------” , “--- --- --- ---”
Working Environment
● Mac OS 10.9.1● Eclipse 4.3.2● Hadoop 1.2.1● Mahout 0.9● Source code available here.
Prerequisite (Where are you?)● You have input tsv file at result > output-topfivetags.● You are at “result” directory in Terminal.● Command “hadoop” and “mahout” is working.
Prepare Data1. Convert TSV file to Hadoop sequence file format.
Specify tag as a category. (Run TSVToSeq.java)
output-tsvtoseq folder and chunk-0 file is created.
Prepare Data1. Make directory in HDFS and upload chunk-0 (sequence
file) to the folder.
hadoop fs -mkdir <directory>
hadoop fs -put <source> <destination>
Prepare Data2. Transform questions into vectors. (mahout seq2sparse)
mahout seq2sparse -i <input directory> -o <output directory>
Prepare Data3. Split data into
a. Train set : to train modelb. Test set : to test model
mahout split \-i <input directory> \
--trainingOutput <output dir to train> \--testOutput <output dir to test> \--randomSelectionPct <integer> \
--overwrite \--sequenceFiles \
-xm sequential
Build Classifier1. Choose algorithm to use for classificationAvailable algorithms:
○ Naive Bayes■ trainnb, testnb■ org.apache.mahout.
classifier.naivebayes
○ Hidden Markov Model■ baumwelch, hmmpredict■ org.apache.mahout.
classifier.sequencelearning.hmm
○ Logistic Regression■ trainlogistic, testlogistic■ org.apache.mahout.
classifier.sgd
○ Random Forest■ ?■ ?
2. Train & test model using train set
Should yield high accuracy
Build Classifier (Naive Bayes)
mahout trainnb \-i <dir to train vectors> \
-el \-li <dir to put label index> \
-o <dir to put model> \-ow \
-c
mahout testnb \-i <dir to train vectors> \
-m <dir to model> \-l <dir to label index> \
-ow \-o <output dir> \
-c
Build Classifier (Naive Bayes)3. Test model using test set
Check if the accuracy is satisfactory
Apply ClassifierWhat do you have at this point?● model● label index
You can start classifying new data! (Check this example)
Model
Label Index
References● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages (part 2: distribute classification with hadoop)
Happy Machine Learning!