composing mahout clustering jobs
TRANSCRIPT
![Page 2: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/2.jpg)
Bio
● Frank Scholten
● Developer at Amsterdam, NL
● Mahout user / contributor
● http://blog.jteam.nl/author/frank
![Page 3: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/3.jpg)
Agenda
What is clustering?
Introducing
Clustering
![Page 4: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/4.jpg)
What is clustering?
![Page 5: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/5.jpg)
Clustering - Google News
![Page 6: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/6.jpg)
Why clustering?
● Summarizing data
● Applications
Market analysis – identify customer groups
Biology – identify species
Image compression
many more applications!
![Page 7: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/7.jpg)
Definition
“Cluster analysis or clustering is the assignment of a set of observations into
subsets (called clusters) so that observations in the same cluster are similar in some
sense.”
Source: Wikipedia
![Page 8: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/8.jpg)
2-D Clustering Example
Inter-cluster distance
ClusterCenter
Intra-cluster distance
PointLegend Cluster
![Page 9: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/9.jpg)
Distance Measures
Euclidian distance measure
P
Q
d
![Page 10: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/10.jpg)
Vectorization
Vectorize data to measure distances
'The fox chased the dog'● [the => 2, fox => 1, chased => 1, dog => 1]
#0000CD → [wavelength => 475]
“Amsterdam” → [ lat => 52, long => 4]
![Page 11: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/11.jpg)
K-Means Algorithm
Select K random vectors
Specify distance measure + threshold
Every iteration● Add vector closest to cluster● Recompute center● Converged if no vectors within threshold
![Page 12: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/12.jpg)
Introducing
![Page 13: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/13.jpg)
Scalable machine learning
On top of Hadoop, for the most part
Started in 2008
Version 0.5 released last week!
![Page 14: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/14.jpg)
CollaborativeFiltering
Clustering
Classifcation
Is this SPAM?
And much more!
![Page 15: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/15.jpg)
Composing several jobs
$ mahout seqdirectory <options>
$ mahout seq2sparse <options>
$ mahout kmeans <options>
$ mahout clusterdump <options>
![Page 16: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/16.jpg)
bin/mahout
$ mahout seqdirectory --help
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
Usage:
[--keyPrefix <keyPrefix> --chunkSize <chunkSize> --charset <charset> --output
<output> --fileFilterClass <fileFilterClass> --help --input <input>]
![Page 17: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/17.jpg)
bin/mahout
● bin/mahout calls MahoutDriver
● MahoutDriver
● Parses options
● Configures other Drivers
![Page 18: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/18.jpg)
KMeansDriver
String[] args = new String[] {
"--input", input,
"--output", output,
"--clusters", clusters,
"--clustering",
"--numClusters", “10”
};
ToolRunner.run(conf,new KmeansDriver(),args);
![Page 19: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/19.jpg)
KMeansDriver
V1V2V3
V4V5V6
KMeansMapper KMeansCombiner KMeansReducer
(C1,V1)(C2,V2)(C2,V3)
(C3,V4)(C3,V5)(C4,V6)
(C1,V1)(C2,[V2,V3])
Cluster closest to vector
Combine clusterobservations
(C3,[V4,V5)(C4,V6)
(C1, Centroid1)(C2, Centroid2)(C3, Centroid3)(C4, Centroid4)
![Page 20: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/20.jpg)
Clustering
![Page 21: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/21.jpg)
Clustering
● Publicly available monthly dumps
● Posts ~ 5.5 GB ~ 1.4 M questions
● Inspired by Mahout in Action book
![Page 22: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/22.jpg)
Goal - Tag cloud
(250 tags)
Deploying Mahout on a Hadoop cluster
(unknown #) Questions Datasets for Apache Mahout
Classifying data using Apache Mahout
...
Ruby
Unit TestingMahout
Hadoop IPhone
Java
![Page 23: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/23.jpg)
How to cluster?
● Stopwords influence clustering big time!
● Option?● Cluster on collocations, e.g “Unit Testing”
● MAHOUT-415
Lucene filter for collocations
![Page 24: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/24.jpg)
Preprocess
Cluster
Index
Join clustered points
Vectorize[ unit testing => 1, … ][ version control => 1, … ]
Clustering on collocations
Continuous Integration
Unit Testing
Version control
Posts
Findcollocations
Text
ClusterLabels
![Page 25: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/25.jpg)
Pre process
StackOverfow posts
Extractpost body
Interpret &Strip HTMLPlain text
<row Id="4234" PostTypeId="1" content=”...”><row Id="136" PostTypeId="2" content=”...”> <row Id="985" PostTypeId="1" content=”...”>
“
”<pHow to use Mahout</p>
How to use Mahout
Use Mahout'sXMLInputFormat
![Page 26: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/26.jpg)
Find collocations
Unigrams
● “Java”, “Ruby”
Bigrams
● “Continuous Integration”, “Unit Testing”
![Page 27: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/27.jpg)
Find collocations
● Use Mahout's CollocDriver
● Compute LogLikelihood Ratio (LLR) (Dunning)
● Select bigrams with high LLR
![Page 28: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/28.jpg)
Find collocations
Save collocations in Bloom Filter
[1, 0, 1, 0, 1, 1, 0, 0, ...]
Generate khash values for“Unit testing”
Add collocation “Unit Testing”
0, 2, 5, 4
Set bits in bitset
BloomFilter
![Page 29: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/29.jpg)
Vectorize
Lucene analyzer emits collocations
[ Unit testing => 1, … ][ Version control => 1, … ]
[1, 0, 1, 0, 1, 1, 0, 0, ...]
Generate khash values for“Unit testing”
0, 2, 5, 4
Check bits in bitset
Is “Unit Testing”a siginifcant colloc? Bloom
FilterTrue Can be
false positive!
![Page 30: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/30.jpg)
Cluster
KMeans
![Page 31: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/31.jpg)
Join clustered points
PostsSequence fle
[ (id), (title, content) ]
Clustered points
( id = 23 , )
( id = 51 , )
( id = 78 , )
( id = 23 , )
( id = 34 , )
Map-side join
[ (id), (title, content, clusterId) ]
![Page 32: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/32.jpg)
Cluster labels
[0.56, 0.32, 0.98, ...]
0 => Unit testing1 => Version control2 => Continuous integration...
Dictionary fle
Term with highest weight
0.98
Label “Continuous integration”
Cluster centroid
![Page 33: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/33.jpg)
Index
Index
[id,title,content,clusterId,clusterLabel]
View with web app & Solr
![Page 34: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/34.jpg)
Running the job on
Mahout job jar Amazon instances
Launch via Whirr
Submit via Java or CLI
![Page 35: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/35.jpg)
Apache Whirr
● Tool for launching clusters
● Whirr property fle
whirr.provider=aws-ec2
whirr.instance-templates= 1 hadoop-jobtracker+hadoop-namenode, 10 hadoop-datanode+hadoop-tasktracker
whirr.identity=topsecret
![Page 36: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/36.jpg)
Apache Whirr
Launch!
$ whirr launch-cluster \
--config so-cluster.properties
$ export HADOOP_CONF_DIR=.whirr/so-cluster
Run!
$ hadoop fs -put posts.xml input
$ mahout seq2sparse ...
![Page 37: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/37.jpg)
Whirr
Launch!
Configuration prop = new PropertiesConfiguration(whirrConfigFile);
ClusterSpec spec = new ClusterSpec(prop);
Service service = new Service();
Cluster cluster = service.launchCluster(clusterSpec);
![Page 38: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/38.jpg)
Whirr
Submit!
Configuration configuration = new Configuration();
configuration.addResource(new Path( “/home/frank/.whirr/so/hadoop-site.xml"));
Job job = new Job(conf);
job.submit();
![Page 39: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/39.jpg)
Demo Time!
![Page 40: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/40.jpg)
Conclusions
![Page 41: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/41.jpg)
References
http://blog.jteam.nl/author/frank
Mahout mailinglist
![Page 42: Composing Mahout clustering jobs](https://reader034.vdocuments.net/reader034/viewer/2022051405/58a1aa9b1a28abd04d8c494a/html5/thumbnails/42.jpg)
Q&A