lecture 7 - cs 246h
DESCRIPTION
aTRANSCRIPT
![Page 1: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/1.jpg)
Stanford CS 246H Winter ‘14
Stanford CS 246H: Mining Massive Data Sets Hadoop Lab
![Page 2: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/2.jpg)
Stanford CS 246H Winter ‘14
Machine Learning & Hadoop
![Page 3: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/3.jpg)
Stanford CS 246H Winter ‘14
Peanut BuCer and Chocolate?
• The Promise of Big Data™ • Sounds great, but how?
• Hadoop talent pool is small • ML talent pool is Kny
• Tools and toolkits starKng to appear • Mahout, Oryx, Alpine, Ayasdi, Skytree, etc.
• Summary: Hadoop is hard, and ML is hard 1. Lots of people/companies are trying to make it easy 2. Don’t believe anyone who tells you they make it easy
![Page 4: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/4.jpg)
Stanford CS 246H Winter ‘14
Hadoop & ML: A Brief History
• 2005 – Taste project started on SourceForge • 2007 – Mahout project started at Apache • 2008 – Taste donated to Mahout • … Kme passes … • 2012 – Myrrix is launched • 2013 – Cloudera ML project started on Github • Late 2013 – Oryx project started on Github
![Page 5: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/5.jpg)
Stanford CS 246H Winter ‘14
Hadoop ML Family Tree
Taste
Mahout
Myrrix Cloudera ML
Oryx
Lucene
Andrew Ng
![Page 6: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/6.jpg)
Stanford CS 246H Winter ‘14
Apache Mahout
![Page 7: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/7.jpg)
Stanford CS 246H Winter ‘14
What is Mahout?
• “Scalable machine learning” • not just Hadoop-‐oriented machine learning • not en%rely, that is. Just mostly.
• Components • math library • clustering • classificaKon • decomposiKons • recommendaKons
©MapR Technologies 2013
![Page 8: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/8.jpg)
Stanford CS 246H Winter ‘14
Mahout Math
• Goals are • basic linear algebra, • and staKsKcal sampling, • and good clustering, • decent speed, • extensibility, • especially for sparse data
• But not • totally badass speed • comprehensive set of algorithms • opKmizaKon, root finders, quadrature
©MapR Technologies 2013
![Page 9: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/9.jpg)
Stanford CS 246H Winter ‘14
Caveat Emptor
• Mahout is a toolkit • There is a command line interface
• You can’t always use it
• Very oken end up wriKng code • DocumentaKon is… ahem… scant
• Best reference is Mahout in AcKon
• Varying levels of maturity • Varying levels of Hadoop support
![Page 10: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/10.jpg)
Stanford CS 246H Winter ‘14
Matrices and Vectors
• At the core: • DenseVector, RandomAccessSparseVector • DenseMatrix, SparseRowMatrix
• Highly composable API
• Important ideas: • view*, assign and aggregate • iteraKon
m.viewDiagonal().assign(v)!
©MapR Technologies 2013
![Page 11: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/11.jpg)
Stanford CS 246H Winter ‘14
Assign? View?
• Why assign? • Copying is the major cost for naïve matrix packages • In-‐place operaKons criKcal to reasonable performance • Many kinds of updates required, so funcKonal style very helpful
• Why view? • In-‐place operaKons oken required for blocks, rows, columns or diagonals
• With views, we need #assign + #views methods • Without views, we need #assign x #views methods
• Synergies • With both views and assign, many loops become single line
©MapR Technologies 2013
![Page 12: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/12.jpg)
Stanford CS 246H Winter ‘14
Assign
• Matrices
• Vectors
Matrix assign(double value);!Matrix assign(double[][] values);!Matrix assign(Matrix other);!Matrix assign(DoubleFunction f);!Matrix assign(Matrix other, DoubleDoubleFunction f);!
Vector assign(double value);!Vector assign(double[] values);!Vector assign(Vector other);!Vector assign(DoubleFunction f);!Vector assign(Vector other, DoubleDoubleFunction f);!Vector assign(DoubleDoubleFunction f, double y);!
©MapR Technologies 2013
![Page 13: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/13.jpg)
Stanford CS 246H Winter ‘14
Views
• Matrices
• Vectors
Matrix viewPart(int[] offset, int[] size);!Matrix viewPart(int row, int rlen, int col, int clen);!Vector viewRow(int row);!Vector viewColumn(int column);!Vector viewDiagonal();!
Vector viewPart(int offset, int length);!
©MapR Technologies 2013
![Page 14: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/14.jpg)
Stanford CS 246H Winter ‘14
Aggregates
• Matrices
• Vectors double zSum();!double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);!double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!
double zSum();!Vector aggregateRows(VectorFunction f);!Vector aggregateColumns(VectorFunction f);!double aggregate(DoubleDoubleFunction combiner, ! DoubleFunction mapper);!
©MapR Technologies 2013
![Page 15: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/15.jpg)
Stanford CS 246H Winter ‘14
Predefined FuncKons
• Many handy funcKons ABS LOG2 !ACOS NEGATE !ASIN RINT !ATAN SIGN !CEIL SIN !COS SQRT !EXP SQUARE !FLOOR SIGMOID !IDENTITY SIGMOIDGRADIENT !INV TAN !LOGARITHM!
©MapR Technologies 2013
![Page 16: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/16.jpg)
Stanford CS 246H Winter ‘14
Examples
double alpha; a.assign(alpha);
a.assign(b, FuncKons.chain( FuncKons.plus(beta), FuncKons.mult(alpha));
A =α
A =αB+β
©MapR Technologies 2013
![Page 17: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/17.jpg)
Stanford CS 246H Winter ‘14
Sparse OpKmizaKons
• DoubleDoubleFuncKon abstract properKes
• And Vector properKes
public boolean isLikeRightPlus();!public boolean isLikeLeftMult();!public boolean isLikeRightMult();!public boolean isLikeMult();!public boolean isCommutative();!public boolean isAssociative();!public boolean isAssociativeAndCommutative();!public boolean isDensifying();!
public boolean isDense();!public boolean isSequentialAccess();!public double getLookupCost();!public double getIteratorAdvanceCost();!public boolean isAddConstantTime();!
©MapR Technologies 2013
![Page 18: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/18.jpg)
Stanford CS 246H Winter ‘14
Examples
• The trace of a matrix
• Set diagonal to zero
• Set diagonal to negaKve of row sums excluding the diagonal
m.viewDiagonal().zSum()!
m.viewDiagonal().assign(0)!
Vector diag = m.viewDiagonal().assign(0);!diag.assign(m.rowSums().assign(Functions.MINUS));!
©MapR Technologies 2013
![Page 19: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/19.jpg)
Stanford CS 246H Winter ‘14
IteraKon
• Matrices are Iterable in Mahout
• Vectors are densely or sparsely iterable
// compute both row and columns sums in one pass!for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);!}!
double entropy = 0;!for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());!}!
©MapR Technologies 2013
![Page 20: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/20.jpg)
Stanford CS 246H Winter ‘14
Random Sampling
• Samples from some type
• Lots of kinds ChineseRestaurant Missing Normal !Empirical Multinomial PoissonSampler !IndianBuffet MultiNormal Sampler !
public interface Sampler<T> {! T sample();!}!!public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler<Double>!
©MapR Technologies 2013
![Page 21: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/21.jpg)
Stanford CS 246H Winter ‘14
Mahout Math Summary
• Matrices, Vectors • views • in-‐place assignment • aggregaKons • iteraKons
• FuncKons • lots built-‐in • cooperate with sparse vector opKmizaKons
• Sampling • abstract samplers • samplers as funcKons
• Other stuff … clustering, SVD
©MapR Technologies 2013
![Page 22: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/22.jpg)
Stanford CS 246H Winter ‘14
Other Stuff
• Matrix DecomposiKon • ClassificaKon • Clustering • RecommendaKons
![Page 23: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/23.jpg)
Stanford CS 246H Winter ‘14
Focus: Machine Learning
Math Vectors/Matrices/SVD
Recommenders Clustering ClassificaKon Freq. PaCern Mining
GeneKc
UKliKes Lucene/Vectorizer
CollecKons (primiKves)
Apache Hadoop
ApplicaKons
Examples
See hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
©Lucid ImaginaKon 2010
![Page 24: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/24.jpg)
Stanford CS 246H Winter ‘14
Prepare Data from Raw content
• Data Sources: • Lucene integraKon
• bin/mahout lucenevector …
• Document Vectorizer • bin/mahout seqdirectory … • bin/mahout seq2sparse …
• ProgrammaKcally • See the UKls module in Mahout
• Database • File system
©Lucid ImaginaKon 2010
![Page 25: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/25.jpg)
Stanford CS 246H Winter ‘14
RecommendaKons
• Extensive framework for collaboraKve filtering • Recommenders
• User based, Item based, ALS, SlopeOne, SVD, others
• Online and Offline support • Offline can uKlize Hadoop
• Many different Similarity measures • Cosine, LLR, Tanimoto, Pearson, others
©Lucid ImaginaKon 2010
![Page 26: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/26.jpg)
Stanford CS 246H Winter ‘14
Clustering
• Document level • Group documents based on a noKon of similarity
• K-‐Means, Fuzzy K-‐Means, Dirichlet, Canopy, Mean-‐Shik
• Distance Measures • ManhaCan, Euclidean, other
• Topic Modeling • Cluster words across documents to idenKfy topics
• Latent Dirichlet AllocaKon
©Lucid ImaginaKon 2010
![Page 27: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/27.jpg)
Stanford CS 246H Winter ‘14
CategorizaKon
• Place new items into predefined categories: • Sports, poliKcs, entertainment
• Mahout has several implementaKons • Naïve Bayes • Complementary Naïve Bayes • Decision Forests • LogisKc Regression (SGD)
©Lucid ImaginaKon 2010
![Page 28: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/28.jpg)
Stanford CS 246H Winter ‘14
Freq. PaCern Mining
• IdenKfy frequently co-‐occurrent items
• Useful for: • Query RecommendaKons
• Apple -‐> iPhone, orange, OS X
• Related product placement • “Beer and Diapers”
• Spam DetecKon • Yahoo: hCp://www.slideshare.net/hadoopusergroup/mail-‐anKspam
hCp://www.amazon.com
©Lucid ImaginaKon 2010
![Page 29: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/29.jpg)
Stanford CS 246H Winter ‘14
EvoluKonary
• Map-‐Reduce ready fitness funcKons for geneKc programming
• IntegraKon with Watchmaker • hCp://watchmaker.uncommons.org/index.php
• Problems solved: • Traveling salesman • Class discovery • Many others
©Lucid ImaginaKon 2010
![Page 30: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/30.jpg)
Stanford CS 246H Winter ‘14
Singular Value DecomposiKon
• Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/reducing the less important parts
• Mahout has fully distributed Lanczos implementaKon <MAHOUT_HOME>/bin/mahout svd -‐Dmapred.input.dir=path/to/corpus -‐-‐tempDir path/for/svd-‐output -‐-‐rank 300 -‐-‐numColumns <numcols> -‐-‐numRows <num rows in the input> <MAHOUT_HOME>/bin/mahout cleansvd -‐-‐eigenInput path/for/svd-‐output -‐-‐corpusInput path/to/corpus -‐-‐output path/for/cleanOutput -‐-‐maxError 0.1 -‐-‐minEigenvalue 10.0
• hCps://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+ReducKon
©Lucid ImaginaKon 2010
![Page 31: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/31.jpg)
Stanford CS 246H Winter ‘14
How to: Command Line
• Most algorithms have a Driver program • Shell script in $MAHOUT_HOME/bin helps with most tasks
• Prepare the Data • Different algorithms require different setup
• Run the algorithm • Single Node • Hadoop
• Print out the results • Several helper classes:
• LDAPrintTopics, ClusterDumper, etc.
©Lucid ImaginaKon 2010
![Page 32: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/32.jpg)
Stanford CS 246H Winter ‘14
Ugly Demo II -‐ Prep
• Data Set: Reuters • hCp://www.daviddlewis.com/resources/testcollecKons/reuters21578/
• Convert to Text via hCp://www.lucenebootcamp.com/lucene-‐boot-‐camp-‐preclass-‐training/
• Convert to Sequence File: bin/mahout seqdirectory –input <PATH> -‐-‐output <PATH> -‐-‐charset UTF-‐8
• Convert to Sparse Vector: bin/mahout seq2sparse -‐-‐input <PATH>/content/reuters/seqfiles/ -‐-‐norm 2 -‐-‐weight TF -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/ -‐-‐minDF 5 -‐-‐maxDFPercent 90
©Lucid ImaginaKon 2010
![Page 33: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/33.jpg)
Stanford CS 246H Winter ‘14
Ugly Demo II: Topic Modeling
• Latent Dirichlet AllocaKon ./mahout lda -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/vectors/ -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output -‐-‐numWords 34000 –numTopics 10 ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/state-‐19 -‐-‐dict <PATH>/content/reuters/seqfiles-‐TF/dictionary.file-‐0 -‐-‐words 10 -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/topics -‐-‐dictionaryType sequencefile
• Good feature reducKon (stopword removal) required
©Lucid ImaginaKon 2010
![Page 34: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/34.jpg)
Stanford CS 246H Winter ‘14
Ugly Demo III: Clustering
• K-‐Means • Same Prep as UD II, except use TFIDF weight ./mahout kmeans -‐-‐input <PATH>/content/reuters/seqfiles-‐TFIDF/vectors/part-‐00000 -‐-‐k 15 -‐-‐output <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans -‐-‐clusters <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters
• Print out the clusters: ./mahout clusterdump -‐-‐seqFileDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters-‐15/ -‐-‐pointsDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/points/ -‐-‐dictionary <PATH>/content/reuters/seqfiles-‐TFIDF/dictionary.file-‐0 -‐-‐dictionaryType sequencefile -‐-‐substring 20
©Lucid ImaginaKon 2010
![Page 35: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/35.jpg)
Stanford CS 246H Winter ‘14
Ugly Demo IV: Frequent PaCern Mining
• Data: hCp://fimi.cs.helsinki.fi/data/ • ./mahout fpg -‐i <PATH>/content/freqitemset/accidents.dat -‐o patterns -‐k 50 -‐method mapreduce -‐g 10 -‐regex [\ ]
• ./mahout seqdump -‐-‐seqFile patterns/fpgrowth/part-‐r-‐00000
©Lucid ImaginaKon 2010
![Page 36: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/36.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML
![Page 37: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/37.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML
• CollecKon of Java libraries and command-‐line tools • Goal: make data scienKsts more producKve with CDH
• Exploratory data analysis • Data preparaKon • Model fi}ng • Model evaluaKon
• Apache 2.0 licensed • Developed on GitHub
• hCp://github.com/cloudera/ml
37
![Page 38: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/38.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: Building Blocks
• Apache Hadoop • Scalable data storage (HDFS) and processing (MapReduce)
• Apache Hive • Metadata for structured data in HDFS
• Apache Crunch • Easy MapReduce pipelines
• Apache Mahout • Vector interface
• Apache Avro • SerializaKon format
38
![Page 39: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/39.jpg)
Stanford CS 246H Winter ‘14 39
Cloudera ML Workflow: Clustering
![Page 40: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/40.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: summary
• client/bin/ml summary -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/header.csv (local FS) -‐-‐summary-‐file examples/kdd99/s.json (local FS)
40
![Page 41: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/41.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: summary
41
HDFS
Local FS
kddcup.data_10_percent
header.csv
1. summary
![Page 42: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/42.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: summary
42
HDFS
Local FS
kddcup.data_10_percent
header.csv
1. summary
s.json
![Page 43: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/43.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: summary
• s.json • Categorical features: histogram • Numerical features: distribuKon summary
43
![Page 44: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/44.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
• client/bin/ml normalize -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐summary-‐file examples/kdd99/s.json (local FS) -‐-‐transform Z -‐-‐output-‐path kdd99 (HDFS) -‐-‐output-‐type avro -‐-‐id-‐column category -‐-‐compress
44
![Page 45: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/45.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
45
HDFS
Local FS
kddcup.data_10_percent
header.csv
2. normalize
s.json
![Page 46: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/46.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
46
HDFS
Local FS
kddcup.data_10_percent
header.csv
2. normalize
s.json
kdd99/
![Page 47: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/47.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
• kdd99/part-‐m-‐0000[0|1].avro • Examples (rows)
• Part 0: 442,454 vectors • Part 1: 51,567 vectors • Total: 494,021 vectors
• Features (columns) • Before: 41 fields • Aker: 143 fields
47
![Page 48: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/48.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
• client/bin/ml ksketch -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐points-‐per-‐iteraKon 500 -‐-‐output-‐file wc.avro (local FS) -‐-‐seed 1729 -‐-‐iteraKons 5 -‐-‐cross-‐folds 2
48
![Page 49: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/49.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
49
HDFS
Local FS
kddcup.data_10_percent
header.csv
3. ksketch
s.json
kdd99/
![Page 50: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/50.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
50
HDFS
Local FS
kddcup.data_10_percent
header.csv
3. ksketch
s.json
kdd99/
wc.avro
![Page 51: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/51.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
• wc.avro • Examples (rows)
• 2 “folds” of 2501 examples • 1 iniKal example • 500 examples from each iteraKon (5 iteraKons) • Each example has an associated weight
• Features (columns) • 143 features (sKll)
51
![Page 52: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/52.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
• client/bin/ml kmeans -‐-‐input-‐file wc.avro (local FS) -‐-‐centers-‐file centers.avro (local FS) -‐-‐seed 19 -‐-‐clusters 1,10,25,35,45 -‐-‐best-‐of 2 -‐-‐num-‐threads 4 -‐-‐eval-‐stats-‐file kmeans_stats.csv (local FS)
52
![Page 53: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/53.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
53
HDFS
Local FS
kddcup.data_10_percent
header.csv
4. kmeans
s.json
kdd99/
wc.avro
![Page 54: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/54.jpg)
Stanford CS 246H Winter ‘14
HDFS
Local FS
kddcup.data_10_percent
header.csv
4. kmeans
s.json
kdd99/
wc.avro
kmeans_stats.csv
centers.avro
Cloudera ML: kmeans
54
![Page 55: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/55.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
• centers.avro • 1 row for each run of k-‐means++ • 9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45
• kmeans_stats.csv • Clustering quality scores
55
![Page 56: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/56.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
• client/bin/ml kassign -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐centers-‐file centers.avro (local FS) -‐-‐center-‐ids 4 -‐-‐output-‐path assigned (HDFS) -‐-‐output-‐type csv
56
![Page 57: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/57.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
57
HDFS
Local FS
kddcup.data_10_percent
header.csv
5. kassign
s.json
kdd99/
wc.avro centers.avro
![Page 58: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/58.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
58
HDFS
Local FS
kddcup.data_10_percent
header.csv
5. kassign
s.json
kdd99/
wc.avro centers.avro
assigned/
![Page 59: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/59.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
• assigned/part-‐m-‐0000[0|1] • Rows
• Part 0: 442,454 • Part 1: 51,567 • Total: 494,021
• Columns • Point ID (normal/aCack type, in this case) • Index in centers.avro • Assigned cluster ID • Squared distance to nearest cluster
59
![Page 60: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/60.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: sample
• client/bin/ml sample -‐-‐input-‐paths assigned (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/kassign_header.csv (local FS) -‐-‐weight-‐field squared_distance -‐-‐group-‐fields clustering_id,closest_center_id -‐-‐output-‐type csv -‐-‐size 20 -‐-‐output-‐path extremal (HDFS)
60
![Page 61: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/61.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: sample
61
HDFS
Local FS
kddcup.data_10_percent
header.csv
6. sample
s.json
kdd99/
wc.avro centers.avro
assigned/
kassign_header.csv
![Page 62: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/62.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: sample
62
HDFS
Local FS
kddcup.data_10_percent
header.csv
6. sample
s.json
kdd99/
wc.avro centers.avro
assigned/
kassign_header.csv
extremal/
![Page 63: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/63.jpg)
Stanford CS 246H Winter ‘14
Cloudera ML: sample
• extremal/part-‐r-‐00000 • Rows
• Up to 20 examples from each cluster • Examples that are furthest from the center of the cluster
• Columns • Point ID (normal/aCack type, in this case) • Index in centers.avro • Assigned cluster ID • Squared distance to nearest cluster
63
![Page 64: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/64.jpg)
Stanford CS 246H Winter ‘14
Oryx
![Page 65: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/65.jpg)
Stanford CS 246H Winter ‘14
2014: Lab to Factory
65
![Page 66: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/66.jpg)
Stanford CS 246H Winter ‘14
Data Science Will Be Opera-onal Analy-cs
66
![Page 67: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/67.jpg)
Stanford CS 246H Winter ‘14
I Built A Model. Now What?
67
Build Model Query Model Collect Input
Repeat
![Page 68: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/68.jpg)
Stanford CS 246H Winter ‘14
I Built A Model On Hadoop. Now What?
68
Build Model Query Model Collect Input
Repeat
? ? ?
![Page 69: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/69.jpg)
Stanford CS 246H Winter ‘14 69
Example: Oryx
![Page 70: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/70.jpg)
Stanford CS 246H Winter ‘14 70
www.mwCl.com/wp-‐content/uploads/2013/11/IMG_5446_edited-‐2_mwCl.jpg
![Page 71: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/71.jpg)
Stanford CS 246H Winter ‘14
Gaps to fill, and Goals
71
• Model Building • Large-‐scale • Con-nuous • Apache Hadoop™-‐based • Few, good algorithms
• Model Serving • Real-‐-me query • Real-‐-me update
• Algorithms • Parallelizable • Updateable • Works on diverse input
• Interoperable • PMML model format • Simple REST API • Open source
![Page 72: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/72.jpg)
Stanford CS 246H Winter ‘14
Large-‐Scale or Real-‐Time?
72
Large-‐Scale Offline Batch
Real-‐Time Online Streaming
vs
Why Don’t We Have Both?
λ!
![Page 73: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/73.jpg)
Stanford CS 246H Winter ‘14
Lambda Architecture
73
• Batch, Stream Processing are different
• Tackle separately in 2+ Layers
• Batch Layer: offline, asynchronous
• Serving / Speed Layer: real-‐Kme, incremental, approximate
jameskinley.tumblr.com/post/37398560534/the-‐lambda-‐architecture-‐principles-‐for-‐architecKng
… λ?
![Page 74: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/74.jpg)
Stanford CS 246H Winter ‘14 74
Batch
Serving/Speed
![Page 75: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/75.jpg)
Stanford CS 246H Winter ‘14
Two Layers
75
• ComputaKon Layer • Java-‐based server process • Client of Hadoop 2.x • Periodically builds “generaKon” from recent data and past model
• Baby-‐sits MapReduce* jobs (or, locally in-‐core)
• Publishes models
• Serving Layer • Apache Tomcat™-‐based server process
• Consumes models from HDFS (or local FS)
• Serves queries from model in memory
• Updates from new input • Also writes input to HDFS • Replicas for scale
* Apache Spark later
![Page 76: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/76.jpg)
Stanford CS 246H Winter ‘14
CollaboraKve Filtering : ALS
76
• AlternaKng Least Squares • Latent-‐factor model • Accepts implicit or explicit feedback
• Real-‐Kme update via fold-‐in of input
• No cold-‐start • Parallelizable
YT
X
![Page 77: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/77.jpg)
Stanford CS 246H Winter ‘14
Clustering : k-‐means++
77
• Well-‐known and understood
• Parallelizable • Clusters updateable
cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering
![Page 78: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/78.jpg)
Stanford CS 246H Winter ‘14
ClassificaKon / Regression : RDF
78
• Random Decision Forests • Ensemble method • Numeric, categorical features and target
• Very parallel • Nodes updateable • Works well on many problems
age$>$30
female? Yes
income$>$20000 Yes
Yes No
![Page 79: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/79.jpg)
Stanford CS 246H Winter ‘14
PMML
79
• PredicKve Modeling Markup Language
• XML-‐based format for predicKve models
• Standardized by Data Mining Group (www.dmg.org)
• Wide tool support
<PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>!</PMML>!
www.dmg.org/v4-‐1/TreeModel.html
![Page 80: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/80.jpg)
Stanford CS 246H Winter ‘14
HTTP REST API
80
• ConvenKon for RPC-‐like request / response
• HTTP verbs, transport • GET : query • POST : add input • Easy from browser, CLI, Java, Python, Scala, etc.
GET /recommend/jwills!
HTTP/1.1 200 OK!Content-Type: text/plain!!"Ray LaMontagne",0.951 "Fleet Foxes",0.7905!"The National",0.688!"Shearwater",0.3017!
![Page 81: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/81.jpg)
Stanford CS 246H Winter ‘14
Wish List
81
• Revamp workflow • Spark / Crunch-‐like API, not raw M/R
• De-‐emphasize model building • Well-‐solved • Bring your own
• More component-‐ized • Less black-‐box service • Emphasize integraKon
• PMML, etc.
• “Pull” opKons • Ka�a? • Hive / Impala ?
![Page 82: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/82.jpg)
Stanford CS 246H Winter ‘14
Open Source
82
github.com/cloudera/oryx!
100% Apache License 2.0
![Page 83: Lecture 7 - CS 246h](https://reader030.vdocuments.net/reader030/viewer/2022033018/577cc77b1a28aba711a1113e/html5/thumbnails/83.jpg)
Stanford CS 246H Winter ‘14