the hitchhiker's guide to machine learning with python & apache spark

87
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark October 29, 2014 @ksankar // doubleclix.wordpress.com http://www.bigdatatechcon.com/ classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI I want to die on Mars but not on impactElon Musk, interview with Chris Anderson “The shrewd guess, the fertile hypothesis, the courageous leap to a tentative conclusion – these are the most valuable coin of the thinker at work” -- Jerome Seymour Bruner "There are no facts, only interpretations." - Friedrich Nietzsche

Upload: krishna-sankar

Post on 21-Apr-2017

6.915 views

Category:

Data & Analytics


12 download

TRANSCRIPT

Page 1: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

October 29, 2014

@ksankar // doubleclix.wordpress.com

http://www.bigdatatechcon.com/classes.html#TheHitchhikersGuidetoMachineLearningwithPythonandApacheSparkPartI

“I want to die on Mars but not on

impact”

— Elon Musk, interview with Chris Ande

rson

“The shrewd guess, the fertile hypothesis, the courageous leap to a

tentative conclusion – these are the most valuable coin of the thinker at

work” -- Jerome Seymour Bruner�"There are no facts, only interpretations." - Friedrich Nietzsche �

Page 2: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Agenda

o  Spark & Data Science DevOps •  Spark, Python & Machine Learning •  Goals/non-goals •  Intro to Spark

•  Stack, Mechanisms – RDD

•  Datasets : SOTU, Titanic, Frequent Flier

•  Statistical Toolbox •  Summary, Correlations

o  “Mood Of the Union” •  State of the Union w/ Washington,

Lincoln, FDR, JFK, Clinton, Bush & Obama

•  Map reduce, parse text

o Clustering •  K-means for Gallactic Hoppers!

o  Break [3:15-3:45) o  Predicting Survivors with Classification

•  Decision Trees •  NaiveBayes (Titanic data set)

o  Linear Regression o  Recommendation Engine

•  Collab Filtering w/movie lens o Discussions/Slack

Oct  29  2-­‐3:15  (75min),  3:45-­‐5:00  (75  min)  =  150  min  [20]  2:00  –  2:20  [30]  2:20  –  2:50  [25]  2:50  –  3:15  [30]  3:45  –  4:15  [10]  4:15  –  4:25  [20]  4:25  –  4:45        [15]  4:45  –  5:00  

Page 3: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Goals & non-goals

Goals

¤ Understand how to program Machine Learning with Spark & Python

¤ Focus on programming & ML application

¤ Give you a focused time to work thru examples § Work with me. I will wait

if you want to catch-up ¤ Less theory, more usage - let us

see if this works ¤ As straightforward as possible § The programs can be

optimized

Non-goals

¡ Go deep into the algorithms • We don’t have sufficient

time. The topic can be easily a 5 day tutorial !

¡ Dive into spark internals •  That is for another day

¡ The underlying computation, communication, constraints & distribution is a fascinating subject •  Paco does a good job

explaining them ¡ A passive talk

•  Nope. Interactive & hands-on

Page 4: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

About Me

o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things:

•  Big Data (Retail, Bioinformatics, Financial, AdTech), •  Written Books (Web 2.0, Wireless, Java,…) •  Standards, some work in AI, •  Guest Lecturer at Naval PG School,… •  Planning MS-CFinance or Statistics or Computational Math

o Volunteer as Robotics Judge at First Lego league World Competitions o  @ksankar, doubleclix.wordpress.com

The  Nuthead  band  !  

Page 5: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Spark & Data Science DevOps

2:00

Page 6: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Close Encounters

� 1st  ◦  This Tutorial

�  2nd  ◦  Do More Hands-on Walkthrough

�  3nd  ◦  Listen To Lectures ◦  More competitions …

Page 7: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Spark Installation

o Install Spark 1.1.0 in local Machine o https://spark.apache.org/downloads.html

• Pre-built For Hadoop 2.4 is fine o Download & uncompress o Remember the path & use it wherever you see /usr/local/spark/ o I have downloaded in /usr/local & have a softlink spark to the latest version

Page 8: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Tutorial Materials

o Github : https://github.com/xsankar/cloaked-ironman o Clone or download zip o Open terminal o cd ~/cloaked-ironman o IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" /usr/local/spark/bin/

pyspark o Note : o I have a soft link “spark” in my /usr/local that points to the spark version that I

use. For example ln -s spark-1.1.0/ spark o Click on ipython dashboard o Just look thru the ipython notebooks

Page 9: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Data Science - Context

o  Scalable  Model  Deployment  

o  Big  Data  automation  &  purpose  built  appliances  (soft/hard)  

o  Manage  SLAs  &  response  times  

o  Volume  o  Velocity  o  Streaming  Data  

o  Canonical  form  o  Data  catalog  o  Data  Fabric  across  the  

organization  o  Access  to  multiple  

sources  of  data    o  Think  Hybrid  –  Big  Data  

Apps,  Appliances  &  Infrastructure  

Collect Store Transform

o  Metadata  o  Monitor  counters  &  

Metrics  o  Structured  vs.  Multi-­‐

structured  

o  Flexible  &  Selectable  §  Data  Subsets    §  Attribute  sets  

o  Refine  model  with  §  Extended  Data  

subsets  §  Engineered  

Attribute  sets  o  Validation  run  across  a  

larger  data  set  

Reason Model Deploy

Data Management

Data Science

o  Dynamic  Data  Sets  o  2  way  key-­‐value  tagging  of  

datasets  o  Extended  attribute  sets  o  Advanced  Analytics  

Explore Visualize Recommend Predict

o  Performance  o  Scalability  o  Refresh  Latency  o  In-­‐memory  Analytics  

o  Advanced  Visualization  o  Interactive  Dashboards  o  Map  Overlay  o  Infographics  

¤  Bytes to Business a.k.a. Build the full stack

¤  Find Relevant Data For Business

¤  Connect the Dots

Page 10: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Volume

Velocity

Variety

Data Science - Context

Context

Connectedness

Intelligence

Interface

Inference

“Data of unusual size” that can't be brute forced

o  Three Amigos o  Interface = Cognition o  Intelligence = Compute(CPU) & Computational(GPU) o  Infer Significance & Causality

Page 11: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Day in the life of a (super) Model

Intelligence

Inference

Data Representation

Interface

Algorithms  

Parameters  AIributes  

Data  (Scoring)  

Model  SelecMon  

Reason  &  Learn  

Models  

Visualize,  Recommend,  Explore  

Model  Assessment  

Feature  SelecMon  Dimensionality  ReducMon  

Page 12: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Data Science Maturity Model & Spark Isolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics

Data Small  Data   Larger  Data  set   Big  Data   Big  Data  Factory  Model  

Context Local   Domain   Cross-­‐domain  +  External  

Cross  domain  +  External  

Model, Reason & Deploy

•  Single  set  of  boxes,  usually  owned  by  the  Model  Builders  

•  Departmental  

•  Deploy  -­‐  Central  AnalyMcs  Infrastructure  

•  Models  sMll  owned  &  operated  by  Modelers  

•  Partly  Enterprise-­‐wide  

•  Central  AnalyMcs  Infrastructure  •  Model  &  Reason  –  by  Model  Builders  •  Deploy,  Operate  –  by  ops  •  Residuals  and  other  metrics  

monitored  by  modelers  •  Enterprise-­‐wide  

•  Distributed  AnalyMcs  Infrastructure  •  AI  Augmented  models  •  Model  &  Reason  –  by  Model  

Builders  •  Deploy,  Operate  –  by  ops  •  Data  as  a  moneMzed  service,  

extending  to  eco  system  partners  

•  Reports   •  Dashboards   •  Dashboards  +  some  APIs   •  Dashboards  +  Well  defined  APIs  +  programming  models  

Type •  DescripMve  &  ReacMve   •  +  PredicMve   •  +  AdapMve   •  AdapMve  

Datasets •  All  in  the  same  box   •  Fixed  data  sets,  usually  in  temp  data  spaces  

•  Flexible  Data  &  AIribute  sets   •  Dynamic  datasets  with  well-­‐defined  refresh  policies    

Workload •  Skunk  works   •  Business  relevant  apps  with  approx  SLAs  

•  High  performance  appliance  clusters   •  Appliances  and  clusters  for  mulMple  workloads  including  real  Mme  apps  

•  Infrastructure  for  emerging  technologies  

Strategy •  Informal  definiMons   •  Data  definiMons  buried  in  the  analyMcs  models  

•  Some  data  definiMons   •  Data  catalogue,  metadata  &  AnnotaMons  

•  Big  Data  MDM  Strategy  

Page 13: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

The  Sense  &  Sensibility  of  a  DataScien3st  DevOps  

Factory  =  OperaMonal  

Lab  =  InvesMgaMve  

hIp://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐sensibility-­‐of-­‐a-­‐data-­‐scienMst-­‐devops/  

Page 14: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Spark-The Stack

Page 15: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

hIp://databricks.com/blog/2014/10/10/spark-­‐breaks-­‐previous-­‐large-­‐scale-­‐sort-­‐record.html  

Page 16: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

RDD – The workhorse of Spark

o Resilient Distributed Datasets • Collection that can be operated in parallel

o Transformations – create RDDs • Map, Filter,…

o Actions – Get values • Collect, Take,…

o We will apply these operations during this tutorial

Page 17: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Algorithm spectrum

o  Regression o  Logit o  CART o  Ensemble :

Random Forest

o  Clustering o  KNN o  Genetic Alg o  Simulated

Annealing  

o  Collab Filtering

o  SVM o  Kernels

o  SVD

o  NNet o  Boltzman

Machine o  Feature

Learning  

Machine  Learning   Cute  Math   Ar0ficial  Intelligence  

Page 18: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

ALL MLlib APIs are not available in Python (as of 1.1.0)

API Spark 1.1.0 Spark 1.2.0

Java/Scala Python

Basic Statistics ✔ ✔

Linear Models ✔ ✔

Decision Trees ✔ ✔

Random Forest ✖ ✖

Collab Filtering ✔ ✔

Clustering-KMeans ✔ ✔

Clustering-Hierarchical ✖ ✖

SVD ✔ ✖

PCA ✔ ✖

Standard Scaler, Normalizer ✔ ✖

Model Evaluation-PR/ROC

Spark  1.2  MLlib  JIRA    h=p://bit.ly/1ywotkm  

Page 19: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Statistical Toolbox

o Sample data : Car mileage data

hIps://github.com/apache/spark/blob/master/examples/src/main/python/mllib/correlaMons.py  

Page 20: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Page 21: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

“Mood Of the Union” with TF-IDF

2:20

Page 22: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Scenario – Mood Of the Union

o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ?

o If so, can we infer the mood of the country by analyzing SOTU ? o If we embark on this line of thought, how would we do it with Spark & python ? o Is it different from Hadoop-MapReduce ? o Is it better ?

Page 23: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

POA (Plan Of Action)

o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, JFK, Bill Clinton, GW Bush & Barack Obama

o Read the 7 SOTU from the 7 presidents into 7 RDDs o Create word vectors o Transform into word frequency vectors o Remove stock common words o Inspect to n words to see if they reflect the sentiment of the time o Compute set difference and see how new words have cropped up o Compute TF-IDF (homework!)

Page 24: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Lookout for these interesting Spark features

o RDD Map-Reduce o How to parse input o Removing common words o Sort rdd by value

Page 25: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Read & Create word vector iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 26: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Remove Common Words – 1 of 3

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 27: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Remove Common Words – 2 of 3

Page 28: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Remove Common Words – 3 of 3

Page 29: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

FDR vs. Barack Obama as reflected by SOTU

Page 30: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Barack Obama vs. Bill Clinton

Page 31: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

GWB vs Abe Lincoln as reflected by SOTU

Page 32: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Epilogue

o Interesting Exercise o Highlights

•  Map-reduce in a couple of lines ! •  But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)

•  Set differences using substractByKey •  Ability to sort a map by values (or any arbitrary function, for that matter)

o To Explore as homework: •  TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf •  Haven’t seen it in python for 1.1.

hIp://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/  

Page 33: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Clustering

2:50

Page 34: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Scenario – Clustering with Spark

o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.

o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.

o So the business want to customize promotions to their frequent flier program. o Can they just have one type of promotion ? o Should they have different types of incentives ? o Who exactly are the customers in their GallacticHoppers program ? o Recently they have deployed an infrastructure with Spark o Can Spark help in this business problem ?

Page 35: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Clustering - Theory

o Clustering is unsupervised learning o While the computers can dissect a dataset into “similar” clusters, it still needs

human direction & domain knowledge to interpret & guide o Two types:

• Centroid based clustering – k-means clustering

•  Tree based Clustering – hierarchical clustering o Spark implements the Scalable Kmeans++

• Paper : http://theory.stanford.edu/~‾sergei/papers/vldb12-kmpar.pdf

Page 36: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Lookout for these interesting Spark features

o Application of Statistics toolbox o Center & Scale RDD o Filter RDDs

Page 37: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Clustering - API

o from pyspark.mllib.clustering import KMeans o Kmeans.train o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||") o K = number of clusters to create, default=2 o  initializationMode = The initialization algorithm. This can be either "random" to

choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||

o KMeansModel.predict o Maps a point to a cluster

Page 38: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Data iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 39: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Read Data & Create RDD

Page 40: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Train & Predict

Page 41: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Calculate error

Page 42: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

But Data is not even

Page 43: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

So let us center & scale the data and try again

Page 44: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Looks Good

Let us try with 5 clusters

Page 45: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Let us map the cluster to our data

Page 46: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

And interpret them We  have  mulMple  cluster  types:  •  1  :  Very  AcMve  –  Give  them  the  most  

aIenMon  •  3  :  Very  AcMve  on-­‐line,  few  flights  –  Give  

them  on-­‐line  coupons  •  4  :  RelaMvely  new  customers,  not  that  

acMve  –  Give  them  flight  coupons  to  encourage  them  to  fly  more.  Ask  them  why  they  are  not  flying.  May  be  they  are  flying  to  desMnaMons  (say  Jupiter)  where  InterGallacMc  has  less  gates  

Note  :    •  This  is  just  a  sample  interpreta0on.  •  In  real  life  we  would  “noodle”  over  the  

clusters  &  tweak  them  to  be  useful,  interpretable  and  dis0nguishable.  

•  May  be  3  is  more  suited  to  create  targeted  promo0ons  

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 47: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Epilogue

o KMeans in Spark has enough controls o It does a decent job o We were able to control the clusters based on our experience (2 cluster is too

low, 10 is too high, 5 seems to be right) o We can see that the Scalable KMeans has control over runs, parallelism et al.

(Home work : explore the scalability) o We were able to interpret the results with domain knowledge and arrive at a

scheme to solve the business opportunity o Naturally we would tweak the clusters to fit the business viability. 20 clusters

with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.

Page 48: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Break

3:15

Page 49: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Predicting Survivors with Classification

3:45

Page 50: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Titanic  Passenger  Metadata  •  Small  •  3  Predictors  

•  Class  •  Sex  •  Age  •  Survived?  

Classification - Scenario

o This is a knowledge exercise o Classify survival from the titanic data o Gives us a quick dataset to run & test classification

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 51: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Classifying Classifiers

Statistical   Structural  

Regression   Naïve  Bayes  

Bayesian  Networks  

Rule-­‐based   Distance-­‐based  

Neural  Networks  

Production  Rules   Decision  Trees  

Multi-­‐layer  Perception  

Functional   Nearest  Neighbor  

Linear   Spectral  Wavelet  

kNN   Learning  vector  Quantization  

Ensemble  

Random  Forests  

Logistic  Regression1  

SVM  Boosting  

1Max  Entropy  Classifier    

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Page 52: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Classifiers  

Regression  Continuous Variables

Categorical Variables

Decision  Trees  

k-­‐NN(Nearest  Neighbors)  

Bias Variance

Model Complexity Over-fitting

BoosMng  Bagging  

CART  

Page 53: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Classification - Spark API

o  Logistic Regression o SVMWIthSGD o DecisionTrees o Data as LabelledPoint (we will see in a moment) o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",

maxDepth=4, maxBins=100) o  Impurity – “entropy” or “gini” o maxBins = control to throttle communication at the expense of accuracy

•  Larger = Higher Accuracy •  Smaller = less communication (as # of bins = number of instances)

o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning

o  intelligent framework - need this for scale

Page 54: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Lookout for these interesting Spark features

o Concept of Labeled Point & how to create an RDD of LPs o Print the tree o Calculate Accuracy & MSE from RDDs

Page 55: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Read data & extract features

iPython notebook at https://github.com/xsankar/cloaked-ironman

Page 56: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Create the model

Page 57: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Extract labels & features

Page 58: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Calculate Accuracy & MSE

Page 59: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Use NaiveBayes Algorithm

Page 60: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Decision Tree – Best Practices

maxDepth   Tune  with  Data/Model  SelecMon  

maxBins   Set  low,  monitor  communicaMons,  increase  if  needed  

#  RDD  parMMons   Set  to  #  of  cores  •  Usually the recommendation is that the RDD partitions should be over

partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out

•  But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help

•  Joe Bradley talk (reference below) has interesting insights

hIps://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup  

DecisionTree.trainClassifier(data,  numClasses,  categoricalFeaturesInfo,  impurity="gini",  maxDepth=4,  maxBins=100)  

Page 61: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Future …

o Actually we should split the data to training & test sets o Then use different feature sets to see if we can increase the accuracy o Leave it as Homework o In 1.2 … o Random Forest

• Bagging

• PR for Random Forest o Boosting o Alpine lab sequoia Forest: coordinating merge o Model Selection Pipeline ; Design Doc

Page 62: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

◦  “Output  of  weak  classifiers  into  a  powerful  commiIee”  ◦  Final  PredicMon  =  weighted  majority  vote    ◦  Later  classifiers  get  misclassified  points    �  With  higher  weight,    �  So  they  are  forced    �  To  concentrate  on  them  ◦  AdaBoost  (AdapMveBoosting)  ◦  BoosMng  vs  Bagging  �  Bagging  –  independent  trees  <-­‐  Spark  shines  here  �  BoosMng  –  successively  weighted  

Boosting �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Page 63: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

◦  Builds  large  collecMon  of  de-­‐correlated  trees  &  averages  them  

◦  Improves  Bagging  by  selecMng  i.i.d*  random  variables  for  splipng  

◦  Simpler  to  train  &  tune  ◦  “Do  remarkably  well,  with  very  li=le  tuning  required”  –  ESLII  ◦  Less  suscepMble  to  over  fipng  (than  boosMng)  ◦  Many  RF  implementaMons  �  Original  version  -­‐  Fortran-­‐77  !  By  Breiman/Cutler  �  Python,  R,  Mahout,  Weka,  Milk  (ML  toolkit  for  py),  matlab    

* i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Random Forests+

�  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Page 64: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

◦  Two  Step  �  Develop  a  set  of  learners  �  Combine  the  results  to  develop  a  composite  predictor  ◦  Ensemble  methods  can  take  the  form  of:  �  Using  different  algorithms,    �  Using  the  same  algorithm  with  different  sepngs  �  Assigning  different  parts  of  the  dataset  to  different  classifiers  

◦  Bagging  &  Random  Forests  are  examples  of  ensemble  method    

Ref: Machine Learning In Action

Ensemble Methods �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Page 65: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Random Forests

o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables

o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller)

o Error prediction •  For each iteration, predict for dataset that is not in the sample (OOB data) •  Aggregate OOB predictions •  Calculate Prediction Error for the aggregate, which is basically the OOB

estimate of error rate •  Can use this to search for optimal # of predictors

•  We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliers

Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk

A Brief Overview of RF by Dan Steinberg

Page 66: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Linear Regression

4:15

Page 67: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Linear Regression - API

LabeledPoint The features and labels of a data point LinearModel weights, intercept LinearRegressionModelBase predict() LinearRegressionModel LinearRegressionWithSGD

train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False)

LassoModel Least-squares fit with an l_1 penalty term.

LassoWithSGD

train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None)

RidgeRegressionModel Least-squares fit with an l_2 penalty term.

RidgeRegressionWithSGD

train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)

Page 68: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Basic Linear Regression

Page 69: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Use LR model for prediction & calculate MSE

Page 70: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Step size is important, the model can diverge !

Page 71: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Interesting step size

Page 72: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Page 73: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Page 74: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Recommendation Engine

4:25

Page 75: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Recommendation & Personalization - Spark

Automated  Analytics-­‐  Let  Data  tell  story  Feature  Learning,  AI,  Deep  Learning  

Learning  Models  -­‐  fit  parameters  as  it  gets  more  data    

Dynamic  Models  –  model  selection  based  on  context  

o  Knowledge  Based  o  Demographic  Based  o  Content  Based  o  Collaborative  Filtering  

o  Item  Based  o  User  Based  

o  Latent  Factor  based  

o  User  Rating  o  Purchased  o  Looked/Not  purchased  

Spark  (in  1.1.0)  implements  the  user  based  ALS  collaboraMve  filtering  

Ref:    ALS  -­‐  CollaboraMve  Filtering  for  Implicit  Feedback  Datasets,  Yifan  Hu  ;  AT&T  Labs.,  Florham  Park,  NJ  ;  Koren,  Y.  ;  Volinsky,  C.  ALS-­‐WR  -­‐  Large-­‐Scale  Parallel  CollaboraMve  Filtering  for  the  Nevlix  Prize,  Yunhong  Zhou,  Dennis  Wilkinson,  Robert  Schreiber,  Rong  Pan  

Page 76: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Spark Collaborative Filtering API

o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,

alpha=0.01) o MatrixFactorizationModel.predict(self, user, product) o MatrixFactorizationModel.predictAll(self, usersProducts)

Page 77: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Read & Parse

Page 78: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Split & Train

Page 79: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Evaluate

Page 80: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Epilogue

o We explored interesting APIs in Spark o ALS-Collab Filtering o RDD Operations

• Join (HashJoin) •  In memory, Grace, Recursive hash join

hIp://technet.microsox.com/en-­‐us/library/ms189313(v=sql.105).aspx  

Page 81: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Questions ?

4:45

Page 82: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Reference

1.  SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-spark

2.  http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering

3.  http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-before-making-a-model-when-is

4.  http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/

5.  https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup 6.  http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html 7.  http://blogs.gartner.com/matthew-davis/

Page 83: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Essential Reading List

o  A few useful things to know about machine learning - by Pedro Domingos •  http://dl.acm.org/citation.cfm?id=2347755

o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert •  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/

lack_of_a_priori_distinctions_wolpert.pdf o  http://www.no-free-lunch.org/ o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C

•  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FDR.pdf

o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe •  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o  Avoid these three mistakes, James Faghmo •  https://medium.com/about-data/73258b3848a4

o  Leakage in Data Mining: Formulation, Detection, and Avoidance •  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/

cs670_Tran_PreferredPaper_LeakingInDataMining.pdf

Page 84: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

For your reading & viewing pleasure … An ordered List

①  An Introduction to Statistical Learning •  http://www-bcf.usc.edu/~‾gareth/ISL/

②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning •  http://online.stanford.edu/course/statistical-learning-winter-2014

③  Prof. Pedro Domingo •  https://class.coursera.org/machlearning-001/lecture/preview

④  Prof. Andrew Ng •  https://class.coursera.org/ml-003/lecture/preview

⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data •  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120

⑥  Mathematicalmonk @ YouTube •  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA

⑦  The Elements Of Statistical Learning •  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/

http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/

Page 85: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

References:

o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas •  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-

learning

o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel •  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn

o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner •  http://strataconf.com/strata2013/public/schedule/detail/27291

o The Problem of Multiple Testing •  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/

PIIS1934148209014609.pdf

Page 86: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

The Beginning As The End

How did we do ? 4:45

Page 87: The Hitchhiker's Guide to Machine Learning with Python & Apache Spark