Download - Website Classification using Apache Spark
Website classification using Apache Spark
Amith Nambiar
Demo of the WebCat app
Business problem
Automatically classify new websites into one or more predefined categories.
Why?
Web logs collected from data providers have new websites popping up everyday. And these need to be categorized before they are presented to customers in reports - daily.
Website classification using Apache Spark's MLlib.
Training Data
Starting point was already categorised data in the form:
URL, category_id
www.linux.com, 10 -> (Computers and Internet)
www.coles.com.au, 20 -> (Shopping and Classifieds)
Training Data
Developed a crawler to crawl each of the categorised websites
2,550 websites picked for initial training and test data. URL, Category_Id -> URL, Category_Id, Features
www.coles.com.au, 10 ->
www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first spend onlin liquorland cole card cole insur apparel cole credit card locat hour look hervey hervey today normal store hour monday friday 8am special store hour saturday decemb sunday decemb store store search suburb postcod search suburb postcod select locat suburb locat found pleas store store state recip inspir recip tast cole partner tast weekli plan easier visit tast cook month cole magazin everyday ingredi sensat meal famili friend latest cole cole handi video recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit corpor respons corpor respons supplier commit work …
Crawled, Stemmed and removed stop words from the data for the website
coles.com.au
Bayes's theorem
Website classification using Naive Bayes
Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.
tf-idf for weighting
In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus
https://en.wikipedia.org/wiki/Tf-idf
tf-idf
The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which
helps to adjust for the fact that some words appear more frequently in general.
Training Data from Database/HDFS
TermDoc RDD’s
tf-idf’s
Array of LabeledPoint(classId, vector)
Calculate tf-idf’s on the features.
Create a LabelPoint for each of the training data row
model = NaiveBayes.train(labelPoints)
Train the NaiveBayes Model
model.predict(feature_vector)
Predict class
New Data e.g “Automotive”
Each row of Training data (website) is turned into this form:
(ClassId, Sparse Vector) in the form:5.0, [100,(1,44,..),(0.3,0.12,…)]
API first for Data science
http://engineering.pivotal.io/post/api-first-for-data-science/
High Level Architecture of WebCat
High level architecture of WebCat
Webcat App
Queues/Topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorizewww.coles.com.au
Category is “Shopping and Classifieds”
Category is “Shopping and Classifieds”
Scale the Crawler service independent of the rest of the services
WebCat dashboard on PWS - Pivotal Web Services
Note that the crawler service is scaled up to 6 instances for better performance.
Ideas for improving WebCat?
User feedback loop to update the model on incorrect predictions
Webcat App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorise www.bmw.com.
We think it is “Electronics” - Did we
get it right?
No. The Category was “Automotive”
Upload your own data - (website, category) pairs
Webcat App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
I know kogan.com.au belongs to category
“Shopping and Classifieds” - add it to
the training data please.
More data = Better predictions?
User defined categories e.g realestate.com.au -> “Real Estate”
Webcat App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Create New Category
“Real Estate”
Provide a publicly available API for categorised websites
Webcat App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
GET /websites/{id}/category
GET /websites/{id}/features
…
WebCat on Apache Madlib
http://madlib.incubator.apache.org/