Download - Website Classification using Apache Spark

Website classification using Apache Spark

Amith Nambiar

Demo of the WebCat app

Business problem

Automatically classify new websites into one or more predefined categories.

Why?

Web logs collected from data providers have new websites popping up everyday. And these need to be categorized before they are presented to customers in reports - daily.

Website classification using Apache Spark's MLlib.

Training Data

Starting point was already categorised data in the form:

URL, category_id

www.linux.com, 10 -> (Computers and Internet)

www.coles.com.au, 20 -> (Shopping and Classifieds)

http://www.linux.com/

http://www.coles.com.au/

Training Data

Developed a crawler to crawl each of the categorised websites

2,550 websites picked for initial training and test data. URL, Category_Id -> URL, Category_Id, Features

www.coles.com.au, 10 ->

www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first spend onlin liquorland cole card cole insur apparel cole credit card locat hour look hervey hervey today normal store hour monday friday 8am special store hour saturday decemb sunday decemb store store search suburb postcod search suburb postcod select locat suburb locat found pleas store store state recip inspir recip tast cole partner tast weekli plan easier visit tast cook month cole magazin everyday ingredi sensat meal famili friend latest cole cole handi video recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit corpor respons corpor respons supplier commit work …

Crawled, Stemmed and removed stop words from the data for the website

coles.com.au



http://coles.com.au/

Bayes's theorem

Website classification using Naive Bayes

Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

tf-idf for weighting

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a

word is to a document in a collection or corpus

https://en.wikipedia.org/wiki/Tf-idf

https://en.wikipedia.org/wiki/Tf-idf

tf-idf

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which

helps to adjust for the fact that some words appear more frequently in general.

Training Data from Database/HDFS

TermDoc RDD’s

tf-idf’s

Array of LabeledPoint(classId, vector)

Calculate tf-idf’s on the features.

Create a LabelPoint for each of the training data row

model = NaiveBayes.train(labelPoints)

Train the NaiveBayes Model

model.predict(feature_vector)

Predict class

New Data e.g “Automotive”

Each row of Training data (website) is turned into this form:

(ClassId, Sparse Vector) in the form:5.0, [100,(1,44,..),(0.3,0.12,…)]

API first for Data science

http://engineering.pivotal.io/post/api-first-for-data-science/

High Level Architecture of WebCat

High level architecture of WebCat

Webcat App

Queues/Topics

Link Collector Service Link Crawler Service

Classification Service

Training Data

Database

Apache Spark

Categorizewww.coles.com.au

Category is “Shopping and Classifieds”

Category is “Shopping and Classifieds”

Scale the Crawler service independent of the rest of the services


WebCat dashboard on PWS - Pivotal Web Services

Note that the crawler service is scaled up to 6 instances for better performance.

Ideas for improving WebCat?

User feedback loop to update the model on incorrect predictions

Webcat App

Queue with topics



Training Data

Database

Apache Spark

Categorise www.bmw.com.

We think it is “Electronics” - Did we

get it right?

No. The Category was “Automotive”

http://www.bmw.com/

Upload your own data - (website, category) pairs

Webcat App

Queue with topics



Training Data

Database

Apache Spark

I know kogan.com.au belongs to category

“Shopping and Classifieds” - add it to

the training data please.

More data = Better predictions?

User defined categories e.g realestate.com.au -> “Real Estate”

Webcat App

Queue with topics



Training Data

Database

Apache Spark

Create New Category

“Real Estate”

http://realestate.com.au/

Provide a publicly available API for categorised websites

Webcat App

Queue with topics



Training Data

Database

Apache Spark

GET /websites/{id}/category

GET /websites/{id}/features

…

WebCat on Apache Madlib

http://madlib.incubator.apache.org/

Download - Website Classification using Apache Spark

Top Related