running with elephants: predictive analytics with hdinsight

Running with Elephants

Predictive Analytics with Mahout & HDInsight

Introduction

Chris Price Senior BI Consultant with Pragmatic Works

AuthorRegular SpeakerData Geek & Super Dad!

@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]

http://bluewatersql.wordpress.com/

mailto:[email protected]

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=Qpl1UCX6TAxfPM&tbnid=pbGaYjIY5O7oJM:&ved=0CAUQjRw&url=https://twitter.com/twitter&ei=29fEUeCfD8L_qwH2j4G4Dw&bvm=bv.48293060,d.aWM&psig=AFQjCNG8L3E0nDKe_QWwPpfpTsyE61dFug&ust=1371941198287009

You are the demo….

SQL Brewhaushttp://sqlbrewhaus.azurewebsites.net

Create an Account…

Rate some beers…

Don’t worry your infowill only be sold to the HIGHEST bidder

http://sqlbrewhaus.azurewebsites.net/

Agenda

• Business Case for Recommendations• How a Recommendation Engine Works• Recommendation Implementation & Integration• Evaluating Recommendations• Challenges of Implementing Recommendations

Making the Business Case

ObjectiveIncreaseRevenue

Increase # of Orders

Increase Items per

Order

Increase Average

Item PriceUp-Sell Website

Navigational

Inefficiency

Cross-Sell

Business Case Example

Up- Sell

Increase

Unit Pric

e

Cross-Sell

Increase Unit Qty

IncreasedRevenue

Recommendation Engines

• Take observation data and use data mining/machine learning algorithms to predict outcomes

• Assumptions:• People with similar interest have common preferences• Sufficiently large number of preferences available

Recommendation Options

• Collaborative Filtering (Mahout)• User-Based• Item-Based

• Content-Based (Mahout Clustering)• Data Mining (SSAS)

• Association• Clustering

Technology

• A scalable machine learning library• Fast, Efficient & Pragmatic• Many of the algorithms can be run on Hadoop

HDInsight• Hadoop on Windows• HDInsight on Windows Azure (Seamlessly scale in the

cloud)• HortonWorks Data Platform/HDP (On-Premise Solution)

Generating Recommendations

1. Sources of Data2. Clean & Prepare Data3. Generate Recommendations• Build User/Item matrix• Calculate User Similarity• Form Neighborhoods• Generate Recommendations

Sources of Data

• Implicit• Ratings• Feedback• Demographics• Psychographics (Personality/Lifestyle/Attitude),• Ephemeral Need (Need for a moment)

• Explicit• Purchase History• Click/Browse History

• Product/Item• Taxonomy• Attributes• Descriptions

Our focus for today

Data Preparation

• Clean-Up:• Remove Outliers (Z-Score)• Remove frequent buyers (Skew)• Normalize Data (Unity-Based)

• Format Data into CSV input file:<User ID>, <Item ID>, <Rating>

How it Works?

• Build a User/Item Matrix

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

N

Neighborhood Formation

U2

U1

U5

U3

U6

U7

U4

Neighborhood Formation

• Requires some experimentation• Similarity Metrics

• Pearson Correlation• Euclidean Distance• Spearman Correlation• Cosine• Tanimoto Coefficient• Log-Likelihood

How it Works?

• Find users similar to U5

• Use a similarity metric (kNN)

• U1 & U7 are identified as most similar to U5

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

N

How it Works?

• Generate Recommendations:• Find items that have not been reviewed (I1 and I6)

• Predict rating by taking weighted sum

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 0.5 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

5 1 1

6 0.7 1

Pseudo-Code Implementation

for each item i that u has no preferencefor each user v that has a preference for i

compute similarity s between u and vcalculate running average of v‘s

preference for i, weighted by s

return top ranked (weighted average) i

Restrict to Neighborhood

Mahout Implementation

• Real-Time Recommendations• Write Java Code and host in JVM Instance• Limited scalability• Requires Training Data• Integration typically handled through web services

• Batch-Based Recommendations• Uses MapReduce jobs on Hadoop• Offline, Slow, yet scalable• Out-of-the-box recommender jobs

Mahout MapReduce Implementation1 – Generate List of ItemIDs2 – Create Preference Vector3 – Count Unique Users4 – Transpose Preference Vectors5 – Row Similarity

• Compute Weights• Computer Similarities• Similarity Matrix

6 – Pre-Partial Multiply, Similarity Matrix7 – Pre-Partial Multiply, Preferences8 – Partial Multiple (Steps 6 & 7)9 – Filter Items10 – Aggregate & Recommend

Integrating Mahout

• Real-Time• Requires Java coding• Web Service• Process:• Load training data (memory pressure)• Generate recommendations

• Batch• ETL from source• Generate input file (UserID, ItemID, Rating)• Load to HDFS

• Process with Mahout/Hadoop• ETL output from HDFS/Hadoop

• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]• UserID [ItemID:Estimate Rating, ………]

Handling Recommendations

Storing Recommendations:• Hive• Data Warehouse system for Hadoop• Hive ODBC Driver

• MongoDB• Leading NOSQL database• JSON-like storage with flexible schema• C#/.Net MongoDB Driver

• HBase• Open-source distributed, column-oriented database modeled

after Google’s BigTable• Use Pig/MapReduce to process output files and load HBase

table• Java API for easy reading

• Source System (SQL Server, etc)

Evaluating the Recommendations

• How good are your recommendations?• How do you evaluate the recommendation

engine?• Two options both split data into test & training

data sets:• Average Difference• Root-Mean Square

• How it works?I1 I2 I3

Estimated Review 3.5 4.0 1.5

Actual Review 4.0 2.0 2.0

Absolute Difference 0.5 2.0 0.5

Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0

Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23

Evaluating the Recommendations

DataModel model = new FileDataModel(new File(“ratings.csv”));

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder bldr = new RecommenderBuilder(){@Overridepublic Recommender buildRecommender(DataModel model) throws TasteException{

//Use the Pearson Correlation to calculate similarityUserSimilarity similarity = new PearsonCorrelationSimilarity(model);//Generate neighborhoods of approx. 10 usersUserNeighborhood hood = new NearestUserNeighborhood(10, similarity,

model);return new GenericUserBasedRecommender(model, hood, similarity);

}};

//Use 70% of the data to train the model and 30% to testdouble score = eval.evaluate(bldr, model, 0.7, 1.0);

Challenges

1. Context2. Cold Start3. Data Scarsity4. Popularity Bias5. Curse of Dimensionality

Context Challenges

???January

20 degrees & Snowing…..

Other Challenges

• Cold Start• Occurs when either a new item or new user is introduced• Can be handled by:• Can substitute average item/user profile• Use another recommendation generation technique

(Content-Based)

• Data Sparsity• Too many items/user make finding intersections difficult

• Popularity Bias• Skewed towards popular items, people with “unique”

taste are left out

• Curse of Dimensionality• More items/user leads to more noise and greater error

Resources

Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman

Hadoop: The Definitive GuideTom White

Thank You!

@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]

QUESTIONS???




mailto:[email protected]

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=Qpl1UCX6TAxfPM&tbnid=pbGaYjIY5O7oJM:&ved=0CAUQjRw&url=https://twitter.com/twitter&ei=29fEUeCfD8L_qwH2j4G4Dw&bvm=bv.48293060,d.aWM&psig=AFQjCNG8L3E0nDKe_QWwPpfpTsyE61dFug&ust=1371941198287009

running with elephants: predictive analytics with hdinsight

Technology

split data

observation data

recommendations challenges

handling recommendations

data unitybased format

training data integration

recommendations batch

data preparation cleanup