running with elephants: predictive analytics with hdinsight
DESCRIPTION
Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.TRANSCRIPT
Running with Elephants
Predictive Analytics with Mahout & HDInsight
Introduction
Chris Price Senior BI Consultant with Pragmatic Works
AuthorRegular SpeakerData Geek & Super Dad!
@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]
You are the demo….
SQL Brewhaushttp://sqlbrewhaus.azurewebsites.net
Create an Account…
Rate some beers…
Don’t worry your infowill only be sold to the HIGHEST bidder
Agenda
• Business Case for Recommendations• How a Recommendation Engine Works• Recommendation Implementation & Integration• Evaluating Recommendations• Challenges of Implementing Recommendations
Making the Business Case
ObjectiveIncreaseRevenue
Increase # of Orders
Increase Items per
Order
Increase Average
Item PriceUp-Sell Website
Navigational
Inefficiency
Cross-Sell
Business Case Example
Up- Sell
Increase
Unit Pric
e
Cross-Sell
Increase Unit Qty
IncreasedRevenue
Recommendation Engines
• Take observation data and use data mining/machine learning algorithms to predict outcomes
• Assumptions:• People with similar interest have common preferences• Sufficiently large number of preferences available
Recommendation Options
• Collaborative Filtering (Mahout)• User-Based• Item-Based
• Content-Based (Mahout Clustering)• Data Mining (SSAS)
• Association• Clustering
Technology
• A scalable machine learning library• Fast, Efficient & Pragmatic• Many of the algorithms can be run on Hadoop
HDInsight• Hadoop on Windows• HDInsight on Windows Azure (Seamlessly scale in the
cloud)• HortonWorks Data Platform/HDP (On-Premise Solution)
Generating Recommendations
1. Sources of Data2. Clean & Prepare Data3. Generate Recommendations• Build User/Item matrix• Calculate User Similarity• Form Neighborhoods• Generate Recommendations
Sources of Data
• Implicit• Ratings• Feedback• Demographics• Psychographics (Personality/Lifestyle/Attitude),• Ephemeral Need (Need for a moment)
• Explicit• Purchase History• Click/Browse History
• Product/Item• Taxonomy• Attributes• Descriptions
Our focus for today
Data Preparation
• Clean-Up:• Remove Outliers (Z-Score)• Remove frequent buyers (Skew)• Normalize Data (Unity-Based)
• Format Data into CSV input file:<User ID>, <Item ID>, <Rating>
How it Works?
• Build a User/Item Matrix
Item
s
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
Neighborhood Formation
U2
U1
U5
U3
U6
U7
U4
Neighborhood Formation
• Requires some experimentation• Similarity Metrics
• Pearson Correlation• Euclidean Distance• Spearman Correlation• Cosine• Tanimoto Coefficient• Log-Likelihood
How it Works?
• Find users similar to U5
• Use a similarity metric (kNN)
• U1 & U7 are identified as most similar to U5
Item
s
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
How it Works?
• Generate Recommendations:• Find items that have not been reviewed (I1 and I6)
• Predict rating by taking weighted sum
Item
s
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 0.5 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1
6 0.7 1
Pseudo-Code Implementation
for each item i that u has no preferencefor each user v that has a preference for i
compute similarity s between u and vcalculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood
Mahout Implementation
• Real-Time Recommendations• Write Java Code and host in JVM Instance• Limited scalability• Requires Training Data• Integration typically handled through web services
• Batch-Based Recommendations• Uses MapReduce jobs on Hadoop• Offline, Slow, yet scalable• Out-of-the-box recommender jobs
Mahout MapReduce Implementation1 – Generate List of ItemIDs2 – Create Preference Vector3 – Count Unique Users4 – Transpose Preference Vectors5 – Row Similarity
• Compute Weights• Computer Similarities• Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix7 – Pre-Partial Multiply, Preferences8 – Partial Multiple (Steps 6 & 7)9 – Filter Items10 – Aggregate & Recommend
Integrating Mahout
• Real-Time• Requires Java coding• Web Service• Process:• Load training data (memory pressure)• Generate recommendations
• Batch• ETL from source• Generate input file (UserID, ItemID, Rating)• Load to HDFS
• Process with Mahout/Hadoop• ETL output from HDFS/Hadoop
• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]• UserID [ItemID:Estimate Rating, ………]
Handling Recommendations
Storing Recommendations:• Hive• Data Warehouse system for Hadoop• Hive ODBC Driver
• MongoDB• Leading NOSQL database• JSON-like storage with flexible schema• C#/.Net MongoDB Driver
• HBase• Open-source distributed, column-oriented database modeled
after Google’s BigTable• Use Pig/MapReduce to process output files and load HBase
table• Java API for easy reading
• Source System (SQL Server, etc)
Evaluating the Recommendations
• How good are your recommendations?• How do you evaluate the recommendation
engine?• Two options both split data into test & training
data sets:• Average Difference• Root-Mean Square
• How it works?I1 I2 I3
Estimated Review 3.5 4.0 1.5
Actual Review 4.0 2.0 2.0
Absolute Difference 0.5 2.0 0.5
Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0
Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
Evaluating the Recommendations
DataModel model = new FileDataModel(new File(“ratings.csv”));
RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder bldr = new RecommenderBuilder(){@Overridepublic Recommender buildRecommender(DataModel model) throws TasteException{
//Use the Pearson Correlation to calculate similarityUserSimilarity similarity = new PearsonCorrelationSimilarity(model);//Generate neighborhoods of approx. 10 usersUserNeighborhood hood = new NearestUserNeighborhood(10, similarity,
model);return new GenericUserBasedRecommender(model, hood, similarity);
}};
//Use 70% of the data to train the model and 30% to testdouble score = eval.evaluate(bldr, model, 0.7, 1.0);
Challenges
1. Context2. Cold Start3. Data Scarsity4. Popularity Bias5. Curse of Dimensionality
Context Challenges
???January
20 degrees & Snowing…..
Other Challenges
• Cold Start• Occurs when either a new item or new user is introduced• Can be handled by:• Can substitute average item/user profile• Use another recommendation generation technique
(Content-Based)
• Data Sparsity• Too many items/user make finding intersections difficult
• Popularity Bias• Skewed towards popular items, people with “unique”
taste are left out
• Curse of Dimensionality• More items/user leads to more noise and greater error
Resources
Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman
Hadoop: The Definitive GuideTom White