predictionio - building applications that predict user behavior through big data using open-source...

73
Donald Szeto Kenneth Chan Simon Chan [email protected] Big Data TechCon - Oct 17, 2013 Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Upload: predictionio

Post on 06-May-2015

4.395 views

Category:

Technology


2 download

DESCRIPTION

Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies Presented by PredictionIO at Big Data TechCon (Oct 17, 2013)

TRANSCRIPT

Page 1: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Donald SzetoKenneth ChanSimon Chan

[email protected]

Big Data TechCon - Oct 17, 2013Building Applications That Predict User Behavior Through Big Data

Using Open-Source Technologies

Page 2: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Machine Learning is....

computers learning to predict

from data

Page 3: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

putting

Machine Learning

into practice

Page 4: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Existing data tools are like raw material

Page 5: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

PredictionIO

3 Product Principles

Page 6: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Love Developers• By Developers, For Developers

• REST APIs & SDKs

• Streamlined Data Science Process with UIs

Page 7: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Be Open• Open Source

• Github - github.com/predictionio

• Supported by Mozilla WebFWD

Page 8: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

For Production

Page 9: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

challenge #1

Scalability

Page 10: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Big Data Bottlenecks

Machine Learning Processing

Page 11: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

PredictionIO has ahorizontally scalablearchitecture

Page 12: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 13: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

challenge #2

Productivity

Page 14: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Conventional Routine

‣ Collecting Data

‣ Picking an Algorithm

‣ Writing Scalable Code

‣ Evaluating Results

‣ Tuning Algorithm Parameters

‣ Iterate . . .

Page 15: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Algorithms

‣ Item Similarity

‣ Collaborative Filtering (CF)

‣ Item Recommendation

‣ kNN Item-based CF

‣ Threshold Item-based CF

‣ Alternating Least Squares

‣ SlopeOne

‣ Singular Value Decomposition

???

???

???

???

???

???

Page 16: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

MapReduce- Native Java

public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws .....{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }}

Page 17: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

What else?

‣ Evaluating Results

‣ But how?

‣ Tuning Algorithm Parameters

‣ Lambda

‣ Regularization

‣ # Iterations

‣ . . .

Page 18: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 19: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Better Routine with PredictionIO

‣ Collecting Data

‣ Using PredictionIO SDKs and API

‣ Picking an Algorithm & Writing Scalable Code

‣ Ready-to-use Algorithms

‣ Evaluating Results & Tuning Algorithm Parameters

‣ Ready-to-use Metrics

‣ Baseline Algorithms

‣ All of the Above Driven by a GUI (!)

Page 20: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 21: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 22: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 23: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Efficiently tune algorithmsas a developer.

Develop intuition withhow algorithms behave.

Page 24: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

(Very) Light Intro to kNN CF

1 2 3 4 5 6 7 8 9 10

1 3 5 4 1 4 1

2 2 2 1 2

3 3 1 4 ? 2 5

4 3 1 1 3 1

5 5 2 5 2 2 1

Users

Item

sUsers and items

Predict user rating on unseen item

Page 25: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

k-Nearest Neighbor (kNN)

Find k most similar items to the targeted items

•Pearson Correlation

•Cosine / Adjusted Cosine

•Jaccard coefficient

Weight-sum the known ratings of the targeted users on these items

1 2 3 4 5 6 7 8 9 10

1 3 5 4 1 4 1

2 2 2 1 2

3 3 1 4 ? 2 5

4 3 1 1 3 1

5 5 2 5 2 2 1

Users

Item

s

Page 26: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

k-Nearest Neighbor (kNN)

Correlation as Item Similarity:

Item 1 & 3:

R1 = [3, 5, 1, 4]R3 = [3, 4, 2, 5]

Corr(R1, R3) = 0.8

1 2 3 4 5 6 7 8 9 10

1 3 5 4 1 4 1

2 2 2 1 2

3 3 1 4 2 5

4 3 1 1 3 1

5 5 2 5 2 2 1

Users

Item

s

Page 27: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

k-Nearest Neighbor (kNN)

Sij:similarity score of item i and item j

Let S13 = 0.8, S35 = 0.9,predicted score of User 6 on Item 3 would be:

(4 * 0.8 + 2 * 0.9) /(0.8 + 0.9)= 5 / 1.7= 2.9

1 2 3 4 5 6 7 8 9 10

1 3 5 4 1 4 1

2 2 2 1 2

3 3 1 4 ? 2 5

4 3 1 1 3 1

5 5 2 5 2 2 1

Users

Item

s

Page 28: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

PredictionIO In Action

Page 29: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 30: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Something is missing here...

Page 31: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

“If you follow this startup, you may also

want to follow these X, Y, Z startups.”

Page 32: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How are we going to do this?

Data +

Page 33: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Users

Items

Actions

- the followers (AngelList users)

- the startups

- the “follow” actions

What data do we need?

Page 34: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

For this demo...

What data do we need?

- Use AngelList API (https://angel.co/api)

- Retrieve all users following these startups (Users and Actions).

- Retrieve all startups (Items) belonging to the following accelerators:

- 16382 users. 993 items. 204176 actions.

Page 35: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Generate these files...

What data do we need?

users.csv - list of unique user ids

<user_id>

startup_id_name_url_incubator.csv - list of startups with ids, names, AngelList URL and incubators

<startup_id>, <name>, <url>, <incubator>

follows.csv - list of ids for startups and users who follow the startup

<startup_id>, <user_id>

Page 36: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Page 37: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 38: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Page 39: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

# PredictionIO Client Object

client = predictionio.Client(“APP_KEY”)

Python SDK is used in this demo (other SDKs are similar)....

# Import users and items

client.create_user(“user_id”)

client.create_item(“startup_id”, (“item_type_1”,))

# Example

client.create_user(“5”)

client.create_item(“17”, (“startup”, “y-combinator”,))

Page 40: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Python SDK is used in this demo (other SDKs are similar)....

PredictionIO comes with the following built-in actions:

“like”, “dislike”, “rate”, “view”, “conversion”

# Example

client.identify(“5”)

client.record_action_on_item(“like”, “17”) # use like to represent follow

# Import actions

client.identify(“user_id”)

client.record_action_on_item(“action_name”, “startup_id”)

Page 41: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Python SDK is used in this demo (other SDKs are similar)....

# Asynchronous version

client.acreate_user(“user_id”)

client.acreate_item(“item_id”, (“item_type_1”,))

client.arecord_action_on_item(“action_name”, “item_id”)

# More advanced usage

client = predictionio.Client(APP_KEY, THREADS, API_URL, qsize=REQUEST_QSIZE)

Page 42: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Page 43: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Page 44: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to integrate with PredictionIO?

Page 45: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to retrieve prediction results?

# Asynchronous version

req = client.aget_itemsim_topn(“engine_name”, “startup_id”, num)

# can do something else in between.....

rec = client.aresp(req)

# retrieve prediction results from engine

rec = client.get_itemsim_topn(“engine_name”, “startup_id”, num)

# Example

rec = client.get_itemsim_topn(“my-engine”, “17”, 10)

Python SDK is used in this demo (other SDKs are similar)....

Page 46: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Store and retrieve startups data

Retrieve startups data through REST API

Batch import data through Python SDK

Retrieve prediction results through REST API

+

Demo Setup

https://github.com/PredictionIO/Demo-AngelList

Page 47: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies
Page 48: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Analytics Backend as a Service (web + mobile + internet of things)

Analytics platform for mobile & web.

E-commerce across social, mobile, and webData analysis tool

Page 49: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Analytics Backend as a Service (web + mobile + internet of things)

Production and distribution of visual contentMobile Application Performance Monitoring

Mobile monetization

Payments for marketplaces and crowdfundingXbox Kinect for iPhone

Page 50: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Analytics Backend as a Service (web + mobile + internet of things)

Live showsYoutube network for Artists

Page 51: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Analytics Backend as a Service (web + mobile + internet of things)

Analytics platform for mobile & web.

Production and distribution of visual contentMobile Application Performance Monitoring

E-commerce across social, mobile, and webMobile monetization

Data analysis toolLive showsYoutube network for ArtistsPayments for marketplaces and crowdfunding

Xbox Kinect for iPhone

Page 52: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to try different algorithms?

Page 53: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to try different algorithms?

Page 54: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to try different algorithms?

Page 55: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to change algorithm parameters?

Page 56: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to change algorithm parameters?

Page 57: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How do we evaluate how good the prediction results are?

We need metrics!

Page 58: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

MAP@k

Page 59: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

MAP@k

Retrieve Top 5 items (prediction):

item4

item10

item6

item15

item7

(Mean Average Precision at k)

Relevant items (actual behavior):

item7

item10

item25

For this user A...

Total relevant items: 3

Page 60: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

MAP@k

Retrieve Top 5 items (prediction):

item4

item10

item6

item15

item7

(Mean Average Precision at k)

Relevant items (actual behavior):

item7

item10

item25

Total relevant items: 3

For this user A...

How good is this prediction?

Page 61: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

MAP@k

Retrieve Top 5 items (prediction):

item4

item10

item6

item15

item7

(Mean Average Precision at k)

Position:

1

2

3

4

5

Total relevant items: 3

Number of relevant items up to this position:

0

1

1

1

2

AP@5 = (1/2 + 2/5) / min(5,3)

P@k:

0

1/2

0

0

2/5

MAP@5 = average of AP@5 of all users

For this user A...

Page 62: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

ISMAP@k

- Extension to MAP@k for Item Similarity Engine

Page 63: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Retrieve Top 5 items (prediction):

item4

item10

item6

item15

item7

Relevant items (actual behavior):

item7

item10

item25

MAP@k

MAP@5 = average of AP@5 of all users

For each user....

AP@5 = (1/2 + 2/5) / min(5,3)

Page 64: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Retrieve Top 5 similar items to X that the user may also follow (prediction):

item4

item10

item6

item15

item7

Relevant items (actual behavior):

item7

item10

item25

ISMAP@k

MAP@5 = average of AP@5 of all users

For each item X...

For each user who “follows” this item X....

ISMAP@5 = average of MAP@5 of all items

We have metrics, but how to evaluate?

AP@5 = (1/2 + 2/5) / min(5,3)

Page 65: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Offline Evaluation

Page 66: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Retrieve Top 5 items (prediction):

item4

item10

item6

item15

item7

Relevant items (actual behavior):

item7

item10

item25

Offline Evaluation

Data

TrainingRetrieve prediction

What does it mean by “relevant”? What’s the prediction goal?

Get relevant items

(Tested with unseen data)

Training set

Test set

Split

(seen)

(treated like unseen)

Page 67: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to set the goal?

Page 68: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How to set the goal?

Page 69: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How do we evaluate how good the prediction results are?

Page 70: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

How do we evaluate how good the prediction results are?

Page 71: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Sample evaluation scores

0.0039

0.0335Default-Algo

random

Page 72: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

Predict User Preference

‣ Smarter Ride-Sharing Apps

‣ Personalized Meal Recommendation

‣ Intelligent Online Course Discovery

‣ App Discovery

‣ and many more...

Page 73: PredictionIO - Building Applications That Predict User Behavior Through Big Data Using Open-Source Technologies

- @PredictionIO

- prediction.io

- github.com/predictionio

Getting Started!