cloudera movies data science project on big data

Cloudera Movies Data Science Project On Big Data

-Abhishek M Shivalingaiah

https://www.linkedin.com/in/abhishekmshivalingaiah

ContentsProject IntroductionData ExplorationData CleaningData ClassificationData ClusteringPredicting User Ratings

Project Introduction

Cloudera Movies is an internet on-demand streaming video service. The viewership is

rising currently. To cater the rising demand of the content they provide, they plan to

expand the hardware infrastructure and improve their software stack. They want to

identify its consumer base and build a recommendation system.

Background

As Data Scientist Team, we are tasked to:

• understand which user accounts are used most often by younger viewers

• segment sessions based on the customer behaviour to improve the site’s usability

• build recommendation engine for Cloudera users to increase time on site and

reduce their churn

Objective

Data Exploration

Cloudera Node

17GB JSON Log files (#68)

Heckle

103 MB JSON log Files (#10)

Jeckle

99 MB JSON log Files (#10)

~600K lines of log

HDFS

Local Disk File Storage System

Data Lake for streaming log data

Map – Reduce Jobs

Data Exploration contd..

Auth createdAt payload refId sessionId type user userAgent

… 19 event

type values

91.14%

3.20%

N= 612,873

Majority of events are content playback events

• 1000000 < USER ID < 100000000

• Presence of Non – Numeric ITEM IDs

Avg. User Rating Avg. Write Review

Different Time zones

Timestamps (UTC,

UTC-8 and others)

Account Event

Update

Password

Update

Payment

Info

Parental

Controls

Data Exploration contd..

Exploration

and Reviewing

Aggregation,

Summarization

and Debugging

Unix Command

in Hadoop ShellMap-Reduce

Architecture

Regular Expression

Grep –Eo Regexp

Reading and

Extracting unique

- cat

- awk

Python Libraries

- Json

- Sys

- Uniq –c

- User Defined

Reducer

- Bash -c

- Job Detail Page (Pass/

Killed in logs col.)

- Job Failure Page

(identifying the problem in

the stack trace of the error

column)

- Task Logs (To spot the error

in the log)

Broken

Pipe Error

Data Cleaning

• Summarize data into less verbose format, so that the volume of the data is

reduced for further analysis

• Fix the issues identified in the exploration phase

Objective

Camel Type, Snake Type and Spelling Variances

- Broken Pipe Error

- Key Error

Lot of non meaningful variables having less variance

Handling Missing Variables

Same account different behavior

Data Cleaning contd..KEY : VALUE

UserID Timestamp SessionID Flags Parameters, , ,[TAB]

- dateutil.parser

- Json

- Sys

Tzinfo

Timedelta

datetime

- C -> recommendation- L -> Login- l -> Logout- P -> Popular- R -> Recommended- S -> Searched

- a -> Queued- c -> subactions of an account- h -> Hover- i -> Browsed- p -> Marker- q -> ReviewedQueue- r -> Recent- t -> Rating- v -> Verified Password- w -> Review- x -> Parental Control

4MB

Classifying users into Child and Adults

The purpose of this section is to help the Cloudera Movies legal team to

understand which user accounts are used most often by younger viewers.

The information that we used in this classification problem

Parental control events labeled the account as a child account.

Only adult accounts are able to perform account control operations like

changing the password

We are using that information to label some of the accounts and then use

those known labels to discover the unknown label

Approach to solve the classification

problem

We can use logit or a probit to classify the users as a child or adult

Challenges in this approach:

Learn the labels based on behavior profile, hard to tease out the signal because having parental controls enabled does not guarantee anything about the user’s behavior.

create features from the content viewed, we can count each item of content as a

boolean feature, which fits well with logistic regression.

2000 users, each of which viewed zero or more content items, and over 8500 content items, each of which was viewed by zero or more users.

there’s a strict separation between the items viewed by accounts with parental controls enabled and those viewed by accounts without parental controls enabled.

Propagate labels of known labeled users to unknown users based on content viewed.

Approach to solve the classification

problem

SimRank Approach.

Propagate influence based on distances and similarities using a SimRank

approach.

The process

Extracting the content items played

Write a mapper program to generate a compound key (userid, start and end)

The compound value (‘kid’, ‘item string’)

Write a reduce program to find the adult, kids and the content viewed

Prepare the SimRank algorithm

The teleport sets for adults and kids

Creating test and training for both sets (80/20)

The process

Adjacency Matrix

Write a mapper program to output the content item and every user who viewed

it and then reduce it to aggregate all the users who viewed that item. (item1,

(user1,user2,user3))

Implementing SimRank

mapper needs to do is read in the current SimRank vector and compute the

matrix product with the adjacency matrix

For each non-zero entry in a column, multiply by the corresponding entry in the

SimRank vector and emit the result with the row label as the key

reducer, sum up the intermediate values for each row, add in the teleport

contribution, and emit the final sum as the value for that row in the new SimRank

vector

The Process

Interpreting the comparing the SimRank Vectors

Normalize the vectors and assign them a sign

The label is assigned based on the larger absolute value

Classify the observation as adult or child

Testing on validation set we obtain a 99.64% accuracy with 9 mislabeled

records

We look to profile consumer behavior

Clustering user sessions reveals

No. of natural groupings

Which behavioral group a session belongs to

End goal of clustering is to

look for notable behavior groupings, such as a large group of sessions where the

user searches unsuccessfully several times and then watches a video from the

home page.

flag sessions that are outliers in the grouping, as these sessions may represent

anomalies of interest, such as bots, fraud, system errors, etc.

Identifying patterns reflective of the groups (system optimization)

Steps involved in clustering

Create a list of features and try them out, using

the statistics from Cloudera ML to evaluate the

quality of the features

Arrive at set of features that give optimal

statistics

Cloudera ML also gives the optimal number of

clusters

Step 1: Determining features for testing

Directly pull out some features from the cleaned data

(Actions, Number of items hovered over, Session

duration, Number of items played, Number of items

browsed, Number of items reviewed, Number of items

rated, Number of items searched etc.)

Ignore features that are unlikely to relate to the user’s

behavior during the session (eg. whether the user

logged in and out)

Some less direct features can be extracted too (Mean

play time, shortest play time, total play time etc.)

Step 2 & 3: Merging the data and

generating feature vectors

Merge sessions with parental controls which were

previously split while cleaning

Aggregate the records by session ID and then merge

records for the same ID

Use merged data to generate the features

Step 4 : Normalize numeric values using

z-score

Dimensions like Total play time measured in large scales would have a

much larger effect than others (Eg. Number of recommended items

played)

Z-score for standardization

Subtract the mean from each observation and divide by the sample standard

deviation

Resulting data will have a mean of zero and a standard deviation of one (if you

divide by s)

𝑧 =𝑥 − ҧ𝑥

𝑠𝑧 =

𝑥 − ҧ𝑥

𝑀𝐴𝐷OR

Step 5 : Clustering with k-means

Partitioning goal: Partition the log files into k clusters by session ID

Given a k, find a partition of k clusters that optimize the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods:

k-means (MacQueen 1967) – Each cluster is represented by the center of the cluster

k-medoids (Kaufman and Rousseeuw 1987) – Each cluster is represented by one of the objects in the cluster. Also known as Partition Around Medoids (PAM)

Since we’d like to be able to compare results we specify the seed

We use k-means++ sketch for choosing the initial values (or "seeds") for the k-means clustering algorithm

Feature identification :

Adding them all in chunk

Back them out incrementally when the clusters start to become less distinct

Code

$ ml kmeans --input-file part2sketch.avro --centers-file part2centers.avro --clusters 40,60,80,100,120,140,160,180,200 --best-of 3 --seed 1729 --num-threads 1 --eval-details-file part2evaldetails.csv --eval-stats-file part2evalstats.csv

https://en.wikipedia.org/wiki/K-means_clustering

Step 5 : Clustering with k-means

With maximum number of

features yielding cluster

statistics - 23 Cluster selection

Threshold

Predictive strength - 0.8

Stable Clusters - 0.8

With the final results cluster 22

seems to be the best choice for

clustering

Number of clusters - 300

Predictive strength - 1.0

Stable Clusters - 0.92

User – User Similarity Item - Item Similarity

Predicting User Ratings(Recommendation Engine)

Predicting User Ratings

Preparation of data

Mahout requires that all IDs, user and item, be numeric.

Implicit data (user, item, rating )and Explicit data (user, played, reviewed)

Training(May 5th – May 10th) and Testing Data (May 10th -- May 12th )

Similarity

TanimotoCoefficientSimilarity

LogLikelihoodSimilarity

Evaluation

RMSE(Root Mean Square Error)

cloudera movies data science project on big data

Data & Analytics