map reduce: beyond word count

MapReduce:Beyond Word Count Jeff Patti

https://github.com/jepatti/mrjob_recipes



What is MapReduce?“MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia

Map - given a line of a file, yield key: value pairs

Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs

Word CountProblem: count frequencies of words in documents

Word Count Using mrjob def mapper(self, key, line):

for word in line.split():

yield word, 1

def reducer(self, word, occurrences):

yield word, sum(occurrences)

Sample Output"ligula" 4"ligula." 2"lorem" 5"lorem."4"luctus" 3"magna" 5"magna," 3"magnis" 1

Monetate Background● Core products are merchandising,

personalization, testing, etc.● A/B & Multivariate testing to determine

impact of experiments● Involved with >20% of ecommerce spend

each holiday season for the past 2 years running

Monetate Stack● Distributed across multiple availability zones

and regions for redundancy, scaling, and lower round trip times

● Real time decision engine using MySQL● Nightly processing of each days data via

Hadoop using mrjob, a python library for writing mapreduce jobs

Beyond Word Count● Activity stream sessionization● Product recommendations● User behavior statistics

Activity Stream SessionizationGoal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes

Input format: timestamp, user_id

Collate user activity def mapper(self, key, line):

timestamp, user_id = line.split()

yield user_id, timestamp

def reducer(self, uid, timestamps):

yield uid, sorted(timestamps)

Sample Output"998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"]"999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]

Segment into SessionsMAX_SESSION_INACTIVITY = 60 * 5

...

def reducer(self, uid, timestamps):

timestamps = sorted(timestamps)

start_index = 0

for index, timestamp in enumerate(timestamps):

if index > 0:

if timestamp - timestamps[index-1] >

MAX_SESSION_INACTIVITY:

yield uid, timestamps[start_index:index]

start_index = index

yield uid, timestamps[start_index:]

Sample Output"999"[1384388414, 1384388425]"999"[1384389419, 1384389420]"999"[1384390420]"999"[1384391415, 1384391418]"999"[1384393413, 1384393425]"999"[1384394426]"999"[1384395416]"999"[1384396415, 1384396422]

Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation

Input: product_id_1, product_id_2, ...

Coincident Purchase Frequency def mapper(self, key, line):

purchases = set(line.split(','))

for p1, p2 in permutations(purchases, 2):

yield (p1, p2), 1

def reducer(self, pair, occurrences):

p1, p2 = pair

yield p1, (p2, sum(occurrences))

Sample output"8" ["5", 11]"8" ["6", 19]"8" ["7", 14]"8" ["9", 11]"9" ["1", 20]"9" ["10", 22]"9" ["11", 21]"9" ["12", 13]

Top Recommendations def reducer(self, purchase_pair, occurrences):

p1, p2 = purchase_pair

yield p1, (sum(occurrences), p2)

def reducer_find_best_recos(self, p1, p2_occurrences):

top_products = sorted(p2_occurrences, reverse=True)[:5]

top_products = [p2 for occurrences, p2 in top_products]

yield p1, top_products

def steps(self):

return [self.mr(mapper=self.mapper, reducer=self.reducer),

self.mr(reducer=self.reducer_find_best_recos)]

Sample Output"7" ["15", "18", "17", "16", "3"]"8" ["14", "15", "20", "6", "3"]"9" ["15", "17", "19", "6", "3"]

Top RecommendationsMulti Account def mapper(self, key, line):

account_id, purchases = line.split()

purchases = set(purchases.split(','))

for p1, p2 in permutations(purchases, 2):

yield (account_id, p1, p2), 1

def reducer(self, purchase_pair, occurrences):

account_id, p1, p2 = purchase_pair

yield (account_id, p1), (sum(occurrences), p2)

2nd step reducer unchanged

Sample Output["9", "20"] ["8", "14", "13", "10", "1"]["9", "3"] ["2", "4", "16", "11", "17"]["9", "4"] ["3", "18", "11", "16", "15"]["9", "5"] ["2", "1", "7", "18", "17"]["9", "6"] ["12", "3", "2", "17", "16"]["9", "7"] ["18", "5", "17", "1", "9"]["9", "8"] ["20", "14", "13", "10", "4"]["9", "9"] ["18", "7", "6", "5", "4"]

User Behavior StatisticsGoal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner

Input:account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time

Statistics PrimerWith sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays

Statistics Primer (cont.)y = a sessions metric value, ex: time on site● Sample count: count the number of sessions

that viewed the experiment○ sum(y^0)

● Mean: sum the metric / sample count○ sum(y^1)/sum(y^0)

Statistics Primer (cont.)● Variance:

○ Variance = mean of square minus square of mean○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2

For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)

Statistics by accountstatistic_rollup/statistic_summarize.py

Sample Output["8", "average session length"] [99, 24463, 7968891]["8", "conversion rate"] [99, 45, 45]["9", "average session length"] [115, 29515, 10071591]["9", "conversion rate"] [115, 55, 55]

Statistics by experimentstatistic_rollup_by_experiment/statistic_summarize.py

Sample Output["9", 0, "average session length"] [32, 8405, 3031009]["9", 0, "conversion rate"] [32, 20, 20]["9", 1, "average session length"] [23, 5405, 1770785]["9", 1, "conversion rate"] [23, 14, 14]["9", 2, "average session length"] [39, 9481, 2965651]["9", 2, "conversion rate"] [39, 20, 20]["9", 3, "average session length"] [25, 6276, 2151014]["9", 3, "conversion rate"] [25, 13, 13]["9", 4, "average session length"] [27, 5721, 1797715]["9", 4, "conversion rate"] [27, 16, 16]

Questions?

?

map reduce: beyond word count

Technology