map reduce: beyond word count

29
MapReduce: Beyond Word Count Jeff Patti https://github.com/jepatti/mrjob_recipes

Upload: jeff-patti

Post on 06-May-2015

2.229 views

Category:

Technology


2 download

DESCRIPTION

This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ ) Map Reduce: Beyond Word Count by Jeff Patti Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.

TRANSCRIPT

Page 1: Map reduce: beyond word count

MapReduce:Beyond Word Count Jeff Patti

https://github.com/jepatti/mrjob_recipes

Page 2: Map reduce: beyond word count

What is MapReduce?“MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia

Map - given a line of a file, yield key: value pairs

Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs

Page 3: Map reduce: beyond word count

Word CountProblem: count frequencies of words in documents

Page 4: Map reduce: beyond word count

Word Count Using mrjob def mapper(self, key, line):

for word in line.split():

yield word, 1

def reducer(self, word, occurrences):

yield word, sum(occurrences)

Page 5: Map reduce: beyond word count

Sample Output"ligula" 4"ligula." 2"lorem" 5"lorem."4"luctus" 3"magna" 5"magna," 3"magnis" 1

Page 6: Map reduce: beyond word count

Monetate Background● Core products are merchandising,

personalization, testing, etc.● A/B & Multivariate testing to determine

impact of experiments● Involved with >20% of ecommerce spend

each holiday season for the past 2 years running

Page 7: Map reduce: beyond word count

Monetate Stack● Distributed across multiple availability zones

and regions for redundancy, scaling, and lower round trip times

● Real time decision engine using MySQL● Nightly processing of each days data via

Hadoop using mrjob, a python library for writing mapreduce jobs

Page 8: Map reduce: beyond word count

Beyond Word Count● Activity stream sessionization● Product recommendations● User behavior statistics

Page 9: Map reduce: beyond word count

Activity Stream SessionizationGoal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes

Input format: timestamp, user_id

Page 10: Map reduce: beyond word count

Collate user activity def mapper(self, key, line):

timestamp, user_id = line.split()

yield user_id, timestamp

def reducer(self, uid, timestamps):

yield uid, sorted(timestamps)

Page 11: Map reduce: beyond word count

Sample Output"998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"]"999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]

Page 12: Map reduce: beyond word count

Segment into SessionsMAX_SESSION_INACTIVITY = 60 * 5

...

def reducer(self, uid, timestamps):

timestamps = sorted(timestamps)

start_index = 0

for index, timestamp in enumerate(timestamps):

if index > 0:

if timestamp - timestamps[index-1] >

MAX_SESSION_INACTIVITY:

yield uid, timestamps[start_index:index]

start_index = index

yield uid, timestamps[start_index:]

Page 13: Map reduce: beyond word count

Sample Output"999"[1384388414, 1384388425]"999"[1384389419, 1384389420]"999"[1384390420]"999"[1384391415, 1384391418]"999"[1384393413, 1384393425]"999"[1384394426]"999"[1384395416]"999"[1384396415, 1384396422]

Page 14: Map reduce: beyond word count

Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation

Input: product_id_1, product_id_2, ...

Page 15: Map reduce: beyond word count

Coincident Purchase Frequency def mapper(self, key, line):

purchases = set(line.split(','))

for p1, p2 in permutations(purchases, 2):

yield (p1, p2), 1

def reducer(self, pair, occurrences):

p1, p2 = pair

yield p1, (p2, sum(occurrences))

Page 16: Map reduce: beyond word count

Sample output"8" ["5", 11]"8" ["6", 19]"8" ["7", 14]"8" ["9", 11]"9" ["1", 20]"9" ["10", 22]"9" ["11", 21]"9" ["12", 13]

Page 17: Map reduce: beyond word count

Top Recommendations def reducer(self, purchase_pair, occurrences):

p1, p2 = purchase_pair

yield p1, (sum(occurrences), p2)

def reducer_find_best_recos(self, p1, p2_occurrences):

top_products = sorted(p2_occurrences, reverse=True)[:5]

top_products = [p2 for occurrences, p2 in top_products]

yield p1, top_products

def steps(self):

return [self.mr(mapper=self.mapper, reducer=self.reducer),

self.mr(reducer=self.reducer_find_best_recos)]

Page 18: Map reduce: beyond word count

Sample Output"7" ["15", "18", "17", "16", "3"]"8" ["14", "15", "20", "6", "3"]"9" ["15", "17", "19", "6", "3"]

Page 19: Map reduce: beyond word count

Top RecommendationsMulti Account def mapper(self, key, line):

account_id, purchases = line.split()

purchases = set(purchases.split(','))

for p1, p2 in permutations(purchases, 2):

yield (account_id, p1, p2), 1

def reducer(self, purchase_pair, occurrences):

account_id, p1, p2 = purchase_pair

yield (account_id, p1), (sum(occurrences), p2)

2nd step reducer unchanged

Page 20: Map reduce: beyond word count

Sample Output["9", "20"] ["8", "14", "13", "10", "1"]["9", "3"] ["2", "4", "16", "11", "17"]["9", "4"] ["3", "18", "11", "16", "15"]["9", "5"] ["2", "1", "7", "18", "17"]["9", "6"] ["12", "3", "2", "17", "16"]["9", "7"] ["18", "5", "17", "1", "9"]["9", "8"] ["20", "14", "13", "10", "4"]["9", "9"] ["18", "7", "6", "5", "4"]

Page 21: Map reduce: beyond word count

User Behavior StatisticsGoal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner

Input:account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time

Page 22: Map reduce: beyond word count

Statistics PrimerWith sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays

Page 23: Map reduce: beyond word count

Statistics Primer (cont.)y = a sessions metric value, ex: time on site● Sample count: count the number of sessions

that viewed the experiment○ sum(y^0)

● Mean: sum the metric / sample count○ sum(y^1)/sum(y^0)

Page 24: Map reduce: beyond word count

Statistics Primer (cont.)● Variance:

○ Variance = mean of square minus square of mean○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2

For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)

Page 25: Map reduce: beyond word count

Statistics by accountstatistic_rollup/statistic_summarize.py

Page 26: Map reduce: beyond word count

Sample Output["8", "average session length"] [99, 24463, 7968891]["8", "conversion rate"] [99, 45, 45]["9", "average session length"] [115, 29515, 10071591]["9", "conversion rate"] [115, 55, 55]

Page 27: Map reduce: beyond word count

Statistics by experimentstatistic_rollup_by_experiment/statistic_summarize.py

Page 28: Map reduce: beyond word count

Sample Output["9", 0, "average session length"] [32, 8405, 3031009]["9", 0, "conversion rate"] [32, 20, 20]["9", 1, "average session length"] [23, 5405, 1770785]["9", 1, "conversion rate"] [23, 14, 14]["9", 2, "average session length"] [39, 9481, 2965651]["9", 2, "conversion rate"] [39, 20, 20]["9", 3, "average session length"] [25, 6276, 2151014]["9", 3, "conversion rate"] [25, 13, 13]["9", 4, "average session length"] [27, 5721, 1797715]["9", 4, "conversion rate"] [27, 16, 16]

Page 29: Map reduce: beyond word count

Questions?

?