Map reduce: beyond word count
-
Upload
jeff-patti -
Category
Technology
-
view
2.226 -
download
2
description
Transcript of Map reduce: beyond word count
MapReduce:Beyond Word Count Jeff Patti
https://github.com/jepatti/mrjob_recipes
What is MapReduce?“MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia
Map - given a line of a file, yield key: value pairs
Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
Word CountProblem: count frequencies of words in documents
Word Count Using mrjob def mapper(self, key, line):
for word in line.split():
yield word, 1
def reducer(self, word, occurrences):
yield word, sum(occurrences)
Sample Output"ligula" 4"ligula." 2"lorem" 5"lorem."4"luctus" 3"magna" 5"magna," 3"magnis" 1
Monetate Background● Core products are merchandising,
personalization, testing, etc.● A/B & Multivariate testing to determine
impact of experiments● Involved with >20% of ecommerce spend
each holiday season for the past 2 years running
Monetate Stack● Distributed across multiple availability zones
and regions for redundancy, scaling, and lower round trip times
● Real time decision engine using MySQL● Nightly processing of each days data via
Hadoop using mrjob, a python library for writing mapreduce jobs
Beyond Word Count● Activity stream sessionization● Product recommendations● User behavior statistics
Activity Stream SessionizationGoal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes
Input format: timestamp, user_id
Collate user activity def mapper(self, key, line):
timestamp, user_id = line.split()
yield user_id, timestamp
def reducer(self, uid, timestamps):
yield uid, sorted(timestamps)
Sample Output"998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"]"999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
Segment into SessionsMAX_SESSION_INACTIVITY = 60 * 5
...
def reducer(self, uid, timestamps):
timestamps = sorted(timestamps)
start_index = 0
for index, timestamp in enumerate(timestamps):
if index > 0:
if timestamp - timestamps[index-1] >
MAX_SESSION_INACTIVITY:
yield uid, timestamps[start_index:index]
start_index = index
yield uid, timestamps[start_index:]
Sample Output"999"[1384388414, 1384388425]"999"[1384389419, 1384389420]"999"[1384390420]"999"[1384391415, 1384391418]"999"[1384393413, 1384393425]"999"[1384394426]"999"[1384395416]"999"[1384396415, 1384396422]
Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation
Input: product_id_1, product_id_2, ...
Coincident Purchase Frequency def mapper(self, key, line):
purchases = set(line.split(','))
for p1, p2 in permutations(purchases, 2):
yield (p1, p2), 1
def reducer(self, pair, occurrences):
p1, p2 = pair
yield p1, (p2, sum(occurrences))
Sample output"8" ["5", 11]"8" ["6", 19]"8" ["7", 14]"8" ["9", 11]"9" ["1", 20]"9" ["10", 22]"9" ["11", 21]"9" ["12", 13]
Top Recommendations def reducer(self, purchase_pair, occurrences):
p1, p2 = purchase_pair
yield p1, (sum(occurrences), p2)
def reducer_find_best_recos(self, p1, p2_occurrences):
top_products = sorted(p2_occurrences, reverse=True)[:5]
top_products = [p2 for occurrences, p2 in top_products]
yield p1, top_products
def steps(self):
return [self.mr(mapper=self.mapper, reducer=self.reducer),
self.mr(reducer=self.reducer_find_best_recos)]
Sample Output"7" ["15", "18", "17", "16", "3"]"8" ["14", "15", "20", "6", "3"]"9" ["15", "17", "19", "6", "3"]
Top RecommendationsMulti Account def mapper(self, key, line):
account_id, purchases = line.split()
purchases = set(purchases.split(','))
for p1, p2 in permutations(purchases, 2):
yield (account_id, p1, p2), 1
def reducer(self, purchase_pair, occurrences):
account_id, p1, p2 = purchase_pair
yield (account_id, p1), (sum(occurrences), p2)
2nd step reducer unchanged
Sample Output["9", "20"] ["8", "14", "13", "10", "1"]["9", "3"] ["2", "4", "16", "11", "17"]["9", "4"] ["3", "18", "11", "16", "15"]["9", "5"] ["2", "1", "7", "18", "17"]["9", "6"] ["12", "3", "2", "17", "16"]["9", "7"] ["18", "5", "17", "1", "9"]["9", "8"] ["20", "14", "13", "10", "4"]["9", "9"] ["18", "7", "6", "5", "4"]
User Behavior StatisticsGoal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner
Input:account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
Statistics PrimerWith sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
Statistics Primer (cont.)y = a sessions metric value, ex: time on site● Sample count: count the number of sessions
that viewed the experiment○ sum(y^0)
● Mean: sum the metric / sample count○ sum(y^1)/sum(y^0)
Statistics Primer (cont.)● Variance:
○ Variance = mean of square minus square of mean○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2
For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
Statistics by accountstatistic_rollup/statistic_summarize.py
Sample Output["8", "average session length"] [99, 24463, 7968891]["8", "conversion rate"] [99, 45, 45]["9", "average session length"] [115, 29515, 10071591]["9", "conversion rate"] [115, 55, 55]
Statistics by experimentstatistic_rollup_by_experiment/statistic_summarize.py
Sample Output["9", 0, "average session length"] [32, 8405, 3031009]["9", 0, "conversion rate"] [32, 20, 20]["9", 1, "average session length"] [23, 5405, 1770785]["9", 1, "conversion rate"] [23, 14, 14]["9", 2, "average session length"] [39, 9481, 2965651]["9", 2, "conversion rate"] [39, 20, 20]["9", 3, "average session length"] [25, 6276, 2151014]["9", 3, "conversion rate"] [25, 13, 13]["9", 4, "average session length"] [27, 5721, 1797715]["9", 4, "conversion rate"] [27, 16, 16]
Questions?
?