Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

39
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster Dan Şerban

Transcript of Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Page 1: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Amazon-styleshopping cart analysis

using MapReduceon a Hadoop cluster

Dan Şerban

Page 2: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

:: Introduction- Real-world uses of MapReduce- The origins of Hadoop- Hadoop facts and architecture

:: Part 1- Deploying Hadoop

:: Part 2- MapReduce is machine learning

:: Q&A

Agenda

Page 3: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Why shopping cart analysisis useful to amazon.com

Page 4: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Linkedin and Google Reader

Page 5: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

:: Hadoop got its start in Nutch:: A few enthusiastic developers were attempting

to build an open source web search engineand having trouble managing computationsrunning on even a handful of computers

:: Once Google published their GoogleFS and MapReducewhitepapers, the way forward became clear

:: Google had devised systems to solve preciselythe problems the Nutch project was facing

:: Thus, Hadoop was born

The origins of Hadoop

Page 6: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

:: Hadoop is a distributed computing platformfor processing extremely large amounts of data

:: Hadoop is divided into two main components:- the MapReduce runtime- the Hadoop Distributed File System (HDFS)

:: The MapReduce runtime allows the user to submitMapReduce jobs

:: The HDFS is a distributed file system that providesa logical interface for persistent and redundantstorage of large data

:: Hadoop also provides the HadoopStreaming librarythat leverages STDIN and STDOUT so you can writemappers and reducers in your programming languageof choice

Hadoop facts

Page 7: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

:: Hadoop is based on the principleof moving computation to where the data is

:: Data stored on the Hadoop Distributed File Systemis broken up into chunks and replicated acrossthe cluster providing fault tolerant parallel processingand redundancy for both the data and the jobs

:: Computation takes the form of a job which consistsof a map phase and a reduce phase

:: Data is initially processed by map functions which runin parallel across the cluster

:: Map output is in the form of key-value pairs:: The reduce phase then aggregates the map results:: The reduce phase typically happens in multiple

consecutive waves until the job is complete

Hadoop facts

Page 8: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hadoop architecture

Page 9: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Part 1:Configuring and deploying

the Hadoop cluster

Page 10: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 11: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

core-site.xml - before

Page 12: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

core-site.xml - after

Page 13: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

hdfs-site.xml - before

Page 14: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

hdfs-site.xml - after

Page 15: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

mapred-site.xml - before

Page 16: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

mapred-site.xml - after

Page 17: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

:: hadoop@hadoop-master needs to be able to ssh* into:- hadoop@hadoop-master- hadoop@chunkserver-a- hadoop@chunkserver-b- hadoop@chunkserver-c

:: hadoop@job-tracker needs to be able to ssh* into:- hadoop@job-tracker- hadoop@chunkserver-a- hadoop@chunkserver-b- hadoop@chunkserver-c

*Passwordless-ly and passphraseless-ly

Setting up SSH

Page 18: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 19: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 20: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 21: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 22: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 23: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 24: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with Hadoop

Page 25: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Part 2:MapReduce

is machine learning

Page 26: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Rolling your ownself-hosted alternative to ...

Page 27: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 28: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 29: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 30: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

#!/usr/bin/python

import sys

for line in sys.stdin: line = line.strip() IDs = line.split() for firstID in IDs: for secondID in IDs: if secondID > firstID: print '%s_%s\t%s' % (firstID, secondID, 1)

mapper.py

Page 31: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

#!/usr/bin/python

import sys

subTotals = {}for line in sys.stdin: line = line.strip() word = line.split('\t')[0] count = int(line.split('\t')[1]) subTotals[word] = subTotals.get(word, 0) + countfor k, v in subTotals.items(): print "%s\t%s" % (k, v)

reducer.py

Page 32: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 33: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 34: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 35: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 36: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Hands-on with MapReduce

Page 37: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

:: Google Suggest:: Video recommendations (YouTube):: ClickStream Analysis (large web properties):: Spam filtering and contextual advertising (Yahoo):: Fraud detection (eBay, CC companies):: Firewall log analysis to discover exfiltration and

other undesirable (possibly malware-related) activity:: Finding patterns in social data, analyzing “likes” and

building a search engine on top of them (FaceBook):: Discovering microblogging trends and opinion leaders,

analyzing who follows who (Twitter):: Plain old supermarket shopping basket analysis:: The semantic web

Other MapReduce use cases

Page 38: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Questions / Feedback

Page 39: Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Bonus slide: Making of SQLite DB