Post on 13-Aug-2015
BloomReach and AWS Elastic MapReduce
Prateek Gupta – Lead Engineer10/24/2014
The BloomReach
Personalized Discovery
Platformhttp://bloomreach.com/what-we-do/
About BloomReach’s applications
Organic
Search
Con
ten
t u
nd
ers
tan
din
g
What it does
Content optimization, management and mea-
surement
Benefit
Enhanced discoverability and customer acquisition in organic
search
What it does
Personalized onsite search and
navigation across devices
Benefit
Relevant and consistent onsite experiences for new and known
users
What it does
Merchandising tool that un-derstands products and identifies opportunities
Benefit
Prioritize and optimize online merchandising
SNAP
Compass
BloomReach Organic Search - Merchant Integration
Merchant domain
Bloomreach domain (Amazon Web Services)
Cloudfrontdomain: brcdn.combr-trk.js
pix.gif Elastic Compute Cloud
domain: brsrvr.com
REST API request
domain: brsrvr.com Elastic Compute Cloud
Javascript
API response
BloomReach Organic Search Architecture
API response
REST API request
Domain Name Server (DNS)
AWS Load balancer
Instance
Instance
Instance
Instance
Alternate Cloud Provider
Multiple Availability Zones
Domain request
Domain response
Example Workflow - Personalization
Compute User
Features
Compute Recommendations
Compute User
Profile
User/ Product Database
Pixel Logs (S3)
Extract Related Users
Extract User
Session
Elastic MapReduce (EMR) Usage
• We serve 150+ customer websites 100+ million pages processed/ day Users we see per day > 400M Multiple hadoop steps (clusters)
Usage Metric BloomReach Volume
Clusters per day 1500-2000
Hadoop jobs per day 5000-6000
Instance hours per day
25,000 – 30,000
Elastic MapReduce Usage Growth
Q4 20
09
Q1 20
10
Q2 20
10
Q3 20
10
Q4 20
10
Q1 20
11
Q2 20
11
Q3 20
11
Q4 20
11
Q1 20
12
Q2 20
12
Q3 20
12
Q4 20
12
Q1 20
13
Q2 20
13
Q3 20
13
Q4 20
13
Q1 20
14
Q2 20
14
Q3 20
140
100000
200000
300000
400000
500000
600000
700000
800000
Spot Instance
SNAP Mobile
SNAP Desktop
Compass
Instance hours/ month
Organic
Challenges
• Cost containment On demand vs spot usage
• Cost tracking EMR tags
• Cluster setup delay Sharing clusters
• Cluster lifecycle management Terminate long-running clusters
Resource Selection
• Dynamic resource (instance type) selection based on CPU, memory
maxCpuPerUnitPrice = 0optimalInstanceType = nullFor each instance_type in (Availability Zone, Region) { cpuPerUnitPrice = instance.cpuCores/instance.spotPrice if (maxCpuPerUnitPrice < cpuPerUnitPrice) { optimalInstanceType = instance_type; }}
Workflow Management
• Makefile• A framework for flow control using
python meta programming
A
C B
D
Valid Flows:A->B->C->DA->B->D->C
EMR Best Practices
• Use spot instances for cost optimization
• Use EMR tags for cost tracking• Share EMR clusters for small jobs• Keep track of long-running clusters• Use optimal resource type based on
resource usage (e.g. CPU, memory)• Workflow management
Thank You!
Prateek Gupta, Lead Engineerprateek@bloomreach.comwww.bloomreach.com