Big Data: Guidelines and Examples for the Enterprise Decision Maker
-
Upload
mongodb -
Category
Technology
-
view
468 -
download
4
description
Transcript of Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Examples and Guidelines for the Enterprise Decision Maker
Solutions Architect, MongoDB
Buzz [email protected]
#MongoDB
Who is your Presenter?• Yes, I use “Buzz” on my business cards• Former Investment Bank Chief Architect at
JPMorganChase and Bear Stearns before that
• Over 25 years of designing and building systems• Big and small• Super-specialized to broadly useful in any
vertical• “Traditional” to completely disruptive• Advocate of language leverage and strong
factoring• Still programming – using emacs, of course
Agenda• (Occasionally) Brutal Truths about Big Data
• Review of Directed Content Business Architecture
• A Simple Technical Implementation
Truths• Clear definition of Big Data still maturing
• Efficiently operationalizing Big Data is non-trivial• Developing, debugging, understanding MapReduce• Cluster monitoring & management, job scheduling/recovery• If you thought regular ETL Hell was bad….
• Big Data is not about math/set accuracy• The last 25000 items in a 25,497,612 set “don’t matter”
• Big Data questions are best asked periodically• “Are we there yet?”
• Realtime means … realtime
It’s About The Functions, not the Terms
DON’T ASK:• Is this an operations or an analytics
problem?• Is this online or offline?• What query language should we use?• What is my integration strategy across tools?ASK INSTEAD:• Am I incrementally addressing data (esp.
writes)?• Am I computing a precise answer or a
trend?• Do I need to operate on this data in
realtime?• What is my holistic architecture?
What We’re Going to “Build” today
Realtime Directed Content System• Based on what users click,
“recommended” content is returned in addition to the target
• The example is sector (manufacturing, financial services, retail) neutral
• System dynamically updates behavior in response to user activity
The Participants and Their Roles
DirectedContentSystem
Customers
ContentCreators
Management/
Strategy
Analysts/Data
Scientists
Generate and tag content from a known domain of tags
Make decisions based on trends and other summarized data
Operate on data to identify trends and develop tag domains
Developers/ProdOps
Bring it all together: apps, SDLC, integration, etc.
Priority #1: Maximizing User value
Considerations/Requirements
Maximize realtime user value and experienceProvide management reporting and trend analysisEngineer for Day 2 agility on recommendation engineProvide scrubbed click history for customerPermit low-cost horizontal scalingMinimize technical integrationMinimize technical footprintUse conventional and/or approved toolsProvide a RESTful service layer…..
The Architecture
mongoDB HadoopApp(s) MapReduce
Complementary Strengths
mongoDB HadoopApp(s) MapReduce
• Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.)
• Language flexibility (Java, C#, C++ python, Scala, …)
• Webscale deployment model• appservers, DMZ,
monitoring• High performance rich shape
CRUD
• MapReduce design paradigm• Node deployment model• Very large set operations• Computationally intensive,
longer duration• Read-dominated workload
“Legacy” Approach: Somewhat unidirectional
mongoDB HadoopApp(s) MapReduce
• Extract data from mongoDB and other sources nightly (or weekly)
• Run analytics• Generate reports for people to
read
• Where’s the feedback?
Somewhat better approach
mongoDB HadoopApp(s) MapReduce
• Extract data from mongoDB and other sources nightly (or weekly)
• Run analytics• Generate reports for people to
read• Move important summary data
back to mongoDB for consumption by apps.
…but the overall problem remains:
• How to realtime integrate and operate upon both periodically generated data and realtime current data?
• Lackluster integration between OLTP and Hadoop
• It’s not just about the database: you need a realtime profile and profile update function
The legacy problem in pseudocode
onContentClick() {String[] tags = content.getTags();Resource[] r = f1(database, tags);
}
• Realtime intraday state not well-handled
• Baselining is a different problem than click handling
The Right Approach• Users have a specific Profile entity
• The Profile captures trend analytics as baselining information
• The Profile has per-tag “counters” that are updated with each interaction / click
• Counters plus baselining are passed to fetch function
• The fetch function itself could be dynamic!
24 hours in the life of The System
• Assume some content has been created and tagged
• Two systemetized tags: Pets & PowerTools
Monday, 1:30AM EST
• Fetch all user Profiles from mongoDB; load into Hadoop• Or skip if using the mongoDB-Hadoop
connector!
mongoDB HadoopApp(s) MapReduce
mongoDB-Hadoop MapReduce Example
public class ProfileMapper extends Mapper<Object, BSONObject, IntWritable, IntWritable> { @Override public void map(final Object pKey,
final BSONObject pValue,final Context pContext )
throws IOException, InterruptedException{ String user = (String)pValue.get(”user"); Date d1 = (Date)pValue.get(“lastUpdate”); int count = 0; List<String> keys = pValue.get(“tags”).keys(); for ( String tag : keys) { count += pValue.get(tag).get(“hist”).size(); ) int avg = count / keys.size(); pContext.write( new IntWritable( count), new IntWritable( avg ) ); }}
Monday, 1:45AM EST
• Grind through all content data and user Profile data to produce:• Tags based on feature extraction (vs. creator-
applied tags)• Trend baseline per user for tags Pets and
PowerTools
• Load Profiles with new baseline back into mongoDB• Or skip if using the mongoDB-Hadoop
connector!
mongoDB HadoopApp(s) MapReduce
Monday, 8AM EST
• User Bob logs in and Profile retrieved from mongoDB• Bob clicks on Content X which is already tagged as
“Pets”• Bob has clicked on Pets tagged content many times• Adjust Profile for tag “Pets” and save back to
mongoDB
• Analysis = f(Profile)
• Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc.
mongoDB HadoopApp(s) MapReduce
Monday, 8:02AM EST
• Bob clicks on Content Y which is already tagged as “Spices”
• Spice is a new tag type for Bob• Adjust Profile for tag “Spices” and save back to
mongoDB• Analysis = f(profile)
mongoDB HadoopApp(s) MapReduce
Profile in Detail{ user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ { ts: datetime3, url: url3 } ]} }}
Tag-based algorithm detailgetRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available {
filter = algo(profile, tag); } fetch N recommendations (filter);}
A4(profile, tag) { weight = get tag (“PETS”) global weighting; adjustForPersonalBaseline(weight, “PETS” baseline); if “PETS” clicked more than 2 times in past 10 mins then weight += 10; if “PETS” clicked more than 10 times in past 2 days then weight += 3; return new filter({“PETS”, weight}, globals)}
Tuesday, 1AM EST
mongoDB HadoopApp(s) MapReduce
• Fetch all user Profiles from mongoDB; load into Hadoop• Or skip if using the mongoDB-Hadoop
connector!
Tuesday, 1:30AM EST
• Grind through all content data and user profile data to produce:• Tags based on feature extraction (vs. creator-
applied tags)• Trend baseline for Pets and PowerTools and Spice
• Data can be specific to individual or by group• Load baseline back into mongoDB
• Or skip if using the mongoDB-Hadoop connector!
mongoDB HadoopApp(s) MapReduce
New Profile in Detail{ user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} }}
Tuesday, 1:35AM EST
• Perform maintenance on user Profiles• Click history trimming (variety of
algorithms)• “Dead tag” removal• Update of auxiliary reference data
mongoDB HadoopApp(s) MapReduce
New Profile in Detail{ user: “Bob”, personalData: { zip: “10022”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [ 1322,44,23, … ], hist: [ { ts: datetime1, url: url1 } // 50 more ]}, SPICE: { algo: “Z1”, hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} }}
Feel free to run the baselining more frequently
… but avoid “Are We There Yet?”
mongoDB HadoopApp(s) MapReduce
Nearterm / Realtime Questions & Actions
With respect to the Customer:• What has Bob done over the past 24 hours?• Given an input, make a logic decision in 100ms or
less
With respect to the Provider:• What are all current users doing or looking at?• Can we nearterm correlate single events to shifts in
behavior?
Longterm/ Not Realtime Questions & Actions
With respect to the Customer:• Any way to explain historic performance /
actions?• What are recommendations for the future?
With respect to the Provider:• Can we correlate multiple events from multiple
sources over a long period of time to identify trends?
• What is my entire customer base doing over 2 years?
• Show me a time vs. aggregate tag hit chart• Slice and dice and aggregate tags vs. XYZ• What tags are trending up or down?
The Key To Success: It is One System
mongoDB
Hadoop
App(s)
MapReduce
Webex Q&A