Big Data: Guidelines and Examples for the Enterprise Decision Maker

Big Data: Examples and Guidelines for the Enterprise Decision Maker

Solutions Architect, MongoDB

Buzz [email protected]

#MongoDB

Who is your Presenter?• Yes, I use “Buzz” on my business cards• Former Investment Bank Chief Architect at

JPMorganChase and Bear Stearns before that

• Over 25 years of designing and building systems• Big and small• Super-specialized to broadly useful in any

vertical• “Traditional” to completely disruptive• Advocate of language leverage and strong

factoring• Still programming – using emacs, of course

Agenda• (Occasionally) Brutal Truths about Big Data

• Review of Directed Content Business Architecture

• A Simple Technical Implementation

Truths• Clear definition of Big Data still maturing

• Efficiently operationalizing Big Data is non-trivial• Developing, debugging, understanding MapReduce• Cluster monitoring & management, job scheduling/recovery• If you thought regular ETL Hell was bad….

• Big Data is not about math/set accuracy• The last 25000 items in a 25,497,612 set “don’t matter”

• Big Data questions are best asked periodically• “Are we there yet?”

• Realtime means … realtime

It’s About The Functions, not the Terms

DON’T ASK:• Is this an operations or an analytics

problem?• Is this online or offline?• What query language should we use?• What is my integration strategy across tools?ASK INSTEAD:• Am I incrementally addressing data (esp.

writes)?• Am I computing a precise answer or a

trend?• Do I need to operate on this data in

realtime?• What is my holistic architecture?

What We’re Going to “Build” today

Realtime Directed Content System• Based on what users click,

“recommended” content is returned in addition to the target

• The example is sector (manufacturing, financial services, retail) neutral

• System dynamically updates behavior in response to user activity

The Participants and Their Roles

DirectedContentSystem

Customers

ContentCreators

Management/

Strategy

Analysts/Data

Scientists

Generate and tag content from a known domain of tags

Make decisions based on trends and other summarized data

Operate on data to identify trends and develop tag domains

Developers/ProdOps

Bring it all together: apps, SDLC, integration, etc.

Priority #1: Maximizing User value

Considerations/Requirements

Maximize realtime user value and experienceProvide management reporting and trend analysisEngineer for Day 2 agility on recommendation engineProvide scrubbed click history for customerPermit low-cost horizontal scalingMinimize technical integrationMinimize technical footprintUse conventional and/or approved toolsProvide a RESTful service layer…..

The Architecture

mongoDB HadoopApp(s) MapReduce

Complementary Strengths


• Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.)

• Language flexibility (Java, C#, C++ python, Scala, …)

• Webscale deployment model• appservers, DMZ,

monitoring• High performance rich shape

CRUD

• MapReduce design paradigm• Node deployment model• Very large set operations• Computationally intensive,

longer duration• Read-dominated workload

“Legacy” Approach: Somewhat unidirectional


• Extract data from mongoDB and other sources nightly (or weekly)

• Run analytics• Generate reports for people to

read

• Where’s the feedback?

Somewhat better approach


• Extract data from mongoDB and other sources nightly (or weekly)

• Run analytics• Generate reports for people to

read• Move important summary data

back to mongoDB for consumption by apps.

…but the overall problem remains:

• How to realtime integrate and operate upon both periodically generated data and realtime current data?

• Lackluster integration between OLTP and Hadoop

• It’s not just about the database: you need a realtime profile and profile update function

The legacy problem in pseudocode

onContentClick() {String[] tags = content.getTags();Resource[] r = f1(database, tags);

}

• Realtime intraday state not well-handled

• Baselining is a different problem than click handling

The Right Approach• Users have a specific Profile entity

• The Profile captures trend analytics as baselining information

• The Profile has per-tag “counters” that are updated with each interaction / click

• Counters plus baselining are passed to fetch function

• The fetch function itself could be dynamic!

24 hours in the life of The System

• Assume some content has been created and tagged

• Two systemetized tags: Pets & PowerTools

Monday, 1:30AM EST

• Fetch all user Profiles from mongoDB; load into Hadoop• Or skip if using the mongoDB-Hadoop

connector!


mongoDB-Hadoop MapReduce Example

public class ProfileMapper extends Mapper<Object, BSONObject, IntWritable, IntWritable> { @Override public void map(final Object pKey,

final BSONObject pValue,final Context pContext )

throws IOException, InterruptedException{ String user = (String)pValue.get(”user"); Date d1 = (Date)pValue.get(“lastUpdate”); int count = 0; List<String> keys = pValue.get(“tags”).keys(); for ( String tag : keys) { count += pValue.get(tag).get(“hist”).size(); ) int avg = count / keys.size(); pContext.write( new IntWritable( count), new IntWritable( avg ) ); }}

Monday, 1:45AM EST

• Grind through all content data and user Profile data to produce:• Tags based on feature extraction (vs. creator-

applied tags)• Trend baseline per user for tags Pets and

PowerTools

• Load Profiles with new baseline back into mongoDB• Or skip if using the mongoDB-Hadoop

connector!


Monday, 8AM EST

• User Bob logs in and Profile retrieved from mongoDB• Bob clicks on Content X which is already tagged as

“Pets”• Bob has clicked on Pets tagged content many times• Adjust Profile for tag “Pets” and save back to

mongoDB

• Analysis = f(Profile)

• Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc.


Monday, 8:02AM EST

• Bob clicks on Content Y which is already tagged as “Spices”

• Spice is a new tag type for Bob• Adjust Profile for tag “Spices” and save back to

mongoDB• Analysis = f(profile)


Profile in Detail{ user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ { ts: datetime3, url: url3 } ]} }}

Tag-based algorithm detailgetRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available {

filter = algo(profile, tag); } fetch N recommendations (filter);}

A4(profile, tag) { weight = get tag (“PETS”) global weighting; adjustForPersonalBaseline(weight, “PETS” baseline); if “PETS” clicked more than 2 times in past 10 mins then weight += 10; if “PETS” clicked more than 10 times in past 2 days then weight += 3; return new filter({“PETS”, weight}, globals)}

Tuesday, 1AM EST


• Fetch all user Profiles from mongoDB; load into Hadoop• Or skip if using the mongoDB-Hadoop

connector!

Tuesday, 1:30AM EST

• Grind through all content data and user profile data to produce:• Tags based on feature extraction (vs. creator-

applied tags)• Trend baseline for Pets and PowerTools and Spice

• Data can be specific to individual or by group• Load baseline back into mongoDB

• Or skip if using the mongoDB-Hadoop connector!


New Profile in Detail{ user: “Bob”, personalData: { zip: “10024”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [0,0,10,4,1322,44,23, … ], hist: [ { ts: datetime1, url: url1 }, { ts: datetime2, url: url2 } // 100 more ]}, SPICE: { hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} }}

Tuesday, 1:35AM EST

• Perform maintenance on user Profiles• Click history trimming (variety of

algorithms)• “Dead tag” removal• Update of auxiliary reference data


New Profile in Detail{ user: “Bob”, personalData: { zip: “10022”, gender: “M” }, tags: { PETS: { algo: “A4”, baseline: [ 1322,44,23, … ], hist: [ { ts: datetime1, url: url1 } // 50 more ]}, SPICE: { algo: “Z1”, hist: [ baseline: [0], { ts: datetime3, url: url3 } ]} }}

Feel free to run the baselining more frequently

… but avoid “Are We There Yet?”


Nearterm / Realtime Questions & Actions

With respect to the Customer:• What has Bob done over the past 24 hours?• Given an input, make a logic decision in 100ms or

less

With respect to the Provider:• What are all current users doing or looking at?• Can we nearterm correlate single events to shifts in

behavior?

Longterm/ Not Realtime Questions & Actions

With respect to the Customer:• Any way to explain historic performance /

actions?• What are recommendations for the future?

With respect to the Provider:• Can we correlate multiple events from multiple

sources over a long period of time to identify trends?

• What is my entire customer base doing over 2 years?

• Show me a time vs. aggregate tag hit chart• Slice and dice and aggregate tags vs. XYZ• What tags are trending up or down?

The Key To Success: It is One System

mongoDB

Hadoop

App(s)

MapReduce

Webex Q&A

Thank You

Buzz [email protected]

#MongoDB

mailto:[email protected]

mailto:[email protected]

Big Data: Guidelines and Examples for the Enterprise Decision Maker

Technology

Transcript of Big Data: Guidelines and Examples for the Enterprise Decision Maker