Essential Tools For Your Big Data Arsenal

Post on 08-May-2015

5.094 views 5 download

description

For some, Hadoop is synonymous with “Big Data,” but Hadoop is just one component of a successful Big Data architecture. Depending on one’s application, it may not even be the most important part. NoSQL solutions like MongoDB also play a dominant role for storage and real-time data processing, helping companies keep pace with the scale of their data requirements. But NoSQL figures even more prominently in helping enterprises consume a wide variety of data sources at speeds not currently possible in Hadoop. NoSQL, then, offers a useful complement to Hadoop, as well as the transaction-based data of traditional RDBMSs. Tackling Big Data is not a one-tool job, and so the orchestration of the appropriate NoSQL database with Hadoop and RDBMS is essential. In this session, we’ll dig deep into the different types of NoSQL, identifying how they differ and the types of Big Data workloads for which they’re best suited. We’ll also explore the trade-offs one makes in choosing NoSQL databases like MongoDB or Neo4j over an RDBMS like MySQL, and when it makes sense to use both Hadoop and NoSQL and when it’s more appropriate to use NoSQL on its own.

Transcript of Essential Tools For Your Big Data Arsenal

Matt Asay (@mjasay)VP, Business Development & Strategy, MongoDB

Essential Tools For Your Big Data Arsenal

The Big Data Unknown

3

Top Big Data Challenges?

Translation? Most struggle to know what Big Data is, how to manage it and who can manage it

Source: Gartner

4

Understanding Big Data – It’s Not Very “Big”

from Big Data Executive Summary – 50+ top executives from Government and F500 firms

64% - Ingest diverse, new data in real-time

15% - More than 100TB of data

20% - Less than 100TB (average of all? <20TB)

Innovation As Iteration

“I have not failed. I've just found 10,000 ways that won't work.” ― Thomas A. Edison

7

Back in 1970…Cars Were Great!

8

So Were Computers!

9

Lots of Great Innovations Since 1970

10

Including the Relational Database

11

RDBMS Makes Development Hard

Relational Database

Object Relational Mapping

Application

Code XML Config DB Schema

12

And Even Harder To Iterate

New Table

New Table

New Column

Name Pet Phone Email

New Column

3 months later…

13

RDBMS

From Complexity to Simplicity

MongoDB

{

_id : ObjectId("4c4ba5e5e8aabf3"),

employee_name: "Dunham, Justin",

department : "Marketing",

title : "Product Manager, Web",

report_up: "Neray, Graham",

pay_band: “C",

benefits : [

{ type :  "Health",

plan : "PPO Plus" },

{ type :   "Dental",

plan : "Standard" }

]

}

14

So…Use Open Source

15

Big Data != Big Upfront Payment

16

RDBMS Is Expensive To Scale

“Clients can also opt to run zEC12 without a raised datacenter floor -- a first for high-end IBM mainframes.”

IBM Press Release 28 Aug, 2012

17

Spoiled for choice

1 Oracle  Relational DBMS 1583.84 54.232 MySQL  Relational DBMS 1331.34 25.583 Microsoft SQL Server  Relational DBMS 1207 -106.784 PostgreSQL  Relational DBMS 177.01 -5.225 DB2  Relational DBMS 175.83 3.586 MongoDB  NoSQL Document Store 149.48 -2.717 Microsoft Access  Relational DBMS 142.49 -4.218 SQLite  Relational DBMS 77.88 -4.99 Sybase  Relational DBMS 73.66 -1.68

10 Teradata  Relational DBMS 54.41 3.32

DB-Engines.com Database Ranking

18

Remember the Long Tail?

19

It Didn’t Work Out So Well

20

Use Popular, Well-Known Technologies

Source: Silicon Angle, 2012

21

Ask the Right Questions…

“Organizations already have people who know their own data better than mystical data scientists….Learning Hadoop [or MongoDB] is easier than learning the company’s business.”

(Gartner, 2012)

22

Leverage Existing Skills

23

Search as a Sign?

When To Use Hadoop, NoSQL

25

Enterprise Big Data Stack

EDWHadoop

Man

agem

ent

& M

on

ito

rin

gS

ecurity &

Au

ditin

g

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Online Data Offline Data

26

Consideration – Online vs. Offline

• Long-running• High-Latency• Availability is lower

priority

• Real-time• Low-latency• High availability

Online Offlinevs.

27

Consideration – Online vs. Offline

Online Offlinevs.

28

Hadoop Is Good for…

Risk Modeling Churn AnalysisRecommendation

Engine

Ad TargetingTransaction

AnalysisTrade

Surveillance

Network Failure Prediction

Search Quality Data Lake

29

MongoDB/NoSQL Is Good for…

360° View of the Customer

Mobile & Social Apps

Fraud Detection

User Data Management

Content Management &

DeliveryReference Data

Product CatalogsMachine to

Machine AppsData Hub

How To Use The Two Together?

31

Finding Waldo

32

Customer example: Online Travel

Travel

• Flights, hotels and cars

• Real-time offers• User profiles, reviews• User metadata

(previous purchases, clicks, views)

• User segmentation• Offer recommendation

engine• Ad serving engine• Bundling engine

Algorithms

MongoDB Connector for

Hadoop

33

Predictive Analytics

Government

• Predictive analytics system for crime, health issues

• Diverse, unstructured (incl. geospatial) data from 30+ agencies

• Correlate data in real-time

• Long-form trend analysis• MongoDB data dumped

into Hadoop, analyzed, re-inserted into MongoDB for better real-time response

Algorithms

MongoDB

+ Hadoop

34

Data Hub

Insurance

• Insurance policies• Demographic data• Customer web data• Call center data• Real-time churn

detection

• Customer action analysis

• Churn prediction algorithms

Churn Analysis

MongoDB Connector for

Hadoop

35

Machine Learning

Ad-Serving

• Catalogs and products

• User profiles• Clicks• Views• Transactions

• User segmentation• Recommendation

engine• Prediction engine

Algorithms

MongoDB Connector for

Hadoop

36

• Makes MongoDB a Hadoop-enabled file system

• Read and write to live data, in-place

• Copy data between Hadoop and MongoDB

• Full support for data processing

– Hive

– MapReduce

– Pig

– Streaming

– EMR

MongoDB + Hadoop Connector

MongoDB Connector for

Hadoop

@mjasay