Expect More from Hadoop

Post on 13-Jan-2015

353 views 1 download

Tags:

description

MapR Technologies Chief Marketing Officer, Jack Norris, talks about the advantages of Hadoop. He elaborates and multiple use cases and explains how MapR Technologies is the best Hadoop distribution.

Transcript of Expect More from Hadoop

1©MapR Technologies

Expect More from HadoopJack Norris, MapR Technologies

3©MapR Technologies

Hadoop Growth

4©MapR Technologies

Important Drivers for Hadoop

Data on compute

You don’t need to know what questions to ask beforehand

Simple algorithms on Big Data

Analysis of unstructured data

5©MapR Technologies

The Cost of Enterprise Storage

SAN Storage

$2 - $10/Gigabyte

$1M gets:0.5Petabytes 200,000 IOPS

1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets:1 Petabyte

400,000 IOPS2Gbyte/sec

Local Storage

$0.02/Gigabyte

$1M gets:50 Petabytes

10,000,000 IOPS800 Gbytes/sec

1/100 to 1/20 the cost

6©MapR Technologies

MapReduce: A Paradigm Shift

Distributed, scalable computing platform– Data/Compute framework– Commodity hardware

Pioneered at Google

Commercially available as Hadoop

7©MapR Technologies

MapR Distribution for Apache Hadoop

Complete Hadoop distribution

Comprehensive management suite

Industry-standard interfaces

Enterprise-grade dependability

Higher performance

Pig

Hive

HBase

Mahout

Oozie

Whirr

Avro

Cascading

Nagios

Ganglia

MapR Control System

MapR Data Platform

MapR Control System

MapR Data Platform

Flume

Sqoop

HCatalog

Zookeeper

Drill

Map

Reduc

e

8©MapR Technologies

How do you Benefit?

9©MapR Technologies

Expanding data for existing applications

10©MapR Technologies

Use Case #1

Major telecom vendor

Key step in billing pipeline handled by data warehouse (EDW)

EDW at maximum capacity

Multiple rounds of software optimization already done

Revenue limiting (= career limiting) bottleneck

11©MapR Technologies

TransformationExtract and Load

CDR billing records

Billing reports

Data Warehouse

Customer bills

Original Flow

12©MapR Technologies

Problem Analysis

70% of EDW load is related to call detail record (CDR) normalization

–< 10% of total lines of code–CDR normalization difficult within the EDW–Binary extraction and conversion

Data rates are too high for upstream transform

–Requires high volume joins

13©MapR Technologies

ETLCDR billing

records

Billing reports

Data Warehouse

Customer billing

With ETL Offload

Hadoop Cluster

15©MapR Technologies

Simplified Analysis

70% of EDW consumed by ETL processing – Offload frees capacity

EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost

Additional EDW only increases capacity by 50% due to poor division of labor

17©MapR Technologies

The Results

EDW strategy–1.5 x performance–$30 million

MapR Strategy–3 x faster–20x cost/performance advantage for MapR strategy–With High Availability and data protection

19©MapR Technologies

Use Case #2

Combine Many Different Data Sources

20©MapR Technologies

Use Case #2 – Customer Example

Global Credit Card Issuer

Launching a New Location Based Service

Benefits both Merchants and Consumers

21©MapR Technologies

Combining different feeds on one platform

Hadoop and HBase Storage and Processing

Real-time data feed from social network

Stored in Hadoop

Historical Purchase Information

Predictive Analytics from Historical data combined with NoSQL querying on real-time

social networking data

Billing Data

22©MapR Technologies

Results

New Service Rolled out in 1 quarter

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

Includes data versioning support for easier development and updating of models

25©MapR Technologies

Use Case #3

New Application from New Data Source

26©MapR Technologies

Ancestry.com – Family Tree

27©MapR Technologies

Overview and Requirements

Collect and Collate information from disparate sources (Text files, Images, etc.)

Leverage new data source: Spit

Machine learning techniques and DNA Matching Algorithms

28©MapR Technologies

The Results

Storage Infrastructure for billions of small and large files

Blob Store for large images through NoSQL solutions

Multi-tenant capability for data-mining and machine-learning algorithm development

One highly available, efficient platform

29©MapR Technologies

MapR M7: Making HBase Enterprise Grade

Disks

ext3

JVM

DFS

JVM

HBase

Other Distributions

Disks

Unified

Easy Dependable Fast

No RegionServers No compactions Consistent low latency

Seamless splits Instant recovery from node failure

Real-time in-memory configuration

Automatic merges Snapshots Disk and network compression

In-memory column families Mirroring Reduced I/O to disk

30©MapR Technologies

Use Case

New Analytics on Existing Data

31©MapR Technologies

Analytic Flexibility

MapReduce enabled Machine learning algorithms

Enhanced Search

Real-time event processing

No need to sample the data

Fraud Detection Target Marketing Consumer Behavior Analysis …

32©MapR Technologies

Hadoop Expands Analytics

“Simple algorithms and lots of data trump complex models ”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

34©MapR Technologies

Use Case #4

Combine All Three

35©MapR Technologies

Where do you Start?

36©MapR Technologies

One Platform for Big Data

Batch

99.999% HA

Data Protection

Disaster Recovery

Scalability &

Performance

Enterprise Integration

Multi-tenancy

BatchProcessing

File-Based Applications SQL Database Search Stream

Processing

Interactive Realtime

37©MapR Technologies

World Record Performance

Why is MapR faster and more efficient?– C/C++ vs. Java – Distributed metadata– Optimized shuffle

New Minute Sort World Record

1.5 TB in 1 minute2103 nodes

38©MapR Technologies

Thank You

39©MapR Technologies