Post on 13-Jan-2015
description
1©MapR Technologies
Expect More from HadoopJack Norris, MapR Technologies
3©MapR Technologies
Hadoop Growth
4©MapR Technologies
Important Drivers for Hadoop
Data on compute
You don’t need to know what questions to ask beforehand
Simple algorithms on Big Data
Analysis of unstructured data
5©MapR Technologies
The Cost of Enterprise Storage
SAN Storage
$2 - $10/Gigabyte
$1M gets:0.5Petabytes 200,000 IOPS
1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:1 Petabyte
400,000 IOPS2Gbyte/sec
Local Storage
$0.02/Gigabyte
$1M gets:50 Petabytes
10,000,000 IOPS800 Gbytes/sec
1/100 to 1/20 the cost
6©MapR Technologies
MapReduce: A Paradigm Shift
Distributed, scalable computing platform– Data/Compute framework– Commodity hardware
Pioneered at Google
Commercially available as Hadoop
7©MapR Technologies
MapR Distribution for Apache Hadoop
Complete Hadoop distribution
Comprehensive management suite
Industry-standard interfaces
Enterprise-grade dependability
Higher performance
Pig
Hive
HBase
Mahout
Oozie
Whirr
Avro
Cascading
Nagios
Ganglia
MapR Control System
MapR Data Platform
MapR Control System
MapR Data Platform
Flume
Sqoop
HCatalog
Zookeeper
Drill
Map
Reduc
e
8©MapR Technologies
How do you Benefit?
9©MapR Technologies
Expanding data for existing applications
10©MapR Technologies
Use Case #1
Major telecom vendor
Key step in billing pipeline handled by data warehouse (EDW)
EDW at maximum capacity
Multiple rounds of software optimization already done
Revenue limiting (= career limiting) bottleneck
11©MapR Technologies
TransformationExtract and Load
CDR billing records
Billing reports
Data Warehouse
Customer bills
Original Flow
12©MapR Technologies
Problem Analysis
70% of EDW load is related to call detail record (CDR) normalization
–< 10% of total lines of code–CDR normalization difficult within the EDW–Binary extraction and conversion
Data rates are too high for upstream transform
–Requires high volume joins
13©MapR Technologies
ETLCDR billing
records
Billing reports
Data Warehouse
Customer billing
With ETL Offload
Hadoop Cluster
15©MapR Technologies
Simplified Analysis
70% of EDW consumed by ETL processing – Offload frees capacity
EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost
Additional EDW only increases capacity by 50% due to poor division of labor
17©MapR Technologies
The Results
EDW strategy–1.5 x performance–$30 million
MapR Strategy–3 x faster–20x cost/performance advantage for MapR strategy–With High Availability and data protection
19©MapR Technologies
Use Case #2
Combine Many Different Data Sources
20©MapR Technologies
Use Case #2 – Customer Example
Global Credit Card Issuer
Launching a New Location Based Service
Benefits both Merchants and Consumers
21©MapR Technologies
Combining different feeds on one platform
Hadoop and HBase Storage and Processing
…
Real-time data feed from social network
Stored in Hadoop
Historical Purchase Information
Predictive Analytics from Historical data combined with NoSQL querying on real-time
social networking data
Billing Data
22©MapR Technologies
Results
New Service Rolled out in 1 quarter
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
Includes data versioning support for easier development and updating of models
25©MapR Technologies
Use Case #3
New Application from New Data Source
26©MapR Technologies
Ancestry.com – Family Tree
27©MapR Technologies
Overview and Requirements
Collect and Collate information from disparate sources (Text files, Images, etc.)
Leverage new data source: Spit
Machine learning techniques and DNA Matching Algorithms
28©MapR Technologies
The Results
Storage Infrastructure for billions of small and large files
Blob Store for large images through NoSQL solutions
Multi-tenant capability for data-mining and machine-learning algorithm development
One highly available, efficient platform
29©MapR Technologies
MapR M7: Making HBase Enterprise Grade
Disks
ext3
JVM
DFS
JVM
HBase
Other Distributions
Disks
Unified
Easy Dependable Fast
No RegionServers No compactions Consistent low latency
Seamless splits Instant recovery from node failure
Real-time in-memory configuration
Automatic merges Snapshots Disk and network compression
In-memory column families Mirroring Reduced I/O to disk
30©MapR Technologies
Use Case
New Analytics on Existing Data
31©MapR Technologies
Analytic Flexibility
MapReduce enabled Machine learning algorithms
Enhanced Search
Real-time event processing
No need to sample the data
Fraud Detection Target Marketing Consumer Behavior Analysis …
32©MapR Technologies
Hadoop Expands Analytics
“Simple algorithms and lots of data trump complex models ”
Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems
34©MapR Technologies
Use Case #4
Combine All Three
35©MapR Technologies
Where do you Start?
36©MapR Technologies
One Platform for Big Data
…
Batch
99.999% HA
Data Protection
Disaster Recovery
Scalability &
Performance
Enterprise Integration
Multi-tenancy
BatchProcessing
File-Based Applications SQL Database Search Stream
Processing
Interactive Realtime
37©MapR Technologies
World Record Performance
Why is MapR faster and more efficient?– C/C++ vs. Java – Distributed metadata– Optimized shuffle
New Minute Sort World Record
1.5 TB in 1 minute2103 nodes
38©MapR Technologies
Thank You
39©MapR Technologies