Post on 13-Jan-2015
description
1
Big Data Lessons from the CloudJack Norris, MapR Technologies
2
Data VolumeGrowing 44x
2020: 35.2 Zettabytes
2010:1.2
Zettabytes
The Challenge of Big Data
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC Digital Universe
Study
Data is Growing Faster than Moore’s Law
3
What are the Requirements for Big Data?
Process it quickly
Combine multiple data sources
Expand analysis
4
Big Data in the Cloud
Distributed, scalable computing platform– Data/Compute framework– Commodity hardware
Pioneered at Google
Commercially available as Hadoop
5
Important Drivers for Hadoop
Data on compute
You don’t need to know what questions to ask beforehand
Simple algorithms on Big Data
Analysis of unstructured data
6
Hadoop Growth
7
Apache Hadoop Distribution
Combination of Various Packages
Integrated, tested and hardened
Pig
Hive
HBase
Mahout
Oozie
Whirr
Avro
Cascading
Nagios
Ganglia
Management Control SystemData Platform
Data Platform
Flume
Sqoop
HCatalog
Zookeeper
Drill
Map
Reduc
e
8
Hadoop in the Cloud
9
Amazon Example: Elastic MapReduce (EMR)
EMR provides Hadoop as a Service in the Cloud
10
How does it work?
EMR
EMR ClusterS3
You can store the data in S3 and/or on
the cluster (HDFS)
You decide which Hadoop distribution to run, how many
nodes, and what types of nodes
11
EMR
EMR Cluster
How does it work?
S3
You can easily add additional nodes
12
How does it work?
EMR ClusterS3
When processing is complete, you can shut down the cluster
(and stop paying)
13
Launching a Cluster
14
Thousands of customers, 2 million+ clusters
16
Hadoop in the Cloud is a Flexible Infrastructure for Big Data
17 17
MinuteSort - Amount of data that can be sorted in 60.00 seconds.– Benchmark is technology Agnostic
Previous record was 1.4TB set by Microsoft Research using specially designed software across physical hardware
Previous Hadoop MinuteSort record was 578 GB
Cloud Example of Scalability
18
A New MinuteSort World Record
New World Record1.5 TB in 60seconds
3X more data processed than the previous Hadoop Record
19
Previous Record
3452 physical serversPrepare datacenter
Rack and stack serversMaintain hardware
2103 instancesInvoke gcutil command
Months Minutes
Cloud Deployment Comparison
20
Previous Record
3452 1U servers x$4K/server =
2103 n1-standard-4-d x$.58/instance hour x
60 seconds =
$13,808,000 $20.33
Cost Comparison
21
Use Case 1: Expand Data for Analysis
22
Comparing an EDW to Hadoop
Major telecom vendor Key step in billing pipeline
handled by data warehouse (EDW)
EDW at maximum capacity Multiple rounds of software
optimization already done Revenue limiting (= career
limiting) bottleneck
23
TransformationExtract and Load
CDR billing records
Billing reports
Data Warehouse
Customer bills
Original Flow
24
Problem Analysis
70% of EDW load is related to call detail record (CDR) normalization
–< 10% of total lines of code–CDR normalization difficult within the EDW–Binary extraction and conversion
Data rates are too high for upstream transform
–Requires high volume joins
25
ETLCDR billing
records
Billing reports
Data Warehouse
Customer billing
With ETL Offload
Hadoop Cluster
26
ETL Offload
Hadoop Distribution
27
Simplified Analysis
70% of EDW consumed by ETL processing – Offload frees capacity
EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost
Additional EDW only increases capacity by 50% due to poor division of labor
28
The Results
EDW strategy–1.5 x performance–$30 million
Hadoop Strategy–3 x faster–20x cost/performance advantage for Hadoop strategy–With High Availability and data protection
29
Use Case 2:Combine Many Different Data Sources
30
Combining different feeds on one platform
Hadoop and HBase Storage and Processing
…
Real-time data feed from social network
Stored in Hadoop
Historical Purchase Information
Predictive Analytics from Historical data combined with NoSQL querying on real-time
social networking data
Billing Data
31
Results
New Service Rolled out in 1 quarter
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
Includes data versioning support for easier development and updating of models
32
Collect Data from Dispersed Data Sources
33
Leading Veterinary Equipment Mfgr
Aggregates data across 6000 veterinary clinics Nightly extracts from each clinic One job runs once a week for a few hours Expanding applications to include vaccination analysis for 300M
vaccinations Predictive analytics for disease prevalence and prevention
34
Use Case 3:New Application from New Data Source
35
Ancestry.com – Family Tree
36
Overview and Requirements
Collect and Collate information from disparate sources (Text files, Images, etc.)
Leverage new data source: Spit
Machine learning techniques and DNA Matching Algorithms
37
The Results
Storage Infrastructure for billions of small and large files
Blob Store for large images through NoSQL solutions
Multi-tenant capability for data-mining and machine-learning algorithm development
38
Use Case 4:New Analytics on Existing Data
39
Analytic Flexibility
MapReduce enabled Machine learning algorithms
Enhanced Search
Real-time event processing
No need to sample the data
Fraud Detection Target Marketing Consumer Behavior Analysis …
40
Hadoop Expands Analytics
“Simple algorithms and lots of data trump complex models ”
Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems
41
Advanced Simple Analytics
Fraud detection: – Detect small frauds using transaction patterns across the entire
portfolio– Identify compromise signature to prevent further exploits and
provide solid case explanations
Google Flu Trends vs. Traditional Flu Surveillance systems and modeling
Netflix recommendation engine– Complex models vs. adding IMDB data
42
Combine Them All
43
Clickstream Analysis –
Big Box Retailer came to Razorfish– 3.5 billion records– 71 million unique cookies– 1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
44
Clickstream Analysis –
Targeted Ad
User recently purchased a sports movie and is searching for video games (1.7 Million per day)
45
Clickstream Analysis –
Processing time dropped from 2+ days to 8 hours (with lots more data)
46
Clickstream Analysis –
Increased Return On Ad Spend by 500%
47
Hadoop in the Cloud/EMR applications Targeted advertising / Clickstream analysis
Security: anti-virus, fraud detection, image recognition
Pattern matching / Recommendations
Data warehousing / BI
Bio-informatics (Genome analysis)
Financial simulation (Monte Carlo simulation)
File processing (resize jpegs, video encoding)
Web indexing
48
Big Data Processing
…
99.999% HA
Data Protection
Disaster Recovery
Scalability &
Performance
Enterprise Integration
Multi-tenancy
MapReduce
File-Based Applications SQL Database Search Stream
Processing
Batch Orientation:Enterprise Logfile Analysis
ETL OffloadObject Archive
Fraud DetectionClickstream Analytics
Real-Time Orientation:Sensor Analysis
“Twitterscraping”Telematics
Process Optimization
Interactive Orientation:Forensic Analysis
Analytic ModelingBI User Focus
49
Big Data Lessons from the Cloud
1. Big Data requires a new approach
2. Hadoop is a paradigm shift
3. Easy to get started with Hadoop in the Cloud
4. Scale clusters up and down in the Cloud
5. Only pay for what you use
6. Expand data for analysis
7. Combine data sources
8. New application from new data source
9. New analytics
10. Wide variety of applications appropriate for Hadoop