Big Data Lessons from the Cloud

Big Data Lessons from the CloudJack Norris, MapR Technologies

Data VolumeGrowing 44x

2020: 35.2 Zettabytes

2010:1.2

Zettabytes

The Challenge of Big Data

Business Analytics Requires a New Approach

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

IDC Digital Universe

Data is Growing Faster than Moore’s Law

What are the Requirements for Big Data?

Process it quickly

Combine multiple data sources

Expand analysis

Big Data in the Cloud

Distributed, scalable computing platform– Data/Compute framework– Commodity hardware

Pioneered at Google

Commercially available as Hadoop

Important Drivers for Hadoop

Data on compute

You don’t need to know what questions to ask beforehand

Simple algorithms on Big Data

Analysis of unstructured data

Hadoop Growth

Apache Hadoop Distribution

Combination of Various Packages

Integrated, tested and hardened

Mahout

Cascading

Nagios

Ganglia

Management Control SystemData Platform

Data Platform

HCatalog

Zookeeper

Hadoop in the Cloud

Amazon Example: Elastic MapReduce (EMR)

EMR provides Hadoop as a Service in the Cloud

How does it work?

EMR ClusterS3

You can store the data in S3 and/or on

the cluster (HDFS)

You decide which Hadoop distribution to run, how many

nodes, and what types of nodes

EMR Cluster

How does it work?

You can easily add additional nodes

How does it work?

EMR ClusterS3

When processing is complete, you can shut down the cluster

(and stop paying)

Launching a Cluster

Thousands of customers, 2 million+ clusters

Hadoop in the Cloud is a Flexible Infrastructure for Big Data

MinuteSort - Amount of data that can be sorted in 60.00 seconds.– Benchmark is technology Agnostic

Previous record was 1.4TB set by Microsoft Research using specially designed software across physical hardware

Previous Hadoop MinuteSort record was 578 GB

Cloud Example of Scalability

A New MinuteSort World Record

New World Record1.5 TB in 60seconds

3X more data processed than the previous Hadoop Record

Previous Record

3452 physical serversPrepare datacenter

Rack and stack serversMaintain hardware

2103 instancesInvoke gcutil command

Months Minutes

Cloud Deployment Comparison

Previous Record

3452 1U servers x$4K/server =

2103 n1-standard-4-d x$.58/instance hour x

60 seconds =

$13,808,000 $20.33

Cost Comparison

Use Case 1: Expand Data for Analysis

Comparing an EDW to Hadoop

Major telecom vendor Key step in billing pipeline

handled by data warehouse (EDW)

EDW at maximum capacity Multiple rounds of software

optimization already done Revenue limiting (= career

limiting) bottleneck

TransformationExtract and Load

CDR billing records

Billing reports

Data Warehouse

Customer bills

Original Flow

Problem Analysis

70% of EDW load is related to call detail record (CDR) normalization

–< 10% of total lines of code–CDR normalization difficult within the EDW–Binary extraction and conversion

Data rates are too high for upstream transform

–Requires high volume joins

ETLCDR billing

records

Billing reports

Data Warehouse

Customer billing

With ETL Offload

Hadoop Cluster

ETL Offload

Hadoop Distribution

Simplified Analysis

70% of EDW consumed by ETL processing – Offload frees capacity

EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost

Additional EDW only increases capacity by 50% due to poor division of labor

The Results

EDW strategy–1.5 x performance–$30 million

Hadoop Strategy–3 x faster–20x cost/performance advantage for Hadoop strategy–With High Availability and data protection

Use Case 2:Combine Many Different Data Sources

Combining different feeds on one platform

Hadoop and HBase Storage and Processing

Real-time data feed from social network

Stored in Hadoop

Historical Purchase Information

Predictive Analytics from Historical data combined with NoSQL querying on real-time

social networking data

Billing Data

Results

New Service Rolled out in 1 quarter

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

Includes data versioning support for easier development and updating of models

Collect Data from Dispersed Data Sources

Leading Veterinary Equipment Mfgr

Aggregates data across 6000 veterinary clinics Nightly extracts from each clinic One job runs once a week for a few hours Expanding applications to include vaccination analysis for 300M

vaccinations Predictive analytics for disease prevalence and prevention

Use Case 3:New Application from New Data Source

Ancestry.com – Family Tree

Overview and Requirements

Collect and Collate information from disparate sources (Text files, Images, etc.)

Leverage new data source: Spit

Machine learning techniques and DNA Matching Algorithms

The Results

Storage Infrastructure for billions of small and large files

Blob Store for large images through NoSQL solutions

Multi-tenant capability for data-mining and machine-learning algorithm development

Use Case 4:New Analytics on Existing Data

Analytic Flexibility

MapReduce enabled Machine learning algorithms

Enhanced Search

Real-time event processing

No need to sample the data

Fraud Detection Target Marketing Consumer Behavior Analysis …

Hadoop Expands Analytics

“Simple algorithms and lots of data trump complex models ”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

Advanced Simple Analytics

Fraud detection: – Detect small frauds using transaction patterns across the entire

portfolio– Identify compromise signature to prevent further exploits and

provide solid case explanations

Google Flu Trends vs. Traditional Flu Surveillance systems and modeling

Netflix recommendation engine– Complex models vs. adding IMDB data

Combine Them All

Clickstream Analysis –

Big Box Retailer came to Razorfish– 3.5 billion records– 71 million unique cookies– 1.7 million targeted ads required per day

Problem: Improve Return on Ad Spend (ROAS)

Targeted Ad

User recently purchased a sports movie and is searching for video games (1.7 Million per day)

Processing time dropped from 2+ days to 8 hours (with lots more data)

Increased Return On Ad Spend by 500%

Hadoop in the Cloud/EMR applications Targeted advertising / Clickstream analysis

Security: anti-virus, fraud detection, image recognition

Pattern matching / Recommendations

Data warehousing / BI

Bio-informatics (Genome analysis)

Financial simulation (Monte Carlo simulation)

File processing (resize jpegs, video encoding)

Web indexing

Big Data Processing

99.999% HA

Data Protection

Disaster Recovery

Scalability &

Performance

Enterprise Integration

Multi-tenancy

MapReduce

File-Based Applications SQL Database Search Stream

Processing

Batch Orientation:Enterprise Logfile Analysis

ETL OffloadObject Archive

Fraud DetectionClickstream Analytics

Real-Time Orientation:Sensor Analysis

“Twitterscraping”Telematics

Process Optimization

Interactive Orientation:Forensic Analysis

Analytic ModelingBI User Focus

Big Data Lessons from the Cloud

1. Big Data requires a new approach

2. Hadoop is a paradigm shift

3. Easy to get started with Hadoop in the Cloud

4. Scale clusters up and down in the Cloud

5. Only pay for what you use

6. Expand data for analysis

7. Combine data sources

8. New application from new data source

9. New analytics

10. Wide variety of applications appropriate for Hadoop

Big Data Lessons from the Cloud

Technology

Transcript of Big Data Lessons from the Cloud

Rackspace Cloud Big Data Platformc744563d32d0468a7cf1-2fe04d8054667ffada6c4002813… · · 2015-10-12Rackspace Cloud Big Data Platform: On-demand Big Data Processing Platform ...

Presentation cloud meets big

Sharing a Startup’s Big Data Lessons

Cloud Storage - Big Data and Cloud Computing (CC4053)

Big Data @ Cloud Scale

Cloud - Big Data

Cloud tco numbers and lessons

Cloud security lessons learned and audit

Big Bang ITSM Implementierung - Lessons Learned

Cloud based 3D Printing : Lessons Learned

BIG-IQ and BIG-IP Cloud Edition

Big Data in Production: Lessons from Running in the Cloud

Lessons Learned From Cloud Migrations

Big Data & Cloud | Cloud Storage Simplified | Adrian Cole

3D Engineering Cloud Solution Lessons Learned

Cloud Platform Adoption: Lessons Learned

Important Lessons in Cloud Backup

Big Data & the Cloud

Retrospective from a startup built in the cloud: top three big lessons learned from the AWS outage on 4.21.11

Big data cloud architecture