Hadoop Update Big Data Analytics May 23 nd 2012 Matt Mead, Cloudera.
-
Upload
imogene-simon -
Category
Documents
-
view
215 -
download
1
Transcript of Hadoop Update Big Data Analytics May 23 nd 2012 Matt Mead, Cloudera.
Hadoop UpdateBig Data Analytics
May 23nd 2012Matt Mead, Cloudera
Hadoop Distributed File System (HDFS)
Self-Healing, High Bandwidth Clustered
Storage
MapReduce
Distributed Computing Framework
Apache Hadoop is an open source platform for data storage and processing that is…
Scalable Fault tolerant Distributed
CORE HADOOP SYSTEM COMPONENTS
Provides storage and computationin a single, scalable system.
What is Hadoop?
Why Use Hadoop?
Move beyond rigid legacy frameworks.
Hadoop handles any data type, in any quantity.
Structured, unstructured
Schema, no schema
High volume, low volume
All kinds of analytic applications
1 2 3
Hadoop is 100% Apache® licensed and open source.
No vendor lock-in
Community development
Rich ecosystem of related projects
Hadoop grows with your business.
Proven at petabyte scale
Capacity and performance grow simultaneously
Leverages commodity hardware to mitigate costs
Hadoop helps you derive the complete value of all
your data.
Drives revenue by extracting value from data that was previously out of reach
Controls costs by storing data more affordably than any other platform
The Need for CDH
1. The Apache Hadoop ecosystem is complex– Many different components – lots of moving parts
– Most companies require more than just HDFS and MapReduce
– Creating a Hadoop stack is time-consuming and requires specific expertise• Component and version selection• Integration (internal & external)• System test w/end-to-end workflows
2. Enterprises consume software in a certain way– System, not silo– Tested and stable– Documented and supported– Predictable release schedule
Core Values of CDH
Storage
Computation
Integration
Coordination
Access
Components of theCDH Stack
A Hadoop system with everything you need for production use.
Coordination
Data IntegrationFast
Read/Write Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE ZOOKEEPER
APACHE FLUME, APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE, APACHE MAHOUT
APACHE OOZIE APACHE OOZIE APACHE HIVE
File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK
HDFS, MAPREDUCE
The Need for CDH
A set of open source components, packaged into a single system.
CORE APACHE HADOOP
HDFS – Distributed, scalable, fault tolerant file system
MapReduce – Parallel processing framework for large data sets
QUERY / ANALYTICS Apache Hive – SQL-like language and metadata repository
Apache Pig – High level language for expressing data analysis programs
Apache HBase – Hadoop database for random, real-time read/write access
Apache Mahout – Library of machine learning algorithms for Apache Hadoop
DATA INTEGRATION
Apache Sqoop – Integrating Hadoop with RDBMS
Apache Flume – Distributed service for collecting and aggregating log and event data
Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system
WORKFLOW / COORDINATION
Apache Oozie – Server-based workflow engine for Hadoop activities
Apache Zookeeper – Highly reliable distributed coordination service
GUI / SDK Hue – Browser-based desktop interface for interacting with Hadoop
CLOUD Apache Whirr – Library for running Hadoop in the cloud
Core Hadoop Use CasesAD
VAN
CED
AN
ALYT
ICS
1 2Two Core Use CasesApplied Across Verticals
DATA
PRO
CESS
ING
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Engagement
Mediation
Data Factory
Trade Reconciliation
SIGINT
INDUSTRY TERM INDUSTRY TERMVERTICAL
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
FMV & Image Processing
Data Processing – Full Motion Video & Image Processing
• Record by record -> Easy Parallelization– “Unit of work” is important
– Raw data in HDFS
• Adaptation of existing image analyzers to Map Only / Map Reduce
• Scales horizontally
• Simple detections– Vehicles
– Structures
– Faces
Cybersecurity Analysis
Advanced Analytics – Cybersecurity Analysis
• Rates and flows – ingest can be in excess of the multiple gigabyte per second range
• Can be complex because of mixed-workload clusters
• Typically involves ad-hoc analysis– Question oriented analytics
• “Productionized” use cases allow insight by non-analysts
• Existing open source solution SHERPASURFING– Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.)
– Provides a means to ask questions without reinventing all the plumbing
Index Preparation
Data Processing – Index Preparation
• Hadoop’s Seminal Use Case
• Dynamic Partitioning -> Easy Parallelization
• String Interning
• Inverse Index Construction
• Dimensional data capture
• Destination indices– Lucene/Solr (and derivatives)
– Endeca
• Existing solution USA Search (http://usasearch.howto.gov/)
Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone
• Begins as storage, light ingest processing, retrieval
• Capacity scales horizontally
• Schema-less -> holds arbitrary content
• Schema-less -> allows ad-hoc fusion and analysis
• Additional analytic workload forces decisions
Data Landing Zone
Hadoop: Getting Started
• Reactive– Forced by scale or cost of scaling
• Proactive– Seek talent ahead of need to build
– Identify data-sets
– Determine high-value use cases that change organizational outcomes
– Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional
• Either way– Talent a major challenge
– Start with “Data Processing” use cases
– Physical infrastructure is complex, make the software infrastructure simple to manage
Customer Success
Time required for Production Deployment (Months)
Cost
, $M
illio
ns
1 2 3 4 5 6
$1M
$2M
$3M
$4M
$5M
Option 1: Use Cloudera EnterpriseEstimated Cost: $2 millionDeployment Time: ~ 2 Months
Option 2: Self-SourceEstimated Cost: $4.8 millionDeployment Time: ~ 6 Months
Note: Cost estimates include personnel, software & hardwareSource: Cloudera internal estimates
Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment
Customer Success
Item Cloudera Enterprise Self-Source or Contract
Support Offering World-Class, Global, Dedicated Contributors and Committers
Must recruit, hire, train and retain Hadoop experts
Monitoring and Management Fully Integrated application for Hadoop Intelligence
Must be developed and maintained in house
Support for the Full Hadoop Stack Full Stack* Unknown
Regular Scheduled Releases Yearly Major, Quarterly Minor, Hot Fixes?
N/A
Training and Certification for the Full Hadoop Stack
Available Worldwide None
Support for Full Lifecycle All Inclusive Development through Production
Community support
Rich Knowledge-base 500+ Articles None
Production Solution Guides Included None
* Flume, FuseDFS, HBase, HDFS, Hive, Hue, Mahout, MR1, MR2, Oozie, Pig, Sqoop, Zookeeper
Cloudera Enterprise Subscription vs. Self-Source
• Erin Hawley– Business Development, Cloudera DoD Engagement
• Matt Mead– Sr. Systems Engineer, Cloudera Federal Engagements
Contact Us