Hadoop: What It Is and What It's Not
-
Upload
inside-analysis -
Category
Technology
-
view
652 -
download
0
Transcript of Hadoop: What It Is and What It's Not
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Twitter Tag: #briefr
The Briefing Room
! November: Cloud
! December: Innovators
! January: Big Data
! February: Performance
! March: Integration
Twitter Tag: #briefr
The Briefing Room
! The Data Warehouse was once considered the Holy Grail of Business Intelligence, but as data volumes increase exponentially, we’re finding that data warehousing cannot be all things for all users.
! Hadoop was initially developed at Yahoo! to support a search
engine project and has since turned into the poster child for open source Big Data processing.
! While Hadoop is not a data warehouse, its capabilities can help organizations store and analyze huge volumes of data.
Twitter Tag: #briefr
The Briefing Room
Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor at Forbes Online and Information Management. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
Twitter Tag: #briefr
The Briefing Room
! Hortonworks is an enterprise software company that focuses on the development and support of Apache Hadoop.
! Its product is the Hortonworks Data Platform, an open source platform for storing, processing and analyzing large volumes of data from many sources and in a variety of formats.
! Hortonworks recently introduced its Hive ODBC Driver 1.0, which allows users to integrate its Hadoop platform with the BI apps running on top.
Twitter Tag: #briefr
The Briefing Room
Jim is the Director of Product Marketing at Hortonworks. He is a recovering developer, professional marketer and amateur photographer with nearly twenty years experience building products and developing emerging technologies. During his career, he has brought multiple products to market in a variety of fields, including data loss prevention, master data management and now big data. At Hortonworks, Jim is focused on accelerating the development and adoption of Apache Hadoop.
© Hortonworks Inc. 2012
Hadoop: What It Is & Isn’t October 2012
Jim Walker Director, Product Marketing Hortonworks
Page 9
© Hortonworks Inc. 2012
Big Data: Organizational Game Changer
Page 10
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail Purchase record Payment record
ERP
CRM
WEB
BIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMS Sentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions + Observations = BIG DATA
© Hortonworks Inc. 2012
What is a Data Driven Business?
• DEFINITION Better use of available data in the decision making process
• RULE Key metrics derived from data should be tied to goals
• PROVEN RESULTS Firms that adopt Data-Driven Decision Making have output and productivity that is 5-6% higher than what would be expected given their investments and usage of information technology*
Page 11
* “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?” Brynjolfsson, Hitt and Kim (April 22, 2011)
1110010100001010011101010100010010100100101001001000010010001001000001000100000100010010010001000010111000010010001000101001001011110101001000100100101001010010011111001010010100011111010001001010000010010001010010111101010011001001010010001000111
© Hortonworks Inc. 2012
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
opt imize
Big Data: Optimize Outcomes at Scale
Media Content
Intelligence Detection
Finance Algorithms
Advertising Performance
Fraud Prevention
Retail / Wholesale Inventory turns
Manufacturing Supply chains
Healthcare Patient outcomes
Education Learning outcomes
Government Citizen services
Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.
Page 12
© Hortonworks Inc. 2012
Dashboards, Reports, Visualization, …
CRM, ERP Web, Mobile Point of sale
Enterprise Big Data Flows
Page 13
Big Data Platform
Business Transactions & Interactions
Business Intelligence & Analytics
Unstructured Data
Log files
DB data
Exhaust Data
Social Media
Sensors, devices
Classic Data Integration & ETL
Capture Big Data Collect data from all sources structured &unstructured
Process Transform, refine, aggregate, analyze, report
Distribute Results Interoperate and share data with applications/analytics
Feedback Use operational data w/in big data platform, preserve data
1 2 3 4
© Hortonworks Inc. 2012
Data Platform for Big Data
Data Platform Requirements for Big Data
Page 14
Capture
• Collect data from all sources - structured and unstructured data
• all speeds batch, async, streaming, real-time
Process
• Transform, refine, aggregate, analyze, report
Exchange
• Deliver data with enterprise data systems
• Share data with analytic applications and processing
Operate • Provision, monitor, diagnose, manage at scale • Reliability, availability, affordability, scalability, interoperability
Operating Systems
Virtual Platforms
Cloud Platforms
Big Data Appliances
Across all deployment models
© Hortonworks Inc. 2012
Big Data Transactions, Interactions, Observations
Apache Hadoop & Big Data Use Cases
Page 15
Refine Explore Enrich
Business Case
© Hortonworks Inc. 2012
Enterprise Data Warehouse
Operational Data Refinery Hadoop as platform for ETL modernization
Capture • Capture new unstructured data along with log
files all alongside existing sources • Retain inputs in raw form for audit and
continuity purposes Process • Parse the data & cleanse • Apply structure and definition • Join datasets together across disparate data
sources Exchange • Push to existing data warehouse for
downstream consumption • Feeds operational reporting and online systems
Page 16
Unstructured Log files
Refinery
Structure and join
Capture and archive
Parse & Cleanse
Refine Explore Enrich
DB data
Upload
© Hortonworks Inc. 2012
Visualization Tools EDW / Datamart
Explore
Big Data Exploration & Visualization Hadoop as agile, ad-hoc data mart
Capture • Capture multi-structured data and retain inputs
in raw form for iterative analysis Process • Parse the data into queryable format • Explore & analyze using Hive, Pig, Mahout and
other tools to discover value • Label data and type information for
compatibility and later discovery • Pre-compute stats, groupings, patterns in data
to accelerate analysis Exchange • Use visualization tools to facilitate exploration
and find key insights • Optionally move actionable insights into EDW
or datamart Page 17
Capture and archive
upload JDBC / ODBC
Structure and join
Categorize into tables
Unstructured Log files DB data
Refine Explore Enrich
Optional
© Hortonworks Inc. 2012
Online Applications
Enrich
Application Enrichment Deliver Hadoop analysis to online apps
Capture • Capture data that was once
too bulky and unmanageable
Process • Uncover aggregate characteristics across data • Use Hive Pig and Map Reduce to identify patterns • Filter useful data from mass streams (Pig) • Micro or macro batch oriented schedules
Exchange • Push results to HBase or other NoSQL alternative
for real time delivery • Use patterns to deliver right content/offer to the
right person at the right time
Page 18
Derive/Filter
Capture
Parse
NoSQL, HBase Low Latency
Scheduled & near real time
Unstructured Log files DB data
Refine Explore Enrich
© Hortonworks Inc. 2012
Hadoop in Enterprise Data Architectures
Page 19
EDW
Existing Business Infrastructure
ODS & Datamarts
Applications & Spreadsheets
Visualization & Intelligence
Discovery Tools
IDE & Dev Tools
Low Latency/NoSQL
Web
Web Applications
Operations
Custom Existing
Templeton Sqoop WebHDFS Flume HCatalog
Pig HBase
Hive
Ambari HA Oozie ZooKeeper
MapReduce HDFS
Big Data Sources (transactions, observations, interactions)
CRM ERP Exhaust
Data logs files financials
Social Media
New Tech
Datameer Tableau
Karmasphere Splunk
© Hortonworks Inc. 2012
Where Does It Fit into Your Business?
Vertical Refine Explore Enrich
Retail & Web • Log Analysis/Site Optimization • Social Network Analysis
• Dynamic Pricing • Session & Content
Optimization
Retail • Loyalty Program Optimization • Brand and Sentiment Analysis • Dynamic Pricing/Targeted
Offer
Intelligence • Threat Identification • Person of Interest Discovery • Cross Jurisdiction Queries
Finance • Risk Modeling & Fraud
Identification • Trade Performance
Analytics
• Surveillance and Fraud Detection
• Customer Risk Analysis
• Real-time upsell, cross sales marketing offers
Energy • Smart Grid: Production Optimization
• Grid Failure Prevention • Smart Meters • Individual Power Grid
Manufacturing • Supply Chain Optimization • Customer Churn Analysis • Dynamic Delivery • Replacement parts
Healthcare & Payer
• Electronic Medical Records (EMPI)
• Clinical Trials Analysis
• Insurance Premium Determination
Page 20
© Hortonworks Inc. 2012
We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop.
Hortonworks Vision & Leadership
Page 21
• 100% open platform • No POS holdback • Open to the Hadoop
community • Open to the Hadoop
ecosystem • Closely aligned to
Hadoop core
• Stewards of core Hadoop • Original builders and
operators of Hadoop • 100+ years Hadoop
development experience • Managed every viable,
stable Hadoop release • HDP built on Hadoop 1.0
• Innovating current platform with HCatalog, Ambari, HA
• Innovating future platform with YARN, HA
• Complete vision for Hadoop-based platform
• Enable the Hadoop ecosystem
Trusted Open Innovative
© Hortonworks Inc. 2012
1
• Simplify deployment to get started quickly and easily
• Monitor, manage any size cluster with familiar console and tools
• Only platform to include data integration services to interact with any data
• Metadata services opens the platform for integration with existing applications
• Dependable high availability architecture
• Tested at scale to future proof your cluster growth
Hortonworks Data Platform
Page 22
ü Reduce risks and cost of adoption ü Lower the total cost to administer and provision ü Integrate with your existing ecosystem
© Third Nature Inc.
“In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.”
Grace Hopper
© Third Nature Inc.
What’s different today? We’re not ge@ng more CPU speed, but more CPU cycles.
There are too many CPUs relaEve to other resources, creaEng an imbalance in hardware plaForms.
We therefore use nodes to aggregate memory, network bandwidth and IOPs.
Most soJware is designed for a single worker, not high degrees of parallelism and won’t scale well.
© Third Nature Inc.
Analy:cs makes the data volume problem bigger
Many of the processing problems are O(n2) or worse, so moderate data can be a problem for DW architectures
© Third Nature Inc.
.
It would be logical to keep all the data in one place.
I need that data now.
A common problem with new projects or unexpected business problems…
It will take 6 months
© Third Nature Inc.
Welcome to the Hadoop schema!
Why soJ / no schema can be good: Easier programming Easier modeling since you don’t have to be perfect in advance, and it’s change-‐resilient Join eliminaEon = I/O savings (if no updates)
© Third Nature Inc.
Whether to switch from a DB isn’t the right discussion
SQL...
SQL!
SQL?
SQL
Hadoop
© Third Nature Inc.
Ques:ons for discussion
1. Is scale of data really that much of a problem for most organizaEons?
2. Hadoop is designed for batch work – how good is it for interacEve use? Real-‐Eme use cases?
3. How do you define “plaForm”? 4. ETL modernizaEon is menEoned, but isn’t this a reversion to manual coding?
5. How do you design for long-‐term use rather than one-‐off analysis projects?
6. Does open source really macer for this part of the stack?
© Third Nature Inc.
CC Image AOribu:ons Thanks to the people who supplied the creaEve commons licensed images used in this presentaEon: Phone dump -‐ Richard Barnes ponies in field.jpg -‐ hcp://www.flickr.com/photos/bulle_de/352732514/
Twitter Tag: #briefr
The Briefing Room
! This Month: Database
! November: Cloud
! December: Innovators
! January: Big Data
! 2013 Editorial Calendar (www.insideanalysis.com)