Post on 13-Aug-2015
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Introduction to Hadoop
Eric Mizell – Director, Solution Engineering
Hortonworks. We do Hadoop.
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Quick Audience Poll
Which best describes how your org is using Hadoop? A. We’re using Hadoop B. We’re in the process of getting Hadoop integrated C. We don’t have Hadoop installed D. What’s Hadoop? E. I don’t know
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Big Data, Hadoop, and the Modern Data Architecture
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Big Data Explosion
Big Data Market Trends & Projections
20% % by which org’s leveraging modern info management
systems outperform peers by 2015
!"
1 Zettabyte (ZB) =
1 Billion TBs
15x
growth rate of machine generated
data by 2020
The US has 1/3 of the world’s data
Big Data is 1 of 5 US GDP Game Changers $325 billion incremental annual GDP from big data analytics in retail and manufacturing by
2020
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Existing Siloed Data Architectures Under Pressure AP
PLICAT
IONS
DATA
SYSTEM
SOURC
ES
Business Analy:cs
Custom Applica:ons
Packaged Applica:ons
Exis:ng Sources (CRM, ERP, Clickstream, Logs)
SILO SILO
RDBMS
SILO SILO SILO SILO
EDW MPP
Data growth: New Data Types
OLTP, ERP, CRM Systems
Unstructured docs, emails
Clickstream
Server logs
Social/Web Data
Sensor. Machine Data
Geoloca:on
85% Source: IDC
??
" Can’t manage new data paradigm
" Constrains data to specific schema
" Siloed data
" Limited scalability
" Economically unfeasible
" Limited analytics
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop is Driving the New Data-driven Era of IT
1st Era
Real-time Data Driven
RDBMS
2nd Era 3rd Era
Automation + Efficiency Processing Power
Mainframe
Goa
l D
ata
Tech
nolo
gy
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Key Drivers of Hadoop
OPERATIONS TOOLS
Provision, Manage & Monitor
DEV & DATA TOOLS
Build & Test
DATA
SYSTEM
REPOSITORIES
SOURC
ES
RDBMS EDW MPP
APPLICAT
IONS
Business Analy:cs
Custom Applica:ons
Packaged Applica:ons
Unlock New Approach to Analy:cs • Agile analy*cs via “Schema on Read” with ability to store all data in na*ve format
• Create new apps from new types of data A
Op:mize Investments, Cut Costs • Focus EDW on high value workloads • Use commodity servers & storage to enable all data (original and historical) to be accessible for ongoing explora*on
B Enable a Modern Data Architecture • Integrate new & exis*ng data sets • Make all data available for shared access and processing in mul*tenant infrastructure
• Batch, interac*ve & real-‐*me use cases • Integrated with exis*ng tools & skills
C EXISTING Systems
Clickstream Web & Social
Geoloca:on Sensor & Machine
Server Logs
Unstructured
YARN: Data Operating System
° ° ° ° ° ° ° ° °
Interactive Real-Time Batch
HDFS: Hadoop Distributed File System
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
…to real-time personalization From static branding
…to repair before break From break then fix
…to designer medicine From mass treatment
…to automated algorithms From educated investing
…to 1x1 targeting From mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Manufacturing
Hadoop enables organizations to cost effectively store and use all of the data available in a way that shifts the business from…
Reactive
Proactive
Shift to Data-driven Means Treating Data like Capital
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Goals for the Modern Data Architecture
ü Centrally manage new and existing data
ü Data needs flexibility and lands in Hadoop without schema
ü Prepare data with no predetermined questions
ü User self-service – no limit to questions
ü Run batch, interactive & real time analytic applications on shared datasets
ü Leverage new and existing data center infrastructure investments
ü Scalable and affordable; low cost per TB
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING Systems
Clickstream Web & Social
Geoloca:on Sensor & Machine
Server Logs
Unstructured
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN and HDP Enables the Modern Data Architecture YARN is the architectural center of Hadoop and HDP • YARN enables a common data set
across all applications
• Batch, interactive & real-time workloads
• Support multi-tenant access & processing
HDP enables Apache Hadoop to become Enterprise Viable Data Platform with centralized services • Security
• Governance
• Operations
• Productization
Enabled broad ecosystem adoption
Hortonworks drove this innovation of Hadoop through YARN
Hortonworks Data Platform 2.2
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud
Others
ISV Engines
On-Premises
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURE
Modern Data Architecture SO
UR
CES
EXISTING Systems
Clickstream Web &Social Geoloca:on Sensor & Machine
Server Logs Unstructured
DAT
A S
YSTE
M
RDBMS EDW HANA
APPLICAT
IONS
BusinessObjects BI
Deep Partnerships Hortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, HP, Teradata, SAS, SAP & Redhat Broad Partnerships Over 600 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N HDFS
(Hadoop Distributed File System)
Interactive Real-Time Batch
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop unlocks a new approach: Iterative Analytics
✚ Determine list of ques:ons
Design solu:ons
Collect structured data
Ask ques:ons from list
Detect addi:onal ques:ons
Current Reality Apply schema on write
Dependent on IT
Repeatable Process: SQL Only
Augment w/ Hadoop
Apply schema on read
Support range of access patterns to data stored in HDFS: polymorphic access
HADOOP Iterate over structure
Transform and Analyze
batch interactive real-time
Right Engine, Right Job
in-memory
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop delivers compelling economics
✚
EDW Optimization
OPERATIONS 50%
ANALYTICS 20%
ETL PROCESS 30%
OPERATIONS 50% ANALYTICS
50%
Current Reality EDW at capacity: some usage from low value workloads
Older data archived, unavailable for ongoing exploration
Source data often discarded
Augment w/ Hadoop
Free up EDW resources from low value tasks
Keep 100% of source data and historical data for ongoing exploration
Mine data for value after loading it because of schema-on-read
MPP
SAN
Engineered System
NAS
HADOOP
Cloud Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)
Commodity Compute & Storage
Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure
Hadoop Parse, Cleanse
Apply Structure, Transform
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Try Hadoop Today
Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/
Learn Hadoop
Build a Proof of Concept
Test New Functionality
© Hortonworks Inc. 2013
5 Reasons Hadoop is Kicking Cans and Taking Names
Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data.
Forrester believes that Hadoop will become must-have infrastructure for large enterprises.
Here are five reasons firms should adopt Hadoop today: 1. Build a data lake with the Hadoop file system (HDFS) 2. Enjoy cheap, quick processing with MapReduce 3. Data scientists can wrangle big data faster 4. Even the POC can make you money 5. The future of Hadoop is real-time and transactional
Page 19
http://blogs.forrester.com/mike_gualtieri/13-10-22-5_reasons_hadoop_is_kicking_can_and_taking_names