Hadoop past, present and future
-
Upload
codemotion -
Category
Technology
-
view
132 -
download
3
description
Transcript of Hadoop past, present and future
© Hortonworks Inc. 2013
Hadoop : Past, Present and Future Chris Harris Email : [email protected] Twitter : cj_harris5
© Hortonworks Inc. 2013
Past
Page 2
© Hortonworks Inc. 2013
A little history… it’s 2005
© Hortonworks Inc. 2013
A Brief History of Apache Hadoop
Page 4
2013
2005: Yahoo! creates team under E14 to work on Hadoop
Yahoo! begins to Operate at scale
Enterprise Hadoop
Apache Project Established
Hortonworks Data Platform
2004 2008 2010 2012 2006
© Hortonworks Inc. 2013
Key Hadoop Data Types
1. Sentiment Understand how your customers feel about your brand and products – right now
2. Clickstream Capture and analyze website visitors’ data trails and optimize your website
3. Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines
4. Geographic Analyze location-based data to manage operations where they occur
5. Server Logs Research logs to diagnose process failures and prevent security breaches
6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents
Value
© Hortonworks Inc. 2013
Hadoop is NOT
! ESB ! NoSQL ! HPC ! Relational ! Real-time ! The “Jack of all Trades”
© Hortonworks Inc. 2013
Hadoop 1
• Limited up to 4,000 nodes per cluster • O(# of tasks in a cluster) • JobTracker bottleneck - resource management,
job scheduling and monitoring • Only has one namespace for managing HDFS • Map and Reduce slots are static • Only job to run is MapReduce
© Hortonworks Inc. 2013
Hadoop 1 - Basics
B C A A A
A B C C B
MapReduce (Computation Framework)
HDFS (Storage Framework)
© Hortonworks Inc. 2013
Hadoop 1 - Reading Files
Rack1 Rack2 Rack3 RackN
read file (fsimage/edit) Hadoop Client
NameNode SNameNode
return DNs, block ids, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
heartbeat/ block report read blocks
© Hortonworks Inc. 2013
Hadoop 1 - Writing Files
Rack1 Rack2 Rack3 RackN
request write (fsimage/edit) Hadoop Client
NameNode SNameNode
return DNs, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
block report write blocks
replication pipelining
© Hortonworks Inc. 2013
Hadoop 1 - Running Jobs
Rack1 Rack2 Rack3 RackN
Hadoop Client
JobTracker
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
submit job
deploy job
part 0
map
reduce
shuffle
© Hortonworks Inc. 2013
Hadoop 1 - Security
Users
F I R E W A L L
LDAP/AD
Client Node/ Spoke Server
KDC
Hadoop Cluster
authN/authZ
service request
block token
delegate token
* block token is for accessing data
* delegate token is for running jobs
Encryption Plugin
© Hortonworks Inc. 2013
Hadoop 1 - APIs
! org.apache.hadoop.mapreduce.Partitioner ! org.apache.hadoop.mapreduce.Mapper ! org.apache.hadoop.mapreduce.Reducer ! org.apache.hadoop.mapreduce.Job
© Hortonworks Inc. 2013
Present
Page 14
© Hortonworks Inc. 2013
Hadoop 2
! Potentially up to 10,000 nodes per cluster ! O(cluster size) ! Supports multiple namespace for managing HDFS ! Efficient cluster utilization (YARN) ! MRv1 backward and forward compatible ! Any apps can integrate with Hadoop ! Beyond Java
© Hortonworks Inc. 2013
Hadoop 2 - Basics
© Hortonworks Inc. 2013
Hadoop 2 - Reading Files (w/ NN Federation)
Rack1 Rack2 Rack3 RackN
read file
fsimage/edit copy Hadoop Client NN1/ns1
SNameNode per NN
return DNs, block ids, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
checkpoint
register/ heartbeat/ block report
read blocks
fs sync Backup NN per NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
ns1 ns2 ns3 ns4
dn1, dn2 dn1, dn3
dn4, dn5 dn4, dn5
Block Pools
© Hortonworks Inc. 2013
Hadoop 2 - Writing Files
Rack1 Rack2 Rack3 RackN
request write
Hadoop Client
return DNs, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
write blocks
replication pipelining
fsimage/edit copy NN1/ns1
SNameNode per NN
checkpoint
block report
fs sync Backup NN per NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
© Hortonworks Inc. 2013
Hadoop 2 - Running Jobs
RackN
NodeManager
NodeManager
NodeManager
Rack2
NodeManager
NodeManager
NodeManager
Rack1
NodeManager
NodeManager
NodeManager
C2.1
C1.4
AM2
C2.2 C2.3
AM1
C1.3
C1.2
C1.1
Hadoop Client 1
Hadoop Client 2
create app2
submit app1
submit app2
create app1
ASM Scheduler queues
ASM Containers
NM ASM
Scheduler Resources
.......negotiates.......
.......reports to.......
.......partitions.......
ResourceManager
status report
© Hortonworks Inc. 2013
Hadoop 2 - Security
F I R E W A L L
LDAP/AD Knox Gateway Cluster
KDC
Hadoop Cluster
Enterprise/ Cloud SSO Provider
JDBC Client
REST Client
F I R E W A L L
DMZ
Browser(HUE) Native Hive/HBase Encryption
© Hortonworks Inc. 2013
Hadoop 2 - APIs
! org.apache.hadoop.yarn.api.ApplicationClientProtocol ! org.apache.hadoop.yarn.api.ApplicationMasterProtocol ! org.apache.hadoop.yarn.api.ContainerManagementProtocol
© Hortonworks Inc. 2013
Future
Page 22
© Hortonworks Inc. 2013
Apache Tez A New Hadoop Data Processing Framework
Page 23
© Hortonworks Inc. 2013
HDP: Enterprise Hadoop Distribution
Page 24
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATA SERVICES
HIVE & HCATALOG PIG HBASE
HDFS
MAP
Hortonworks Data Platform (HDP) Enterprise Hadoop
• The ONLY 100% open source and complete distribution
• Enterprise grade, proven and tested at scale
• Ecosystem endorsed to ensure interoperability
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN*
TEZ* OTHER REDUCE*
© Hortonworks Inc. 2013
Tez (“Speed”)
• What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF
• Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
© Hortonworks Inc. 2013
Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine • Built on YARN
• Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS
– Much lighter disk and network usage
• New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline
© Hortonworks Inc. 2013
Tez - Core Idea
Task with pluggable Input, Processor & Output
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>
© Hortonworks Inc. 2013
Building Blocks for Tasks MapReduce ‘Map’
MapReduce ‘Reduce’
HDFS Input
Map Processor
MapReduce ‘Map’ Task
Sorted Output
Intermediate ‘Reduce’ for Map-Reduce-Reduce
Shuffle Input
Reduce Processor
Intermediate ‘Reduce’ for Map-Reduce-Reduce
Sorted Output
Shuffle Input
Reduce Processor
HDFS Output
MapReduce ‘Reduce’ Task
Special Pig/Hive ‘Map’
HDFS Input
Map Processor
Tez Task
Pipeline
Sorter Output
Special Pig/Hive ‘Reduce’
Shuffle Skip-
merge Input
Reduce Processor
Tez Task
Sorted Output
In-memory Map
HDFSInput
Map Processor
Tez Task
In-memor
y Sorted Output
© Hortonworks Inc. 2013
Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId) GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization Barrier
I/O Synchronization Barrier
Job 1
Job 2
Job 3
Single Job
© Hortonworks Inc. 2013
Tez on YARN: Going Beyond Batch
Tez Optimizes Execution New runtime engine for
more efficient data processing
Always-On Tez Service Low latency processing for all Hadoop data processing
Tez Task
© Hortonworks Inc. 2013
Apache Knox Secure Access to Hadoop
© Hortonworks Inc. 2013
Knox Initiative Make Hadoop security simple
Simplify security for both users
and operators.
Provide seamless access for users while securing cluster at
the perimeter, shielding the intricacies of the security
implementation.
Simplify Security
Deliver unified and centralized access to the Hadoop cluster.
Make Hadoop feel like a single
application to users.
Aggregate Access
Ensure service users are abstracted from where services are located and how services
are configured & scaled.
Client Agility
© Hortonworks Inc. 2013
Knox: Make Hadoop Security Simple
Hadoop Cluster
Authentication & Verification
Client
User Store KDC, AD, LDAP
{REST}! Knox Gateway
© Hortonworks Inc. 2013
Knox: Next Generation of Hadoop Security
• All users see one end-point website
• All online systems see one end-point RESTful service
• Consistency across all interfaces and capabilities
• Firewalled cluster that no end users need to access
• More IT-friendly. Enables: – Systems admins – DB admins – Security admins – Network admins
Hadoop cluster
Gateway
firewallfirewall
end usersonline apps
+analytics tools
© Hortonworks Inc. 2013
Apache Falcon Data Lifecycle Management for Hadoop
© Hortonworks Inc. 2013
Data Lifecycle on Hadoop is Challenging
Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs
Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges.
© Hortonworks Inc. 2013
Falcon: One-stop Shop for Data Lifecycle
Apache Falcon Provides Orchestrates
Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs
Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.
© Hortonworks Inc. 2013
Falcon At A Glance
> Falcon provides the key services data processing applications need. > Complex data processing logic handled by Falcon instead of hard-coded in apps. > Faster development and higher quality for ETL, reporting and other data
processing apps on Hadoop.
Data Processing Applications
Spec Files or REST APIs
Data Import and
Replication
Scheduling and
Coordination
Data Lifecycle Policies
Multi-Cluster Management
SLA Management
Falcon Data Lifecycle Management Service
© Hortonworks Inc. 2013
Falcon Core Capabilities
• Core Functionality – Pipeline processing – Replication – Retention – Late data handling
• Automates – Scheduling and retry – Recording audit, lineage and metrics
• Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation
• CLI and REST API
© Hortonworks Inc. 2013
Falcon Example: Multi-Cluster Failover
> Falcon manages workflow, replication or both. > Enables business continuity without requiring full data reprocessing. > Failover clusters require less storage and CPU.
Staged Data
Cleansed Data
Conformed Data
Presented Data
Staged Data
Presented Data
BI and Analytics
Primary Hadoop Cluster
Failover Hadoop Cluster
Rep
licat
ion
© Hortonworks Inc. 2013
> Sophisticated retention policies expressed in one place. > Simplify data retention for audit, compliance, or for data re-processing.
Falcon Example: Retention Policies
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
© Hortonworks Inc. 2013
Falcon Example: Late Data Handling
> Processing waits until all data is available. > Developers don’t write complex data handling rules within applications.
Online Transaction
Data (Pull via Sqoop)
Web Log Data (Push via FTP)
Staging Area Combined Dataset
Wait up to 4 hours for FTP data
to arrive
© Hortonworks Inc. 2013
Multi Cluster Management with Prism
Page 43
> Prism is the part of Falcon that handles multi-cluster. > Key use cases: Replication and data processing that spans clusters.
© Hortonworks Inc. 2013
Hortonworks Sandbox Go from Zero to Big Data in 15 minutes
Page 44
© Hortonworks Inc. 2013
Sandbox: A Guided Tour of HDP
Page 45
Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop
Browse and manage HDFS files
Easily import data and create tables
Easy-to-use editors for Apache Pig and Hive
Latest tutorials pushed directly to your Sandbox
© Hortonworks Inc. 2013 Page 46
THANK YOU! Chris Harris
Download Sandbox
hortonworks.com/sandbox