Introduction to Hadoop
-
Upload
odimulescu -
Category
Technology
-
view
777 -
download
0
description
Transcript of Introduction to Hadoop
![Page 1: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/1.jpg)
Hadoop – Taming Big Data Jax ArcSig, June 2012
Ovidiu Dimulescu
![Page 2: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/2.jpg)
About @odimulescu
• Working on the Web since 1997 • Likes stuff well done • Into engineering cultures and all around automaEon • Speaker at local user groups • Organizer for the local Mobile User Group jaxmug.com
![Page 3: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/3.jpg)
Agenda
• IntroducEon • Use cases • Architecture • MapReduce Examples
• Q&A
![Page 4: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/4.jpg)
What is ?
• Apache Hadoop is an open source Java soSware framework for running data-‐intensive applicaEons on large clusters of commodity hardware
• Created by Doug CuVng (Lucene & Nutch creator)
• Named aSer Doug’s son’s toy elephant
![Page 5: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/5.jpg)
What and how is solving?
• Processing diverse large datasets in pracAcal Ame at low cost
• Consolidates data in a distributed file system
• Moves computaAon to data rather then data to computaEon
• Simpler programming model
![Page 6: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/6.jpg)
Why does it maEer?
• Volume, Velocity, Variety and Value
• Datasets do not fit on local HDDs let alone RAM
• Data grows at tremendous pace
• Data is heterogeneous • Scaling up is expensive (licensing, cpus, disks, interconnects, etc.)
• Scaling up has a ceiling (physical, technical, etc.)
![Page 7: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/7.jpg)
Why does it maEer?
80%
20%
Data types
Complex Structured
Complex Data
Images, Video Logs Documents Call records Sensor data Mail archives
Structured Data
User Profiles CRM HR Records
* Chart Source: IDC White Paper
![Page 8: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/8.jpg)
Why does it maEer?
• Need to process a 10TB dataset
• Assume sustained transfer of 75MB/s
• On 1 node -‐ Scanning data ~ 2 days
• On 10 node cluster -‐ Scanning data ~ 5 hrs
• Low $/TB for commodity drives
• Low-‐end servers are mulEcore capable
![Page 9: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/9.jpg)
Use Cases
• ETL -‐ Extract Transform Load
• RecommendaEon Engines
• Customer Churn Analysis • Ad TargeEng • Data “sandbox”
![Page 10: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/10.jpg)
Use Cases -‐ Typical ETL
Live DB
ReporAng DB
ETL 1
BI ApplicaAons
Data Warehouse
ETL 2
Logs
![Page 11: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/11.jpg)
Use Cases -‐ Hadoop ETL
Live DB
ReporAng DB
BI ApplicaAons
Data Warehouse
Hadoop Data Loading
Logs
Data Loading
![Page 12: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/12.jpg)
Use Cases – Analysis methods
• Pakern recogniEon
• Index building
• Text mining
• CollaboraEve filtering
• PredicEon models
• SenEment analysis
• Graphs creaEon and traversal
![Page 13: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/13.jpg)
Who uses it?
![Page 14: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/14.jpg)
Who supports it?
![Page 15: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/15.jpg)
Why use Hadoop?
• PracEcal to do things that were previously not
ü Shorter execuEon Eme ü Costs less
ü Simpler programming model • Open system with greater flexibility
• Large and growing ecosystem
![Page 16: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/16.jpg)
Hadoop – Silver bullet?
• Not a database replacement
• Not a data warehousing (complements it)
• Not for interacEve reporEng • Not a general purpose storage mechanism
• Not for problems that are not parallelizable in a share-‐nothing fashion
![Page 17: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/17.jpg)
Architecture – Design Axioms
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Should Move to Data • Simple Core, Modular and Extensible
![Page 18: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/18.jpg)
Architecture – Core Components
HDFS
Distributed filesystem designed for low cost storage and high bandwidth access across the cluster.
Map-‐Reduce
Programming model for processing and generaEng large data sets.
![Page 19: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/19.jpg)
Architecture – Official Extensions
HDFS HBase
Storage
MapReduce Framework
Data Processing
ZooKeeper Chukwa
Management
Pig (Data Flow) Avro
Data Access
Hive (SQL)
![Page 20: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/20.jpg)
Architecture – CDH DistribuAon
1. CDH – Cloudera’s DistribuEon of Hadoop 2. Image credit -‐ Cloudera presentaEon @ Microstrategy World 2011
![Page 21: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/21.jpg)
HDFS -‐ Design
• Based on Google’s GFS
• Files are stored as blocks (64MB default size)
• Configurable data replicaEon (3x, Rack Aware)
• Fault Tolerant, Expects HW failures
• HUGE files, Expects Streaming not Low Latency
• Mostly WORM
![Page 22: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/22.jpg)
HDFS -‐ Architecture
Namenode (NN)
Datanode 1 Datanode 2 Datanode N
Namenode -‐ Master • Filesystem metadata • Controls read/write to files • Manages blocks replicaEon • Applies transacEon log on startup
Datanode -‐ Slaves • Reads / Write blocks to/from clients • Replicates blocks at master’s request
H D F S
Client ask NN for file NN returns DNs that host it Client ask DN for data
![Page 23: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/23.jpg)
HDFS – Fault tolerance
• DataNode
§ Uses CRC to avoid corrupEon § Data is replicated on other nodes (3x)
• NameNode
§ Checkpoint NameNode § Backup NameNode § Failover is manual
![Page 24: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/24.jpg)
MapReduce -‐ Design
• Based on Google’s MR paper • Borrows from funcEonal programming • Simpler programming model
§ map (in_key, in_value) -‐> (out_key, intermediate_value) list
§ reduce (out_key, intermediate_value list) -‐> out_value list
• No user synchronizaEon and coordinaEon
Input -‐> Map -‐> Reduce -‐> Output
![Page 25: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/25.jpg)
MapReduce -‐ Architecture
JobsTracker (JT)
TaskTracker 1
JobTracker -‐ Master • Accepts MR jobs submiked by clients • Assigns Map and Reduce tasks to
TaskTrackers, data locality aware • Monitors tasks and TaskTracker status,
re-‐executes tasks upon failure • SpeculaEve execuEon
TaskTracker -‐ Slaves • Run Map and Reduce tasks received
from Jobtracker • Manage storage and transmission of
intermediate output
J O B S API
Client launches a job -‐ ConfiguraEon -‐ Mapper -‐ Reducer -‐ Input -‐ Output TaskTracker 2 TaskTracker N
![Page 26: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/26.jpg)
Hadoop -‐ Core Architecture
JobsTracker
TaskTracker 1 DataNode 1
J O B S API
NameNode
TaskTracker 2 DataNode 2
TaskTracker N DataNode N
H D F S
Mini OS • File system • Scheduler
![Page 27: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/27.jpg)
hkp://www.slideshare.net/esaliya/mapreduce-‐in-‐simple-‐terms
MapReduce – Head First Style
![Page 28: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/28.jpg)
MapReduce – Mapper Types
One-‐to-‐One map(k, v) = emit (k, transform(v))
Exploder map(k, v) = foreach p in v: emit (k, p)
Filter map(k, v) = if cond(v) then emit (k, v)
![Page 29: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/29.jpg)
MapReduce – Reducer Types
Sum Reducer
reduce(k, vals) = sum = 0 foreach v in vals: sum += v emit (k, sum)
![Page 30: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/30.jpg)
MapReduce – High level pipeline
K1
K1
K1
K1
K2
K2
K2
K2
![Page 31: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/31.jpg)
MapReduce – Detailed pipeline
Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html
![Page 32: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/32.jpg)
MapReduce – Combiner Phase
• OpEonal • Runs on mapper nodes aSer map phase • “ Mini-‐reduce,” only on local map output • Used to save bandwidth before sending data to full reducer • The Reducer can be Combiner if
1. Output key, values are the same as input key, values 2. CommutaEve and AssociaEve (SUM, MAX ok but AVG not)
Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html
![Page 33: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/33.jpg)
InstallaAon
1. Download & configure single-‐node cluster
hadoop.apache.org/common/releases.html
2. Download a demo VM
Cloudera Hortonwork
3. Use a hosted environment (Amazon’s EMR, Azure)
![Page 34: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/34.jpg)
InstallaAon – Pla[orm Notes
ProducAon Linux – Official
Development
Linux OSX Windows via Cygwin *Nix
![Page 35: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/35.jpg)
MapReduce – Client Languages
Java, Any JVM Languages -‐ NaEve C++ -‐ Pipes framework – Socket IO Any – Streaming – Stdin / Stdout
Pig LaEn, Hive HQL, C via JNI
hadoop pipes -‐input path_in -‐output path_out -‐program exec_program
hadoop jar hadoop-‐streaming.jar -‐mapper map_prog -‐reducer reduce_prog -‐input path_in -‐output path_out
hadoop jar jar_path main_class input_path output_path
![Page 36: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/36.jpg)
MapReduce – Client Anatomy
• Main Program (aka Driver)
Configures the Job IniEates the Job
• Input LocaEon • Mapper • Combiner (opEonal) • Reducer • Output LocaEon
![Page 37: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/37.jpg)
MapReduce – Word Count Example
![Page 38: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/38.jpg)
MapReduce – C# Mapper
![Page 39: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/39.jpg)
MapReduce – C# Reducer
![Page 40: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/40.jpg)
MapReduce – Java Mapper
![Page 41: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/41.jpg)
MapReduce – Java Reducer
![Page 42: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/42.jpg)
MapReduce – JavaScript Mapper
![Page 43: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/43.jpg)
MapReduce – JavaScript Reducer
![Page 44: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/44.jpg)
Summary
is an economical scalable distributed data processing system which enables data:
ü ConsolidaAon (Structured or Not) ü Query Flexibility (Any Language) ü Agility (Evolving Schemas)
![Page 45: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/45.jpg)
QuesAons ?
![Page 46: Introduction to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022061223/54c6bab94a79599e578b456b/html5/thumbnails/46.jpg)
References
Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio 10 Hadoop-‐able Problems, by Cloudera Hadoop, An Industry PerspecEve, by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop