Hadoop Release 2.0
-
Upload
prashant-sharma -
Category
Documents
-
view
145 -
download
0
description
Transcript of Hadoop Release 2.0
www.edureka.in/hadoopSlide 1
www.edureka.in/hadoopSlide 2
How It Works…
LIVE On-Line classes
Class recordings in Learning Management System (LMS)
Module wise Quizzes, Coding Assignments
24x7 on-demand technical support
Project work on large Datasets
Online certification exam
Lifetime access to the LMS
Complimentary Java Classes
www.edureka.in/hadoopSlide 3
Module 1 Understanding Big Data Hadoop Architecture
Module 2 Introduction to Hadoop 2.x Data loading Techniques Hadoop Project Environment
Module 3 Hadoop MapReduce framework Programming in Map Reduce
Module 4 Advance MapReduce YARN (MRv2) Architecture Programming in YARN
Module 5 Analytics using Pig Understanding Pig Latin
Module 6 Analytics using Hive Understanding HIVE QL
Module 7 NoSQL Databases Understanding HBASE Zookeeper
Module 8 Real world Datasets and Analysis Project Discussion
Course Topics
www.edureka.in/hadoopSlide 4
What is Big Data?
Limitations of the existing solutions
Solving the problem with Hadoop
Introduction to Hadoop
Hadoop Eco-System
Hadoop Core Components
HDFS Architecture
MapRedcue Job execution
Anatomy of a File Write and Read
Hadoop 2.0 (YARN or MRv2) Architecture
Topics for Today
www.edureka.in/hadoopSlide 5
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.
NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
What Is Big Data?
www.edureka.in/hadoopSlide 6
Un-Structured Data is Exploding
www.edureka.in/hadoopSlide 7
Volume Velocity Variety
Characteristics of Big Data
IBM’s Definition
www.edureka.in/hadoopSlide 8
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Annie’s Introduction
www.edureka.in/hadoopSlide 9
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Annie’s Question
Map the following to corresponding data type:
- XML Files
- Word Docs, PDF files, Text files
- E-Mail body
- Data from Enterprise systems (ERP, CRM etc.)
www.edureka.in/hadoopSlide 10
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Annie’s Answer
www.edureka.in/hadoopSlide 11
More on Big Datahttp://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoophttp://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoophttp://www.edureka.in/blog/jobs-in-hadoop/
Big Data http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/
Further Reading
www.edureka.in/hadoopSlide 12
Web and e-tailing Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection
Telecommunications Customer Churn Prevention Network Performance
Optimization Calling Data Record (CDR)
Analysis Analyzing Network to Predict
Failure
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios
www.edureka.in/hadoopSlide 13
Government Fraud Detection And Cyber Security Welfare schemes Justice
Healthcare & Life Sciences Health information exchange Gene sequencing Serialization Healthcare service quality
improvements Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios (Contd.)
www.edureka.in/hadoopSlide 14
Banks and Financial services Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis
Retail Point of sales Transaction
Analysis Customer Churn Analysis Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios (Contd.)
www.edureka.in/hadoopSlide 15
Insight into data can provide Business Advantage.
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data.
Case Study: Sears Holding Corporation
X
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data.
Hidden Treasure
www.edureka.in/hadoopSlide 16
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of the ~2PB Archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data death
1. Can’t explore original high fidelity raw data
2. Moving data to compute doesn’t scale
Mostly Append
A meagre 10% of the
~2PB Data is available for
BI
Storage only Grid (original Raw Data)
Collection
Limitations of Existing Data Analytics Architecture
www.edureka.in/hadoopSlide 17
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data Archiving
1. Data Exploration & Advanced analytics
2. Scalable throughput for ETL & aggregation
3. Keep data alive forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB Data is
available for processing
Both Storage
And Processing
Solution: A Combined Storage Computer Layer
www.edureka.in/hadoopSlide 18
Accessible
RobustSimple
Scalable
DifferentiatingFactors
Hadoop Differentiating Factors
www.edureka.in/hadoopSlide 19
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP AnalyticsComplex ACID TransactionsOperational Data Store
Best Fit Use Data DiscoveryProcessing Unstructured DataMassive Storage/Processing
RDBMSRDBMS EDW MPP NoSQL HADOOP
Hadoop – It’s about Scale And Structure
www.edureka.in/hadoopSlide 20
Read 1 TB Data
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine
4 I/O ChannelsEach Channel – 100 MB/s
10 Machines
Why DFS?
www.edureka.in/hadoopSlide 21
45 Minutes
Read 1 TB Data
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine
4 I/O ChannelsEach Channel – 100 MB/s
10 Machines
Why DFS?
www.edureka.in/hadoopSlide 22
4.5 Minutes45 Minutes
Read 1 TB Data
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine
4 I/O ChannelsEach Channel – 100 MB/s
10 Machines
Why DFS?
www.edureka.in/hadoopSlide 23
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
www.edureka.in/hadoopSlide 24
Reliable
EconomicalFlexible
Scalable
HadoopFeatures
Hadoop Key Characteristics
www.edureka.in/hadoopSlide 25
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Hadoop is a framework that allows for the distributed processing of:
- Small Data Sets
- Large Data Sets
Annie’s Question
www.edureka.in/hadoopSlide 26
Annie’s Answer
Large Data Sets. It is also capable to process small data-sets
however to experience the true power of Hadoop one needs to have
data in Tb’s because this where RDBMS takes hours and fails
whereas Hadoop does the same in couple of minutes.
www.edureka.in/hadoopSlide 27
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig LatinData Analysis
MahoutMachine Learning
HiveDW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured orSemi-Structured data Structured Data
Hadoop Eco-System
www.edureka.in/hadoopSlide 28
Hadoop and MapReduce magic in
action
Write intelligent applications using Apache Mahout
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
LinkedIn Recommendations
Machine Learning with Mahout
www.edureka.in/hadoopSlide 29
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
Hadoop Core Components
www.edureka.in/hadoopSlide 30
Data Node
TaskTracker
Data Node
TaskTracker
Data Node
TaskTracker
Data Node
TaskTracker
MapReduceEngine
HDFSCluster
Job Tracker
Admin Node
Name node
Hadoop Core Components (Contd.)
www.edureka.in/hadoopSlide 31
Metadata (Name, replicas,…):/home/foo/data, 3,…
Rack 1
Blocks
DatanodesBlock ops
Replication
Write
Datanodes
Metadata ops
Client
Read
NameNode
Rack 2Client
HDFS Architecture
www.edureka.in/hadoopSlide 32
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
Main Components Of HDFS
www.edureka.in/hadoopSlide 33
Meta-data in Memory The entire metadata is in main memory No demand paging of FS meta-data
Types of Metadata List of files List of Blocks for each file List of DataNode for each block File attributes, e.g. access time, replication factor
A Transaction Log Records file creations, file deletions. etc
Name Node(Stores metadata only)
METADATA:/user/doug/hinfo -> 1 3 5/user/doug/pdetail -> 4 2
Name Node:Keeps track of overall file directory structure and the placement of Data Block
NameNode Metadata
www.edureka.in/hadoopSlide 34
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
You give me metadata every hour, I will make
it secure
Single Point Failure
Secondary NameNode
NameNode
metadata
metadata
Secondary Name Node
www.edureka.in/hadoopSlide 35
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
NameNode?
a) is the “Single Point of Failure” in a cluster
b) runs on ‘Enterprise-class’ hardware
c) stores meta-data
d) All of the above
Annie’s Question
www.edureka.in/hadoopSlide 36
Annie’s Answer
All of the above. NameNode Stores meta-data and runs on reliable
high quality hardware because it’s a Single Point of failure in a
hadoop Cluster.
www.edureka.in/hadoopSlide 37
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Annie’s Question
When the NameNode fails, Secondary NameNode takes over
instantly and prevents Cluster Failure:
a) TRUE
b) FALSE
www.edureka.in/hadoopSlide 38
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Annie’s Answer
False. Secondary NameNode is used for creating NameNode
Checkpoints. NameNode can be manually recovered using ‘edits’
and ‘FSImage’ stored in Secondary NameNode.
www.edureka.in/hadoopSlide 39
DFS
Job Tracker
1. Copy Input Files
User
2. Submit Job
3. Get Input Files’ Info
6. Submit Job
4. Create Splits
5. Upload Job Information
Input Files
Client
Job.xml.Job.jar.
JobTracker
www.edureka.in/hadoopSlide 40
DFS
Client Job.xml.Job.jar.
6. Submit Job
8. Read Job Files
7. Initialize JobJob Queue
As many maps as splitsInput Spilts
Maps Reduces
9. Create maps and reduces
Job Tracker
JobTracker (Contd.)
www.edureka.in/hadoopSlide 41
Job Tracker
Job QueueH3
H4
H5
H1
Task Tracker -H2
Task Tracker -H4
10. Heartbeat
12. Assign Tasks
10. Heartbeat
10. Heartbeat
10. Heartbeat
11. Picks Tasks(Data Local if possible)
Task Tracker -H3
Task Tracker -H1
JobTracker (Contd.)
www.edureka.in/hadoopSlide 42
Hadoop framework picks which of the following daemon
for scheduling a task ?
a) namenode
b) datanode
c) task tracker
d) job tracker
Annie’s Question
www.edureka.in/hadoopSlide 43
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Annie’s Answer
JobTracker takes care of all the job scheduling and
assign tasks to TaskTrackers.
www.edureka.in/hadoopSlide 44
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Create
7. Complete
5. ack Packet4. Write Packet
Pipeline of Data nodes
DataNode
DataNode
Distributed File System
HDFSClient
1. Create
3. Write
4
5
4
5
Anatomy of A File Write
www.edureka.in/hadoopSlide 45
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Get Block locations
4. Read
DataNode
DataNode
Distributed File System
HDFSClient
1. Create
3. Write
5. Read
Anatomy of A File Read
www.edureka.in/hadoopSlide 46
Replication and Rack Awareness
www.edureka.in/hadoopSlide 47
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a) TRUE
b) FALSE
Annie’s Question
www.edureka.in/hadoopSlide 48
Annie’s Answer
True. A files is divided into Blocks, these blocks are
written in parallel but the block replication happen in
sequence.
www.edureka.in/hadoopSlide 49
Annie’s Question
A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file:
a) can read up to block that's successfully written.b) can read up to last bit successfully written.c) Will throw an throw an exception.d) Cannot see that file until its finished copying.
www.edureka.in/hadoopSlide 50
Annie’s Answer
Client can read up to the successfully written data block,
Answer is (a)
www.edureka.in/hadoopSlide 51
Node Manager
HDFS
YARN
Resource Manager
Shared edit logs
All name space edits logged to shared NFS storage; single writer
(fencing)
Read edit logs and applies to its own namespace
Secondary Name Node
Data Node
Standby NameNode
Active NameNode
ContainerApp
Master
Node Manager
Data Node
ContainerApp
Master
Data Node
Client
Data Node
ContainerApp
Master
Node Manager
Data Node
ContainerApp
Master
Node Manager
Hadoop 2.x (YARN or MRv2)
www.edureka.in/hadoopSlide 52
Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecturehttp://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Hadoop 2.0 and YARNhttp://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
Further Reading
www.edureka.in/hadoopSlide 53
Setup the Hadoop development environment using the documents present in the LMS.
Hadoop Installation – Setup Cloudera CDH3 Demo VM
Hadoop Installation – Setup Cloudera CDH4 QuickStart VM
Execute Linux Basic Commands
Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.
Module-2 Pre-work
Thank You See You in Class Next Week