BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Bigdata and Hadoop
-
Upload
girish-l -
Category
Data & Analytics
-
view
139 -
download
2
Transcript of Bigdata and Hadoop
What is the Need of Big data Technology
when we have robust, high-performing,
relational database management system
?
Data Stored in structured format like PK, Rows,
Columns, Tuples and FK .
It was for just Transactional data analysis.
Later using Data warehouse for offline data.
(Analysis done within Enterprise)
Massive use of Internet and Social Networking(FB,
Linkdin) Data become less structured.
Data is stored on central server.
‘Big Data’ is similar to ‘small data’, but bigger
…but having data bigger it requires different approaches:› Techniques, tools and architecture
…with an aim to solve new problems› …or old problems in a better way
Volume
• Data quantity
Velocity
• Data Speed
Variety
• Data Types
HADOOP
Open-source data storage and processing API
Massively scalable, automatically parallelizable
Based on work from Google
GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
HDFS
Storage
Self-healing
high-bandwidth
clustered storage
MapReduce
Processing
Fault-tolerant
distributed
processing
HDFS is a file system written in Java
Sits on top of a native file system
Provides redundant storage for massive
amounts of data
Use cheap, unreliable computers
Data is split into blocks and stored on multiple
nodes in the cluster
› Each block is usually 64 MB or 128 MB (conf)
Each block is replicated multiple times (conf)
› Replicas stored on different data nodes
Large files, 100 MB+
Master Nodes Slave Nodes
NameNode
› only 1 per cluster
› metadata server and database
› SecondaryNameNode helps with some housekeeping
• JobTracker
• only 1 per cluster
• job scheduler
DataNodes
› 1-4000 per cluster
› block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution
A single NameNode stores all metadata
Filenames, locations on DataNodes of each
block, owner, group, etc.
All information maintained in RAM for fast lookup
File system metadata size is limited to the amount
of available RAM on the NameNode
DataNodes store file contents
Stored as opaque ‘blocks’ on the underlying filesystem
Different blocks of the same file will be stored on different DataNodes
Same block is stored on three (or more) DataNodes for redundancy
DataNodes send heartbeats to NameNode
› After a period without any heartbeats, a DataNode is assumed to be lost
› NameNode determines which blocks were on the lost node
› NameNode finds other DataNodes with copies of these blocks
› These DataNodes are instructed to copy the blocks to other nodes
› Replication is actively maintained
The Secondary NameNode is not a failover NameNode
Does memory-intensive administrative functions for the NameNode
Should run on a separate machine
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
MapReduce
MapReduce
JobTrackerMapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
In our case: circe.rc.usf.edu
Preparing for
MapReduceLoading
Files
64 MB
128 MB
File System
Native file system
HDFS
Cloud
Output
Immutable
You Define
Input, Map, Reduce, Output
Use Java or other programming
language
Work with key-value pairs
Input: a set of key/value pairs
User supplies two functions:
› map(k,v) list(k1,v1)
› reduce(k1, list(v1)) v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
InputFormat
Map function
Partitioner
Sorting & Merging
Combiner
Shuffling
Merging
Reduce function
OutputFormat
• Task Tracker
• Data Node
• Task Tracker
• Data Node
• Task Tracker
• Data Node
• Name Node
• Job Tracker
Master Node
Slave Node
1
Slave Node
2
Slave Node
3
MapReduce Example -
WordCount
InputFormat:
› TextInputFormat
› KeyValueTextInputFormat
› SequenceFileInputFormat
OutputFormat:
› TextOutputFormat
› SequenceFileOutputFormat
2
8
Probably the most complex aspect of MapReduce!
Map side
› Map outputs are buffered in memory in a circular buffer
› When buffer reaches threshold, contents are “spilled” to disk
› Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here
Reduce side
› First, map outputs are copied over to reducer machine
› “Sort” is a multi-pass merge of map outputs (happens in
memory and on disk): combiner runs here
› Final merge pass goes directly into reducer
Mapper
Reducer
other mappers
other reducers
circular buffer
(in memory)
spills (on disk)
merged spills
(on disk)
intermediate files
(on disk)
Combiner
Combiner
Writable Defines a de/serialization protocol.
Every data type in Hadoop is a Writable.
WritableComprable Defines a sort order. All keys must be
of this type (but not values).
IntWritable
LongWritable
Text
…
Concrete classes for different data types.
SequenceFiles Binary encoded of a sequence of
key/value pairs
Map function
Reduce function
Run this program as a
MapReduce job
Hadoop ClusterYou
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job3a. Go back to Step 2
4. Retrieve data from HDFS
Applications for Big Data Analytics
Homeland Security
Finance Smarter HealthcareMulti-channel
sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
CASE STUDY : 1Environment Change Prediction to Assist Formers Using Hadoop
Thank you
Questions ?