HADOOP ADMIN: Session -2
What is Hadoop?
AGENDAHadoop Demo using CygwinHDFS DaemonsMap Reduce DaemonsHadoop Ecosystem Projects
Hadoop Using CygwinWhat is Cygwin?Hadoop needs Java version 1.6 or higher
bin/hadoopbin/hadoop jar hadoop-examples-1.0.4.jar
Word count input outputWord count example
Tokenization problemModifying the Program
C:\Documents and Settings\sb009239\Deskt
HDFS Daemons
Name NodeMeta Data in RAM
Data Node 1Secondary Name Node
Block Repor
t
Heart
Beats
Not a backup
node/stand by Node
Read
Read Data Block 1
Roll edits
Copy
Fsimage and
edits
Replay all edits and create new fs image
Rename new edits
Send New
Fs image
Map Reduce V1 DaemonsJob TrackerTask Tracker
Job TrackerJob Tracker
Task TrackerTask Tracker
Task TrackerTask Tracker Task TrackerTask Tracker Task TrackerTask Tracker
Word Count over a Given Set of Web Pages
see bob throw see 1
bob 1
throw 1
see 1
spot 1
run 1
bob 1
run 1
see 2
spot 1
throw 1
see spot run
Can we do word count in parallel?
The MapReduce Framework (pioneered by Google)
Automatic Parallel Execution in MapReduce (Google)
Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to
avoid a slow task slowing down the whole job
MapReduce in Hadoop (1)
MapReduce in Hadoop (2)
Data Flow in a MapReduce Program in Hadoop
InputFormatMap functionPartitionerSorting & MergingCombinerShufflingMergingReduce functionOutputFormat
1:many
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
Map Wave 1
ReduceWave 1
Map Wave 2
ReduceWave 2
Input Splits
Lifecycle of a MapReduce JobTime
How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters190+ parameters in
HadoopSet manually or
defaults are used
Hadoop Ecosystem/Sub Projects
PIGOne frequent complaint about MR is that it’s difficult to
programOne criticism of MapReduce is that the development
cycle is very longAs you implement the program in MapReduce, you’ll
have to think at the level of mapper and reducer functions and job chaining
Pig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007
Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin
Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability
Yahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIG
Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there
PIG::How I look like:Not a variable, relation
Loads data file into a relation,with a defined schema
Word count example in PIG Text=LOAD ‘text’ USING Textloader()Loads each line as one column Tokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as
word COUNT_STAR($1)
PIG JOB
MR TRANSFOR
MATIONMR JOBS HDFS
PIG Vs HivePig is a new language, easy to learn if you know
languages similar to PerlHive is a sub-set of SQL with very simple variations
to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you
Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL).
Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.
HIVE(HQL)Hive is a data ware house infrastructure
built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop cluster
Invented at Facebook for their own problems .
SQL like query language(HQL/Hive QL) to retrieve the data and process it.
JDBC/ODBC access is providedCurrently used with respect to Hbase
HbaseHBase is not about being a high level
language that compiles to map-reduce,Hbase is about allowing Hadoop to support
lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.
SqoopTo load bulk data into Hadoop from relational
databasesImports individual tables or entire databases to
files in HDFSProvides the ability to import from SQL
databases straight into your Hive data warehouse
Importing this table into HDFS could be done with the command:
you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --local --hive-import- See more at:
Top Related