Download - HADOOP ADMIN: Session -2

HADOOP ADMIN: Session -2

What is Hadoop?

AGENDAHadoop Demo using CygwinHDFS DaemonsMap Reduce DaemonsHadoop Ecosystem Projects

Hadoop Using CygwinWhat is Cygwin?Hadoop needs Java version 1.6 or higher

bin/hadoopbin/hadoop jar hadoop-examples-1.0.4.jar

Word count input outputWord count example

Tokenization problemModifying the Program

C:\Documents and Settings\sb009239\Deskt

HDFS Daemons

Name NodeMeta Data in RAM

Data Node 1Secondary Name Node

Block Repor

t

Heart

Beats

Not a backup

node/stand by Node

Read

Read Data Block 1

Roll edits

Copy

Fsimage and

edits

Replay all edits and create new fs image

Rename new edits

Send New

Fs image

Map Reduce V1 DaemonsJob TrackerTask Tracker

Job TrackerJob Tracker

Task TrackerTask Tracker

Task TrackerTask Tracker Task TrackerTask Tracker Task TrackerTask Tracker

Word Count over a Given Set of Web Pages

see bob throw see 1

bob 1

throw 1

see 1

spot 1

run 1

bob 1

run 1

see 2

spot 1

throw 1

see spot run

Can we do word count in parallel?

The MapReduce Framework (pioneered by Google)

Automatic Parallel Execution in MapReduce (Google)

Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to

avoid a slow task slowing down the whole job

MapReduce in Hadoop (1)

MapReduce in Hadoop (2)

Data Flow in a MapReduce Program in Hadoop

InputFormatMap functionPartitionerSorting & MergingCombinerShufflingMergingReduce functionOutputFormat

1:many

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters190+ parameters in

HadoopSet manually or

defaults are used

Hadoop Ecosystem/Sub Projects

PIGOne frequent complaint about MR is that it’s difficult to

programOne criticism of MapReduce is that the development

cycle is very longAs you implement the program in MapReduce, you’ll

have to think at the level of mapper and reducer functions and job chaining

Pig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007

Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin

Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability

Yahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIG

Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there

PIG::How I look like:Not a variable, relation

Loads data file into a relation,with a defined schema

Word count example in PIG Text=LOAD ‘text’ USING Textloader()Loads each line as one column Tokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as

word COUNT_STAR($1)

PIG JOB

MR TRANSFOR

MATIONMR JOBS HDFS

PIG Vs HivePig is a new language, easy to learn if you know

languages similar to PerlHive is a sub-set of SQL with very simple variations

to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you

Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL).

Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

HIVE(HQL)Hive is a data ware house infrastructure

built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop cluster

Invented at Facebook for their own problems .

SQL like query language(HQL/Hive QL) to retrieve the data and process it.

JDBC/ODBC access is providedCurrently used with respect to Hbase

HbaseHBase is not about being a high level

language that compiles to map-reduce,Hbase is about allowing Hadoop to support

lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.

SqoopTo load bulk data into Hadoop from relational

databasesImports individual tables or entire databases to

files in HDFSProvides the ability to import from SQL

databases straight into your Hive data warehouse

Importing this table into HDFS could be done with the command:

you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --local --hive-import- See more at: