Hadoop introduction

Post on 14-Apr-2017

513 views 0 download

Transcript of Hadoop introduction

Hadoop Framework

• Map-Reduce introduction• Hadoop introduction• Hadoop Application Architecture• Developing a typical Hadoop Application• Practice on Hadoop

Agenda

• A programming model specification from Google.• Tend to use for processing Terabyte(1024GBs), Petabyte(1024

Terabytes) data.• Break large or complex processing into smaller, independent

pieces and modeling into key-value pair.• Run on a commodity of group of clustering machines.• Scale by add more workers, not bigger worker• Consist of two phases:

– Map: written by the user, takes an input pair and produce a set of intermediate key/value pairs.

– Reduce: aggregate and collate intermediate results.– (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3>

(output)

Map-Reduce concept

Map-Reduce flow sample

Map-Reduce overall flow

• User program splits the input file into M pieces.• One of the copies of the program is the master, the rest are the

slaves.• Master selects idle slaves and assigns a map or reduce task to each

one of them.• Slaves parse the input into key-value pairs and pass to reduce

function.• The slaves emit key-pair in buffer memory and local hard-disk. This

location is also sent to Master.• The master notifies to reduce slaves the location of key-pair.• The reduce slave get the key-pair, sort base on key.• The reduce pass intermediate key and its value to reduce function. • The reduce slaves process using reduce function and produce output

to user.• End process, master return result and control to user.

Map-reduce overall flow

• An open source from Apache implementing the Map-Reduce specification using Java.

• Distributed processing for large or computationally complex problems

• Main core tenet:– Scale out not up– Move processing– Expect and embrace failure

• Normally batch processing for a massive amount of data set.• Consisting of two main parts:

– A data storage using for processing(HDFS).– A parallel process engine (MapReduce APIs).

• Current main players: Amazon Elastic Map Reduce, Cloudera, MapR, Hortonworks

Hadoop framework

Hadoop Overall Architecture

• Using for temporarily storing data for Map-Reduce processing• A typical file in HDFS is gigabytes to terabytes in size• Divide large file into smaller block, default is 64Mb.• Structure like any existing FS: file, directory, permission• Support Linux-base command for interact: ls, rm, put…• Communication model via TPC/IP protocol• Provide a Java base APIs for access.

Hadoop Distributed File System

Hadoop Distributed File System

Hadoop working model

• Client submit a Job to Hadoop– The job can be a Mapper, a Reducer, or list of Input.– It’s a collection of Java classes which packaged into Jar file.

• the Job is sent to JobTracker process on Master Node.• Each slave Node runs a process called TaskTracker.• JobTracker instruct the TaskTracker and monitor.• A Map or Reduce over a piece of data is a single task.• A task attempt is an instance of a task running on a slave node.

Hadoop working model

Hadoop Programming model

• The Map-Reduce framework relies on the InputFormat of the job to:– Validate the input-specification of the job.– Split-up the input file(s) into logical InputSplits, each of which is then

assigned to an individual Mapper.– Provide the RecordReader implementation to be used to glean input records

from the logical InputSplit for processing by the Mapper.• Mapper task processing, resulting intermediate key-value pair and

sending to reducer using Map.context(k, v) class.• Reduce reduces a set of intermediate values which share a key to a

smaller set of values and has 3 primary phases: – Shuffle: copies the sorted output from each Mapper across the network– Sort: sorts inputs by keys (since different Mappers may output the same

key)– Reduce: call reduce method defined by user.

• Hadoop defines “box” classes for strings (Text), integers (IntWritable) for optimizing the serialization over the network.

Hadoop Programming model

Hadoop Application Architecture

• Using Sqoop or Flume to import/export data from various external data source into HDFS for processing:– The process is executed in map task of Hadoop.– Can work with or RDBMS or NoSQL.– Sample: sqoop import –connect

jdbc:mysql://localhost:3306/sqoop -username root -pasword pass -table employees

• Using Apache Hive as a data warehouse software facilitates querying and managing large datasets:– Organize data model as table, row, column, partition– Support data type like: integer, float, double, string, list, struct– Support Join, Group, Filter…built-in operators and function

• Using Sping Data for simplifying developing Apache Hadoop:– Create and configure applications that use MapReduce, Streaming,

Hive, Pig, or Hbase.– Integration with Spring Boot, using Dependency Injection…

Typical Hadoop Application Architecture

Concrete Hadoop Application Architecture

• Choose appropriate frameworks for each application:– Hive or Pig for logged/relational data– Sqoop for working with database, Flume for collecting log data from

web server because it’s event driven.– HDFS or Hbase for storage of temporary data for processing– Crunch APIs for join/aggregation rather than Hadoop APIs.

• Apply best practices:– Choose Number of Mapper and Reducer wisely: Total mapper or

reducer = Number of Nodes * maximum number of tasks per node.

– Set Reducers to zero if you not using it.– Mappers process optimal amount of data– Always use Combiner if possible for local aggregation– Minimize your mapper output– Always write unit test and run in a small data set

Developing a typical Hadoop Application

• Tuning Hadoop using configuration parameter– Hadoop provide a lot of parameter for tuning.

• What do when a task fail– Usually happens– Try again(retries possible because of idempotence)– Report failure

• Slow tasks:– Run anther version of the same task in parallel.

• Apply java coding best practice

Developing Typical Hadoop Application

• Support Standalone/Pseudo distributed/fully distributed mode• Implement a word count problem• Debug a Hadoop program:

– Using log file– Using remote debug

Setup environment and practice

A sample demo

THANK YOU