Hadoop introduction

22
Hadoop Framework

Transcript of Hadoop introduction

Page 1: Hadoop introduction

Hadoop Framework

Page 2: Hadoop introduction

• Map-Reduce introduction• Hadoop introduction• Hadoop Application Architecture• Developing a typical Hadoop Application• Practice on Hadoop

Agenda

Page 3: Hadoop introduction

• A programming model specification from Google.• Tend to use for processing Terabyte(1024GBs), Petabyte(1024

Terabytes) data.• Break large or complex processing into smaller, independent

pieces and modeling into key-value pair.• Run on a commodity of group of clustering machines.• Scale by add more workers, not bigger worker• Consist of two phases:

– Map: written by the user, takes an input pair and produce a set of intermediate key/value pairs.

– Reduce: aggregate and collate intermediate results.– (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3>

(output)

Map-Reduce concept

Page 4: Hadoop introduction

Map-Reduce flow sample

Page 5: Hadoop introduction

Map-Reduce overall flow

Page 6: Hadoop introduction

• User program splits the input file into M pieces.• One of the copies of the program is the master, the rest are the

slaves.• Master selects idle slaves and assigns a map or reduce task to each

one of them.• Slaves parse the input into key-value pairs and pass to reduce

function.• The slaves emit key-pair in buffer memory and local hard-disk. This

location is also sent to Master.• The master notifies to reduce slaves the location of key-pair.• The reduce slave get the key-pair, sort base on key.• The reduce pass intermediate key and its value to reduce function. • The reduce slaves process using reduce function and produce output

to user.• End process, master return result and control to user.

Map-reduce overall flow

Page 7: Hadoop introduction

• An open source from Apache implementing the Map-Reduce specification using Java.

• Distributed processing for large or computationally complex problems

• Main core tenet:– Scale out not up– Move processing– Expect and embrace failure

• Normally batch processing for a massive amount of data set.• Consisting of two main parts:

– A data storage using for processing(HDFS).– A parallel process engine (MapReduce APIs).

• Current main players: Amazon Elastic Map Reduce, Cloudera, MapR, Hortonworks

Hadoop framework

Page 8: Hadoop introduction

Hadoop Overall Architecture

Page 9: Hadoop introduction

• Using for temporarily storing data for Map-Reduce processing• A typical file in HDFS is gigabytes to terabytes in size• Divide large file into smaller block, default is 64Mb.• Structure like any existing FS: file, directory, permission• Support Linux-base command for interact: ls, rm, put…• Communication model via TPC/IP protocol• Provide a Java base APIs for access.

Hadoop Distributed File System

Page 10: Hadoop introduction

Hadoop Distributed File System

Page 11: Hadoop introduction

Hadoop working model

Page 12: Hadoop introduction

• Client submit a Job to Hadoop– The job can be a Mapper, a Reducer, or list of Input.– It’s a collection of Java classes which packaged into Jar file.

• the Job is sent to JobTracker process on Master Node.• Each slave Node runs a process called TaskTracker.• JobTracker instruct the TaskTracker and monitor.• A Map or Reduce over a piece of data is a single task.• A task attempt is an instance of a task running on a slave node.

Hadoop working model

Page 13: Hadoop introduction

Hadoop Programming model

Page 14: Hadoop introduction

• The Map-Reduce framework relies on the InputFormat of the job to:– Validate the input-specification of the job.– Split-up the input file(s) into logical InputSplits, each of which is then

assigned to an individual Mapper.– Provide the RecordReader implementation to be used to glean input records

from the logical InputSplit for processing by the Mapper.• Mapper task processing, resulting intermediate key-value pair and

sending to reducer using Map.context(k, v) class.• Reduce reduces a set of intermediate values which share a key to a

smaller set of values and has 3 primary phases: – Shuffle: copies the sorted output from each Mapper across the network– Sort: sorts inputs by keys (since different Mappers may output the same

key)– Reduce: call reduce method defined by user.

• Hadoop defines “box” classes for strings (Text), integers (IntWritable) for optimizing the serialization over the network.

Hadoop Programming model

Page 15: Hadoop introduction

Hadoop Application Architecture

Page 16: Hadoop introduction

• Using Sqoop or Flume to import/export data from various external data source into HDFS for processing:– The process is executed in map task of Hadoop.– Can work with or RDBMS or NoSQL.– Sample: sqoop import –connect

jdbc:mysql://localhost:3306/sqoop -username root -pasword pass -table employees

• Using Apache Hive as a data warehouse software facilitates querying and managing large datasets:– Organize data model as table, row, column, partition– Support data type like: integer, float, double, string, list, struct– Support Join, Group, Filter…built-in operators and function

• Using Sping Data for simplifying developing Apache Hadoop:– Create and configure applications that use MapReduce, Streaming,

Hive, Pig, or Hbase.– Integration with Spring Boot, using Dependency Injection…

Typical Hadoop Application Architecture

Page 17: Hadoop introduction

Concrete Hadoop Application Architecture

Page 18: Hadoop introduction

• Choose appropriate frameworks for each application:– Hive or Pig for logged/relational data– Sqoop for working with database, Flume for collecting log data from

web server because it’s event driven.– HDFS or Hbase for storage of temporary data for processing– Crunch APIs for join/aggregation rather than Hadoop APIs.

• Apply best practices:– Choose Number of Mapper and Reducer wisely: Total mapper or

reducer = Number of Nodes * maximum number of tasks per node.

– Set Reducers to zero if you not using it.– Mappers process optimal amount of data– Always use Combiner if possible for local aggregation– Minimize your mapper output– Always write unit test and run in a small data set

Developing a typical Hadoop Application

Page 19: Hadoop introduction

• Tuning Hadoop using configuration parameter– Hadoop provide a lot of parameter for tuning.

• What do when a task fail– Usually happens– Try again(retries possible because of idempotence)– Report failure

• Slow tasks:– Run anther version of the same task in parallel.

• Apply java coding best practice

Developing Typical Hadoop Application

Page 20: Hadoop introduction

• Support Standalone/Pseudo distributed/fully distributed mode• Implement a word count problem• Debug a Hadoop program:

– Using log file– Using remote debug

Setup environment and practice

Page 21: Hadoop introduction

A sample demo

Page 22: Hadoop introduction

THANK YOU