Introduction to yarn

15
Bhupesh Chawda [email protected] DataTorrent Introduction to YARN Next Gen Hadoop

Transcript of Introduction to yarn

Bhupesh [email protected]

DataTorrent

Introduction to YARNNext Gen Hadoop

Image Source: https://memegenerator.net/instance/64508420

Why YARN

Hadoop v1 (MR1) Architecture

● Job Tracker○ Manages cluster resources ○ Job scheduling○ Bottleneck

● Task Tracker○ Per-node Agent○ Manages tasks○ Map / Reduce task slots

MapReduce Status

Job Submission

JobTracker

Task Task

Task Task

Client

Client

TaskTracker

Task Task

Task Tracker

TaskTracker

Limitations with MR1• Scalability

○ Maximum cluster size: 4,000 nodes○ Maximum concurrent tasks: 40,000

• Availability - Job Tracker is a SPOF• Resource Utilization - Map / Reduce slots• Runs only MapReduce applications

Why YARN (Cont…)

Image Source: memegenerator.net

Introducing YARN

● YARN - Yet Another Resource Negotiator● Framework that facilitates writing arbitrary distributed processing

frameworks and applications.● YARN Applications/frameworks:

e.g. MapReduce2, Apache Spark, Apache Giraph, Apache Apex etc.

Image Source: http://tm.durusau.net/?cat=1525

Hadoop beyond Batch

YARN for better resource utilization

More applications than MapReduce

Image Source: http://tm.durusau.net/?cat=1525

Comparing MapReduce with YARN

MapReduce YARN

≈8Proprietary and Confidential

Job Tracker

Resource Manager

Application Master

Task Tracker Node Manager

Map Slot

Reduce Slot

Backward Compatibility Maintained!

● Existing Map Reduce jobs run as is on the YARN framework

● No Job Tracker and Task Tracker processes

• Resource Manager○ Manages and allocates cluster resources

○ Application scheduling

○ Applications Manager

• Node Manager

○ Per-machine agent

○ Manages life-cycle of container

○ Monitors resources

• Application Master

○ Per-application

○ Manages application scheduling and task execution

Hadoop v2 (YARN) Architecture

Image Source: hadoop.apache.org

Application Submission workflow

YarnClient

Node RM

(ApplicationsManagers + Scheduler)

Resource Manager

Node NM

Node Manager

Node NM

Node Manager

Application Master

ContainerContainer

1) Submit application

2) Launch application Master

RM = Resource ManagerNM = Node ManagerAM = Application Master = Heartbeats

3) AM registers with RM

4) AM negotiates for containers

5) Launch Container

Application Masters - One for each Application Type

MapReduce Application MapReduce Application Master

Apex ApplicationApex

Application Master (StrAM)

Flink Application Flink Application Master

Giraph Application Giraph Application Master

Already provided by Hadoop as a backward compatibility option for MapReduce

Provided by Apache Apex

● YARN enables non-MapReduce applications to run in a distributed fashion● Each Application first asks for a container for the Application Master

○ The Application Master then talks to YARN to get resources needed by the application

○ Once YARN allocates containers as requested to the Application Master, it starts the application components in those containers.

● Hadoop is no more just batch processing!!

Key Takeaways

Image Source: memegenerator.net

References● Simple Yarn code example

○ https://github.com/hortonworks/simple-yarn-app

● Document references○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html○ http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/○ http://www.slideshare.net/

● Acknowledgements○ Priyanka Gugale, DataTorrent - Some of the slides

Thank You!!

Please send your questions at:[email protected]