YARN - way to share cluster BEYOND HADOOP

YARN- way to share cluster beyond traditional HADOOP

Omkar Joshi YARN team

Hortonworks,Inc

About me..

� Software developer at Hortonworks Inc., Palo Alto

� Contributor to Apache YARN and MAPREDUCE projects

� Worked on resource localization (distributed cache), security

� Currently working on resource manager restart

Agenda

� Classical HADOOP MAPREDUCE framework

� YARN architecture �  Resource scheduling �  Resource localization (Distributed cache) �  Security �  Future work

� How to write custom application on YARN � How to contribute to Open source � Q&A

Task Tracker

Map Map

Task Tracker

Map Map

Task Tracker

Reduce

Job Tracker

Classical MAPREDUCE Framework

Client

Client

Client communication

MAPREDUCE Job Status

Drawbacks

� Scalability �  Limited to ~4000 cluster nodes �  Maximum ~40000 concurrent tasks �  Synchronization in Job Tracker becomes

tricky �  If Job Tracker fails then everything fails.

Users will have to resubmit all the jobs. � Very poor cluster utilization

�  Fixed map and reduce slots

Drawbacks contd..

� Lacks support to run and share cluster resources with NON MAPREDUCE applications.

� Lacks support for wire compatibility. �  All clients need to have same version.

So what do we need?

� Better scalability �  10K+ nodes, 10K+ jobs

� High availability � Better resource utilization � Support for multiple application

frameworks � Support for aggregating logs � Wire compatibility � Easy to up grade the cluster

Thinktank!!

� Lets separate the logic of managing cluster resources from managing application itself

� All the applications including MAPREDUCE will run in user land. �  Better isolation in secure cluster �  More fault tolerant.

Architecture

� Application �  Job submitted by user

� Application Master �  Just like job tracker �  For MAPREDUCE it will manage all the map

and reduce tasks – progress, restart etc. � Container

�  Unit of allocation(simple process) �  Replacing fixed map and reduce slots �  Eg. Container 1 = 2 GB, 4 CPU

Architecture contd...

� Resource Manager (RM) �  Single resource scheduler (Pluggable) �  Stores App state (No need to resubmit

application if RM restarts)

� Node Manager (NM) �  Per machine ..think like task tracker �  Manages container life cycle �  Aggregating application logs

Architecture contd...

Container App

Master

Node Manager

Container

Node Manager

Container App

Master

Node Manager

Container

Resource Manager

Client

Client

Map Reduce task status

Node Status

Job Submission

Resource Request and app status

How job gets executed?

App Master

Node Manager(NM)

Node Manager(NM)

Node Manager(NM)

Container

Resource Manager

(RM)

Client

1.  Client submits application. (ex. MAPREDUCE) 2.  RM asks NM to start Application Master (AM) 3.  Node manager starts application master inside a container (process). 4.  Application Master first registers with RM and then keeps requesting

new resources. On the same AMRM protocol it also reports the application status to RM.

5.  When RM allocates a new container to AM it then goes to the specified NM and requests it to launch container (eg. Map task).

6.  Newly stated container then will follow the application logic and keep reporting to AM its progress.

7.  Once done AM informs RM that application is successful. 8.  RM then informs NM about finished application and asks it to start

aggregating logs and cleaning up container specific files.

1.

2.

3.

4. 5.

5.

6.

7.

8.

8.

Resource Scheduler

� Pluggable (Default is Capacity Scheduler) � Capacity Scheduler

�  Hierarchical queues � Can think of it as queues per Organization

�  User limit (range of resources to use) �  Elasticity �  Black/White listing of resources. �  Supports resource priorities �  Security – queue level ACLs �  Find more about Capacity Scheduler

Capacity Scheduler

B (60%)

A (40%)

1.

App1 (60%)

A (40%)

2.

App3 (33%)

App1 (33%)

App2 (33%)

3.

App3 (20%)

App1 (20%)

App2 (20%)

4.

App4 (40%)

App3 (30%)

App2 (30%)

5.

App4 (40%)

Capacity Scheduler

Resource Localization

�  When node manager launches container it needs the executable to run

�  Resources (files) to be downloaded should be specified as a part of Container launch context

�  Resource Types �  PUBLIC : - accessible to all �  PRIVATE :- accessible to all containers of a single

user �  APPLICATION :- only to single application

Resource Localization contd.. �  Public localizer downloads public

resources(owned by NM). �  Private localizer downloads private and

application resources(owned by user). �  Per user quota not supported yet. �  LRU cache with configurable size. �  As soon as it is localized it looses any

connection with remote location. �  Public localizer supports parallel download

where as private localizer support limited parallel download.

Resource Localization contd..

HDFS

AM

NM

AM requests 2 resources while starting container R1 – Public R2 - Application

R1 & R2

Public cache (NM)

Cache Private cache( User)

User1 User2

App cache

A1 A2

Public Localizer

Private Localizer

R1 R2

R2

R1

Security

�  All the users can not be trusted. Confidential data /application’s data need to be protected

�  Resource manager and Node managers are started as “yarn(super)” users.

�  All applications and containers run as user who submitted the job

�  Use LinuxContainerExecutor to launch user process. (see container-executor.c)

�  Private localizers too run as app_user.

Security contd..

�  Kerberos(TGT) while submitting the job. �  AMRMToken :- for AM to talk to RM. �  NMToken :- for AM to talk to NM for launching

new containers �  ContainerToken :- way for RM to pass

container information from RM to NM via AM. �  Contains resource and user information

�  LocalizerToken :- Used by private localizer during resource localization

�  RMDelegationToken :- useful when kerberos (TGT) is not available.

1.  Kerberos (TGT) 2.  NMToken 3.  .. 4.  AMRMToken 5.  NMToken

1.  As a part of launching container LocalizerToken 6.  .. 7.  AMRMToken 8.  NMToken

App Master

Node Manager(NM)

Node Manager(NM)

Node Manager(NM)

Container

Resource Manager

(RM)

Client 1.

2.

3.

4. 5.

5.

6.

7.

8.

8.

Security contd..

Resource manager restart

� Saves application state. �  Support for Zookeeper and HDFS based

state store � Can recover application from saved

state. No need to resubmit the application

� Today support only non work preserving mode

� Lays foundation for RM-HA

YARN paper received best paper award!! J

YARN-paper

Future work

� RM restart �  Non work preserving mode ..almost done �  Work preserving mode .. Needs more effort

� RM HA .. Just started � Task / container preemption.. � Rolling upgrades � Support for long running services

Different applications already running on YARN

� Apache Giraph(graph processing) � Spark (real time processing) � Apache Tez � MapReduce( MRV2) � Apache Hbase (HOYA) � Apache Helix( incubator project) � Apache Samza (incubator project) � Storm

Writing an application on YARN

�  Take a look at Distributed shell �  Write Application Master which once started will

�  First register itself with RM on AMRMprotocol �  Keep heartbeating and requesting resources via

“allocate” �  Use container management protocol to launch

future containers on NM. �  Once done notify RM via finishApplicationMaster

�  Always use AMRMClient and NMClient while talking to RM / NM.

�  Use distributed cache wisely.

Want to contribute to Open source? �  Follow this post �  Subscribe to apache user, yarn dev/issues

mailing list link �  Track YARN-issues �  Post your questions on user mailing list.

�  Try to be specific and add more information to get better and quick replies

�  Try to be patient. �  Start with simple tickets to get an idea about

the underlying component.

Thank You!!

Questions??

checkout blog on YARN

YARN - way to share cluster BEYOND HADOOP

Technology

Transcript of YARN - way to share cluster BEYOND HADOOP