June 10 145pm hortonworks_tan & welch_v2
-
Upload
hadoop-summit -
Category
Documents
-
view
172 -
download
0
Transcript of June 10 145pm hortonworks_tan & welch_v2
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enabling diverse workload scheduling in YARN
June, 2015
Wangda Tan, Hortonworks, ([email protected])Craig Welch, Hortonworks, ([email protected])
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
About us
Wangda Tan• Last 5+ years in big data field,
Hadoop, Open-MPI, etc.• Past
– Pivotal (PHD team, brings OpenMPI/GraphLab to YARN)
– Alibaba (ODPS team, platform for distributed data-mining)
• Now– Apache Hadoop Committer
@Hortonworks, all in YARN.– Now spending most of time on
resource scheduling enhancements.
Craig Welch• Yarn Contributor
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop+YARN is the home of
big data processing.
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Our workloads vary, Service | Batch | interactive/ real-time
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
They have different CRAZY requirements
I wanna be fast!
When cluster is busy
Don’t take away
MY RESOURCES
A huge job
needs be scheduled
at a special time
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We want to make them
AS HAPPY AS POSSIBLE
to run together in YARN.
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Let’s start…
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda today
• Overview• Node Label• Resource Preemption• Reservation system• Pluggable behavior for Scheduler• Docker support• Resource scheduling beyond memory
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Overview
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Background
• Resources are managed by a hierarchy of queues.
• One queue can have multiple applications
• Container is the result resource scheduling, Which is a bundle of resources and can run process(es)
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How to manage your workload by queues
• By organization:–Marketing/Finance
queue
• By workload– Interactive/Batch queue
• Hybrid–Finance-batch/
Marketing-realtime queue
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node Label
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node Label – Overview
• Types of node labels– Node partition (Since 2.6)– Node constraints (WIP)
• Node partition (Today’s focus)– One node belongs to only one
partition– Related to resource planning
• Node constraints– One node can assign multiple
constraints– Not related to resource planning
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node partition – Resource planning
• Nodes belong to “default partition” if not specified• It’s possible to specify different capacities of queues on different partitions
–For example, sales queue can use different resource on GPU and default partition.
• It’s possible to specify some partition will be only used by some queues (ACL for partition)–For example, only sales queue can access “Large memory partition”
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node partition – Exclusive vs. Non-exclusive
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Node Partition – Use cases & best practice
• Dedicate nodes to run important services:–E.g. Running HBase region server using Apache Slider
• Nodes with special hardware in the cluster are used by organizations. –E.g. You may want a queue dedicated to the marketing department to use 80% of
these memory-heavy nodes.
• Use non-exclusive node partition to make better resource utilization.• Be careful about user-limits, capacity, etc. to make sure jobs can be launched
I will cover more details about implementation & usage in Thursday morning’s session “YARN Node Labels” with Mayank Bansal from Ebay.
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption – Overview
• Queue has configured minimum resource.
• Since it has a minimum resource value, the preemption policy (which performs preempting resources) is used to insure that:–When a queue is under its “minimum resource”, and the cluster doesn’t have
available resources, preemption policy can get resource from other queues use more than their minimum resource.
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption – Example
• When preemption is not enabled
• When preemption is enabled
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resource Preemption – best practice
•Configurations to control the pace of preemption:–yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill–yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round–yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor
•Configurations to control when or if preemption happens–yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity (deadzone)–yarn.scheduler.capacity.<queue-path>.disable_preemption
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System – Overview
• Reserving resource ahead of time– Just like ordering table in a restaurant
– “I need a table for X people at Y time”
– “Wait for moment … Reservation confirmed sir“
– (After some time), “Your table is ready”
–What Reservation System does is:–Send a reservation request
–RM checks time table
–Send back reservation confirmation ID
–Notify when ready
•Enables more predictable start and run time for time-critical / resource intensive applications
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System – Use cases
•Gang scheduling– Currently, YARN can do gang scheduling from application side (holding resources until it meets requirements)– Resources could be wasted and there’s risk of deadlocks.
–RS lays the foundation for gang scheduling
•Workflow support– I want to run jobs in stages– Stage-1 at 1 AM tomorrow, needs 10k containers– Stage-2 after stage-1, needs 5k containers– Stage-3 after stage-2, needs 2k containers
– You can submit such requests to RS!
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Reservation System – Result & References
•Before & After Reservation System (reports from MSR)– It increased cluster utilization a lot!
•References
– Design / Discussion / Report : YARN-1051– More detail about example : YARN-2609
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pluggable scheduler behavior
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why
• Problem• It’s difficult to share functionality between schedulers
• Users cannot achieve the same behavior with all schedulers
• Fixes and enhancements tend to end up in one scheduler, not all, leading to fragmentation
• No simple mechanism exists to mix behaviors for a given feature in a single cluster
• Solution• Move to sharable, pluggable scheduler behavior
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How
• The Goal–Recast scheduler behavior as
policies – candidates include –Resource limits for apps, users...
–Ordering for allocation and preemption
• With this, we can:–Maximize feature availability and
reduce fragmentation–Configure different queues for
different workloads in a single cluster
Flexible Scheduler configuration, as simple
as building with Legos!
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Ordering Policy of Capacity Scheduler
• Pluggable ordering policies for LeafQueues in Capacity Scheduler–Enables the implementation of different
policies for ordering assignment and preemption of containers for applications
– Initial implementations include FIFO (Capacity Scheduler original behavior) and Fair
–User Limits and Queue Capacity limits are still respected
• Fair scheduling inside Capacity Scheduler–Based on the Fair Sharing logic in
FairScheduler–Assigns containers to applications in
order of least to greatest resource usage–Allows many applications to make
progress concurrently–Lets short jobs finish in reasonable time
while not starving long running jobs
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Configuration and tuning
• Rough guidelines for when to use Fair and FIFO ordering policies
• Configuration
– yarn.scheduler.capacity.<queue>.ordering-policy (“fifo” or “fair”, default “fifo”)
– yarn.scheduler.capacity.<queue>.ordering-policy.fair.enable-size-based-weight (true or false)
• Tuning–Use max-am-resource-percent to avoid “peanut buttering” from having too many apps running at once–Sometimes it’s necessary to separate large and small apps in different queues, or use size-based-weight, to avoid large app starvation
Workloads Policy
On-demand/interactive/exploratory
Fair
Predictable/Recu-rring batch
FIFO
Mix of above two Fair
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Docker container support
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Docker container support – Overview
• Containers for the Cluster–Brings the sandboxing and
dependency isolation of container technology to Hadoop
–Containers make it simple to use Hadoop resources for a wider range of applications
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Docker container support – Status
• Done–(V1) Initial implementation translating Kubernetes to an Application Master launching Docker containers from the Cluster met with success.–(V2) A custom container launcher for Docker containers. This brought the capability more fully under the management of YARN,
–but a single cluster could not support both traditional YARN applications (MapReduce, etc)
and Docker concurrently
• Next phase–(V3) WIP, is adding support for running Docker and traditional YARN applications side-by-side in a single cluster
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
It’s not all about memory
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
It’s not all about Memory - CPU
• What’s in a CPU–Some workloads are CPU intensive, without accounting for this nodes may end up CPU bound or CPU may be under utilized cluster-wide –CPU awareness at the scheduer level is enabled by selecting the DominantResourceCalculator.–Dominant? “Dominant” stands for the “dominant factor”, or the “bottleneck”. In simplified terms, for the resource type which is the
most constrained becomes the dominant factor for any given comparison or calculation
–For example, If there is enough memory but not enough cpu for a resource request, the cpu component is dominant ( and the answer is “No” )
–See https://www.cs.berkeley.edu/~alig/papers/drf.pdf for more detail
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
It’s not all about Memory – CPU - Vcores
• What’s in a CPU–The unit used to abstract CPU capability in YARN is the vcore–Vcore counts are configured per-node in the yarn-site.xml, typically 1-1 vcore to physical CPU–If some Nodes’ CPUs outclass other nodes’, the number of vcores per physical CPU can be adjusted upward to compensate
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q & A
?