Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters
Wei's notes on MapReduce Scheduling
description
Transcript of Wei's notes on MapReduce Scheduling
- 1. Weis Notes on Map-Reduce Job Scheduling
Feb 2011
2. [Map-Reduce] Workflow
Master splits a job into small chunks (symd model)
Assign to slaves with available mapper slots (taking into account
of data locality)
Mapper collects required data, puts through user defined mapper
function
Mapper writes intermediate results to local disk, report to Master
with location of the results
Master record status, pick slaves with available reducer and push
over location info for reduce phase (*locality? Yes!)
Reducer copies data from mapper via RPC, waits for all mappers to
finish, then sorts by intermediate keys, eventually puts through
user defined reducer function
Reducer writes final output to DFS, report to Master
3. [Map-Reduce] Data flow
Raw
Map(k1, v1) -> list(k2, v2)
Reduce(k2, list(v2)) -> list(v2) *why not v3?
4. [Map-Reduce] Fault Tolerance
Upon machine failure:
5. [Map-Reduce] To-Dos
Splitting:
When: upon arrival or upon head-of-queue
how is size M determined? (based on chunk size)
can be processed in parallel by different machines
Cost of re-execution
Map & reduce
6. [Fair Scheduler] 3-phase allocation
Satisfy the pool whose min share >= demand
Allocate resources to the other pools up to its min share
Residual given to the unfilled, starting with the least
fulfilled
Notes
Resource allocation is pool based instead of job based
Pool: min share is user specified
7. [Fair Scheduler] reschedule
Policy: wait & kill
Algorithm:
Wait Tmin. If min share not achieved, kill others
Wait Tfair. If fare share not achieved, kill more.
8. [Fair Scheduler] Issues & Solutions
Data Locality
Delay scheduling: address sticky slots issue
IO-rate biasing: address hotspot node
Map/Reduce interdependency
Copy-Compute Splitting: overlapping IO intensive copy and CPU
intensive reducing
9. [Fair Scheduler] Tradeoffs
Batch response time: fairness vs. utilization tradeoff
(throughput)
Average Response Time
Space Usage with Intermediate Data
User Isolation: ability to provide worst-case performance
comparable to owning a small private cluster regardless of user
workload
10. [Fair Scheduler] To-Dos
Reschedule/Reassignment
FairScheduler keeps UPDATE_INTERVAL, check all pools for tasks to
preempt and set status of those tasks, and place in action
queue.
Next heartbeat will pick up the changes in task status and carry
out the kills.
Relationship between batch response time and throughput: measure
the same thing.
Relationship between average response time and user isolation:
could be correlated, but not all the time. ART is not a
quantitative measurement of user isolation
11. [Quincy]
Model the problem as a flow network
Flow network: a directed graph each of whose
Edges e is annotated with a non-negative integer capacity and a
cost, and whose
Nodes v is annotated with an integer supply where total supply of
the graph equals to zero
To construct simplest graph with only hard constraint being no
starvation
12. Quincy vs. Fair Scheduler
13. Readings
MapReduce. Jeffery Dean*
Google: Cluster Computing and MR
Job Scheduling for Multi-User. Matei Zaharia*
Max-min fairness. Wikipedia + algo*
Quincy. Michael Isard*
An update on Googles infrastructure
14. Topic
Before: Existing systems predetermined and fixed allocation of
resources/slots to queries/tasks. Intuitively, if resources can be
dynamically allocated to tasks, the resources can be better
utilized.
After: Enable scheduler to make resource aware decisions. (IO, CPU,
memory) + bring fair scheduler from pool level to job level.
15. Tips from Prof Tan
Keep references of all the literature reviews done and note where
it is published