Wei's notes on MapReduce Scheduling

1. Weis Notes on Map-Reduce Job Scheduling
Feb 2011

2. [Map-Reduce] Workflow
Master splits a job into small chunks (symd model)
Assign to slaves with available mapper slots (taking into account of data locality)
Mapper collects required data, puts through user defined mapper function
Mapper writes intermediate results to local disk, report to Master with location of the results
Master record status, pick slaves with available reducer and push over location info for reduce phase (*locality? Yes!)
Reducer copies data from mapper via RPC, waits for all mappers to finish, then sorts by intermediate keys, eventually puts through user defined reducer function
Reducer writes final output to DFS, report to Master
3. [Map-Reduce] Data flow
Raw
Map(k1, v1) -> list(k2, v2)
Reduce(k2, list(v2)) -> list(v2) *why not v3?
4. [Map-Reduce] Fault Tolerance
Upon machine failure:
5. [Map-Reduce] To-Dos
Splitting:
When: upon arrival or upon head-of-queue
how is size M determined? (based on chunk size)
can be processed in parallel by different machines
Cost of re-execution
Map & reduce
6. [Fair Scheduler] 3-phase allocation
Satisfy the pool whose min share >= demand
Allocate resources to the other pools up to its min share
Residual given to the unfilled, starting with the least fulfilled
Notes
Resource allocation is pool based instead of job based
Pool: min share is user specified
7. [Fair Scheduler] reschedule
Policy: wait & kill
Algorithm:
Wait Tmin. If min share not achieved, kill others
Wait Tfair. If fare share not achieved, kill more.
8. [Fair Scheduler] Issues & Solutions
Data Locality
Delay scheduling: address sticky slots issue
IO-rate biasing: address hotspot node
Map/Reduce interdependency
Copy-Compute Splitting: overlapping IO intensive copy and CPU intensive reducing
9. [Fair Scheduler] Tradeoffs
Batch response time: fairness vs. utilization tradeoff (throughput)
Average Response Time
Space Usage with Intermediate Data
User Isolation: ability to provide worst-case performance comparable to owning a small private cluster regardless of user workload
10. [Fair Scheduler] To-Dos
Reschedule/Reassignment
FairScheduler keeps UPDATE_INTERVAL, check all pools for tasks to preempt and set status of those tasks, and place in action queue.
Next heartbeat will pick up the changes in task status and carry out the kills.
Relationship between batch response time and throughput: measure the same thing.
Relationship between average response time and user isolation: could be correlated, but not all the time. ART is not a quantitative measurement of user isolation
11. [Quincy]
Model the problem as a flow network
Flow network: a directed graph each of whose
Edges e is annotated with a non-negative integer capacity and a cost, and whose
Nodes v is annotated with an integer supply where total supply of the graph equals to zero
To construct simplest graph with only hard constraint being no starvation
12. Quincy vs. Fair Scheduler
13. Readings
MapReduce. Jeffery Dean*
Google: Cluster Computing and MR
Job Scheduling for Multi-User. Matei Zaharia*
Max-min fairness. Wikipedia + algo*
Quincy. Michael Isard*
An update on Googles infrastructure
14. Topic
Before: Existing systems predetermined and fixed allocation of resources/slots to queries/tasks. Intuitively, if resources can be dynamically allocated to tasks, the resources can be better utilized.
After: Enable scheduler to make resource aware decisions. (IO, CPU, memory) + bring fair scheduler from pool level to job level.
15. Tips from Prof Tan
Keep references of all the literature reviews done and note where it is published

Wei's notes on MapReduce Scheduling

Business

Transcript of Wei's notes on MapReduce Scheduling