Building a Community of Practice and leveraging Collaboration towards shared Innovations
Towards General-Purpose Resource Management in Shared...
Transcript of Towards General-Purpose Resource Management in Shared...
![Page 1: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/1.jpg)
Towards General-Purpose Resource Management
in Shared Cloud Services
Jonathan Mace, Brown University
Peter Bodik, MSR Redmond
Rodrigo Fonseca, Brown University
Madanlal Musuvathi, MSR Redmond
![Page 2: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/2.jpg)
Shared-tenant cloud services
Processes service requests from multiple clients
✓ Great for cost and efficiency
✘ Performance is a challenge
Aggressive tenants and system maintenance tasks
Resource starvation and bottlenecks
Degraded performance, Violated SLOs, system outages
2
![Page 3: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/3.jpg)
Ideally
manage resources to provide end-to-end guarantees and isolation
Challenge
OS/hypervisor mechanisms insufficient✘ Shared threads & processes✘ Application-level resource bottlenecks (locks, queues)✘ Resources across multiple processes and machines
Today
lack of guarantees, isolation
some ad-hoc solutions
3
Shared-tenant cloud services
![Page 4: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/4.jpg)
This paper
• 5 design principles for resource policies in shared-tenant systems
• Retro – prototype for principled resource management
• Preliminary demonstration of Retro in HDFS
4
![Page 5: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/5.jpg)
Hadoop Distributed File System (HDFS)
5
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
Filesystem metadata Replicated block storage
![Page 6: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/6.jpg)
6
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
Hadoop Distributed File System (HDFS)
Filesystem metadata Replicated block storage
![Page 7: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/7.jpg)
7
![Page 8: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/8.jpg)
8
![Page 9: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/9.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
9
![Page 10: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/10.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
10
![Page 11: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/11.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
11
![Page 12: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/12.jpg)
Principle 1: Consider all resources and request types
• Fine-grained resources within processes
• Resources shared between processes (disk, network)
• Many different API calls
• Bottlenecks can crop up in many placeshardware resources: disk, network, cpu, …software resources: locks, queues, …data structures: transaction logs, shared batches, …
12
![Page 13: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/13.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
13
![Page 14: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/14.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
14
![Page 15: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/15.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
15
![Page 16: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/16.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
16
![Page 17: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/17.jpg)
Principle 2: Distinguish between tenants
• Tenants might send different types of requests
• Tenants might be utilizing different machines
• If a policy is efficient, it should be able to target the cause of contention
e.g.,
if a tenant is causing contention, throttle
otherwise leave the tenant alone
17
![Page 18: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/18.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
18
![Page 19: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/19.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
Admission Control
19
![Page 20: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/20.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
while (!Thread.isInterrupted()){
sendPacket();
}
HDFS DataNode
Admission Control
20
![Page 21: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/21.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
while (!Thread.isInterrupted()){
rate_limit();
sendPacket();
}
Principle 5:
Schedule early, schedule often
21
HDFS DataNode
Admission Control
![Page 22: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/22.jpg)
Resource Management Design Principles
1. Consider all request types and all resources
2. Distinguish between tenants
3. Treat foreground and background tasks uniformly
4. Estimate resource usage at runtime
5. Schedule early, schedule often
Retro – prototype for principled resource management in shared-tenant systems
22
![Page 23: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/23.jpg)
Retro: end-to-end tracing
23
Tenants
![Page 24: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/24.jpg)
Retro: end-to-end tracing
24
Tenants
![Page 25: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/25.jpg)
Retro: application-level resource interception
25
Tenants
![Page 26: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/26.jpg)
Retro: aggregation and centralized reporting
26
Tenants
![Page 27: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/27.jpg)
Retro: application-level enforcement
27
Tenants
![Page 28: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/28.jpg)
Retro: distributed scheduling
28
Tenants
![Page 29: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/29.jpg)
Tenants
29
Retro: distributed scheduling
![Page 30: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/30.jpg)
Early Results
30
Op
en
Rea
d
Cre
ate
Ren
ame
Del
ete
No
rmal
ized
Th
rou
ghp
ut
HDFS
HDFS w/ Retro
1.1
1
0.9
Op
en
Rea
d
Cre
ate
Ren
ame
Del
ete
No
rmal
ized
Lat
ency
1.2
1
0.8
HDFS NNBenchbenchmark
0.01% to 2% average overhead on end-to-end latency, throughput
![Page 31: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/31.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
31
![Page 32: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/32.jpg)
HDFS NameNode HDFS DataNodeHDFS DataNode
HDFS DataNode
32
![Page 33: Towards General-Purpose Resource Management in Shared ...cs.brown.edu/~jcmace/presentations/mace14resource... · Towards General-Purpose Resource Management in Shared Cloud Services](https://reader033.fdocuments.us/reader033/viewer/2022060423/5f19c0dd705ff4711154d6ab/html5/thumbnails/33.jpg)
Retrospective
Thus far:• Per-tenant identification
• Resource measurements
• Schedule enforcement
Next steps:• Abstractions for writing simplified high-level policies
• Low-level enforcement mechanisms
• Policies to monitor system, find bottlenecks, provide guarantees
33