Introduction & Motivation Problem Statementiraicu/teaching/CS554-S15/lecture06... · 2015. 2....
Transcript of Introduction & Motivation Problem Statementiraicu/teaching/CS554-S15/lecture06... · 2015. 2....
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Today (2014): Peta-scale Computing − 34PFlops − 3Million cores
• Near Future (~2020): Exascale Computing − 1000PFlops − 1Billion cores
http://s.top500.org/static/lists/2014/06/TOP500_201406_Poster.pdf http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf
• Manages resources − compute − storage − network
• Job scheduling − resource allocation − job launch
• Data management − data movement − caching
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• PetaScale à Exascale (10^18 Flops, 1B cores) • Workload Diversity
− Traditional large-scale High Performance Computing (HPC) jobs − HPC ensemble runs − Many-Task Computing (MTC) applications
• State-of-the-art Resource Managers − Centralized paradigm: job scheduling − HPC resource managers: SLURM, PBS, Moab, Torque, Cobalt − HTC resource manager: Condor, Falkon, Mesos, YARN
• Next-generation Resource Managers − Distributed paradigm − Scalable, efficient, reliable
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Decade Uses and computer involved
1970s Weather forecasting, aerodynamic research (Cray-1).
1980s Probabilistic analysis, radiation shielding modeling (CDC Cyber).
1990s Brute force code breaking (EFF DES cracker).
2000s 3D nuclear test simulations as a substitute for legal conduct Nuclear Non-Proliferation Treaty (ASCI Q).
2010s Molecular Dynamics Simulation (Tianhe-1A)
Medical Image Processing: Functional Magnetic Resonance
Chemistry Domain: MolDyn
Molecular Dynamics: DOCK
Production Runs in Drug Design
Economic Modeling: MARS
Large-scale Astronomy Application Evaluation
Astronomy Domain: Montage
Data Analytics: Sort and WordCount
http://elib.dlr.de/64768/1/EnSIM_-_Ensemble_Simulation_on_HPC_Computes_EN.pdf
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Scalability − System scale is increasing − Workload size is increasing − Processing capacity needs to increase
• Efficiency − Allocating resources fast − Making fast enough scheduling decisions − Maintaining a high system utilization
• Reliability − Still functioning well under failures
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
KVS server
Scheduler
Executor KVS server
Scheduler
Executor
Compute Node Compute Node
……
Fully-Connected
communication
Client Client Client
MTC Centralized Scheduling MTC Distributed Scheduling
cd cd cd
…
Controller and KVS server
Controller and KVS Server
cd cd cd
…
Controller and KVS Server
cd cd cd
…
…Fully-Connected
Client Client Client
HPC Centralized Scheduling HPC Distributed Scheduling
MATRIX
Slurm++ Slurm
Falkon
Slurm workload manager Slurm++: extension to Slurm
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Distributed • Partition-based • Resource sharing • Distributed metadata management
cd cd cd
…
Controller and KVS server
Controller and KVS Server
cd cd cd
…
Controller and KVS Server
cd cd cd
…
…Fully-Connected
Client Client Client
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Key Value Description
controller id free node list Free nodes in a partition
job id original controller id
The original controller that is responsible for a submitted job
job id + original controller id involved controller list
The controllers that participate in launching a job
job id + original controller id + involved controller id participated node list The nodes in each partition that are involved
in launching a job
• Balancing resources among all the partitions • Trivial for centralized architecture
• Challenging for distributed architecture – No global state information – Need to be achieved dynamically – Resource conflict may happen – Distributed resource stealing
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Different controllers may modify the same resource concurrently
• Resolved through atomic operation ALGORITHM 1: COMPARE AND SWAP Input: key (key), value seen before (seen_value), new value intended to insert (new_value), and the storage hash map (map) of a ZHT server. Output: A Boolean indicates success (TRUE) or failure (FALSE), and the current actual value (current_value). 1 current_value = map.get(key); 2 if (current_value == seen_value) then 3 map.put(key, new_value); 4 return TRUE; 5 else 6 return FALSE, current_value 7 end
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
ALGORITHM 2: RANDOM RESOURCE STEALING Input: number of nodes required (num_node_req), number of nodes allocated (*num_node_alloc), number of controllers/partitions (num_ctl), controller membership list (ctl_id[num_ctl]), sleep length (sleep_length), number of reties (num_retry). Output: list of involved controller ids (list_ctl_id_inv), participated nodes of each involved controller (part_node[]). 1 num_try = 0; num_ctl_inv = 0; default_sleep_length = sleep_length; 2 while *num_node_alloc < num_node_req do 3 sel_ctl_idx = ctl_id[Random(num_ctl)]; sel_ctl_node = zht_lookup(sel_ctl_id); 4 if (sel_ctl_node != NULL) then 5 num_more_node = num_node_req – *num_node_alloc; 6 again: num_free_node = sel_ctl_node.num_free_node; 7 if (num_free_node > 0) then 8 num_try = 0; sleep_length = default_sleep_length; 9 num_node_try = num_free_node > num_more_node ? num_more_node : num_free_node; 10 list_node = allocate(sel_ctl_node, num_node_try); 11 if (list_node != NULL) then 12 *num_node_alloc += num_node_try; part_node[num_ctl_inv++] = list_node; list_ctl_id_inv.add(remote_ctl_id); 13 else 14 goto again; 15 end 16 else if (++num_try > num_retry) do 17 release(list_ctl_id_inv, part_node); *num_node_alloc= 0; sleep_length = default_sleep_length; 18 else 19 sleep(sleep_length); sleep_length *= 2; 20 end 21 end 22 end 23 return list_ctl_id_inv, part_node;
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Poor performance (resource deadlock) – For big jobs (e.g. full-scale, half-scale) – Under high system utilization
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
No. of S
tealing Ope
ra1o
ns
Scale (No. of nodes)
(j,u)=(1,0) (j,u)=(0.75,0) (j,u)=(0.5,0) (j,u)=(0.25,0) (j,u)=(0.75,0.25) (j,u)=(0.5,0.25) (j,u)=(0.25,0.25) (j,u)=(0.5,0.5) (j,u)=(0.25,0.5)
• Linearly increasing as the system scale linearly with the system scales
• Exponentially increasing with both the job size and utilization
• At 1M-node scale, a full-scale job needs about 8000 stealing operations
• Monitoring-based weakly consistent • A centralized monitoring service
– collect all the resources of all partitions periodically (global view)
• Controllers: Two phase tuning – pull all the resources periodically (macro-tuning) – store the resources in a binary-search-tree (BST)
(local view) – find the most suitable partition according to BST – lookup the resource of that partition – update the BST (micro-tuning, not consistent)
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
ALGORITHM 3. MONITORING SERVICE LOGIC Input: number of controllers or partitions (num_ctl), controller membership list (ctl_id[num_ctl]), update frequency (sleep_length). Output: void. 1 global_key = “global resource specification”; global_value = “”; 2 while TRUE do 3 global_value = “”; 4 for each i in 0 to num_ctl – 1; do 5 cur_ctl_res = zht_lookup(ctl_id[i]); 6 if (cur_ctl_res == NULL) then 7 exit(1); 8 else 9 global_value += cur_ctl_res; 10 end 11 end 12 zht_insert(global_key, global_value); sleep(sleep_length); 13 end
• Property – left child <= root <= right
child • Operations
– BST_insert(BST*, char*, int) – BST_delete(BST*, char*, int) – BST_delete_all(BST*) – BST_search_best(BST*, int)
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
PHASE 1: PULL-BASED MACRO-TUNING Input: number of controllers or partitions (num_ctl), controller membership list (ctl_id[num_ctl]), update frequency (sleep_length), binary search tree of the number of free nodes of each partition (*bst) Output: void. 1 global_key = “global resource specification”; global_value = “”; 2 while TRUE do 3 global_value = zht_lookup(global_key); 4 global_res[num_ctl] = split(global_value); 5 pthread_mutex_lock(&bst_lock); 6 BST_delete_all(bst); 7 for each i in 0 to num_ctl – 1; do 8 BST_insert(bst, ctl_id[i], global_res[i].num_free_node); 9 end 10 pthread_mutex_unlock(&bst_lock); 11 sleep(sleep_length); 12 end
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
PHASE 2: MICRO-TUNING RESOURCE ALLOCATION Input: number of nodes required (num_node_req), number of nodes allocated (*num_node_alloc), self id (self_id), sleep length (sleep_length), number of reties (num_retry), binary search tree of the number of free nodes of each partition (*BST) Output: list of nodes allocated to the job 1 *num_node_alloc = 0; num_try = 0; alloc_node_list = NULL; target_ctl = self_id; 2 while *num_node_alloc < num_node_req do 3 resource = zht_lookup(target_ctl); nodelist = allocate_resource(target_ctl, resource, num_node_req – *num_node_alloc); 4 if (nodelist != NULL) then 5 num_try = 0; alloc_node_list += nodelist; *num_node_alloc += nodelist.size(); new_resource = resource – nodelist; 6 pthread_mutex_lock(&bst_lock); 7 old_resource = BST_search_exact(bst, target_ctl); 8 if (old_resource != NULL) then 9 BST_delete(bst, target_ctl, old_resource.size()); 10 end 11 BST_insert(bst, target_ctl, updated_resource.size()); 12 pthread_mutex_unlock(&bst_lock); 13 else if (num_try++ > num_retry) then 14 release_nodes(nodelist); alloc_node_list = NULL; *num_node_alloc = 0; num_try = 0; sleep(sleep_length); 15 end 16 if (*num_node_alloc < num_node_req) then 17 pthread_mutex_lock(&bst_lock); 18 if ((data_item = BST_search_best(bst, num_node_req –*num_node_alloc) != NULL) then 19 target_ctl = data_item.ctl_id; BST_delete(bst, target_ctl, data_item.size()); 20 else 21 target_ctl = self_id; 22 end 23 end 24 end 25 return alloc_node_list;
• Code Available (developed in c) – https://github.com/kwangiit/SLURMPP_V2
• Code Complexity – 5K lines of new code – Plus 500K lines of SLURM code – 8K lines of ZHT code – 1K lines of auto-generated code by Google Protocol Buffer
• Dependencies – Google Protocol Buffer – ZHT key value store – Slurm resource manager
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Test-bed – The Probe 500-node Kodiak Cluster at LANL – Each node has two AMD Opteron (tm) processers with 252
(2.6GHZ), and has 8GB memory – The network supports both Ethernet and Infini-Band. – Our experiments use 10Gbits Ethernet (default configuration of
Kodiak). – Experiments up to 500 nodes
• Workloads – Micro-benchmarking NOOP sleep jobs – Application traces from IBM Blue Gene/P supercomputers – Scientific numeric MPI applications from the PETSc tool package
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
0
20
40
60
80
100
120
50 100 150 200 250 300 350 400 450 500
Thro
ughp
ut (j
obs
/ sec
)
Scale (No. of Nodes)
Slurm++ throughput Slurm++ v0 throughput Slurm throughput
0
10
20
30
40
50
60
70
80
50 100 150 200 250 300 350 400 450 500
Thro
ughp
ut (j
obs
/ sec
)
Scale (No. of Nodes)
Slurm++ throughput Slurm++ v0 throughput Slurm throughput
0
5
10
15
20
25
30
35
40
45
50
100 150 200 250 300 350 400 450 500
Thro
ughp
ut (j
obs
/ sec
)
Scale (No. of Nodes)
Slurm++ throughput Slurm++ v0 throughput Slurm throughput
0
2
4
6
8
10
12
50 100 150 200 250 300 350 400 450 500
Spe
edup
(Slu
rm++
/ S
lurm
)
Scale (No. of nodes)
One-node jobs Half-partition jobs Full-partition jobs
One-node jobs Half-partition jobs
Full-partition jobs Speedup Summary
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
1.00%, 19.52%
2.00%, 70.44%
3.00%, 85.20%
5.00%, 90.91%
10.00%, 95.50% 20.00%, 98.77% 40.00%, 99.34%
63.00%, 99.35%
80.00%, 99.78%
100.00%, 100.00%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percen
tage of N
umbe
r of Job
s
Percentage of Job size to Scale
Percentage of number of jobs
Workload Specification
0
8
16
24
32
40
48
56
64
0
5
10
15
20
25
30
35
40
50 100 150 200 250 300 350 400 450 500
ZHT Message Cou
nt
Throughp
ut (job
s / se
c)
Scale (No. of nodes)
Slurm++ throughput Slurm throughput Average ZHT message count per job
Slurm++ vs. Slurm
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
0
200
400
600
800
1000
50 100 150 200 250 300 350 400 450 500
Sch
edul
ing
Late
ncy
(ms)
Scale (No. of nodes)
Slurm++ average scheduling latency per job Slurm average scheduling latency per job
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
10
100
1000
10000
0 50 100 150 200 250 300 350 400 450 500
Throughp
ut (job
s / se
c)
Scale (No. of nodes)
ParSSon size = 1 ParSSon size = 4 ParSSon size = 16 ParSSon size = 64
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
4.66% 11.48%
8.35% 8.10% 2.95% 1.56% 2.15% 3.28%
4.75% 1.00%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
4
8
12
16
20
24
28
32
36
40
50 100 150 200 250 300 350 400 450 500
Normalize
d Diffe
rence
Throughp
ut (job
s / se
c)
Scale (No. of node)
Slurm++ throughput SimSlurm++ throughput Normalized difference
Validation
7.5
17.5
42.3
130.4
4.00
6.56
28.98
85.89
1
2
4
8
16
32
64
128
256
1
2
4
8
16
32
64
128
256
1024 4096 16384 65536
ZHT message cou
nt
Sche
duling latency (m
s)
Scale (No. of nodes)
Averge scheduling latency per job
Average ZHT message count per job
Scalability
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Workloads are becoming finer-grained • Key-value store is a viable building block • Distributed resource manager is demanded • Fully distributed architecture is scalable • Slurm++ is 10X faster in making scheduling
decisions • Simulation results show the technique is
scalable
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Introduction & Motivation • Problem Statement • Proposed Work • Evaluation • Conclusions • Future Work
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• Contribute to the Intel ORCM cluster manager – Built on top of open-mpi – Make key-value store as a new Plugin
• Distributed data-aware scheduling for HPC applications
• Elastic resource allocation
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems
• More information: – http://datasys.cs.iit.edu/~kewang/
• Contact: – [email protected]
• Questions?
Slurm++: a Distributed Workload Manager for Extreme-Scale High-Performance Computing Systems