Post on 10-Oct-2020
Distributed communication-aware load balancing withTreeMatch in Charm++
The 11th workshop of the Joint Laboratory for Petascale Computing,Sophia-Antipolis
Emmanuel Jeannot Guillaume Mercier Francois TessierIn collaboration with the Charm++ Team from the PPL :
Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale
June 9, 2014
Francois Tessier TreeMatch in Charm++ 1/ 19
Introduction
Scalable execution of parallel applications
Number of cores is increasing
But memory per core is decreasing
Application will need to communicate even more than now
Our solution
Process placement should take into account process affinityHere: load balancing in Charm++ considering :
CPU loadprocess affinity (or other communicating objects)topology : network and intra-node
Francois Tessier TreeMatch in Charm++ 2/ 19
Charm++
Features
Parallel object-oriented programming language based on C++
Programs are decomposed into a number of cooperating message-drivenobjects called chares.In general we have more chares than processing units
Chares are mapped to physical processors by an adaptive runtime system
Load balancers can be called to migrate chares
Charm++ is able to use MPI for the processes communications
Francois Tessier TreeMatch in Charm++ 3/ 19
Processes Placement
Why we should consider it
Many current and future parallel platforms have several levels of hierarchy
Application chares/processes do not exchange the same amount of data(affinity)The process placement policy may have impact on performance
Cache hierarchy, memory bus, high-performance network...
Switch
Cabinet Cabinet
... Node Node
... Processor Processor
Core Core Core Core
Francois Tessier TreeMatch in Charm++ 4/ 19
Problems
Given
The parallel machine topology
The application communication pattern
Map application processes to physical resources (cores) to reduce thecommunication costs (NP-complete)
5 10 15
510
15
zeus16.map
Sender rank
Rec
eive
r ra
nk
01
23
45
67
Francois Tessier TreeMatch in Charm++ 5/ 19
TreeMatch
The TreeMatch Algorithm
Algorithm and environment to compute processes placement based onprocesses affinities and NUMA topologyInput :
The communication pattern of the applicationPreliminary execution with a monitored MPI implementation for staticplacementDynamic recovery on iterative applications with Charm++
A model (tree) of the underlying architecture : Hwloc can provide us this.Output :
A processes permutation σ such that σi is the core number on which wehave to bind the process i
TreeMatch can only work on tree topologies. How to deal with 3d torus ?
Francois Tessier TreeMatch in Charm++ 6/ 19
Network placement
libtopomap
T. Hoefler and M. Snir, "Generic Topology Mapping Strategies forLarge-Scale Parallel Architectures" Proc. Int’l Conf. Supercomputing(ICS), pp. 75-84, 2011.
Library that enables to map processes on various network topologies
Used in TreeMatchLB to consider the Blue Waters 3d torus
Figure: 3d Torus and a Cray Gemini router
Francois Tessier TreeMatch in Charm++ 7/ 19
Load balancing
Principle
Iterative applications
load balancer called at regular interval
Migrate chares in order to optimize several criteriaCharm++ runtime system provides:
chares loadchares affinityetc. . .
Constraints
Dealing with complex modern architectures
Taking into account communications between elements
Some other communication-aware load-balacing algorithms
[L. L. Pilla, et al. 2012] NUCOLB, shared memory machines
[L. L. Pilla, et al. 2012] HwTopoLB
Some "built-in" Charm++ load balancers : RefineCommLB,GreedyCommLB. . .
Francois Tessier TreeMatch in Charm++ 8/ 19
Several issues raised
Not so easy...
Several issues raised!
Scalability of TreeMatch
Need to find a relevant compromise between processes affinities and loadbalancing
What about load balancing time?
The next slides will present our load balancer relying on TreeMatch andlibtopomap which performs a parallel and distributed communication-awareload balancing.
Francois Tessier TreeMatch in Charm++ 9/ 19
Strategy for Charm++ - Network Placement
First step : minimize communication cost on network
libtopomap reorders processes from a communicatorHow to use it to reorder groups of processes (or chares) ? Example :groups of chares on nodes
Charm++ uses MPI : full access to the MPI APINew MPI communicator with MPI_Comm_split
0 1 2 3
Network (3d torus, tree, …)
Nodes
New communicator
Francois Tessier TreeMatch in Charm++ 10 / 19
Strategy for Charm++ - Intra-node placement
TreeMatch load balancer
1st step : Remap groups of chareson nodes according to thecommunication on the network
libtopomap (example : part of3d Torus)
2nd step : Reorder chares insideeach node (distributed)
Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel
3 6 8 9 12 14 15 16
Groups of chares assigned to nodes
CP
U L
oad
Network (3d torus, hierarchical, …)
Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement
TreeMatch load balancer
1st step : Remap groups of chareson nodes according to thecommunication on the network
libtopomap (example : part of3d Torus)
2nd step : Reorder chares insideeach node (distributed)
Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel Figure: Part of a 3d Torus attributed by
the resource manager
Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement
TreeMatch load balancer
1st step : Remap groups of chareson nodes according to thecommunication on the network
libtopomap (example : part of3d Torus)
2nd step : Reorder chares insideeach node (distributed)
Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel
0 2 4 6 0 2 4 6
Groups of chares assigned to cores
CP
U L
oad
Network (3d torus, hierarchical, …)
3 ...
Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement
TreeMatch load balancer
1st step : Remap groups of chareson nodes according to thecommunication on the network
libtopomap (example : part of3d Torus)
2nd step : Reorder chares insideeach node (distributed)
Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel
0 2 4 6
Chares
Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement
TreeMatch load balancer
1st step : Remap groups of chareson nodes according to thecommunication on the network
libtopomap (example : part of3d Torus)
2nd step : Reorder chares insideeach node (distributed)
Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel
0 2 4 6
Chares
Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement
TreeMatch load balancer
1st step : Remap groups of chareson nodes according to thecommunication on the network
libtopomap (example : part of3d Torus)
2nd step : Reorder chares insideeach node (distributed)
Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel
0 2 4 6 0 2 4 6
Groups of chares assigned to cores
CP
U L
oad
Network (3d torus, hierarchical, …)
Francois Tessier TreeMatch in Charm++ 11 / 19
Results
kNeighbor
Benchmarks application designed to simulate intensive communicationbetween processes
Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIMClusterParticularly compared to RefineCommLB
Takes into account load and communicationMinimizes migrations
Dum
myL
B
Gre
edyC
omm
LB
Gre
edyL
B
Ref
ineC
omm
LB
TM
LB_T
reeB
ased
Exe
cutio
n tim
e (in
sec
onds
)
0
100
200
300
400
500
600
700
kNeighbor on 64 cores128 elements − 1MB message size
Dum
myL
B
Gre
edyC
omm
LB
Gre
edyL
B
Ref
ineC
omm
LB
TM
LB_T
reeB
ased
Exe
cutio
n tim
e (in
sec
onds
)
0
500
1000
1500
2000
kNeighbor on 64 cores256 elements − 1MB message size
Francois Tessier TreeMatch in Charm++ 12 / 19
Results
kNeighbor
Experiments on 16 nodes with 8 cores on each (Intel Xeon 5550) -PlaFRIM Cluster
1 MB messages - 100 iterations - 7-Neighbor
40
60
80
100
120
140
160
180
200
1 2 4 8 16
Avera
ge t
ime f
or
each
7-k
Neig
hb
or
itera
tion (
in m
s)
Number of chares by core
Execution time versus chares by core
DummyLBTreeMatchLB
Francois Tessier TreeMatch in Charm++ 13 / 19
Results
kNeighbor
Experiments on 16 nodes with 32 cores on each (AMD Interlagos 6276) -Blue Waters Cluster
1 MB messages - 100 iterations - 7-Neighbor
Bad performances...
60
80
100
120
140
160
180
200
220
1 2 4 8 16
Avera
ge t
ime f
or
each
7-k
Neig
hb
or
itera
tion (
in m
s)
Number of chares by core
Execution time versus chares by core
DummyLBTreeMatchLB
Francois Tessier TreeMatch in Charm++ 14 / 19
Results
Stencil3D
3 dimensional stencil with regularcommunication with fixed neighbors
One chare per core : balance onlyconsidering communications
Only one load balancing step after 10iterations
Experiments on 8 nodes with 8 cores oneach (Intel Xeon 5550)
Dum
myL
B
Gre
edyC
omm
LB
Gre
edyL
B
Ref
ineC
omm
LB
TM
LB_T
reeB
ased
Exe
cutio
n tim
e (in
sec
onds
)
0
50
100
150
200
Stencil3D on 64 cores − 64 elements
Francois Tessier TreeMatch in Charm++ 15 / 19
Results
What about the load balancing time?
Comparison between the sequential and the distributed versions ofTreeMatchLB
The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Sequential
Distributed
Sequential
Distributed
Sequential
Distributed
Tim
e in s
eco
nd
s
Time repartition for each step of the load balancing process
InitializationTM Sequential
TM Parallel
1638481924096
Francois Tessier TreeMatch in Charm++ 16 / 19
Results
What about the load balancing time?
Comparison between the sequential and the distributed versions ofTreeMatchLB
The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)
7
6
5
4
3
2
1
0
Master
165.6 165.7 165.8 165.9 166 166.1 166.2
time
4096 Chares - reverse - Par
InitProcess results
DistributeCalculate
Return
Francois Tessier TreeMatch in Charm++ 16 / 19
Results
What about the load balancing time?
Linear trajectory while the number of chares is doubled
TreeMatchLB is slower than the other Greedy strategies
RefineCommLB which provides some good results forcommunication-bound applications is not scalable (fails from 8192 chares)
0.1
1
10
100
1000
10000
128 256 512 1024 2048 4096 8192
Execu
tion t
ime (
in m
s)
Number of chares
Execution time of load balancingstrategies (running on 128 cores)
GreedyCommLBGreedyLB
RefineCommLBTreeMatchLB
Figure: Load balancing time of the different strategies vs. number of chares for theKNeighbor application.
Francois Tessier TreeMatch in Charm++ 17 / 19
Future work and ConclusionThe end
Topology is not flat!
Processes affinities are not homogeneous
Take into account these information to map chares give us improvement
Algorithm adapted to large problems (Distributed)
JLPC collaborations
10 days during August 2013 at the PPL
Paper accepted at IEEE Cluster 2013
Future work
Find a better way to gather the topology (Hwloc?)
Bad performances on Blue Waters... Need to understand why(Architecture ?)
Perform more large scale experiments
Hybrid architecture? Intel MIC?
Evaluate our solution on other applications
Compare to other load balancer (NUCOLB, application-specific LBs)
Francois Tessier TreeMatch in Charm++ 18 / 19
The End
Thanks for your attention !Any questions?
Francois Tessier TreeMatch in Charm++ 19 / 19