Scalable Algorithms for Constructing Balanced Spanning...
Transcript of Scalable Algorithms for Constructing Balanced Spanning...
Scalable Algorithms for Constructing Balanced SpanningTrees on System-ranked Process Groups
Akhil Langer, Ramprasad Venkataraman, Laxmikant V. Kale
Parallel Programming LaboratoryUniversity of Illinois
September 25 2012
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 1 / 26
Pitch
Not all messaging needs fully-capable communicators
Its worthwhile to consider cheaper constructs
We propose:
Unranked or System-ranked Process Groups
User cannot choose member ranks
Cheap and Scalable Creation Mechanisms
I Shrink-and-Balance
I Rank-and-Hash
∼ 100X faster than MPI Comm split on 32K cores of IBM BG/P
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 2 / 26
Is process group creation/management scalable?
Memory capacity is growing slower than available concurrency
Runtime systems have to adopt resource-conserving mechanisms
Typical Process Group ImplementationsI Each member can id everyone elseI Storage: O(n) (on each member process)I Time for creation: O(n log n)
Applications can create many such groups simultaneously
How can we use less than O(n) memory?
Distributed enrollmentDistributed storage
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 3 / 26
Is process group creation/management scalable?
Memory capacity is growing slower than available concurrency
Runtime systems have to adopt resource-conserving mechanisms
Typical Process Group ImplementationsI Each member can id everyone elseI Storage: O(n) (on each member process)I Time for creation: O(n log n)
Applications can create many such groups simultaneously
How can we use less than O(n) memory?
Distributed enrollmentDistributed storage
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 3 / 26
Achieving distributed enrollment
Central specification of membership
MPI Group incl(MPI Group group,int n,int *ranks, ← not scalableMPI Group *newgroup)
Distributed enrollment
MPI Comm split(MPI Comm comm,int color,int key,MPI Comm *newcomm)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 4 / 26
Achieving distributed storage
Distributed Tables
EuroMPI 2010
A Scalable MPI Comm split Algorithm for Exascale Computing.Sack, P., Gropp, W.In: Recent Advances in the Message Passing Interface. pp. 110. EuroMPI10 (2010)
Process Chains
EuroMPI 2011
Exascale algorithms for generalized MPI comm split.Moody, A., Ahn, D., Supinski, B.In: Recent Advances in the Message Passing Interface. pp. 9-18. EuroMPI11 (2011)
Unranked / System-Ranked Process Groups
EuroMPI 2012
Scalable Algorithms for Constructing Balanced Spanning Trees on System-RankedProcess Groups Langer, A., Venkataraman, R., Kale L.In: Recent Advances in the Message Passing Interface. pp. 9-18. EuroMPI12 (2012)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 5 / 26
Achieving distributed storage
Distributed Tables
EuroMPI 2010
A Scalable MPI Comm split Algorithm for Exascale Computing.Sack, P., Gropp, W.In: Recent Advances in the Message Passing Interface. pp. 110. EuroMPI10 (2010)
Process Chains
EuroMPI 2011
Exascale algorithms for generalized MPI comm split.Moody, A., Ahn, D., Supinski, B.In: Recent Advances in the Message Passing Interface. pp. 9-18. EuroMPI11 (2011)
Unranked / System-Ranked Process Groups
EuroMPI 2012
Scalable Algorithms for Constructing Balanced Spanning Trees on System-RankedProcess Groups Langer, A., Venkataraman, R., Kale L.In: Recent Advances in the Message Passing Interface. pp. 9-18. EuroMPI12 (2012)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 5 / 26
Achieving distributed storage
Distributed Tables
EuroMPI 2010
A Scalable MPI Comm split Algorithm for Exascale Computing.Sack, P., Gropp, W.In: Recent Advances in the Message Passing Interface. pp. 110. EuroMPI10 (2010)
Process Chains
EuroMPI 2011
Exascale algorithms for generalized MPI comm split.Moody, A., Ahn, D., Supinski, B.In: Recent Advances in the Message Passing Interface. pp. 9-18. EuroMPI11 (2011)
Unranked / System-Ranked Process Groups
EuroMPI 2012
Scalable Algorithms for Constructing Balanced Spanning Trees on System-RankedProcess Groups Langer, A., Venkataraman, R., Kale L.In: Recent Advances in the Message Passing Interface. pp. 9-18. EuroMPI12 (2012)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 5 / 26
System-Ranked Process Group
User cannot specify or influence ranks of members
MPI Comm split(MPI Comm comm,int color,int key,MPI Comm *newcomm)
Ranks are assigned by runtime system
Hence, any mapping of application logic / data to ranks has to behandled manually after creation
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 6 / 26
Are user-supplied ranks needed all the time?
barrier, broadcast, reduce, allreduce
Input / output not dependent on ranks
Assume commutative operators
Sizeable fraction of collective communication in applications involvethese operations 1, 2, 3
Several algorithms can be expressed with just these collectives
1NERSC6 Workload Analysis and Benchmark Selection Process.Antypas, K., Shalf, J., Wasserman, H.Tech. Rep. LBNL-1014E, Lawrence Berkeley National Lab(2008)
2Automatic MPI Counter Profiling.Rabenseifner, R.In: 42nd CUG Conference(2000)
3Parallel Scaling Characteristics of Selected NERSC User Project CodesSkinner, D., Verdier, F., Anand, H., Carter, J., Durst, M., Gerber, R.Tech. Rep. LBNL/PUB-904, Lawrence Berkeley National Lab (2005)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 7 / 26
Problem Statement
Represent process groups using spanning trees
Low memory footprint(distributed storage)
Recursive, splitting of original tree(distributed enrollment)
Immediate availability of efficientsynchronization / housekeeping
Can use spanning tree for the targetcollectives too
0
1
3
7 8
4
9 10
2
5
11 12
6
13 14
To support system-ranked groups
Starting from a parent tree, construct balanced spanning tree over enrolledmembers only
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 8 / 26
The Reference Centralized Algorithm
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 9 / 26
The Reference Centralized Algorithm
Upward pass: gathervMembers of new group contribute their process ids
Downward passpick immediate children and split the remaining list
O(m+ log n) time and O(m) memory 4
4m is size of the new subtreeLanger, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 10 / 26
The Shrink-and-Balance Algorithm
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 11 / 26
Algorithm: Shrink-and-BalanceUpward Pass
Use enrollment data to shrink original spanning tree by excludingnon-participating processes
“fill” holes with member processesI leaf processI immediate child process
leaf process - send min(di,k , subtree size(v)) = O(log n) candidatefillers to the parent
O(log n) space and O(log2 n) time
4di,k is depth of a rank i process in a balanced spanning tree of branching factor kdi,k = blogk(i(k − 1) + 1)c
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 12 / 26
Algorithm: Shrink-and-BalanceUpward pass
0
1
3
7 8
4
9 10
2
5
11 12
6
13 14
(a)
0
1
7 4
9
2
5 6
13 14
(b)
0
1
7 4
9
13
6
14
(c)
9
1
7 4
13
6
14
(d)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 13 / 26
Algorithm: Shrink-and-BalanceDownward Pass
Upward pass yields participants-only spanning tree that need not bebalanced
Balance tree while minimizing vertex migrationsI compute ideal height of a perfectly balanced spanning treeI target height yields max size of subtrees5
I based on current size, mark subtrees as vertex suppliers and consumers,respectively
I request supplier for vertex if child is missing (takes O(log n) time)I “matchmaking” step to assign suppliers to one or more consumersI vertex concludes its role by calling balancing step on its children
O(log2 n) time, as a child could be missing at each level
5max size = kh−1k−1
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 14 / 26
The Rank-and-Hash Algorithm
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 15 / 26
Algorithm: Rank-and-HashUpward Pass
Reduction
Store size of each subtree
07
14
31
7
1
8
0
42
9
1
10
0
23
50
11
0
12
0
63
13
1
14
1
(a) Subtree sizes after the upward pass
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 16 / 26
Algorithm: Rank-and-HashUpward Pass
Reduction
Store size of each subtree
07
14
31
7
1
8
0
42
9
1
10
0
23
50
11
0
12
0
63
13
1
14
1
(a) Subtree sizes after the upward pass
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 16 / 26
Algorithm: Rank-and-HashDownward Pass
Size of tree determines available ranks [0,m)
Range split amongst subtrees based on their sizes
Splitting continues down the original spanning tree until all availableranks divided
Non-participating processes not assigned any ranks
10
7
1
42
9
3
64
13
5
14
6
(b) Ranks after the downward pass
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 17 / 26
Algorithm: Rank-and-HashDownward Pass
Size of tree determines available ranks [0,m)
Range split amongst subtrees based on their sizes
Splitting continues down the original spanning tree until all availableranks divided
Non-participating processes not assigned any ranks
10
7
1
42
9
3
64
13
5
14
6
(c) Ranks after the downward pass
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 17 / 26
Algorithm: Rank-and-HashIdentifying Tree Neighbors
Process ids of parent and children discovered through intermediary processes
id of intermediary process (Hi), for rank i computed via a hash function
Each rank i, sends its id to Hi and Hp (where, p is rank of its parent)
Receive msgs from Hi and Hp with ids of children and parent, respectively
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 18 / 26
Algorithm: Rank-and-HashIdentifying Tree Neighbors
Process ids of parent and children discovered through intermediary processes
id of intermediary process (Hi), for rank i computed via a hash function
Each rank i, sends its id to Hi and Hp (where, p is rank of its parent)
Receive msgs from Hi and Hp with ids of children and parent, respectively
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 18 / 26
Experimental Setup
Time measured between broadcast on original spanning tree andreduction on the newly constructed tree
Sample from a uniform distribution u(0, 1) and use participationprobability p to determine participation of a process in the group.
Repeatable seeds to ensure identical groups across runs
Algorithms implemented in Charm++
Runs on BG/P “Intrepid”
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 19 / 26
ResultsPerformance Comparison on up to 128k cores of BG/P
4k 8k 16k 32k 64k 128k28
29
210
211
212
213 p=0.01
4k 8k 16k 32k 64k 128k28
29
210
211
212
213 p=0.1
4k 8k 16k 32k 64k 128k28
29
210
211
212
213 p=0.3
4k 8k 16k 32k 64k 128k28
29
210
211
212
213 p=0.6
4k 8k 16k 32k 64k 128k28
29
210
211
212
213 p=0.9
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Legend
centralized
Rank-and-Hash
Shrink-and-Balance
tim
e (
in m
icro
seco
nds)
tim
e (
in m
icro
seco
nds)
number of processes
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 20 / 26
ResultsPerformance Comparison on up to 128k cores of BG/P
Distributed schemes outperform the centralized scheme at modestprocess counts (except for very small p)
Shrink-and-Balance slower than Rank-and-HashI longer critical path
Both schemes attain the goal of reduced memory footprint!
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 21 / 26
ResultsPerformance Comparison on up to 128k cores of BG/P
Distributed schemes outperform the centralized scheme at modestprocess counts (except for very small p)
Shrink-and-Balance slower than Rank-and-HashI longer critical path
Both schemes attain the goal of reduced memory footprint!
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 21 / 26
ResultsNormalized message counts w.r.t. the centralized scheme on 128k cores of BG/P
0.001 0.01 0.1 0.3 0.6 0.9 0.99participation probability
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Norm
aliz
ed M
ess
age C
ount
Shrink-and-BalanceRank-and-Hash
Shrink-and-Balance has far fewer messagesthan Rank-and-Hash
at p = 0.6, number of messages sent byCentralized, Shrink-and-Balance andRank-and-Hash were 2.1, 2.6 and4.9× 105, respectively
Shrink-and-Balance may perform better whenI multiple groups are being formed simultaneouslyI group formation occurs simultaneous with other communication in the application
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 22 / 26
ResultsNormalized message counts w.r.t. the centralized scheme on 128k cores of BG/P
0.001 0.01 0.1 0.3 0.6 0.9 0.99participation probability
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Norm
aliz
ed M
ess
age C
ount
Shrink-and-BalanceRank-and-Hash
Shrink-and-Balance has far fewer messagesthan Rank-and-Hash
at p = 0.6, number of messages sent byCentralized, Shrink-and-Balance andRank-and-Hash were 2.1, 2.6 and4.9× 105, respectively
Shrink-and-Balance may perform better whenI multiple groups are being formed simultaneouslyI group formation occurs simultaneous with other communication in the application
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 22 / 26
ResultsComparison with MPI Comm split on 32k cores of BG/P
MPI Comm split comparison with multi-color Rank-and-Hash
Group Creation Time (in milliseconds)
# splits MPI-Comm-split Rank-and-Hash
1 134.968 0.7082 106.573 0.7134 96.989 0.7608 93.536 0.785
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 23 / 26
Related Work
Moody et al6 proposed generalized MPI Comm split
I process groups as chain - O(1) space and O(logn) timeI requires O(n) messaging to exchange process ids during collective callI does collective communication using binary spanning trees
Several differences
I lesser dependencies on remote information for progress of collectivehence, more prominent for one-sided transfer calls supported by some networkmessaging APIs
I construct spanning trees of arbitrary branching factors
2 3 4 5 6 7branching factor
200
400
600
800
1000
1200
1400
1600
1800
time
(mic
rose
cond
s)
Message size (bytes)32256409616384
Broadcast time on 32k core of BG/P
6Moody, A., Ahn, D., de Supinski, B.: Exascale algorithms for generalized MPI comm split. In: Recent Advances in the
Message Passing Interface. pp. 9 - 18. EuroMPI11 (2011)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 24 / 26
Summary
Space and time complexities for different group creation schemes
MPI(typical) Centralized Shrink-&-Balance Rank-&-Hash
Space O(n) O(m) O(logn) O(1)Time O(n + m logm) O(m + logn) O(log2 n) O(logn)Msg Count n logn n + m Ω(n + m) n + 4m + m
k
Max Msg Size O(n) O(m) O(logn) O(1)
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 25 / 26
Summary
System assigned ranks eliminate sorting of user-supplied keys
Spanning-Tree based groupsI BalancedI k-aryI Low memory usageI Outperforms traditional creation mechanisms
Evaluate performance in the presence of other computation andcommunication akin to real application execution scenarios
Account for network-topology by executing these algorithmshierarchically
Langer, Venkataraman, Kale (PPL, UIUC) Spanning Trees on Unranked Process Groups September 25 2012 26 / 26