Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in...

71
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”, Addison Wesley, 2003.

Transcript of Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in...

Page 1: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Basic Communication Operations

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

To accompany the text “Introduction to Parallel Computing”,Addison Wesley, 2003.

Page 2: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Topic Overview

• One-to-All Broadcast and All-to-One Reduction

• All-to-All Broadcast and Reduction

• All-Reduce and Prefix-Sum Operations

• Scatter and Gather

• All-to-All Personalized Communication

• Circular Shift

• Improving the Speed of Some Communication Operations

– Typeset by FoilTEX – 1

Page 3: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Basic Communication Operations: Introduction

• Many interactions in practical parallel programs occur in well-defined patterns involving groups of processors.

• Efficient implementations of these operations can improveperformance, reduce development effort and cost, andimprove software quality.

• Efficient implementations must leverage underlying architecture.For this reason, we refer to specific architectures here.

• We select a descriptive set of architectures to illustrate theprocess of algorithm design.

– Typeset by FoilTEX – 2

Page 4: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Basic Communication Operations: Introduction

• Group communication operations are built using point-to-pointmessaging primitives.

• Recall from our discussion of architectures that communicatinga message of size m over an uncongested network takes timets + tmw.

• We use this as the basis for our analyses. Where necessary, wetake congestion into account explicitly by scaling the tw term.

• We assume that the network is bidirectional and thatcommunication is single-ported.

– Typeset by FoilTEX – 3

Page 5: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

One-to-All Broadcast and All-to-One Reduction

• One processor has a piece of data (of size m) it needs to sendto everyone.

• The dual of one-to-all broadcast is all-to-one reduction.

• In all-to-one reduction, each processor has m units of data.These data items must be combined piece-wise (using someassociative operator, such as addition or min), and the resultmade available at a target processor.

– Typeset by FoilTEX – 4

Page 6: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

One-to-All Broadcast and All-to-One Reduction

p-11 1 p-1

All-to-one Reduction0 0. . . . . .M M M M

One-to-all Broadcast

One-to-all broadcast and all-to-one reduction among pprocessors.

– Typeset by FoilTEX – 5

Page 7: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

One-to-All Broadcast and All-to-One Reduction onRings

• Simplest way is to send p − 1 messages from the source to theother p − 1 processors – this is not very efficient.

• Use recursive doubling: source sends a message to a selectedprocessor. We now have two independent problems derinedover halves of machines.

• Reduction can be performed in an identical fashion byinverting the process.

– Typeset by FoilTEX – 6

Page 8: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

One-to-All Broadcast2

33

2

0 1 2 3

4567

1

33

One-to-all broadcast on an eight-node ring. Node 0 is thesource of the broadcast. Each message transfer step is shown bya numbered, dotted arrow from the source of the message to its

destination. The number on an arrow indicates the time stepduring which the message is transferred.

– Typeset by FoilTEX – 7

Page 9: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-One Reduction

1

0 1 2 3

4567

2

2

1 1

3

1

Reduction on an eight-node ring with node 0 as the destinationof the reduction.

– Typeset by FoilTEX – 8

Page 10: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction: Example

Consider the problem of multiplying a matrix with a vector.

• The n × n matrix is assigned to an n × n (virtual) processor grid.The vector is assumed to be on the first row of processors.

• The first step of the product requires a one-to-all broadcastof the vector element along the corresponding column ofprocessors. This can be done concurrently for all n columns.

• The processors compute local product of the vector elementand the local matrix entry.

• In the final step, the results of these products are accumulatedto the first row using n concurrent all-to-one reductionoperations along the oclumns (using the sum operation).

– Typeset by FoilTEX – 9

Page 11: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction: Matrix-VectorMultiplication Example

P

0

4

8

12

P

P

P

P

0

4

8

12

P

P

P 1

5

9

13P

P

P

P

2

6

10

14P

P

P

P

3

7

Matrix11

15P

P

P

P

All-to-onereduction

P P PP0 1 2 3

Output

One-to-all broadcast

Vector

Input Vector

One-to-all broadcast and all-to-one reduction in themultiplication of a 4 × 4 matrix with a 4 × 1 vector.

– Typeset by FoilTEX – 10

Page 12: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction on a Mesh

• We can view each row and column of a square mesh of pnodes as a linear array of

√p nodes.

• Broadcast and reduction operations can be performed in twosteps – the first step does the operation along a row and thesecond step along each column concurrently.

• This process generalizes to higher dimensions as well.

– Typeset by FoilTEX – 11

Page 13: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction on a Mesh: Example3

0

10

15

4

4

4

4

4

4

4

4

3333

22

1

1

2

4

5

6

8

9

11

14

7

13

12

One-to-all broadcast on a 16-node mesh.

– Typeset by FoilTEX – 12

Page 14: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction on a Hypercube

• A hypercube with 2d nodes can be regarded as a d-dimensional mesh with two nodes in each dimension.

• The mesh algorithm can be generalized to a hypercube andthe operation is carried out in d (= log p) steps.

– Typeset by FoilTEX – 13

Page 15: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction on a Hypercube: Example

0 1

32

(001)

4 5

76

3

3

3

12

2

(000)

(011)

(100) (101)

(111)

3

(010)

(110)

One-to-all broadcast on a three-dimensional hypercube. Thebinary representations of node labels are shown in parentheses.

– Typeset by FoilTEX – 14

Page 16: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction on a Balanced Binary Tree

• Consider a binary tree in which processors are (logically) at theleaves and internal nodes are routing nodes.

• Assume that source processor is the root of this tree. In the firststep, the source sends the data to the right child (assumingthe source is also the left child). The problem has nowbeen decomposed into two problems with half the number ofprocessors.

– Typeset by FoilTEX – 15

Page 17: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction on a Balanced Binary Tree

30 1 2 3 4 6 75

1

2

2

3 3 3

One-to-all broadcast on an eight-node tree.

– Typeset by FoilTEX – 16

Page 18: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction Algorithms

• All of the algorithms described above are adaptations of thesame algorithmic template.

• We illustrate the algorithm for a hypercube, but the algorithm,as has been seen, can be adapted to other architectures.

• The hypercube has 2d nodes and my id is the label for a node.

• X is the message to be broadcast, which initially resides at thesource node 0.

– Typeset by FoilTEX – 17

Page 19: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction Algorithms

1. procedure GENERAL ONE TO ALL BC(d, my id, source, X)2. begin3. my virtual id := my id XOR source;4. mask := 2d − 1;5. for i := d − 1 downto 0 do /* Outer loop */6. mask := mask XOR 2i; /* Set bit i of mask to 0 */7. if (my virtual id AND mask) = 0 then8. if (my virtual id AND 2i) = 0 then9. virtual dest := my virtual id XOR 2i;10. send X to (virtual dest XOR source);

/* Convert virtual dest to the label of the physical destination */11. else12. virtual source := my virtual id XOR 2i;13. receive X from (virtual source XOR source);

/* Convert virtual source to the label of the physical source */14. endelse;15. endfor;16. end GENERAL ONE TO ALL BC

One-to-all broadcast of a message X from source on a hypercube.

– Typeset by FoilTEX – 18

Page 20: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Broadcast and Reduction Algorithms

1. procedure ALL TO ONE REDUCE(d, my id, m, X, sum)2. begin3. for j := 0 to m − 1 do sum[j] := X[j];4. mask := 0;5. for i := 0 to d − 1 do

/* Select nodes whose lower i bits are 0 */6. if (my id AND mask) = 0 then7. if (my id AND 2i) 6= 0 then8. msg destination := my id XOR 2i;9. send sum to msg destination;10. else11. msg source := my id XOR 2i;12. receive X from msg source;13. for j := 0 to m − 1 do14. sum[j] :=sum[j] + X[j];15. endelse;16. mask := mask XOR 2i; /* Set bit i of mask to 1 */17. endfor;18. end ALL TO ONE REDUCE

Single-node accumulation on a d-dimensional hypercube. Each nodecontributes a message X containing m words, and node 0 is the destination.

– Typeset by FoilTEX – 19

Page 21: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Cost Analysis

• The broadcast or reduction procedure involves log p point-to-point simple message transfers, each at a time cost of ts + twm.

• The total time is therefore given by:

T = (ts + twm) log p. (1)

– Typeset by FoilTEX – 20

Page 22: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Broadcast and Reduction

• Generalization of broadcast in which each processor is thesource as well as destination.

• A process sends the same m-word message to every otherprocess, but different processes may broadcast differentmessages.

– Typeset by FoilTEX – 21

Page 23: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Broadcast and Reduction

1 pM -1M 0 M 0

M 1

M 0

M 1

M 0

M 1

pM -1 pM -1 pM

M

-1

All-to-all reduction

......

...

p-11 1 p-10 0. . . . . .

All-to-all broadcast

All-to-all broadcast and all-to-all reduction.

– Typeset by FoilTEX – 22

Page 24: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Broadcast and Reduction on a Ring

• Simplest approach: perform p one-to-all broadcasts. This is notthe most efficient way, though.

• Each node first sends to one of its neighbors the data it needsto broadcast.

• In subsequent steps, it forwards the data received from one ofits neighbors to its other neighbor.

• The algorithm terminates in p − 1 steps.

– Typeset by FoilTEX – 23

Page 25: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Broadcast and Reduction on a Ring

.

.

....

7 (4)7 (3)7 (2)

(3,2,1,0,7,6,5)(1,0,7,6,5,4,3) (2,1,0,7,6,5,4)(0,7,6,5,4,3,2)

(5) (4)

(3)(2)(1)

(6)(7)

(0)

(7,6) (6,5) (5,4) (4,3)

(3,2)(2,1)(1,0)

7th communication step

(0,7)

7 (0) 7 (7) 7 (6)

0 1

67

2 3

45

2 (7) 2 (0) 2 (1)

2 (4) 2 (3)2 (5)

0 1

67

2 3

45

1 (0) 1 (1) 1 (2)

1 (6) 1 (5) 1 (4)

(7,6,5,4,3,2,1) (6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)

0 1

67

2 3

45

7 (1) 7 (5)

2 (2)2 (6)

1 (7) 1 (3)

1st communication step

2nd communication step

All-to-all broadcast on an eight-node ring.

– Typeset by FoilTEX – 24

Page 26: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Broadcast and Reduction on a Ring

1. procedure ALL TO ALL BC RING(my id, my msg, p, result)2. begin3. left := (my id − 1) mod p;4. right := (my id + 1) mod p;5. result := my msg;6. msg := result;7. for i := 1 to p − 1 do8. send msg to right;9. receive msg from left;10. result := result ∪ msg;11. endfor;12. end ALL TO ALL BC RING

All-to-all broadcast on a p-node ring.

All-to-all reduction is simply a dual of this operation and can beperformed in an identical fashion.

– Typeset by FoilTEX – 25

Page 27: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all Broadcast on a Mesh

• Performed in two phases – in the first phase, each row of themesh performs an all-to-all broadcast using the procedure forthe linear array.

• In this phase, all nodes collect√

p messages corresponding tothe

√p nodes of their respective rows. Each node consolidates

this information into a single message of size m√

p.

• The second communication phase is a columnwise all-to-allbroadcast of the consolidated messages.

– Typeset by FoilTEX – 26

Page 28: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all Broadcast on a Mesh

7

0 1 2

53 4

86

(3,4,5) (3,4,5)(3,4,5)

0 1 2

53 4

876

(6) (8)

(3) (4) (5)

(0) (1) (2)

(7)

(a) Initial data distribution

(0,1,2)

(b) Data distribution after rowwise broadcast

(6,7,8) (6,7,8) (6,7,8)

(0,1,2) (0,1,2)

All-to-all broadcast on a 3 × 3 mesh. The groups of nodescommunicating with each other in each phase are enclosed bydotted boundaries. By the end of the second phase, all nodes

get (0,1,2,3,4,5,6,7) (that is, a message from each node).

– Typeset by FoilTEX – 27

Page 29: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all Broadcast on a Mesh1. procedure ALL TO ALL BC MESH(my id, my msg, p, result)2. begin/* Communication along rows */3. left := my id − (my id mod

√p) + (my id − 1)mod

√p;

4. right := my id − (my id mod√

p) + (my id + 1) mod√

p;5. result := my msg;6. msg := result;7. for i := 1 to

√p − 1 do

8. send msg to right;9. receive msg from left;10. result := result ∪ msg;11. endfor;/* Communication along columns */12. up := (my id − √

p) mod p;13. down := (my id +

√p) mod p;

14. msg := result;15. for i := 1 to

√p − 1 do

16. send msg to down;17. receive msg from up;18. result := result ∪ msg;19. endfor;20. end ALL TO ALL BC MESH

All-to-all broadcast on a square mesh of p nodes.

– Typeset by FoilTEX – 28

Page 30: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all broadcast on a Hypercube

• Generalization of the mesh algorithm to log p dimensions.

• Message size doubles at each of the log p steps.

– Typeset by FoilTEX – 29

Page 31: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all broadcast on a Hypercube

(0,...,7)

(0,...,7)(0,...,7)

(0,1,

(0,...,7)

(b) Distribution before the second step

(0,...,7)

6,7)

(4,5,

6,7)

(4,5,

6,7)

(4,5,

6,7)

(4,5,

2,3)

(0,1,

2,3)

(0,1,

2,3)

(0,1,

2,3)0 1

32

4 5

76

(c) Distribution before the third step

0 1

32

4 5

76

(d) Final distribution of messages

(0,...,7) (0,...,7)

(0,...,7)

0 1

32

4 5

76

(0)

(2)

(4)

(1)

(5)

(3)

(7)(6)

(a) Initial distribution of messages

0 1

32

4 5

76

(0,1)

(2,3) (2,3)

(0,1)

(6,7) (6,7)

(4,5) (4,5)

All-to-all broadcast on an eight-node hypercube.

– Typeset by FoilTEX – 30

Page 32: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all broadcast on a Hypercube

1. procedure ALL TO ALL BC HCUBE(my id, my msg, d, result)2. begin3. result := my msg;4. for i := 0 to d − 1 do5. partner := my id XOR 2i;6. send result to partner;7. receive msg from partner;8. result := result ∪ msg;9. endfor;10. end ALL TO ALL BC HCUBE

All-to-all broadcast on a d-dimensional hypercube.

– Typeset by FoilTEX – 31

Page 33: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all Reduction

• Similar communication pattern to all-to-all broadcast, exceptin the reverse order.

• On receiving a message, a node must combine it with the localcopy of the message that has the same destination as thereceived message before forwarding the combined messageto the next neighbor.

– Typeset by FoilTEX – 32

Page 34: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Cost Analysis

• On a ring, the time is given by: (ts + twm)(p − 1).

• On a mesh, the time is given by: 2ts(√

p − 1) + twm(p − 1).

• On a hypercube, we have:

T =

log p∑

i=1

(ts + 2i−1twm)

= ts log p + twm(p − 1). (2)

– Typeset by FoilTEX – 33

Page 35: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all broadcast: Notes

• All of the algorithms presented above are asymptoticallyoptimal in message size.

• It is not possible to port algorithms for higher dimensionalnetworks (such as a hypercube) into a ring because this wouldcause contention.

– Typeset by FoilTEX – 34

Page 36: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-all broadcast: Notes

messages

0 1

67

2 3

45

Contention for a singlechannel by multiple

Contention for a channel when the hypercube is mapped ontoa ring.

– Typeset by FoilTEX – 35

Page 37: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-Reduce and Prefix-Sum Operations

• In all-reduce, each node starts with a buffer of size m and thefinal results of the operation are identical buffers of size m oneach node that are formed by combining the original p buffersusing an associative operator.

• Identical to all-to-one reduction followed by a one-to-allbroadcast. This formulation is not the most efficient. Uses thepattern of all-to-all broadcast, instead. The only differenceis that message size does not increase here. Time for thisoperation is (ts + twm) log p.

• Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place, each with a different destination forthe result.

– Typeset by FoilTEX – 36

Page 38: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

The Prefix-Sum Operation

• Given p numbers n0, n1, . . . , np−1 (one on each node), theproblem is to compute the sums sk = Σk

i=0ni for all k between 0and p − 1.

• Initially, nk resides on the node labeled k, and at the end of theprocedure, the same node holds sk.

– Typeset by FoilTEX – 37

Page 39: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

The Prefix-Sum Operation

0

(c) Distribution of sums before third step

1

32

4 5

76

0 1

32

4 5

76

(3)

(7)(6)

(4) [4]

(6+7)(6)

[4]

(4+5)

(2)

[2]

(2+3)

[2]

(4+5)

(0+1) [0+1](0+1)

[0]

(0)

[0]

(2+3)

(5)

(1)

[6] [7]

[3]

[5]

[1]

[6]

[2+3]

[4+5]

[6+7]

0 1

32

4 5

76

0 1

32

4 5

76

[0+ .. +7][0+ .. +6]

[0+1+2](0+1+

2+3)

[0+1+2]

(4+5)

[0+1+2+3+4] [0+ .. +5]

[4]

(4+5)

2+3)

(0+1+

[0]

2+3)

(0+1+[0] [0+1]

[0+1+2+3][0+1+2+3]

(4+5+6+7) [4+5+6+7](4+5+6) [4+5+6]

[4+5]

[0+1]

(0+1+2+3)

(a) Initial distribution of values

(d) Final distribution of prefix sums

(b) Distribution of sums before second step

Computing prefix sums on an eight-node hypercube. At eachnode, square brackets show the local prefix sum accumulated in

the result buffer and parentheses enclose the contents of theoutgoing message buffer for the next step.

– Typeset by FoilTEX – 38

Page 40: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

The Prefix-Sum Operation

• The operation can be implemented using the all-to-allbroadcast kernel.

• We must account for the fact that in prefix sums the node withlabel k uses information from only the k-node subset whoselabels are less than or equal to k.

• This is implemented using an additional result buffer. Thecontent of an incoming message is added to the result bufferonly if the message comes from a node with a smaller labelthan the recipient node.

• The contents of the outgoing message (denoted by parenthesesin the figure) are updated with every incoming message.

– Typeset by FoilTEX – 39

Page 41: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

The Prefix-Sum Operation

1. procedure PREFIX SUMS HCUBE(my id, my number, d, result)2. begin3. result := my number;4. msg := result;5. for i := 0 to d − 1 do6. partner := my id XOR 2i;7. send msg to partner;8. receive number from partner;9. msg := msg + number;10. if (partner < my id) then result := result + number;11. endfor;12. end PREFIX SUMS HCUBE

Prefix sums on a d-dimensional hypercube.

– Typeset by FoilTEX – 40

Page 42: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Scatter and Gather

• In the scatter operation, a single node sends a unique messageof size m to every other node (also called a one-to-allpersonalized communication).

• In the gather operation, a single node collects a uniquemessage from each node.

• While the scatter operation is fundamentally different frombroadcast, the algorithmic structure is similar, except fordifferences in message sizes (messages get smaller in scatterand stay constant in broadcast).

• The gather operation is exactly the inverse of the scatteroperation and can be executed as such.

– Typeset by FoilTEX – 41

Page 43: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Gather and Scatter OperationsM -1

M 0

M 1

...

M 1 pM -1M 0

p

Scatter

p-11 1 p-10 0. . . . . .Gather

Scatter and gather operations.

– Typeset by FoilTEX – 42

Page 44: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Example of the Scatter Operation

2,3)

(0,1,

(4,5,

0 1

6,7)

3

(b) Distribution before the second step

2

4 5

76

0 1

32

4 5

76

(0,1,2,3,

4,5,6,7)

0 1

32

4 5

76

0 1

32

4 5

76

(6,7)

(4) (5)

(7)(6)

(0,1)

(2,3)

(4,5)

(0)

(2)

(1)

(3)

(d) Final distribution of messages

(a) Initial distribution of messages

(c) Distribution before the third step

The scatter operation on an eight-node hypercube.

– Typeset by FoilTEX – 43

Page 45: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Cost of Scatter and Gather

• There are log p steps, in each step, the machine size halves andthe data size halves.

• We have the time for this operation to be:

T = ts log p + twm(p − 1). (3)

• This time hpnds for a linear array as well as a 2-D mesh.

• These times are asymptotically optimal in message size.

– Typeset by FoilTEX – 44

Page 46: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication

• Each node has a distinct message of size m for every othernode.

• This is unlike all-to-all broadcast, in which each node sends thesame message to all other nodes.

• All-to-all personalized communication is also known as totalexchange.

– Typeset by FoilTEX – 45

Page 47: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication

..

pM -1,0...

p -1Mp -1,...

p -1M 0,

p -1M 1,

1 p-10 . . .

M

M 0,0

1,0

pM

M

M 0,1

1,1

-1,1.

All-to-all personalized

.

communication

p -1Mp -1,p -1M 1,p -1M 0,

p-110 . . .

...M

M

.....M

M 0,0

0,1

1,0

1,1

pM -1,0

pM -1,1

All-to-all personalized communication.

– Typeset by FoilTEX – 46

Page 48: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication: Example

Consider the problem of transposing a matrix.

• Each processor contains one full row of the matrix.

• The transpose operation in this case is identical to an all-to-allpersonalized communication operation.

– Typeset by FoilTEX – 47

Page 49: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication: Example0P

3

n

P

P

P1

2

All-to-all personalized communication in transposing a 4 × 4matrix using four processes.

– Typeset by FoilTEX – 48

Page 50: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on a Ring

• Each node sends all pieces of data as one consolidatedmessage of size m(p − 1) to one of its neighbors.

• Each node extracts the information meant for it from the datareceived, and forwards the remaining (p − 2) pieces of size meach to the next node.

• The algorithm terminates in p − 1 steps.

• The size of the message reduces by m at each step.

– Typeset by FoilTEX – 49

Page 51: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on a Ring

({2,0},({1,0})

({0,1} ... {0,5}) ({1,2} ... {1,0})

({0,2} ... {0,5})

({2,1}) ({3,2})

({5,2} ... {5,4})

({5,1} ... {5,4})

({4,1} ... {4,3})

({4,2}, {4,3})({3,1}, {3,2})

({2,3}, {2,4}, {2,5}, {2,0},

{1,0}) {1,5}, {1,4},({1,3},

{0,5}) {0,4},({0,3}, ({5,3},

{5,4})

{2,1})

({4,3}) {4,2}, {4,3}) {5,4})

{5,3}, {5,2}, {5,1},({5,0},

1

{4,1},

{3,2}) {3,1},({3,0}, ({4,0},

{2,1})

0 1 2

345

({3,4} ... {3,2})

({2,4} ... {2,1})

({1,4} ... {1,0})

({4,5} ... {4,3})

({3,5} ... {3,2})

({2,5} ... {2,1})

({0,4}, {0,5})({1,5}, {1,0})

({0,5}) ({5,4})

3 3

1

2

3

4

5

1

2

3

4

5

2345

2 43 5

1

1

1

2

4

5 5

4

2

All-to-all personalized communication on a six-node ring. Thelabel of each message is of the form {x, y}, where x is the label

of the node that originally owned the message, and y is thelabel of the node that is the final destination of the message. The

label ({x1, y1}, {x2, y2}, . . . , {xn, yn}) indicates a message that isformed by concatenating n individual messages.

– Typeset by FoilTEX – 50

Page 52: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on a Ring:Cost

• We have p − 1 steps in all.

• In step i, the message size is m(p − i).

• The total time is given by:

T =

p−1∑

i=1

(ts + twm(p − i))

= ts(p − 1) +

p−1∑

i=1

itwm

= (ts + twmp/2)(p − 1). (4)

• The tw term in this equation can be reduced by a factor of 2 bycommunicating messages in both directions.

– Typeset by FoilTEX – 51

Page 53: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on a Mesh

• Each node first groups its p messages according to the columnsof their destination nodes.

• All-to-all personalized communication is performed independentlyin each row with clustered messages of size m

√p.

• Messages in each node are sorted again, this time accordingto the rows of their destination nodes.

• All-to-all personalized communication is performed independentlyin each column with clustered messages of size m

√p.

– Typeset by FoilTEX – 52

Page 54: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on a Mesh

{2,0},{2,3},{2,6})

{1,0},{1,3},{1,6},

{5,0},{5,3},{5,6})

{4,0},{4,3},{4,6},

({3,0},{3,3},{3,6},

{8,0},{8,3},{8,6})

{7,0},{7,3},{7,6},

({6,0},{6,3},{6,6},

{8,1},{8,4},{8,7})

({6,1},{6,4},{6,7},

{7,1},{7,4},{7,7},

{8,2},{8,5},{8,8})

{7,2},{7,5},{7,8},

0 1 2

53 4

({6,2},{6,5},{6,8},

8

beginning of first phase

76

(b) Data distribution at the beginning of second phase

{4,4},{4,7},

{5,1},{5,,4},

{5,7})

({0,2},{0,5},

{0,8},{1,2},

{1,5},{1,8},

{2,2},{2,5},

{2,8})

({3,2},{3,5},

{3,8},{4,2},

{4,5},{4,8},

{5,2},{5,5},

{5,8})

({3,1},{3,4},

{3,7},{4,1},

{2,7})

{2,1},{2,4},

{1,4},{1,7},

{0,7},{1,1},

({0,1},{0,4},

({0,0},{0,3},{0,6},

0 1 2

53 4

876

{1,1},{1,4},{1,7},

({0,0},{0,3},{0,6},

({3,0},{3,3},{3,6},

{4,1},{4,4},{4,7},

{5,2},{5,5},{5,8})

{8,2},{8,5},{8,8})

{7,1},{7,4},{7,7},

({6,0},{6,3},{6,6},

{2,2},{2,5},{2,8})

({1,0},{1,3},{1,6},

{0,1},{0,4},{0,7},

{0,2},{0,5},{0,8}) {1,2},{1,5},{1,8})

{2,1},{2,4},{2,7},

({2,0},{2,3},{2,6},

{3,1},{3,4},{3,7},

{3,2},{3,5},{3,8})

({4,0},{4,3},{4,6},

{4,2},{4,5},{4,8})

({5,0},{5,3},{5,6},

{5,1},{5,4},{4,7},

{6,1},{6,4},{6,7},

{6,2},{6,5},{6,8})

({7,0},{7,3},{7,6},

{7,2},{7,5},{7,8})

({8,0},{8,3},{8,6},

{8,1},{8,4},{8,7},

(a) Data distribution at the

The distribution of messages at the beginning of each phase ofall-to-all personalized communication on a 3 × 3 mesh. At the

end of the second phase, node i has messages ({0,i}, . . . ,{8,i}),where 0 ≤ i ≤ 8. The groups of nodes communicating together in

each phase are enclosed in dotted boundaries.

– Typeset by FoilTEX – 53

Page 55: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on a Mesh:Cost

• Time for the first phase is identical to that in a ring with√

pprocessors, i.e., (ts + twmp/2)(

√p − 1).

• Time in the second phase is identical to the first phase.Therefore, total time is twice of this time, i.e.,

T = (2ts + twmp)(√

p − 1). (5)

• It can be shown that the time for rearrangement is less muchless than this communication time.

– Typeset by FoilTEX – 54

Page 56: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube

• Generalize the mesh algorithm to log p steps.

• At any stage in all-to-all personalized communication, everynode holds p packets of size m each.

• While communicating in a particular dimension, every nodesends p/2 of these packets (consolidated as one message).

• A node must rearrange its messages locally before each of thelog p communication steps.

– Typeset by FoilTEX – 55

Page 57: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube

({0,0} ... {0,7})

({4,1},{6,1},

{4,5},{6,5},

{5,1},{7,1},

{5,5},{7,5})

({1,0} ... {1,7})

({4,0} ... {4,7}) ({5,0} ... {5,7})

({3,0} ... {3,7})({2,0} ... {2,7})

({7,0} ... {7,7})({6,0} ... {6,7})

(a) Initial distribution of messages

6 7

54

2 3

10

{1,0},{1,2},{1,4},{1,6})

({0,0},{0,2},{0,4},{0,6},

{3,4},{3,6})

{3,0},{3,2},

{2,4},{2,6},

({0,6} ... {7,6})

({2,0},{2,2},

({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},

0 1

32

4 5

76

({1,1},{1,3},{1,5},{1,7},

{0,1},{0,3},{0,5},{0,7})

{7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})

({4,1},{4,3},

{4,5},{4,7},

{5,1},{5,3},

{5,5},{5,7})

0 1

32

4 5

76

(b) Distribution before the second step

(d) Final distribution of messages

({0,0} ... {7,0}) ({0,1} ... {7,1})

({0,5} ... {7,5})({0,4} ... {7,4})

({0,7} ... {7,7})

({0,3} ... {7,3})({0,2} ... {7,2})

{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})

0 1

32

4 5

76

({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},

({6,2},{6,6},{4,2},{4,6},

{7,2},{7,6},{5,2},{5,6})

({7,3},{7,7},{5,3},{5,7},

{6,3},{6,7},{4,3},{4,7})

({0,2},{2,2},

{0,6},{2,6},

{1,2},{3,2},

{1,6},{3,6})

(c) Distribution before the third step

An all-to-all personalized communication algorithm on athree-dimensional hypercube.

– Typeset by FoilTEX – 56

Page 58: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube: Cost

• We have log p iterations and mp/2 words are communicated ineach iteration. Therefore, the cost is:

T = (ts + twmp/2) log p. (6)

• This is not optimal!

– Typeset by FoilTEX – 57

Page 59: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube: Optimal Algorithm

• Each node simply performs p − 1 communication steps,exchanging m words of data with a different node in everystep.

• A node must choose its communication partner in each stepso that the hypercube links do not suffer congestion.

• In the jth communication step, node i exchanges data withnode (i XOR j).

• In this schedule, all paths in every communication step arecongestion-free, and none of the bidirectional links carry morethan one message in the same direction.

– Typeset by FoilTEX – 58

Page 60: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube: Optimal Algorithm

2

6

6

(a)

(d)

0 1

32

4 5

76

0 1

32

4 5

76

7

(c)

(f)

0 1

32

4 5

76

0 1

32

4 5

76

0467

1576

45

354

4023

5

6

132

201

7310

(b)

(e)

(g)

0 1

32

4 5

76

0 1

32

4 5

76

0 1

32

4 5

7

– Typeset by FoilTEX – 59

Page 61: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Seven steps in all-to-all personalized communication on aneight-node hypercube.

– Typeset by FoilTEX – 60

Page 62: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube: Optimal Algorithm

1. procedure ALL TO ALL PERSONAL(d, my id)2. begin3. for i := 1 to 2d − 1 do4. begin5. partner := my id XOR i;6. send Mmy id,partner to partner;7. receive Mpartner,my id from partner;8. endfor;9. end ALL TO ALL PERSONAL

A procedure to perform all-to-all personalized communication on ad-dimensional hypercube. The message Mi,j initially resides on node i and is

destined for node j.

– Typeset by FoilTEX – 61

Page 63: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

All-to-All Personalized Communication on aHypercube: Cost Analysis of Optimal Algorithm

• There are p − 1 steps and each step involves non-congesting messagetransfer of m words.

• We have:T=(ts + twm)(p − 1). (7)

• This is asymptotically optimal in message size.

– Typeset by FoilTEX – 62

Page 64: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Circular Shift

• A special permutation in which node i sends a data packet to node (i + q)

mod p in a p-node ensemble (0 < q < p).

– Typeset by FoilTEX – 63

Page 65: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Circular Shift on a Mesh

• The implementation on a ring is rather intuitive. It can be performed inmin{q, p − q} neighbor communications.

• Mesh algorithms follow from this as well. We shift in one direction (allprocessors) followed by the next direction.

• The associated time has an upper bound of:

T = (ts + twm)(√

p + 1).

– Typeset by FoilTEX – 64

Page 66: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Circular Shift on a Mesh

11

(14)(13)(12)

(8)

(0) (2)

(10)(9)

(6)(5)(4)

(1)(15)

(3)

(7)

(c) Column shifts in the third communication step

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15(12) (13) (14)(11)

0 1 2 3

4 5 6 7

8 9 10

12 13 14 15

(3)

(7)

(11)

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

(3)

(7)

(1)

(4) (5) (6)

(2)(0)

(11) (8) (9) (10)

(14)(13)(12)(15)

(d) Final distribution of the data

(a) Initial data distribution and thefirst communication step

(b) Step to compensate for backward row shifts

(1)

(4) (5) (6)

(9) (10)

(2)(0)

(8)

(12) (13) (14) (15)

(11)

(3)

(7)

(15) (1)

(4) (5) (6)

(9) (10)

(2)(0)

(8)

The communication steps in a circular 5-shift on a 4 × 4 mesh.

– Typeset by FoilTEX – 65

Page 67: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Circular Shift on a Hypercube

• Map a linear array with 2d nodes onto a d-dimensional hypercube.• To perform a q-shift, we expand q as a sum of distinct powers of 2.• If q is the sum of s distinct powers of 2, then the circular q-shift on a

hypercube is performed in s phases.• The time for this is upper bounded by:

T = (ts + twm)(2 log p − 1). (8)

• If E-cube routing is used, this time can be reduced to

T = ts + twm. (9)

– Typeset by FoilTEX – 66

Page 68: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Circular Shift on a Hypercube

0 1

23

4 5

6

(c) Final data distribution after the 5-shift

7

(7)

(0)

(3)

(4)

(6)

(1)

(2)

(5)

0 1

23

4 5

67

(2)(3)

(0)(1)

(4)

(7)

(5)

(6)

0 1

23

4 5

67

(3)

(6)

(7) (0)

(5)

(4)

(1)(2)

0 1

23

4 5

67

(4)

(7)

(0)

(3)

(1)

(2)

(6)

(5)

First communication step of the 4-shift Second communication step of the 4-shift

(a) The first phase (a 4-shift)

(b) The second phase (a 1-shift)

The mapping of an eight-node linear array onto a three-dimensionalhypercube to perform a circular 5-shift as a combination of a 4-shift and a

1-shift.

– Typeset by FoilTEX – 67

Page 69: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Circular Shift on a Hypercube

0 1

32

4 5

76

0 1

3

4 5

76

2 3

10

2

4 5

76

0 1

32

4 5

76

(g) 7-shift

(f) 6-shift(d) 4-shift (e) 5-shift

6

(c) 3-shift(a) 1-shift (b) 2-shift

0 1

32

4 5

76

0 1

32

4 5

76

0 1

32

4 5

7

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8.

– Typeset by FoilTEX – 68

Page 70: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Improving Performance of Operations

• Splitting and routing messages into parts: If the message can be split into p

parts, a one-to-all broadcase can be implemented as a scatter operationfollowed by an all-to-all broadcast operation. The time for this is:

T = 2 × (ts log p + tw(p − 1)m

p)

≈ 2 × (ts log p + twm). (10)

• All-to-one reduction can be performed by performing all-to-all reduction(dual of all-to-all broadcast) followed by a gather operation (dual ofscatter).

– Typeset by FoilTEX – 69

Page 71: Basic Communication OperationsBasic Communication Operations: Introduction Many interactions in practical parallel programs occur in well-dened patterns involving groups of processors.

Improving Performance of Operations

• Since an all-reduce operation is semantically equivalent to an all-to-onereduction followed by a one-to-all broadcast, the asymptotically optimalalgorithms for these two operations can be used to construct a similaralgorithm for the all-reduce operation.

• The intervening gather and scatter operations cancel each other.Therefore, an all-reduce operation requires an all-to-all reduction and anall-to-all broadcast.

– Typeset by FoilTEX – 70