All-Reduce and Prefix-Sum Operations
-
Upload
syed-zaid-irshad -
Category
Engineering
-
view
14 -
download
0
Transcript of All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations • In all-reduce, each node starts with a buffer of size m and the final
results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator.
• Identical to all-to-one reduction followed by a one-to-all broadcast. This formulation is not the most efficient. Uses the pattern of all-to-all broadcast, instead. The only difference is that message size does not increase here. Time for this operation is (ts + twm) log p.
• Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place, each with a different destination for the result.
The Prefix-Sum Operation
• Given p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i
k= 0 ni for all k between 0 and p-1 .
• Initially, nk resides on the node labeled k, and at the end of the procedure, the same node holds Sk.
The Prefix-Sum Operation
Computing prefix sums on an eight-node hypercube. At each node, square brackets show the local prefix sum accumulated in the result buffer and parentheses enclose
the contents of the outgoing message buffer for the next step.
The Prefix-Sum Operation
• The operation can be implemented using the all-to-all broadcast kernel.
• We must account for the fact that in prefix sums the node with label k uses information from only the k-node subset whose labels are less than or equal to k.
• This is implemented using an additional result buffer. The content of an incoming message is added to the result buffer only if the message comes from a node with a smaller label than the recipient node.
• The contents of the outgoing message (denoted by parentheses in the figure) are updated with every incoming message.
The Prefix-Sum Operation
Prefix sums on a d-dimensional hypercube.
Scatter and Gather
• In the scatter operation, a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication).
• In the gather operation, a single node collects a unique message from each node.
• While the scatter operation is fundamentally different from broadcast, the algorithmic structure is similar, except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast).
• The gather operation is exactly the inverse of the scatter operation and can be executed as such.
Gather and Scatter Operations
Scatter and gather operations.
Example of the Scatter Operation
The scatter operation on an eight-node hypercube.
Cost of Scatter and Gather
• There are log p steps, in each step, the machine size halves and the data size halves.
• We have the time for this operation to be:
• This time holds for a linear array as well as a 2-D mesh. • These times are asymptotically optimal in message size.
All-to-All Personalized Communication • Each node has a distinct message of size m for every other node. • This is unlike all-to-all broadcast, in which each node sends the same
message to all other nodes. • All-to-all personalized communication is also known as total
exchange.
All-to-All Personalized Communication
All-to-all personalized communication.
All-to-All Personalized Communication: Example • Consider the problem of transposing a matrix. • Each processor contains one full row of the matrix. • The transpose operation in this case is identical to an all-to-all
personalized communication operation.
All-to-All Personalized Communication: Example
All-to-all personalized communication in transposing a 4 x 4 matrix using four processes.
All-to-All Personalized Communication on a Ring • Each node sends all pieces of data as one consolidated message of
size m(p – 1) to one of its neighbors. • Each node extracts the information meant for it from the data
received, and forwards the remaining (p – 2) pieces of size m each to the next node.
• The algorithm terminates in p – 1 steps. • The size of the message reduces by m at each step.
All-to-All Personalized Communication on a Ring
All-to-all personalized communication on a six-node ring. The label of each message is of the form {x,y}, where x is the label of the node that originally owned the message, and y is the label of the node that is the final destination of the message. The label
({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n individual messages.
All-to-All Personalized Communication on a Ring: Cost
• We have p – 1 steps in all. • In step i, the message size is m(p – i). • The total time is given by:
• The tw term in this equation can be reduced by a factor of 2 by communicating messages in both directions.
All-to-All Personalized Communication on a Mesh • Each node first groups its p messages according to the columns of
their destination nodes. • All-to-all personalized communication is performed independently in
each row with clustered messages of size m√p. • Messages in each node are sorted again, this time according to the
rows of their destination nodes. • All-to-all personalized communication is performed independently in
each column with clustered messages of size m√p.
All-to-All Personalized Communication on a Mesh
The distribution of messages at the beginning of each phase of all-to-all personalized communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,{8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed
in dotted boundaries.
All-to-All Personalized Communication on a Mesh: Cost
• Time for the first phase is identical to that in a ring with √p processors, i.e., (ts + twmp/2)(√p – 1).
• Time in the second phase is identical to the first phase. Therefore, total time is twice of this time, i.e.,
• It can be shown that the time for rearrangement is less much less than this communication time.
All-to-All Personalized Communication on a Hypercube • Generalize the mesh algorithm to log p steps. • At any stage in all-to-all personalized communication, every node
holds p packets of size m each. • While communicating in a particular dimension, every node sends
p/2 of these packets (consolidated as one message). • A node must rearrange its messages locally before each of the log p
communication steps.
All-to-All Personalized Communication on a Hypercube
An all-to-all personalized communication algorithm on a three-dimensional hypercube.
All-to-All Personalized Communication on a Hypercube: Cost • We have log p iterations and mp/2 words are communicated in each
iteration. Therefore, the cost is:
• This is not optimal!
All-to-All Personalized Communication on a Hypercube: Optimal Algorithm • Each node simply performs p – 1 communication steps, exchanging
m words of data with a different node in every step. • A node must choose its communication partner in each step so that
the hypercube links do not suffer congestion. • In the jth communication step, node i exchanges data with node (i
XOR j). • In this schedule, all paths in every communication step are
congestion-free, and none of the bidirectional links carry more than one message in the same direction.
All-to-All Personalized Communication on a Hypercube: Optimal Algorithm
Seven steps in all-to-all personalized communication on an eight-node hypercube.
All-to-All Personalized Communication on a Hypercube: Optimal Algorithm
A procedure to perform all-to-all personalized communication on a d-dimensional hypercube. The message Mi,j initially resides on node i and is
destined for node j.
All-to-All Personalized Communication on a Hypercube: Cost Analysis of Optimal Algorithm • There are p – 1 steps and each step involves non-congesting message
transfer of m words. • We have:
• This is asymptotically optimal in message size.
Dense Matrix Algorithms
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text “Introduction to Parallel Computing”,Addison Wesley, 2003.
Topic Overview
• Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations
Matix Algorithms: Introduction
• Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition.
• Typical algorithms rely on input, output, or intermediate data decomposition.
• Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.
Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y.
• The serial algorithm requires n2 multiplications and additions.
Matrix-Vector Multiplication: Rowwise 1-D Partitioning• The n x n matrix is partitioned among n processors, with each
processor storing complete row of the matrix. • The n x 1 vector x is distributed such that each process owns one of
its elements.
Matrix-Vector Multiplication: Rowwise 1-D Partitioning
Multiplication of an n x n matrix with an n x 1 vector usingrowwise block 1-D partitioning. For the one-row-per-process
case, p = n.
Matrix-Vector Multiplication: Rowwise 1-D Partitioning• Since each process starts with only one element of x , an all-to-all
broadcast is required to distribute all the elements to all the processes.
• Process Pi now computes . • The all-to-all broadcast and the computation of y[i] both take time
Θ(n) . Therefore, the parallel time is Θ(n) .
Matrix-Vector Multiplication:Rowwise 1-D Partitioning• Consider now the case when p < n and we use block 1D partitioning.• Each process initially stores n=p complete rows of the matrix and a
portion of the vector of size n=p.• The all-to-all broadcast takes place among p processes and involves
messages of size n=p.• This is followed by n=p local dot products.• Thus, the parallel run time of this procedure is
This is cost-optimal.
Matrix-Vector Multiplication: Rowwise 1-D Partitioning
Scalability Analysis:
• We know that T0 = pTP - W, therefore, we have,
• For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired efficiency E.
• From this, we have W = O(p2) (from the tw term).• There is also a bound on isoefficiency because of concurrency. In this
case, p < n, therefore, W = n2 = Ω(p2).• Overall isoefficiency is W = O(p2).
Matrix-Vector Multiplication: 2-D Partitioning• The n x n matrix is partitioned among n2 processors such that each
processor owns a single element.• The n x 1 vector x is distributed only in the last column of n
processors.
Matrix-Vector Multiplication: 2-D Partitioning
Matrix-vector multiplication with block 2-D partitioning. For theone-element-per-process case, p = n2 if the matrix size is n x n .
Matrix-Vector Multiplication: 2-D Partitioning• We must first align the vector with the matrix appropriately. • The first communication step for the 2-D partitioning aligns the
vector x along the principal diagonal of the matrix. • The second step copies the vector elements from each diagonal
process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column.
• Finally, the result vector is computed by performing an all-to-one reduction along the columns.
Matrix-Vector Multiplication: 2-D Partitioning• Three basic communication operations are used in this algorithm:
one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time and the parallel time is Θ(log n) .
• The cost (process-time product) is Θ(n2 log n) ; hence, the algorithm is not cost-optimal.
Matrix-Vector Multiplication: 2-D Partitioning• When using fewer than n2 processors, each process owns an
block of the matrix. • The vector is distributed in portions of elements in the last
process-column only. • In this case, the message sizes for the alignment, broadcast, and
reduction are all . • The computation is a product of an submatrix with a
vector of length .
Matrix-Vector Multiplication: 2-D Partitioning• The first alignment step takes time
• The broadcast and reductions take time
• Local matrix-vector products take time
• Total time is
Matrix-Vector Multiplication: 2-D Partitioning• Scalability Analysis:
• • Equating T0 with W, term by term, for isoefficiency, we have,
as the dominant term. • The isoefficiency due to concurrency is O(p). • The overall isoefficiency is (due to the network
bandwidth). • For cost optimality, we have, . For this, we have,
Matrix-Matrix Multiplication • Consider the problem of multiplying two n x n dense, square matrices A
and B to yield the product matrix C =A x B.• The serial complexity is O(n3).• We do not consider better serial algorithms (Strassen's method),
although, these can be used as serial kernels in the parallel algorithms.• A useful concept in this case is called block operations. In this view, an n
x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix.
• In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.
Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows and B along columns.• Perform local submatrix multiplication.
Matrix-Matrix Multiplication
• The two broadcasts take time • The computation requires multiplications of
sized submatrices. • The parallel run time is approximately
• The algorithm is cost optimal and the isoefficiency is O(p1.5) due to bandwidth term tw and concurrency.
• Major drawback of the algorithm is that it is not memory optimal.
Matrix-Matrix Multiplication: Cannon's Algorithm• In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given time, each process is using a different block Ai,k.
• These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation.
Matrix-Matrix Multiplication: Cannon's Algorithm
Communication steps in Cannon's algorithm on 16 processes.
Matrix-Matrix Multiplication: Cannon's Algorithm• Align the blocks of A and B in such a way that each process multiplies
its local submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i steps and all submatrices Bi,j up (with wraparound) by j steps.
• Perform local block multiplication.• Each block of A moves one step left and each block of B moves one
step up (again with wraparound).• Perform next block multiplication, add to partial result, repeat until
all blocks have been multiplied.
Matrix-Matrix Multiplication: Cannon's Algorithm• In the alignment step, since the maximum distance over which a
block shifts is , the two shift operations require a total of time.
• Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time.
• The computation time for multiplying matrices of size is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.
Matrix-Matrix Multiplication: DNS Algorithm• Uses a 3-D partitioning.• Visualize the matrix multiplication algorithm as a cube . matrices A
and B come in two orthogonal faces and result C comes out the other orthogonal face.
• Each internal node in the cube represents a single add-multiply operation (and thus the complexity).
• DNS algorithm partitions this cube using a 3-D block scheme.
Matrix-Matrix Multiplication: DNS Algorithm• Assume an n x n x n mesh of processors.• Move the columns of A and rows of B and perform broadcast.• Each processor computes a single add-multiply.• This is followed by an accumulation along the C dimension.• Since each add-multiply takes constant time and accumulation and
broadcast takes log n time, the total runtime is log n.• This is not cost optimal. It can be made cost optimal by using n / log n
processors along the direction of accumulation.
Matrix-Matrix Multiplication: DNS Algorithm
The communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes.
Matrix-Matrix Multiplication: DNS Algorithm
Using fewer than n3 processors.• Assume that the number of processes p is equal to q3 for some q < n.• The two matrices are partitioned into blocks of size (n/q) x(n/q).• Each matrix can thus be regarded as a q x q two-dimensional square
array of blocks.• The algorithm follows from the previous one, except, in this case, we
operate on blocks rather than on individual elements.
Matrix-Matrix Multiplication: DNS Algorithm
Using fewer than n3 processors. • The first one-to-one communication step is performed for both A
and B, and takes time for each matrix. • The two one-to-all broadcasts take time for each
matrix. • The reduction takes time . • Multiplication of submatrices takes time. • The parallel time is approximated by:
• The isoefficiency function is .
Solving a System of Linear Equations• Consider the problem of solving linear equations of the kind:
• This is written as Ax = b, where A is an n x n matrix with A[i, j] = ai,j, b is an n x 1 vector [ b0, b1, … , bn ]T, and x is the solution.
Solving a System of Linear Equations
Two steps in solution are: reduction to triangular form, and back-substitution. The triangular form is as:
We write this as: Ux = y . A commonly used method for transforming a given matrix into an
upper-triangular matrix is Gaussian Elimination.
Gaussian Elimination
Serial Gaussian Elimination
Gaussian Elimination• The computation has three nested loops - in the kth iteration of the
outer loop, the algorithm performs (n-k)2 computations. Summing from k = 1..n, we have roughly (n3/3) multiplications-subtractions.
A typical computation in Gaussian elimination.
Parallel Gaussian Elimination• Assume p = n with each row assigned to a processor. • The first step of the algorithm normalizes the row. This is a serial
operation and takes time (n-k) in the kth iteration. • In the second step, the normalized row is broadcast to all the
processors. This takes time . • Each processor can independently eliminate this row from its own. This
requires (n-k-1) multiplications and subtractions. • The total parallel time can be computed by summing from k = 1 … n-1
as
• The formulation is not cost optimal because of the tw term.
Parallel Gaussian Elimination
Gaussian elimination steps during the iteration corresponding k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes.
Parallel Gaussian Elimination: Pipelined Execution• In the previous formulation, the (k+1)st iteration starts only after all
the computation and communication for the kth iteration is complete.
• In the pipelined version, there are three steps - normalization of a row, communication, and elimination. These steps are performed in an asynchronous fashion.
• A processor Pk waits to receive and eliminate all rows prior to k.• Once it has done this, it forwards its own row to processor Pk+1.
Parallel Gaussian Elimination: Pipelined Execution
Pipelined Gaussian elimination on a 5 x 5 matrix partitioned withone row per process.
Parallel Gaussian Elimination: Pipelined Execution• The total number of steps in the entire pipelined procedure is Θ(n).• In any step, either O(n) elements are communicated between
directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row.
• The parallel time is therefore O(n2) .• This is cost optimal.
Parallel Gaussian Elimination: Pipelined Execution
The communication in the Gaussian elimination iterationcorresponding to k = 3 for an 8 x 8 matrix distributed among
four processes using block 1-D partitioning.
Parallel Gaussian Elimination: Block 1D with p < n
• The above algorithm can be easily adapted to the case when p < n.• In the kth iteration, a processor with all rows belonging to the active part
of the matrix performs (n – k -1) / np multiplications and subtractions.• In the pipelined version, for n > p, computation dominates
communication.• The parallel time is given by: or approximately, n3/p.• While the algorithm is cost optimal, the cost of the parallel algorithm is
higher than the sequential run time by a factor of 3/2.
Parallel Gaussian Elimination: Block 1D with p < n
Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the
Gaussian elimination iteration corresponding to k = 3.
Parallel Gaussian Elimination: Block 1D with p < n
• The load imbalance problem can be alleviated by using a cyclic mapping.
• In this case, other than processing of the last p rows, there is no load imbalance.
• This corresponds to a cumulative load imbalance overhead of O(n2p) (instead of O(n3) in the previous case).
Gaussian Elimination with Partial Pivoting• For numerical stability, one generally uses partial pivoting.• In the k th iteration, we select a column i (called the pivot column)
such that A[k, i] is the largest in magnitude among all A[k, i] such that k ≤ j < n.
• The k th and the i th columns are interchanged.• Simple to implement with row-partitioning and does not add
overhead since the division step takes the same time as computing the max.
• Column-partitioning, however, requires a global reduction, adding a log p term to the overhead.
• Pivoting precludes the use of pipelining.
Gaussian Elimination with Partial Pivoting: 2-D Partitioning • Partial pivoting restricts use of pipelining, resulting in performance
loss. • This loss can be alleviated by restricting pivoting to specific columns. • Alternately, we can use faster algorithms for broadcast.
Solving a Triangular System: Back-Substitution
• The upper triangular matrix U undergoes back-substitution to determine the vector x.
A serial algorithm for back-substitution.
Solving a Triangular System: Back-Substitution• The algorithm performs approximately n2/2 multiplications and
subtractions.• Since complexity of this part is asymptotically lower, we should optimize
the data distribution for the factorization part.• Consider a rowwise block 1-D mapping of the n x n matrix U with vector
y distributed uniformly.• The value of the variable solved at a step can be pipelined back.• Each step of a pipelined implementation requires a constant amount of
time for communication and Θ(n/p) time for computation.• The parallel run time of the entire algorithm is Θ(n2/p).
Solving a Triangular System: Back-Substitution• If the matrix is partitioned by using 2-D partitioning on a logical
mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation.
• Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time.
• While this is not cost optimal, since this does not dominate the overall computation, the cost optimality is determined by the factorization.
Sorting Algorithms
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Topic Overview • Issues in Sorting on Parallel Computers
• Sorting Networks
• Bubble Sort and its Variants
• Quicksort
• Bucket and Sample Sort
• Other Sorting Algorithms
Sorting: Overview • One of the most commonly used and well-studied kernels.
• Sorting can be comparison-based or noncomparison-based.
• The fundamental operation of comparison-based sorting is compare-exchange.
• The lower bound on any comparison-based sort of n numbers is Θ(nlog n) .
• We focus here on comparison-based sorting algorithms.
Sorting: Basics What is a parallel sorted sequence? Where are the input and output lists stored?
• We assume that the input and output lists are distributed.
• The sorted list is partitioned with the property that each partitioned list is sorted and each element in processor Pi's list is less than that in Pj's list if i < j.
Sorting: Parallel Compare Exchange Operation
A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps
max{ai, aj}.
Sorting: Basics What is the parallel counterpart to a sequential comparator?
• If each processor has one element, the compare exchange operation stores the smaller element at the processor with smaller id. This can be done in ts + tw time.
• If we have more than one element per processor, we call this operation a compare split. Assume each of two processors have n/p elements.
• After the compare-split operation, the smaller n/p elements are at processor Pi and the larger n/p elements at Pj, where i < j.
• The time for a compare-split operation is (ts+ twn/p), assuming that the two partial lists were initially sorted.
Sorting: Parallel Compare Split Operation
A compare-split operation. Each process sends its block of size n/p to the other process. Each process merges the received block with its
own block and retains only the appropriate half of the merged block. In this example, process Pi retains the smaller elements and process
Pi retains the larger elements.
Sorting Networks • Networks of comparators designed specifically for sorting.
• A comparator is a device with two inputs x and y and two outputs x' and y'. For an increasing comparator, x' = min{x,y} and y' = min{x,y}; and vice-versa.
• We denote an increasing comparator by and a decreasing comparator by Ө.
• The speed of the network is proportional to its depth.
Sorting Networks: Comparators
A schematic representation of comparators: (a) an increasing comparator, and (b) a decreasing comparator.
Sorting Networks
A typical sorting network. Every sorting network is made up of a series of columns, and each column contains a number of
comparators connected in parallel.
Sorting Networks: Bitonic Sort • A bitonic sorting network sorts n elements in Θ(log2n) time.
• A bitonic sequence has two tones - increasing and decreasing, or vice versa. Any cyclic rotation of such networks is also considered bitonic.
• 1,2,4,7,6,0 is a bitonic sequence, because it first increases and then decreases. 8,9,2,1,0,4 is another bitonic sequence, because it is a cyclic shift of 0,4,8,9,2,1.
• The kernel of the network is the rearrangement of a bitonic sequence into a sorted sequence.
Sorting Networks: Bitonic Sort • Let s = a0,a1,…,an-1 be a bitonic sequence such that a0 ≤ a1 ≤ ···
≤ an/2-1 and an/2 ≥ an/2+1 ≥ ··· ≥ an-1.
• Consider the following subsequences of s:
s1 = min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1}
s2 = max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1} (1)
• Note that s1 and s2 are both bitonic and each element of s1 is less than every element in s2.
• We can apply the procedure recursively on s1 and s2 to get the sorted sequence.
Sorting Networks: Bitonic Sort
Merging a 16-element bitonic sequence through a series of log 16 bitonic splits.
Sorting Networks: Bitonic Sort • We can easily build a sorting network to implement this bitonic merge algorithm.
• Such a network is called a bitonic merging network.
• The network contains log n columns. Each column contains n/2 comparators and performs one step of the bitonic merge.
• We denote a bitonic merging network with n inputs by BM[n].
• Replacing the comparators by Ө comparators results in a decreasing output sequence; such a network is denoted by ӨBM[n].
Sorting Networks: Bitonic Sort
A bitonic merging network for n = 16. The input wires are numbered 0,1,…, n - 1, and the binary representation of these numbers is shown. Each column
of comparators is drawn separately; the entire figure represents a BM[16] bitonic merging network. The network takes a bitonic sequence and outputs it
in sorted order.
Sorting Networks: Bitonic Sort How do we sort an unsorted sequence using a bitonic merge?
• We must first build a single bitonic sequence from the given sequence.
• A sequence of length 2 is a bitonic sequence.
• A bitonic sequence of length 4 can be built by sorting the first two elements using BM[2] and next two, using ӨBM[2].
• This process can be repeated to generate larger bitonic sequences.
Sorting Networks: Bitonic Sort
A schematic representation of a network that converts an input sequence into a bitonic sequence. In this example, BM[k] and
ӨBM[k] denote bitonic merging networks of input size k that use and Ө comparators, respectively. The last merging network
(BM[16]) sorts the input. In this example, n = 16.
Sorting Networks: Bitonic Sort
The comparator network that transforms an input sequence of 16 unordered numbers into a bitonic sequence.
Sorting Networks: Bitonic Sort • The depth of the network is Θ(log2 n).
• Each stage of the network contains n/2 comparators. A serial implementation of the network would have complexity Θ(nlog2 n).
Mapping Bitonic Sort to Hypercubes • Consider the case of one item per processor. The question becomes one of how the wires in the bitonic network should be mapped to the hypercube interconnect.
• Note from our earlier examples that the compare-exchange operation is performed between two wires only if their labels differ in exactly one bit!
• This implies a direct mapping of wires to processors. All communication is nearest neighbor!
Mapping Bitonic Sort to Hypercubes
Communication during the last stage of bitonic sort. Each wire is mapped to a hypercube process; each connection represents a compare-
exchange between processes.
Mapping Bitonic Sort to Hypercubes
Communication characteristics of bitonic sort on a hypercube. During each stage of the algorithm, processes communicate along the
dimensions shown.
Mapping Bitonic Sort to Hypercubes
Parallel formulation of bitonic sort on a hypercube with n = 2d processes.
Mapping Bitonic Sort to Hypercubes
• During each step of the algorithm, every process performs a compare-exchange operation (single nearest neighbor communication of one word).
• Since each step takes Θ(1) time, the parallel time is
Tp = Θ(log2n) (2)
• This algorithm is cost optimal w.r.t. its serial counterpart, but not w.r.t. the best sorting algorithm.
Mapping Bitonic Sort to Meshes
• The connectivity of a mesh is lower than that of a hypercube, so we must expect some overhead in this mapping.
• Consider the row-major shuffled mapping of wires to processors.
Mapping Bitonic Sort to Meshes
Different ways of mapping the input wires of the bitonic sorting network to a mesh of processes: (a) row-major mapping, (b) row-major snakelike
mapping, and (c) row-major shuffled mapping.
Mapping Bitonic Sort to Meshes
The last stage of the bitonic sort algorithm for n = 16 on a mesh, using the row-major shuffled mapping. During each step, process pairs compare-exchange their elements. Arrows indicate the pairs of
processes that perform compare-exchange operations.
Mapping Bitonic Sort to Meshes • In the row-major shuffled mapping, wires that differ at the ith least-significant bit are mapped onto mesh processes that are 2(i-1)/2 communication links away.
• The total amount of communication performed by each process is . The total computation performed by each process is Θ(log2n).
• The parallel runtime is:
• This is not cost optimal.
)(or ,72log
1 12/)1( nnn
i
i
jj
Block of Elements Per Processor
• Each process is assigned a block of n/p elements.
• The first step is a local sort of the local block.
• Each subsequent compare-exchange operation is replaced by a compare-split operation.
• We can effectively view the bitonic network as having (1 + log p)(log p)/2 steps.
Block of Elements Per Processor: Hypercube • Initially the processes sort their n/p elements (using merge sort) in time
Θ((n/p)log(n/p)) and then perform Θ(log2p) compare-split steps.
• The parallel run time of this formulation is
• Comparing to an optimal sort, the algorithm can efficiently use up to processes.
• The isoefficiency function due to both communication and extra work is Θ(plog plog2p) .
)2( lognp
Block of Elements Per Processor: Mesh • The parallel runtime in this case is given by:
• This formulation can efficiently use up to p = Θ(log2n) processes.
• The isoefficiency function is
Performance of Parallel Bitonic Sort The performance of parallel formulations of bitonic sort for n elements
on p processes.
Bubble Sort and its Variants The sequential bubble sort algorithm compares and exchanges
adjacent elements in the sequence to be sorted:
Sequential bubble sort algorithm.
Bubble Sort and its Variants
• The complexity of bubble sort is Θ(n2).
• Bubble sort is difficult to parallelize since the algorithm has no concurrency.
• A simple variant, though, uncovers the concurrency.
Odd-Even Transposition
Sequential odd-even transposition sort algorithm.
Odd-Even Transposition
Sorting n = 8 elements, using the odd-even transposition sort algorithm. During each phase, n = 8 elements are compared.
Odd-Even Transposition
• After n phases of odd-even exchanges, the sequence is sorted.
• Each phase of the algorithm (either odd or even) requires Θ(n) comparisons.
• Serial complexity is Θ(n2).
Parallel Odd-Even Transposition
• Consider the one item per processor case.
• There are n iterations, in each iteration, each processor does one compare-exchange.
• The parallel run time of this formulation is Θ(n).
• This is cost optimal with respect to the base serial algorithm but not the optimal one.
Parallel Odd-Even Transposition
Parallel formulation of odd-even transposition.
Parallel Odd-Even Transposition
• Consider a block of n/p elements per processor.
• The first step is a local sort.
• In each subsequent step, the compare exchange operation is replaced by the compare split operation.
• The parallel run time of the formulation is
Parallel Odd-Even Transposition
• The parallel formulation is cost-optimal for p = O(log n).
• The isoefficiency function of this parallel formulation is Θ(p2p).
Shellsort
• Let n be the number of elements to be sorted and p be the number of processes.
• During the first phase, processes that are far away from each other in the array compare-split their elements.
• During the second phase, the algorithm switches to an odd-even transposition sort.
Parallel Shellsort • Initially, each process sorts its block of n/p elements internally.
• Each process is now paired with its corresponding process in the reverse order of the array. That is, process Pi, where i < p/2, is paired with process Pp-i-1.
• A compare-split operation is performed.
• The processes are split into two groups of size p/2 each and the process repeated in each group.
Parallel Shellsort
An example of the first phase of parallel shellsort on an eight-process array.
Parallel Shellsort • Each process performs d = log p compare-split operations.
• With O(p) bisection width, each communication can be performed in time Θ(n/p) for a total time of Θ((nlog p)/p).
• In the second phase, l odd and even phases are performed, each requiring time Θ(n/p).
• The parallel run time of the algorithm is:
Quicksort
• Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity.
• Quicksort selects one of the entries in the sequence to be the pivot and divides the sequence into two - one with all elements less than the pivot and other greater.
• The process is recursively applied to each of the sublists.
Quicksort
The sequential quicksort algorithm.
Quicksort
Example of the quicksort algorithm sorting a sequence of size n = 8.
Quicksort
• The performance of quicksort depends critically on the quality of the pivot.
• In the best case, the pivot divides the list in such a way that the larger of the two lists does not have more than αn elements (for some constant α).
• In this case, the complexity of quicksort is O(nlog n).
Parallelizing Quicksort
• Lets start with recursive decomposition - the list is partitioned serially and each of the subproblems is handled by a different processor.
• The time for this algorithm is lower-bounded by Ω(n)!
• Can we parallelize the partitioning step - in particular, if we can use n processors to partition a list of length n around a pivot in O(1) time, we have a winner.
• This is difficult to do on real machines, though.
Parallelizing Quicksort: PRAM Formulation • We assume a CRCW (concurrent read, concurrent write) PRAM with concurrent writes resulting in an arbitrary write succeeding.
• The formulation works by creating pools of processors. Every processor is assigned to the same pool initially and has one element.
• Each processor attempts to write its element to a common location (for the pool).
• Each processor tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the `left' pool, else, it assigns itself to the `right' pool.
• Each pool performs this operation recursively.
• Note that the algorithm generates a tree of pivots. The depth of the tree is the expected parallel runtime. The average value is O(log n).
Parallelizing Quicksort: PRAM Formulation
A binary tree generated by the execution of the quicksort algorithm. Each level of the tree represents a different array-partitioning iteration. If
pivot selection is optimal, then the height of the tree is Θ(log n), which is also the number of iterations.
Parallelizing Quicksort: PRAM Formulation
The execution of the PRAM algorithm on the array shown in (a).
Parallelizing Quicksort: Shared Address Space Formulation
• Consider a list of size n equally divided across p processors.
• A pivot is selected by one of the processors and made known to all processors.
• Each processor partitions its list into two, say Li and Ui, based on the selected pivot.
• All of the Li lists are merged and all of the Ui lists are merged separately.
• The set of processors is partitioned into two (in proportion of the size of lists L and U). The process is recursively applied to each of the lists.
Shared Address Space Formulation
Parallelizing Quicksort: Shared Address Space Formulation
• The only thing we have not described is the global reorganization (merging) of local lists to form L and U.
• The problem is one of determining the right location for each element in the merged list.
• Each processor computes the number of elements locally less than and greater than pivot.
• It computes two sum-scans to determine the starting location for its elements in the merged L and U lists.
• Once it knows the starting locations, it can write its elements safely.
Parallelizing Quicksort: Shared Address Space Formulation
Efficient global rearrangement of the array.
Parallelizing Quicksort: Shared Address Space Formulation
• The parallel time depends on the split and merge time, and the quality of the pivot.
• The latter is an issue independent of parallelism, so we focus on the first aspect, assuming ideal pivot selection.
• The algorithm executes in four steps: (i) determine and broadcast the pivot; (ii) locally rearrange the array assigned to each process; (iii) determine the locations in the globally rearranged array that the local elements will go to; and (iv) perform the global rearrangement.
• The first step takes time Θ(log p), the second, Θ(n/p) , the third, Θ(log p) , and the fourth, Θ(n/p).
• The overall complexity of splitting an n-element array is Θ(n/p) + Θ(log p).
Parallelizing Quicksort: Shared Address Space Formulation
• The process recurses until there are p lists, at which point, the lists are sorted locally.
• Therefore, the total parallel time is:
• The corresponding isoefficiency is Θ(plog2p) due to broadcast and scan operations.
Parallelizing Quicksort: Message Passing Formulation • A simple message passing formulation is based on the recursive halving
of the machine.
• Assume that each processor in the lower half of a p processor ensemble is paired with a corresponding processor in the upper half.
• A designated processor selects and broadcasts the pivot.
• Each processor splits its local list into two lists, one less (Li), and other greater (Ui) than the pivot.
• A processor in the low half of the machine sends its list Ui to the paired processor in the other half. The paired processor sends its list Li.
• It is easy to see that after this step, all elements less than the pivot are in the low half of the machine and all elements greater than the pivot are in the high half.
Parallelizing Quicksort: Message Passing Formulation • The above process is recursed until each processor has its own local list,
which is sorted locally.
• The time for a single reorganization is Θ(log p) for broadcasting the pivot element, Θ(n/p) for splitting the locally assigned portion of the array, Θ(n/p) for exchange and local reorganization.
• We note that this time is identical to that of the corresponding shared address space formulation.
• It is important to remember that the reorganization of elements is a bandwidth sensitive operation.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003
Topic Overview
• Definitions and Representation • Minimum Spanning Tree: Prim's Algorithm • Single-Source Shortest Paths: Dijkstra's Algorithm • All-Pairs Shortest Paths • Transitive Closure • Connected Components • Algorithms for Sparse Graphs
Definitions and Representation
• An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite set of edges.
• An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V. • In a directed graph, the edge e is an ordered pair (u,v). An edge (u,v)
is incident from vertex u and is incident to vertex v. • A path from a vertex v to a vertex u is a sequence <v0,v1,v2,…,vk> of
vertices where v0 = v, vk = u, and (vi, vi+1) ∈ E for I = 0, 1,…, k-1. • The length of a path is defined as the number of edges in the path.
Definitions and Representation
a) An undirected graph and (b) a directed graph.
Definitions and Representation
• An undirected graph is connected if every pair of vertices is connected by a path.
• A forest is an acyclic graph, and a tree is a connected acyclic graph. • A graph that has weights associated with each edge is called a
weighted graph.
Definitions and Representation
• Graphs can be represented by their adjacency matrix or an edge (or vertex) list.
• Adjacency matrices have a value ai,j = 1 if nodes i and j share an edge; 0 otherwise. In case of a weighted graph, ai,j = wi,j, the weight of the edge.
• The adjacency list representation of a graph G = (V,E) consists of an array Adj[1..|V|] of lists. Each list Adj[v] is a list of all vertices adjacent to v.
• For a grapn with n nodes, adjacency matrices take Θ(n2) space and adjacency list takes Θ(|E|) space.
Definitions and Representation
An undirected graph and its adjacency matrix representation.
An undirected graph and its adjacency list representation.
Minimum Spanning Tree
• A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G.
• In a weighted graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph.
• A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight.
Minimum Spanning Tree
An undirected graph and its minimum spanning tree.
Minimum Spanning Tree: Prim's Algorithm• Prim's algorithm for finding an MST is a greedy algorithm. • Start by selecting an arbitrary vertex, include it into the current MST. • Grow the current MST by inserting into it the vertex closest to one of
the vertices already in current MST.
Minimum Spanning Tree: Prim's Algorithm
Prim's minimum spanning tree algorithm.
Minimum Spanning Tree: Prim's Algorithm
Prim's sequential minimum spanning tree algorithm.
Prim's Algorithm: Parallel Formulation • The algorithm works in n outer iterations - it is hard to execute these
iterations concurrently. • The inner loop is relatively easy to parallelize. Let p be the number of
processes, and let n be the number of vertices. • The adjacency matrix is partitioned in a 1-D block fashion, with distance
vector d partitioned accordingly. • In each step, a processor selects the locally closest node, followed by a
global reduction to select globally closest node. • This node is inserted into MST, and the choice broadcast to all
processors. • Each processor updates its part of the d vector locally.
Prim's Algorithm: Parallel Formulation
The partitioning of the distance array d and the adjacency matrix A among p processes.
Prim's Algorithm: Parallel Formulation • The cost to select the minimum entry is O(n/p + log p). • The cost of a broadcast is O(log p). • The cost of local updation of the d vector is O(n/p). • The parallel time per iteration is O(n/p + log p). • The total parallel time is given by O(n2/p + n log p). • The corresponding isoefficiency is O(p2log2p).