Design of parallel algorithms
description
Transcript of Design of parallel algorithms
![Page 1: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/1.jpg)
Design of parallel algorithms
Matrix operations
J. Porras
![Page 2: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/2.jpg)
Matrix x vector
• Sequential approach MAT_VECT(A,x,y)
for(i=0;i<n;i++) {y[i] = 0;for(j=0;j<n:j++) {
y[i] = y[i] + A[i,j] * x[j]}
}
• Work = n2
![Page 3: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/3.jpg)
Parallelization of matrix operationsMatrix x vector
• Three ways to implement – rowwise striping– columnwise striping– checkerboarding
• DRAW each of these approaches !
![Page 4: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/4.jpg)
Rowwise striping
• N x N is distributed into n processors (one row each)
• N x 1 vector is distributed into n processors (one element each)
• All processors need the whole vector so all-to-all broadcast is required
![Page 5: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/5.jpg)
![Page 6: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/6.jpg)
Rowwise striping
• All-to-all broadcast requires n).
• One row takes n) time for multiplications
• Rows are calculated in parallel thus the total time is n) and the work n2).– Algorithm is cost-optimal
![Page 7: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/7.jpg)
Block striping
• Assume that p < n and the matrix in partitioned by using block striping
• All processors contain n/p rows and n/p elements of the vector
• All processors require the whole vector thus all-to-all broadcast is required (message size n/p)
![Page 8: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/8.jpg)
Block striping in hypercube
• all-to-all broadcast in hypercube with n/p-sized message takes
tslog p + tw(n/p)(p-1)
• If p is considered large enoughtslog p + twn
• Multiplication requires n2/p time (n/p rows to multiply with the vector)
![Page 9: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/9.jpg)
Block striping in hypercube
• Parallel execution time TP = n2/p + tslog p + twn
• Cost pTP n2 + ts plog p + twnp
• Algorithm is costoptimal if
p = O(n)
![Page 10: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/10.jpg)
Block striping in mesh
• All-to-all broadcast in mesh with wraparounds takes 2ts(p-1) + tw(n/p)(p-1)
• Parallel execution requiresTP = n2/p + 2ts (p-1) + twn
![Page 11: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/11.jpg)
Scalability of block striping
• Overhead (T0 = pTp – W)
T0 = ts plog p + twnp
• Isoeffiency (W = KT0) for hypercube
W = K ts p log p
W = K tw np
• Since W = n2, W = K2 tw
2 p2
![Page 12: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/12.jpg)
Scalability of block striping
• Because p = O(n), n = p)n2 = p2)W = p2)
• Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency
![Page 13: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/13.jpg)
Scalability of block striping
• Isoeffiency in hypercube is (p2).
• Similar analysis can be done for the mesh architecture and get the same value (p2).
• Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh
![Page 14: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/14.jpg)
Checkerboard
• N x N matrix in partitioned into N2
processors (one element per processor)• N x 1 vector is located on a last column (or
on a diagonal)• Vector is distributed into corresponding
processors• Calculate multiplications in parallel and
collect results with single node accumulation into the last processor
![Page 15: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/15.jpg)
![Page 16: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/16.jpg)
![Page 17: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/17.jpg)
Checkerboard
• Three communication stapes are required– One-to-one communication to send the vector
onto diagonal– One-to-all broadcast to distributed the
elements of the vector– Single-node accumulation to sum the partial
results
![Page 18: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/18.jpg)
Checkerboard
• Mesh requires (n) time for all the operations (SF) and hypercube (log n)
• Multiplication happens in constant time
• Parallel execution time is (n) in mesh and (log n) in hypercube architecture
• Cost is (n3) for the mesh and (n2log n)for the hypercube
• Algorithms are not cost-optimal
![Page 19: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/19.jpg)
Checkerboard p < n2
• Cost-optimality can be achieved if the granularity is increased ??
• Consider two dimensional mesh of p processors in which each processor stores (n/p) x (n/p block of the matrix
• Simlarly for the vector (n/p)
![Page 20: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/20.jpg)
Checkerboard p < n2
• Vector elements are sent to the diagonal
• Vector elements are distributed for the other processors
• Each processor performs n2/p multiplications and calculates n/p additions
• Partial sums are collected with single node accumulation
![Page 21: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/21.jpg)
Scalability of checkerboard p < n2
• Assume that the processors are connected in a two dimensional p x p cut-through routing mesh (no wraparounds)
• Sent to diagonal takes
ts + twn / p + th p
• One-to-all in columns takes(ts + twn / p) log (p) + th p
![Page 22: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/22.jpg)
Scalability of checkerboard p < n2
• Single-node accumulation takes(ts + twn / p) log (p) + th p
• Multiplicatios in each processor takes n2/p.
• Thus
TP = n2/p + tslog p +(tw n / p) log p + 3th p
• T0 = pTP - W gives for the overhead:
T0 = tsplog p + tw n p log p + 3th p3/2
![Page 23: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/23.jpg)
Scalability of checkerboard p < n2
• Isoeffiency for ts:
W = Kts p log p
• Isoeffiency for tw:
W = n2 = K tw n p log p
n = K tw p log p
n2 = K2 tw 2 p log 2 p
W = K2 tw2p log2 p
• Isoeffiency for th:
W = 3 K th p3/2
![Page 24: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/24.jpg)
Scalability of checkerboard p < n2
• If p = O(n2), :p = O(n2)n2 = p)W = p)
• tw and th dominate ts
![Page 25: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/25.jpg)
Scalability of checkerboard p < n2
• Concentrate on th (p3/2) and tw:n (plog2 p)
• Because p3/2 > plog2p only for p > 65536 both of the terms could dominate
• Assume that the term (plog2 p) dominates
![Page 26: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/26.jpg)
Scalability of checkerboard p < n2
• Maximum number of processors that can be used costoptimally for the problem size W is determined by
plog2 p = O( n2 )
log p + 2 log log p = O( log n )
log p = O (log n)
![Page 27: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/27.jpg)
Scalability of checkerboard p < n2
• Substitute log n for log p:n
• p log2 n = O (n2 ) p = O ( n2 / log2 n )
• p gives the upper limit for the number of processors that can be used cost-optimally
![Page 28: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/28.jpg)
SF and CT
• Parallel execution takes n2 / p + 2ts p + 3tw
n time on p processor mesh with SF routing (isoeffiency (p2) dueto tw )
• CT routing performs much better
• Note that this is true for cases with several elements per processor
• HOW about fine-grain case ?
![Page 29: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/29.jpg)
Striped and checkerboard
• Comparison shows that checkerboard is faster than striped approach with the same amount of processors
• If p > n, striped approach is not available
• How about the effect of architecture ?
• Scalability ?
• Isoefficiency ?
![Page 30: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/30.jpg)
Sequential matrix multiplication
• Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j]
• n3 work (strassen’s algorithm has better complexity)
![Page 31: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/31.jpg)
Block approach
• n/q * n/q submatrices
• Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j
• Same complexity n3
![Page 32: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/32.jpg)
Simple parallel approach
• Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2)
• Map into p1/2 x p1/2 mesh
• Processors P0,0 ... Pp-1,p-1
• Pi,j stores Ai,j and Bi,j and computes Ci,j
• Ci,j requires Ai,k and Bk,j
• A needs to communicate within rows • B communicates within columns
![Page 33: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/33.jpg)
Performance on hypercube
• Requires 2 broadcasts (rows and columns)
• message size n2/p
• tc = 2(ts log(p)+tw(n2/p)(p-1))
• tm= p (n/p)3=n3/p
• Tp = n3/p + ts log p + 2twn2/ p , p » 1
![Page 34: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/34.jpg)
Performance on mesh
• Store-and-forward routing
• tc = 2(tsp + twn2/ p)
• tm= p (n/ p)3=n3/p
• tp = n3/p + 2ts p + 2twn2/ p
![Page 35: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/35.jpg)
Cannon´s algorithm
• Partition to blocks as usual
• Processors P0,0 - P p-1, p-1
• Pi,j contains Ai,j and Bi,j
• rotate block !!
• A blocks to the left
• B blocks upwards
![Page 36: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/36.jpg)
![Page 37: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/37.jpg)
![Page 38: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/38.jpg)
![Page 39: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/39.jpg)
Fox’s algorithm
• Partition to blocks as usual
• Pi,j contains Ai,j and Bi,j
• Uses one-to-all broadcastsp iterations
• (1) broadcast selected block to row
• (2) multiply by B
• (3) send B upwards
• (4) select Ai,(j+1)mod(p)
![Page 40: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/40.jpg)
![Page 41: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/41.jpg)
![Page 42: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/42.jpg)
DNS
• Dekel, Nassimi and Sahni
• n3 processors available
• use 3D structure
• Pi,j,k solves A[i,k]xB[k,j]
• C[i,j] = Pi,j,0 +...+ Pi,j,n-1
(log n) time
![Page 43: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/43.jpg)
DNS for hypercube
• 3D structure is mapped into hypercube where n3 = 23d processors
• Processor Pi,j,o contains A[i,j] and B[i,j]
• 3 steps
• (1) move A & B to correct plane
• (2) replicate on each plane
• (3) single node accumulation
![Page 44: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/44.jpg)
![Page 45: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/45.jpg)
![Page 46: Design of parallel algorithms](https://reader036.fdocuments.us/reader036/viewer/2022062321/56813fed550346895daaf23c/html5/thumbnails/46.jpg)
DNS < n3 processors
• Processors p = q3, q < n
• Partition matrices into (n/q)*(n/q) blocks
• Matrices contain q x q submatrices
• Since 1<=q<=n, p=[1,n3]