Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction...
-
date post
20-Dec-2015 -
Category
Documents
-
view
217 -
download
2
Transcript of Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction...
Lecture 6
• Objectives
• Communication Complexity Analysis
• Collective Operations– Reduction– Binomial Trees– Gather and Scatter Operations
• Review Communication Analysis of Floyd’s Algorithm
Parallel Reduction Evolution
Binomial Trees
Subgraph of hypercube
Finding Global Sum
4 2 0 7
-3 5 -6 -3
8 1 2 3
-4 4 6 -1
Finding Global Sum
1 7 -6 4
4 5 8 2
Finding Global Sum
8 -2
9 10
Finding Global Sum
17 8
Finding Global Sum
25
Binomial Tree
Agglomeration
Agglomeration
sum
sum sum
sum
Gather
All-gather
Complete Graph for All-gather
Hypercube for All-gather
Analysis of Communication
n
• Lambda is latency
= message delay
= overhead to send 1 message
• Beta is bandwidth
= number of data items per unit time
= bytes per message
Sending a message with n data items costs
Communication Time for All-Gather
p
pnp
p
np
i
)1(
log2
log
1
1-i
Hypercube
Complete graph
p
pnp
pnp
)1(
)1()/
)(1(
Adding Data Input
Scatter
Scatter in log p Steps
12345678 56781234 56 12
7834
Communication Time for Scatter
p
pnp
p
np
i
)1(
log2
log
1
1-i
Hypercube
Complete graph
p
pnp
pnp
)1(
)1()/
)(1(
Recall Parallel Floyd’s Computational Complexity
• Innermost loop has complexity (n)
• Middle loop executed at most n/p times
• Outer loop executed n times
• Overall computation complexity (n3/p)
Floyd’s Communication Complexity
• No communication in inner loop• No communication in middle loop• Broadcast in outer loop — complexity is
• Executed n times
)/(log np
Execution Time Expression (1)
)/4(log/ npnnpnn
Iterations of outer loopIterations of middle loop
Cell update time
Iterations of outer loop
Messages per broadcastMessage-passing time bytes/msg
Iterations of inner loop
Accounting for Computation/communication Overlap
Note that after the 1st broadcast all the wait times overlap the computation time of Process 0.
Execution Time Expression (2)
Iterations of outer loopIterations of middle loop
Cell update time
Iterations of outer loop
Messages per broadcastMessage-passing time
Iterations of inner loop
/4loglog/ nppnnpnn Message transmission
Predicted vs. Actual Performance
Execution Time (sec)
Processes Predicted Actual
1 25.54 25.54
2 13.02 13.89
3 9.01 9.60
4 6.89 7.29
5 5.86 5.99
6 5.01 5.16
7 4.40 4.50
8 3.94 3.98
X=25.5 nsec
L = 250 usecs
B = 10MB/sec
N = 1000
Summary
• Two matrix decompositions– Rowwise block striped– Columnwise block striped
• Blocking send/receive functions– MPI_Send– MPI_Recv
• Overlapping communications with computations