1 Parallelizing Compiler Technology Dr. Stephen Tse [email protected] Lesson 7.
-
Upload
tamsyn-bond -
Category
Documents
-
view
218 -
download
1
Transcript of 1 Parallelizing Compiler Technology Dr. Stephen Tse [email protected] Lesson 7.
1
Parallelizing Compiler Technology Parallelizing Compiler Technology
Dr. Stephen TseDr. Stephen Tse
Lesson 7Lesson 7
2
Vectorize byProper restructuring the innermost loop
• Presence of a cycle in the data dependence graph usually denies vectorization, but if one of the edges forming a cycle is counter-dependent or output-dependent, the compiler may perform the vectorization by inserting the substitution statement into the temporary arrangement.
do I=1,N do I=I,N C(I)=A(I)+B(I) TEMP(I)=C(I+1) TEMP(1:N)=C(2:N+1) D(I)=C(I)+C(I+1) C(I)=A(I)+B(I) C(1:N)=A(1:N)+B(1:N)end do D(I)=C(I)+TEMP(I) D(1:N)=C(1:N)+TEMP(1:N) end do
(a) This do loop having a cycle consisting of the flow dependence and counter-dependence
(b) This do loop can be rewritten by introducing the temporary variable of TEMP(I)
(c) Finally, it can be vectorized
Nodal Division restructuring method
3
Scalar Expansion Method
• These two methods are intended to vectorize through proper restructuring the innermost loop. These loops cannot be vectorize because from the data dependence analysis result, the loops are counter-dependent or output-dependent. The compiler may perform the vectorization by inserting the substitution statement into the temporary arrangement.
do I=1,N do I=I,N T=A(I)*B(I) TEMP(I)=A(I)*B(I) TEMP(1:N)=A(1:N)*B(1:N) C(I)=T+B(I+1) C(I)=TEMP(I)+B(I) C(I)=TEMP(1:N)+B(1:N)end do end do T=TEMP(N)
T=TEMP(N)
(a) This Method is to cancel out the counter-dependence of the variable T in the loop
(b) Expanding T to the temporary arrangement of TEMP(I)
(c) Finally, it can be vectorized the do loop
Nodal Division restructuring method
4
Other Restructuring Method
• Many other restructuring methods are proposed for better processing efficiency. – by increasing the vector length– by setting plural pipelines in parallel operation– by effectively use of the vector register
5
A Loop Interchange Method
• Because the vector length is short, by loop change, we can increased the vector length and has better processing efficiency
do I=1,100 do J=2,20
do J=2,20 do I=1,100
B(I,J)=A(I,J)+B(I,J-1) B(I,J)=A(I,J )+B(I,J-1)
end do end do
end do end do
(a) The primary recurrence of this dual loop is inside, and the vector length is short.
(b) Increasing the vector length by exchanging the outer and inner loops for better processing efficiency.
6
Loop Decay Method
• Change dual loop for single loop is intended to increase the vector length
real A(6,6), B(6,6) real A(36), B (36)
do I=1,6 do I J=1,36
do J=1,6
A(I,J)=B(I,J)+C(I,J) A(I,J)=B(I,J)+C(I,J)
end do end do
end do
(a) The frequency of the iteration inside of the dual loop is low.
(b) Exchange the dual loop for a single loop is called a loop decay.
7
Strip Mining Method
• The strip mining method is when a single loop has a large N value. Contrary to the loop decay method, we increase in to multiple loops.
• This method is used mainly for the vector register control in the vectorization process.
do I=1,N do I J=1,N,64
C(I)=A(I)+B(I) do J=I,min(I+63,N)
end do C(I)=A(I)+B(I)
end do
end do
(b) To effectively use of those registers, we can convert the loop to a dual loop for an effective of all the registers.
(a) If machine has only 64 vector registers, when N is large, this loop processing will be time consuming.
8
Parallelizing Compilers for Multiprocessor System
• For further improvement of the performance of pipeline super computers in the hardware area, the methods considered including:– increasing the number of pipe stages– raising the speed of process at each stage (increasing the element
speed)– increasing the number of pipes which can operate in parallel
• In reality, there are difficulties in stepping up the performance by means of increasing the number of stages and pipes.
• If rely on the increase in the element speed for boosting the effective performance; even with the current fastest silicon bipolar logic IC (as high as 70 ps, about 20,000 gate concentration), it is still difficult to attain the several-digit speed increase in the future.
• Therefore, the multiprocessor format that has multiple connection of the conventional pipeline processors are important.
• The second generation super computers are all incorporate with multiprocessor format; i.e. CRAY 2, CRAY X-MP, Nichiden SX-3 are all made up with four processors connected with shared memory.
9
Automatic Parallelizing Compiler
with Multiprocessor
The multiprocessor system is a combination of plural processors connected by the interconnection network or the shared memory. Each processor can process a different lane of instructions while one processor gives data to, and receives data from, the other processor.
Shared Memory
Interconnection Network
(bus, Cross-bus switch, multistage switching network, etc)
Local Memory
Processor 1
Local Memory
Processor 2
Local Memory
Processor 3
Multiprocessor System
10
SoftwareParallel Programming Language
Observation 1: A regular sequential programming language (C or Fortran or C++ etc) plus four communication statements (send, receive, myid, numnodes) are necessary and sufficient to form a parallel computing language.
1. Send: One processor sends a message to the network. Note this processor
does not have to know to which processor it is sending this message, but it does give “name” for the message.
2. Receive: One processor receives a message from the network. Note this
processor does not have to know which processor sends this message, but it retrieves the message by name.
3. myid: Integer between 0 and P-1 identifying a processor. myid is always
unique within one partition. 4. numnodes: Integer showing the total number of nodes in the system.
11
Send and Receive
Figure 1: Basic Message PassingSender: The circle on the left represents the "Sender" whose responsibility is to send a
message to the "Network Buffer” without knowing who is the receiver. Receiver: The “Receiver” on the right has to issue a message to the “buffer” to retrieve a
message that is labeled for it. Note:1. This is the so-called single-sided message passing, which is popular in most distributed-memory
supercomputer. 2. The Network Buffer as labeled, in fact, does not exist as an independent entity and is only a
temporary storage. It created either in the sender’s RAM or in the receiver’s RAM and depended on the readiness of the message routing information. For example, if a message’s destination is known but the exact location is not known at the destination, the message will be copied to the receiver’s RAM for easier transmission.
Sender
Network Buffer
Receiver
Figure 1: The basic concept of message passing.
12
THREE WAYS TO COMMUNICATE
1. Synchronous: The sender will not proceed to the next task until the
receiver retrieves the message from the network (hand deliver: slow!)
2. Asynchronous: The sender will proceed to the next task whether the
receiver retrieves the message from the network or not (mailing a letter: will not tie up sender!) No protection for the message in the buffer.
One example for Asynchronous message passing3. Interrupt: The receiver interrupts the sender's current activity for
pulling messages from the sender (ordering a package: interrupt sender!)
13
Synchronous Communication
Figure 2: Synchronous Message Passing
The circle on the left sends a message-1 to the imaginary Network “Buffer”, which then requests the destination to stop its current activities and ready to receive a message from the sender. In synchronous mode, the Receiver will immediately halt its current processing stream by issuing an acknowledgement to the Sender saying “OK” to send the message. After receiving this message, the Sender will immediately dump the original intended message to the Receiver at the exact location.
Send (msg 1)
Receivemsg 1
Figure 2: Synchronous Message Passing
msg 1Yes ?
OK ?
14
Asynchronous Communication
Figure 3: Asynchronous message passing: The Sender issues a message with the appropriate addressing header (envelope information) and regardless of the arrival of the message at the Receiver end or not, the Sender continues its execution without waiting for any confirmation from the Receiver. The Receiver, on the other hand, will also continue its own execution stream until the “receive” statement is met.
Note: The advantage with the asynchronous message passing is its speed. There is no need for either party to wait. The risk lies in the misuse of the correct message.
Send (msg 1)
Receivemsg 1
Figure 3: Asynchronous Message Passing
msg 1Yes ?
OK ?
15
Asynchronous MP ExampleAsynchronous Message Passing Example
SRC Processor (Sender) DST Processor (Receiver)
doing_something_useful……msg_sid=isend() /* send msg */……doing_sth_without_messing_msg
msg_rid=irecv() /* no need of msgs */…doing_sth_without_needing_msgs…msgwait(msg_rid); /*not return until msg arrives*/doing_sth_using_msgs
Choice II: msg_doneif (msgdone(msg_rid))doing_sth_with_it:else doing_other_stuff;
Choice III: msg_ignoremsgignor(msg_rid); /* oops, wrong number */
Choice IV: msgmergeMid=msgmerge(mid1,mid2); /* to grp msgs for a purpose */
Figure 4: Asynchronous Message passing example. The Sender issues a message and then continues on its execution regardless of the Receiver’s
response in receiving the message. While the Receiver can have several options with regard to the message issued already by the Sender; this message now stays somewhere called Buffer:
1.The first option for the Receiver is to wait until the message has arrived and then make use of it. 2.The second is to check if the message has indeed arrived. If YES, do something with it; otherwise, stay with its
own-thing. 3.The third option is to ignore the message; telling the buffer that this message was not for me. 4.The fourth option is to merge this message to another existing message in the Buffer; etc.
16
Interrupt Communication
Figure 5: Interrupt message passing: 1. The “Sender” first issues a short message to interrupt the “Receiver” current execution stream so
that the “Receiver” is ready to receive a long message from the Sender. 2. After appropriate delay (for the interrupt to return the operation pointer to the messaging process),
the Sender pushes through the message to the right location for the Receiver without any delay.
Receiver
Sender Send a short message to Interrupt the receiver
Send the message
Figure 5: Interrupt Message Passing
17
NINE COMMUNICATION PATTERNS
1. 1 to 1 2. 1 to Partial 3. 1 to All 4. Partial to 1 5. Partial to Partial6. Partial to All7. All to 18. All to Partial 9. All to All
18
Communication Patterns
SENDER
Figure 6: Nine Communication Patterns: (A) A single processor can send one message (same) to one processor, to a sub-group of M processors, or to the entire system. (B) A subgroup of M processors or all processors can send M different messages or all different messages to one processor.(C) A sub-group of K processors (how the messages are partitioned is a separate issue) can send messages to the entire system.
Finally, the entire system of P processors can send P different messages to one processor, a sub-group of N processors, or to the entire system.
Note: 1. In the obvious case of one message to 1, K, or P processors (same case in reverse), messages are partitioned naturally. 2. But in the case of M messages sent to K processors, the matter is a different problem and we will discuss that later.
RECEIVER
1 All
1 1
1 M
M 1 All 1
K P
All(P) All(P)
(A)
(B)
(C )
19
PARALLEL PROGRAMMING TOOLS
1. Parallel computing languages (parallel FORTRAN, C, C++ etc)
1.1 Message-passing assistant,
1.2 Portability helper: PVM, MPI......
2. Debuggers
3. Performance analyzer
4. Queuing system (same as in sequential)
20
Parallel Performance Measurement 1. Speed Up
Let T(1, N) be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Speedup is defined as
S(P, N) = T(1,N)/T(P, N)
Remarks: 1. Normally, S(P,N) < P$; Ideally, S(P,N) = P; Rarely,
S(P,N) > P --- super speedup. 2. Linear speedup: S(P,N) = c*P where c is a constant
independent of N and P.3. Algorithms with S(P,N) = c P are called scalable
algorithm.
21
Parallel Efficiency 2. Parallel Efficiency
Let T(1, N) be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Parallel efficiency is defined as
E(P,N)= T(1, N)/[T(P, N)P] = S(P,N)/P
Remarks: 1. Normally, E(P,N) < 1; Ideally, S(P,N) = 1; Rarely,
S(P,N) > 1; E(P,N) ~.6 acceptable. Of course, it is problem-dependent.
2. Linear speedup: E(P,N) = c where c is a constant independent of N and P.
3. Algorithms with E(P,N) = c are called scalable algorithms.
22
3. Load Imbalance Ratio I(P,N)• Processor i spends ti doing useful work and tmax = max{ti} is the
maximum time spent by one or more processors and tavg = (i=0P-1
ti)/P= average time. The total time spent on useful task for computation and communication is i=0
P-1 ti while the time that the system is occupied (either computation or communication or idle) is P tmax. Thus, we define a parameter called load imbalance ratio:
I(P,N) = [Ptmax - i=0P-1 ti]/ i=0
P-1 ti = tmax / tavg – 1
Remarks:1. I(P,N) is the average time wasted by each processor due to load
imbalance.2. If tmax = t, then ti = t, then, I(P,N) = 0 complete load balance. 3. One slow processor (tmax) can mess up the entire team. This
observation shows that slave-master scheme is usually very inefficient because of the load imbalance issue due to slow master processor.
23
Load Balance:ti on P Nodes Within Synchronization
24