Midterm Solution
-
Upload
kenneth-brown -
Category
Documents
-
view
16 -
download
0
description
Transcript of Midterm Solution
Mid-Term Solution, CS423 Computer Architecture
Mid-Term Solution, CS423 Computer Architecture
Q1.Suppose we have made the following measurement:
Frequency of all FP operation = 25%
Average CPI of FP operations excluding FPSQR = 4.0
Average CPI of all other (non-FP) instructions = 1.33
Frequency of FPSQR = 3%
CPI of FPSQR = 20
Assume that the two design alternatives are to reduce the CPI of FPSQR to two, or to reduce the CPI of all FP operations to 2
a) What is the effective CPI of the original machine, without any design enhancements?b) Which of the two design alternatives has a better performance? Show your analysis.
c) What is the speedup of the better design alternative, compared to the original machine?
Solution
CPI original = 1.33 * (1 - 0.25) + 4 * (0.25 - 0.03) + 20 * (0.03)
= 2.4775
CPI new FPSQR = 1.33 * (1 - 0.25) + 4 * (0.22) + 2 * (0.03)
= 1.9375
CPI new FP = 1.33 * (1- 0.25) + 2 * (0.22) + 2 * (0.03)
= 1.4975
Speedup = 2.4775 / 1.4975
Q2. Three enhancements with the following speedups are proposed for a new architecture.
Speedup 1 = 30 Speedup 2 = 20 Speedup 3 = 15 Assume that the three enhancements are non-overlapping, i-e only one enhancement is usable at any point in time
a) If enhancement 1 and 2 are each usable for 25% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? b) Assume, for some benchmark, the fraction of use is 15% for each of enhancement 1 and 2 and
70% for enhancement 3. We want to maximize performance. If only one enhancement can be
implemented, which one should it be, to achieve the best overall performance? b) Assume the enhancement can be used 25%, 35%, and 10% of the time for enhancement 1, 2, and 3, respectively. For what fraction of the reduced execution time is no enhancement in use?
Solution
a) Speedup overall = 1 / (( 1 0.25 0.25 X) + ( 0.25/30 + 0.25/20 + X/15)) 10 = 1 / ((1 0.25 0.25 X) + (0.25/30 + 0.25/20 + X/15)) X = 45.08%
b) Speedup 1 = 1 / ((1-0.15) + 0.15/30) = 1.1696
Speedup 2 = 1 / ((1-0.15) + 0.15/20) = 1.1662 Speedup 3 = 1 / ((1-0.7) + 0.15/15) = 2.8846 The 3rd enhancement gives the best performance so it should be implemented.
c) Enhanced portion = 1 0.25 0.35 0.1 = 0.3
Un-enhanced portion = 0.25/30 + 0.35/20 + 0.1/15 = 0.0325
% time Un-enhanced after speedup = 0.3 / (0.3 + 0.0325) = 90.22%
Q3. Use the following code fragment:Loop: LW R1, 0(R2) LW R4, 0(R3)
ADD R1, R1, R4
SW 0(R2), R1
ADDI R2, R2, #4
ADDI R3, R3, #4
SUB R6, R5, R2 BNEZ R6, Loop
Assume initial value of R5 is R2 + 96a) For the simple DLX pipeline with no forwarding and bypassing hardware draw the time line for the execution and calculate the number of cycles the above loop will take. Figure 3.4 from the book.b) Now for architecture shown in Figure 3.22(book), with normal forwarding and bypassing hardware
draw the timeline for the execution. Assume predict-taken scheme and you are allowed to reschedule
the instruction sequence.Solutioninstruction12345678910111213141516171819
LW R1, 0(R2)FDXMW
LW R4, 0(R3)FDXMW
ADD R1, R1, R4FSSDXMW
SW 0(R2), R1SSFSSDXMW
ADDI R2, R2, #4SSSSFDXMW
ADDI R3, R3, #4SSSSFDXMW
SUB R6, R5, R2SSSSFSDXMW
BNEZ R6, LoopSSSSSFSSDXMW
LW R1, 0(R2)SSSSSFSSF
Number of Cycles = 24 * 18 + 1 = 433instruction12345678910111213141515171819
LW R1, 0(R2)FDXMW
LW R4, 0(R3)FDXMW
ADDI R2, R2, #4FDXMW
ADD R1, R1, R4FDXMW
SUB R6, R5, R2FDXMW
SW -4(R2), R1FDXMW
BNEZ R6, LoopFDXMW
ADDI R3, R3, #4FDXMW
LW R1, 0(R2)F
Number of Cycles = 24 * 9 + 3 = 219Q4. Consider the following code fragment:
Loop: LD F0, 0(R2)
LD F4, 0(R3)
MULTD F0, F0, F4
ADDD F2, F0, F2
ADDI R2, R2, #8
ADDI R3, R3, #8
SUB R5, R4, R2
BNEZ R5, Loop
Assume that the initial value of R4 is R2 + 792. Assume the pipeline latencies as shown in figure 4.2. Also, assume integer latencies as in the DLX architecture, with forwarding and with additional branch hardware in the second (ID) stage to determine the branch target address and the branch condition.
a) For the code segment shown above, how many clock cycles will be required per iteration, without any
rescheduling? Show the number of stalls and their positions with the respect to the instruction in the
loop.
b) Now, reschedule the loop to minimize the number of stall cycles per iteration. How many clock cycles
per iteration are required for the execution, after rescheduling.
c) Unroll the original loop twice so that two copies of the loop are in the unrolled code. Eliminate any obviously redundant computations and name dependences between the unrolled copies of the loop, b
renaming the registers. How many clock cycles are required per iteration of the unrolled loop? How
many clock cycles are required per iteration for each of the two elements of the unrolled loop? Solutiona) LD F0, 0(R2) LD F4, 0(R3) Stall
MULTD F0, F0, F4
Stall14 cycles per iteration
Stall
Stall
ADDD F2, F0, F2
ADDI R2, R2, #8
ADDI R3, R3, #8
SUB R5, R4, R2
Stall
BNEZ R5, Loop
Stall
b)LD F0, 0(R2)
LD F4, 0(R3)
ADDI R2, R2, #88 clock cycles per iteration
MULTD F0, F0, F4
SUB R5, R4, R2
ADDI R3, R3, #8
BNEZ R5, Loop
ADDD F2, F0, F2c) LD F0, 0(R2) LD F4, 0(R3)LD F8, 8(R2)MULTD F0, F0, F4 12 clock cycles in total so 6 cycles per element LD F12, 8(R3)ADDI R2, R2, #16MULTD F8, F8, F12ADDD F2, F0, F2SUB R5, R4, R2ADDI R3, R3, #16BNEZ R5, LoopADDD F2, F8, F2