Midterm Solution

Mid-Term Solution, CS423 Computer Architecture

Mid-Term Solution, CS423 Computer Architecture

Q1.Suppose we have made the following measurement:

Frequency of all FP operation = 25%

Average CPI of FP operations excluding FPSQR = 4.0

Average CPI of all other (non-FP) instructions = 1.33

Frequency of FPSQR = 3%

CPI of FPSQR = 20

Assume that the two design alternatives are to reduce the CPI of FPSQR to two, or to reduce the CPI of all FP operations to 2

a) What is the effective CPI of the original machine, without any design enhancements?b) Which of the two design alternatives has a better performance? Show your analysis.

c) What is the speedup of the better design alternative, compared to the original machine?

Solution

CPI original = 1.33 * (1 - 0.25) + 4 * (0.25 - 0.03) + 20 * (0.03)

= 2.4775

CPI new FPSQR = 1.33 * (1 - 0.25) + 4 * (0.22) + 2 * (0.03)

= 1.9375

CPI new FP = 1.33 * (1- 0.25) + 2 * (0.22) + 2 * (0.03)

= 1.4975

Speedup = 2.4775 / 1.4975

Q2. Three enhancements with the following speedups are proposed for a new architecture.

Speedup 1 = 30 Speedup 2 = 20 Speedup 3 = 15 Assume that the three enhancements are non-overlapping, i-e only one enhancement is usable at any point in time

a) If enhancement 1 and 2 are each usable for 25% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? b) Assume, for some benchmark, the fraction of use is 15% for each of enhancement 1 and 2 and

70% for enhancement 3. We want to maximize performance. If only one enhancement can be

implemented, which one should it be, to achieve the best overall performance? b) Assume the enhancement can be used 25%, 35%, and 10% of the time for enhancement 1, 2, and 3, respectively. For what fraction of the reduced execution time is no enhancement in use?

Solution

a) Speedup overall = 1 / (( 1 0.25 0.25 X) + ( 0.25/30 + 0.25/20 + X/15)) 10 = 1 / ((1 0.25 0.25 X) + (0.25/30 + 0.25/20 + X/15)) X = 45.08%

b) Speedup 1 = 1 / ((1-0.15) + 0.15/30) = 1.1696

Speedup 2 = 1 / ((1-0.15) + 0.15/20) = 1.1662 Speedup 3 = 1 / ((1-0.7) + 0.15/15) = 2.8846 The 3rd enhancement gives the best performance so it should be implemented.

c) Enhanced portion = 1 0.25 0.35 0.1 = 0.3

Un-enhanced portion = 0.25/30 + 0.35/20 + 0.1/15 = 0.0325

% time Un-enhanced after speedup = 0.3 / (0.3 + 0.0325) = 90.22%

Q3. Use the following code fragment:Loop: LW R1, 0(R2) LW R4, 0(R3)

ADD R1, R1, R4

SW 0(R2), R1

ADDI R2, R2, #4

ADDI R3, R3, #4

SUB R6, R5, R2 BNEZ R6, Loop

Assume initial value of R5 is R2 + 96a) For the simple DLX pipeline with no forwarding and bypassing hardware draw the time line for the execution and calculate the number of cycles the above loop will take. Figure 3.4 from the book.b) Now for architecture shown in Figure 3.22(book), with normal forwarding and bypassing hardware

draw the timeline for the execution. Assume predict-taken scheme and you are allowed to reschedule

the instruction sequence.Solutioninstruction12345678910111213141516171819

LW R1, 0(R2)FDXMW

LW R4, 0(R3)FDXMW

ADD R1, R1, R4FSSDXMW

SW 0(R2), R1SSFSSDXMW

ADDI R2, R2, #4SSSSFDXMW

ADDI R3, R3, #4SSSSFDXMW

SUB R6, R5, R2SSSSFSDXMW

BNEZ R6, LoopSSSSSFSSDXMW

LW R1, 0(R2)SSSSSFSSF

Number of Cycles = 24 * 18 + 1 = 433instruction12345678910111213141515171819

LW R1, 0(R2)FDXMW

LW R4, 0(R3)FDXMW

ADDI R2, R2, #4FDXMW

ADD R1, R1, R4FDXMW

SUB R6, R5, R2FDXMW

SW -4(R2), R1FDXMW

BNEZ R6, LoopFDXMW

ADDI R3, R3, #4FDXMW

LW R1, 0(R2)F

Number of Cycles = 24 * 9 + 3 = 219Q4. Consider the following code fragment:

Loop: LD F0, 0(R2)

LD F4, 0(R3)

MULTD F0, F0, F4

ADDD F2, F0, F2

ADDI R2, R2, #8

ADDI R3, R3, #8

SUB R5, R4, R2

BNEZ R5, Loop

Assume that the initial value of R4 is R2 + 792. Assume the pipeline latencies as shown in figure 4.2. Also, assume integer latencies as in the DLX architecture, with forwarding and with additional branch hardware in the second (ID) stage to determine the branch target address and the branch condition.

a) For the code segment shown above, how many clock cycles will be required per iteration, without any

rescheduling? Show the number of stalls and their positions with the respect to the instruction in the

loop.

b) Now, reschedule the loop to minimize the number of stall cycles per iteration. How many clock cycles

per iteration are required for the execution, after rescheduling.

c) Unroll the original loop twice so that two copies of the loop are in the unrolled code. Eliminate any obviously redundant computations and name dependences between the unrolled copies of the loop, b

renaming the registers. How many clock cycles are required per iteration of the unrolled loop? How

many clock cycles are required per iteration for each of the two elements of the unrolled loop? Solutiona) LD F0, 0(R2) LD F4, 0(R3) Stall

MULTD F0, F0, F4

Stall14 cycles per iteration

Stall

Stall

ADDD F2, F0, F2

ADDI R2, R2, #8

ADDI R3, R3, #8

SUB R5, R4, R2

Stall

BNEZ R5, Loop

Stall

b)LD F0, 0(R2)

LD F4, 0(R3)

ADDI R2, R2, #88 clock cycles per iteration

MULTD F0, F0, F4

SUB R5, R4, R2

ADDI R3, R3, #8

BNEZ R5, Loop

ADDD F2, F0, F2c) LD F0, 0(R2) LD F4, 0(R3)LD F8, 8(R2)MULTD F0, F0, F4 12 clock cycles in total so 6 cycles per element LD F12, 8(R3)ADDI R2, R2, #16MULTD F8, F8, F12ADDD F2, F0, F2SUB R5, R4, R2ADDI R3, R3, #16BNEZ R5, LoopADDD F2, F8, F2

Midterm Solution

Documents

Transcript of Midterm Solution