Computer Organization Hamacher Instructor Manual Solution Chapter 81

8
Chapter 8 – Pipelining 8.1. ( ) The operation performed in each step and the operands involved are as given in the figure below. Fetch Decode, 20, 2000 Add Fetch Decode, 3, 50 Mul Fetch Decode, $3A, 50 And Fetch Decode, 2000, 50 Add R12020 R3150 R450 R52050 Clock cycle 1 2 3 4 5 6 7 I 2 : Mul I 3 : And I 4 : Add I 1 : Add Instruction () Clock cycle 2 3 4 5 Buffer B1 Add instruction (I ) Mul instruction (I ) And instruction (I ) Add instruction (I ) Buffer B2 Information from a previous instruction Decoded I Source operands: 20, 2000 Decoded I Source operands: 3, 50 Decoded I Source operands: $3A, 50 Buffer B3 Information from a previous instruction Information from a previous instruction Result of I: 2020 Destination R1 Result of I: 150 Destination R3 1

description

omputer Organization Hamacher Instructor Manual Solution

Transcript of Computer Organization Hamacher Instructor Manual Solution Chapter 81

Page 1: Computer Organization Hamacher Instructor Manual Solution Chapter 81

Chapter 8 – Pipelining

8.1. ( � ) Theoperationperformedin eachstepandtheoperandsinvolvedareasgivenin thefigurebelow.

FetchDecode,20, 2000 Add

FetchDecode,

3, 50 Mul

FetchDecode,$3A, 50 And

FetchDecode,2000, 50 Add

R1←2020

R3←150

R4←50

R5←2050

Clock cycle 1 2 3 4 5 6 7

I2: Mul

I3: And

I4: Add

I1: Add

Instruction

(�)

Clockcycle 2 3 4 5

Buffer B1 Add instruction(I � )

Mul instruction(I � )

And instruction(I � )

Add instruction(I � )

Buffer B2 Informationfrom a previousinstruction

DecodedI �Sourceoperands:20,2000

DecodedI �Sourceoperands:3, 50

DecodedI �Sourceoperands:$3A, 50

Buffer B3 Informationfrom a previousinstruction

Informationfrom a previousinstruction

Result of I � :2020Destination �R1

Result of I � :150Destination �R3

1

Page 2: Computer Organization Hamacher Instructor Manual Solution Chapter 81

8.2. ( � )

FetchDecode,20, 2000 Add

FetchDecode,

3, 50 Mul

FetchDecode,

$3A, 2020 And

FetchDecode,2000, 50 Add

R1←2020

R3←150

R4←32

R5←2050

Clock cycle 1 2 3 4 5 6 7

Mul

And

Add

Add

Instruction

$3A, ?

(�) Cycles2 to 4 arethe sameasin P8.1,but contentsof R1 arenot available

until cycle 5. In cycle 5, B1 andB2 have the samecontentsasin cycle 4. B3containstheresultof themultiply instruction.

8.3. StepD� may be abandoned,to be repeatedin cycle 5, asshown below. But,instructionI � mustremainin buffer B1. For I � to proceed,buffer B1 mustbecapableof holdingtwo instructions.Thedecodestepfor I � hasto bedelayedasshown, assumingthatonly oneinstructioncanbedecodedata time.

F1 E1

Clock cycle 1 2 3 4 5 6 7

I2 (Add)

I3

I4

I1 (Mul)

Instruction

D1 W1

F2 E2D2 W2

F3 E3D3 W3

F4 E4D4 W4

D2

8

2

Page 3: Computer Organization Hamacher Instructor Manual Solution Chapter 81

8.4. If all decodeandexecutestagescanhandletwo instructionsat a time, only in-structionI � is delayed,asshown below. In this case,all buffersmustbecapableof holdinginformationfor two instructions.NotethatcompletinginstructionI �beforeI � couldcauseproblems.SeeSection8.6.1.

F1 E1

Clock cycle 1 2 3 4 5 6 7

I2 (Add)

I3

I4

I1 (Mul)

Instruction

D1 W1

F2 E2D2 W2

F3 E3D3 W3

F4 E4D4 W4

8.5. Executionproceedsasfollows.

F1 E1

Clock cycle 1 2 3 4 5 6 7

I2

I3

I4

I1

Instruction

D1 W1

F2 E2D2 W2

E3D3 W3

F4 E4D4 W4

F3

98

8.6. The instruction immediatelyprecedingthe branchshouldbe placedafter thebranch.

LOOP Instruction1 LOOP Instruction1���� ����Instruction��� � Instruction��� Instruction ConditionalBranchLOOPConditionalBranchLOOP Instruction

This reorganizationis possibleonly if thebranchinstructiondoesnotdependoninstruction .

3

Page 4: Computer Organization Hamacher Instructor Manual Solution Chapter 81

8.7. TheUltraSPARC arrangementis advantageouswhenthebranchinstructionis atthe endof the loop andit is possibleto move oneinstructionfrom the bodyofthe loop into thedelayslot. Thealternative arrangementis advantageouswhenthebranchinstructionis at thebeginningof theloop.

8.8. Theinstructionexecutedon a speculativebasisshouldbeonethatis likely to bethecorrectchoicemostoften. Thus,theconditionalbranchshouldbeplacedattheendof the loop, with an instructionfrom thebodyof the loop movedto thedelayslot if possible. Alternatively, a copy of the first instructionin the loopbody canbe placedin the delayslot andthe branchaddresschangedto thatofthesecondinstructionin theloop.

8.9. Thefirst branch(BLE) hasto befollowedby aNOPinstructionin thedelayslot,becausenoneof the instructionsaroundit canbe moved. The inner andouterloop controlscanbeadjustedasshown below. Thefirst instructionin theouterloop is duplicatedin thedelayslot following BLE. It will beexecutedonemoretime thanin theoriginal program,changingthevalueleft in R3. However, thisshouldcauseno difficulty providedthe contentsof R3 arenot neededoncethesort is completed.Themodifiedprogramis asfollows:

ADD R0,LIST,R3ADD R0,N,R1SUB R1,1,R1SUB R1,1,R2

OUTER LDUB [R3+R1],R5 GetLIST(j)LDUB [R3+R2],R6 GetLIST(k)

INNER SUB R6,R5,R0BLE,pt NEXTSUB R2,1,R2 k � k � 1STUB R5,[R3+R2]STUB R6,[R3+R1]OR R0,R6,R5

NEXT BGE,pt,a INNERLDUB [R3+R2],R6 GetLIST(k)SUB R1,1,R1BGT,pt OUTERSUB R1,1,R2

4

Page 5: Computer Organization Hamacher Instructor Manual Solution Chapter 81

8.10. Without conditionalinstructions:

Compare A,B CheckA � BBranch� 0 Action1

Action2 . . . . . . Oneor moreinstructionsBranch Next

Action1 . . . . . . Oneor moreinstructionsNext . . .

If conditionalinstructionsareavailable,wecanuse:

Compare A,B CheckA � B. . . . . . Action1 instruction(s),conditional. . . . . . Action2 instruction(s),conditional

Next . . .

In the secondcase,all Action 1 andAction 2 instructionsmustbe fetchedanddecodedto determinewhetherthey areto beexecuted.Hence,this approachisbeneficialonly if eachactionconsistsof oneor two instructions.

F1

Clock cycle 1 2 3 4 5 6

Branch>0

Branch

Compare

Instruction

E1

Next

F2 E2

F3 E3

F4 E4

F6 E1…

A,B

Action1

Next

Without conditional instructions

If >0 then action1

If ≤0 then action2

CompareA,B

NEXT …

F1 E1

F2 E2

F3 E3

F4 E4

With conditional instructions

Action2

Action1

5

Page 6: Computer Organization Hamacher Instructor Manual Solution Chapter 81

8.11. Buffer contentswill beasshown below.

RSLT

Cycle No.

Clock

198 130 260

3 4 5

ALU Operation + Shift O3

R3 45 130 260

8.12. UsingLoadandStoreinstructions,theprogrammayberevisedasfollows:

INSERTION Test RHEADBranch� 0 HEADMove RNEWREC,RHEADReturn

HEAD Load RTEMP1,(RHEAD)Load RTEMP2,(RNEWREC)Compare RTEMP1,RTEMP2Branch� 0 SEARCHStore RHEAD,4(RNEWREC)Move RNEWREC,RHEADReturn

SEARCH Move RHEAD,RCURRENTLOOP Load RNEXT,4(RCURRENT)

Test RNEXTBranch=0 TAILLoad RTEMP1,(RNEXT)Load RTEMP2,(RNEWREC)Compare RTEMP1,RTEMP2Branch� 0 INSERTMove RNEXT,RCURRENTBranch LOOP

INSERT Store RNEXT,4(RNEWREC)TAIL Store RNEWREC,4(RCURRENT)

Return

This programcontainsmany dependenciesandbranchinstructions.Thereveryfew possibilitiesfor instructionreordering.Thecritical partwhereoptimizationshouldbeattemptedis theloop. Giventhatnoinformationis availableonbranchbehavior or delayslots,theonly optimizationpossibleis to separateinstructionsthatdependon each.This would reducetheprobabilityof stallingthepipeline.

Theloop maybereorganizedasfollows.

6

Page 7: Computer Organization Hamacher Instructor Manual Solution Chapter 81

LOOP Load RNEXT,4(RCURRENT)Load RTEMP2,(RNEWREC)Test RNEXTLoad RTEMP1,(RNEXT)Branch=0 TAILCompare RTEMP1,RTEMP2Branch� 0 INSERTMove RNEXT,RCURRENTBranch LOOP

INSERT Store RNEXT,4(RNEWREC)TAIL Store RNEWREC,4(RCURRENT)

Return

NotethatwehaveassumedthattheLoadinstructiondoesnotaffecttheconditioncodeflags.

8.13. Becauseof branchinstructions,120clock cyclesareneededto execute100pro-graminstructionswhendelayslotsarenotused.Usingthedelayslotswill elim-inate0.85of theidle cycles.Thus,theimprovementis givenby:

���� �������������� � ��� � � � �

Thatis, instructionthroughputwill increaseby 8.1%.

8.14. Numberof cyclesneededto execute100instructions:

Without optimization 140With optimization( �!��"�#�����$� � ��� �#���%�$� � � ) 127

Thus,throughputimprovementis �!��'&( ��*)+�, � ���� , or 10.2%

8.15. Throughputimprovementduetopipeliningis , where is thenumberof pipelinestages.

Numberof cyclesneededto executeoneinstruction:

Throughput

4-stage: � ������� � � �$� � ���-�, � ��! 4/1.04 � 3.85

6-stage: � !��#� � � ��� � �.�#� � � � �$� � �-� � �/ 6/1.19 � 5.04

Thus,the6-stagepipelineleadsto higherperformance.

7

Page 8: Computer Organization Hamacher Instructor Manual Solution Chapter 81

8.16. For a “do while” loop, theterminationconditionis testedat thebeginningof theloop. A conditionalbranchat that locationwill betakenwhenexiting the loop.Hence,it shouldbe predictednot taken. That is, the statemachineshouldbestartedin thestateLNT, unlesstheloop is not likely to beexecutedat all.

A “do until” loop is executedat leastonce,andthe branchcondition is testedat the endof the loop. Assumingthat the loop is likely to be executedseveraltimes,thebranchshouldbepredictedtaken.Thatis, thestatemachineshouldbestartedin stateLT.

8.17. An instructionfetchedin cycle 0 reachesthe headof the queueandentersthedecodestagein cycle 02143 . AssumethattheinstructionprecedingI � is decodedandinstructionI 5 is fetchedin cycle1. This leadsto instructionsI � to I 5 beinginthequeueat thebeginningof cycle 2. Executionwould thenproceedasshownbelow.

Notethatthequeueis alwaysfull, becauseat mostoneinstructionis dispatchedandupto two instructionsarefetchedin any givencycle. Undertheseconditions,thequeuelengthwoulddropbelow 6 only in thecaseof acachemiss.

X

D1 E1 E1 E1 W1

W3E3

I5 (Branch)

I1

D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

D3

E4D4 W4

D5

F6

Fk Dk Ek

Fk+1 Dk+1

I2

I3

I4

I6

Ik

Ik+1

Wk

Ek+1

10

6 6 6 6 6 6 6 6 6Queue length 6

Time

8