1 ECE 371 Microprocessors Chapter 3 Instruction Scheduling on Pipelined Architectures Herbert G....

1

ECE 371Microprocessors

Chapter 3Instruction Scheduling on

Pipelined Architectures

Herbert G. Mayer, PSUStatus 10/21/2015

For use at CCUT Fall 2015

2

Syllabus Scheduling on Pipelined Architecture Idealized Pipeline Goal of Scheduling Causes for Dependences Stalls and Hazards Realistic Constraints Reservation Tables Collision Vector Vertical Expansion Horizontal Expansion IBM Measurements Appendix

3

Scheduling on Pipelined Architecture

Pipelining is an old architectural HW design method for accelerated processor execution; since late 1970

Pipelining improves performance not by adding HW, but by separating individual HW modules of a conventional uni-processor (UP) architecture

Instead of designing one composite, complex piece of HW for a CPU, the architect for a pipelined machine designs a sequence of simpler and thus faster, consecutive modules

Ideally all i HW modules mi would be of similar complexity

These separate modules mi are significantly simpler each than the original composite, and execute overlapped, simultaneously progressing on more than one machine instruction at any one time

4


Instead of executing one complex instruction in one longer cycle required for one complex step, a pipelined architecture executes a sequence of multiple, faster sub-instructions

Each sub-instruction is much simpler and thus much faster to execute than the one complex single-instruction

Such single-cycle, pipelined sub-instructions are initiated once per short clock-cycle

Each instruction then progresses to completion while migrating through the various stages of the separate hardware modules mi, called the pipeline

5


Left: Traditional Hardware Architecture Right: equivalent Pipelined Architecture

ALU opALU op

I-FetchI-Fetch

R StoreR Store

Decode Decode

O1-FetchO1-Fetch

O2-FetchO2-Fetch

..

I-FetchI-Fetch

Decode Decode

O1-FetchO1-Fetch

O2-FetchO2-Fetch

ALU opALU op

R StoreR Store

6

Idealized Pipeline

Arithmetic Logic Unit (ALU) split into separate, sequential modules

Each of which can be initiated once per cycle, but with a shorter, pipelined clock cycle

Each module mi is replicated in HW just once, like in a regular UP note exceptions, when some module is used more than once

by the original, non-pipelined instruction, OK to duplicate

example: normalize operation in FP instruction

also FPA in FP multiply operation

Multiple modules operate in parallel on different instructions, at different stages of each simultaneous instruction

7

Idealized Pipeline Ideally, all modules would require unit time (1 cycle)!

Ideal only! That would be simple Ideally, all original, complex operations (fp-add,

divide, fetch, store, increment by integer 1, etc.) require the same number n of steps to completion

But they do not! E.g. fp-divide takes way longer than, say, an integer increment, or a no-op

Differing numbers of cycles per instruction cause different terminations

Operations may abort in intermediate stages, e.g. in case of a pipeline hazard, caused by: branch, call, return, conditional branch, exception

An operation also stalls in case of operand dependence

8

Idealized Pipeline

Instruct. 1 2 3 4 5 6 7 8 9 10 11 12 13 timei8 if de op1 op2 exec wbi7 if de op1 op2 exec wbi6 if de op1 op2 exec wbi5 if de op1 op2 exec wbi4 if de op1 op2 exec wbi3 if de op1 op2 exec wbi2 If de op1 op2 exec wbi1 if de op1 op2 exec wb

Retire ? ? i1 ? i2 ? i3 ? i4 ? i5 ? i6 ? i7 ? i8

6 CPI from here: 1 clock per new instruction retirement, CPI = 1

Horizontal: time, units of cycles. Vertical: consecutive instructions

9

Goal of Scheduling,Obstacles to Scheduling

10

Goal of Scheduling

Goal: to complete instructions at a rate faster than the CPI-rate would allow without pipelining

CPI = number of Cycles Per Instruction

Program completion time on a pipelined architecture is shorter than on a non-pipelined architecture, achieved by having the separate hardware module progress in parallel on multiple instructions at the same time

Pipelined instructions must be retired in original, sequential order, or in an order semantically equivalent

Stalls and hazards must be minimized: via branch prediction

HW resolves dependence conflicts (hazards) via interlocking

11

Causes for Dependences Load into register ri in one instruction, followed by

use of that register ri: True Dependence, AKA Flow Dependence

Load into register ri in one instruction, followed by use of any register (if hardware does not bother to check register id; e.g. early HP PA); no longer an issue on contemporary processors

Definition of register ri in one instruction (other than a load), followed by use of register ri (on old HP HW designed with severe limitations)

Store into memory followed by a load from memory; unless memory subsystems checks, whether the load comes from the same address as the earlier store; if not, wait can be bypassed

So done in PCI-X protocols

12

Causes for Dependences

Use of register ri in one instruction, followed by load into that register ri: Anti Dependence

Load into register ri in one instruction, followed by load into register ri in a later instruction with use of register ri in between: Output Dependence

We’ll learn, both of those are false dependences

13

Basic Block: Find Dependences

-- result: is leftmost operand after opcode, except for st-- other operands, if any, are sources-- Mxx is Memory address at xx, implies indirection for ld-- The parens () in (Mxx) render such indirection explicit-- 8(sp) means indirect through sp register, offset by 8-- #4 stands for literal value 4, decimal base

1 ld r2, (M0)2 add sp, r2, #123 st r0, (M1)4 ld r3, -4(sp)5 ld r4, -8(sp)6 add sp, sp, #47 st r2, 0(sp)8 ld r5, (M2)9 add r4, r0, #1

14

Basic Block: Find Dependences

1-2 load of a register followed by use of that register

3-4 load from memory at -4(sp) while write to memory in progress (M1)

3-5 load from memory at -8(sp) while write to memory in progress (M1)

2-4, 2-5 define register sp, followed by use of same register; distance sufficient to avoid stall on typical architectures

6-7 define register sp before use; forces sequential execution, reduces pipelining

7-8 store followed by load!

8-9 load into register r5 followed by use of any register (on few, simple architectures, e.g. early PA)

15

Stalls and Hazards Hardware interlock slows down execution due to

delay of dependent instruction

Benefit is correct result

Cost is sequential execution and delay

Programmer can sometimes re-arrange instructions or insert delays at selected places

Compiler can re-schedule instructions, or insert delays --as can the programmer

Unless programmer’s or compiler’s effort are provably complete, the hardware interlock must still be provided

16

Stalls and Hazards

CDC 6000 and IBM360/91 already used automatic hardware interlock

Advisable to have compiler re-schedule the instruction sequence, since re-ordering may minimize the number of interlocks needing to occur

17

Stalls and Hazards

Not all HW modules are being used exactly once and not only for one single cycle

Some HW modules mi are used more than once in one instruction; e.g. normalizer in floating-point operations is used repeatedly

Basic Block analysis is insufficient to detect all stalls or hazards; may span separate Basic Blocks

Can add delay circuits to HW modules, which slows down execution, but ensures correct result

Def: Basic Block is a sequence of 1 instruction or more with 1 entry and 1 exit point; i.e. no branches in between, no labels branched-to in between!

18

Instruction Reservation Tables&

Collision Vectors

19

Reservation Tables

Table 1: Instructions i1 to i3 use 6 HW modules mi

t1 t2 t3 t4 t5 t6 t7 t8

m6 i1 i2 i3

m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3

20

Reservation Tables

Table 1, known as a reservation table, shows an ideal schedule using hardware modules m1 to m6, required for execution of one instruction

Ideal: because each requires exactly 1 cycle per mi, and each HW module is used exactly once

Always 1 cycle is also unrealistic, yet simple for didactic purposes

The time to complete 3 instructions i1, i2, and i3 is 8 cycles, while the time for any single instruction is 6 cycles, a net saving over time; better than 3 * 6 = 18 cycles!

Completion time for instruction in the steady state is 1 cycle on the pipelined architecture in the steady state

21

Reservation Tables

For fairness sake: on a non-pipelined architecture any one of these 3 instructions would NOT necessarily take the equivalent time of 6 cycles of the pipelined machine; perhaps 4 or 5

Also for fairness sake it is not usual that 3 of the same instructions are arranged one after the other

That model is used here to explain Reservation Tables

22

Reservation Tables

Key learning: pipelined architecture does NOT speed up execution of a single instruction, may even slow it down; but improves throughput of multiple instruction in a row: only in the steady state

Table 1: Instructions i1 to i3 use 6 HW modules mi

t1 t2 t3 t4 t5 t6 t7 t8

m6 i1 i2 i3

m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3

23

Reservation Tables

Table 2 shows a more realistic schedule of HW modules required by one single instruction

In this schedule some modules are used repeatedly; for example m3 is used 3 times in a row, and m6 4 times in a row

But all these cycles are contiguous; even that is not always a realistic constraint

Instead, a HW module mi may be used at various moments during the execution of a single instruction

The schedule in Table 2 attempts to exploit the greedy approach for instruction i2: i2 is initiated as soon as possible after i1, but we see that this doesn’t help the time to completion

24

Reservation Tables

We could have let instruction i2 start at cycle t4 so no additional delay would be caused by m3

Or instruction i2 could start at t3 with just one additional delay due to m3

However, in both cases m6 would cause a delay later anyway

To schedule i2 we must consider these multi-use resources, m3 and m6, that are in use continuously

In case of a load, the actual module would wait many more cycles, until the data arrive, but would not progress until then

25

Reservation Tables

Table 2: Instructions i1 to i3 use 6 HW modules mi for 1-4 cycles

t1 t2 t3 t4 t5 t6 t7 t8 t9 t

10 t 11

t 12

t 13

t 14

t 15

t 16

t 17

t 18

t 19

m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3

m5 i1 i2 d2 i3

m4 i1 i2 i3

m3 i1 i1 i1 i2 i2 i2 i3 i3 i3

m2 i1 i2 d2 d2 i3

m1 i1 i2 d d d d d d i3

26

Reservation Tables

Instead of using the single resources m3 and m6 repeatedly and continuously, an architect can replicate them as many times as needed in HW

This costs more hardware, and does not speed up execution of a single instruction

For a single operation, all would still have to progress in sequence

But it avoids the delay of subsequent instruction start, needing the same type of HW module

See Table 3: shaded areas indicate the duplicated modules

27

Reservation Tables

Table 3: Instructions i1 to i3 use replicated HW modules m3 and m6

t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10

t 11

t 12

t 13

t 14

m6,4 i1 i2 I3

m6,3 i1 i2 i3

m6,2 i1 i2 i3

m6 i1 i2 i3

m5 i1 i2 i3

m4 i1 i2 i3

m3,3 i1 i2 I3

m3,2 i1 i2 i3

m3 i1 i2 i3

m2 i1 i2 i3

m1 i1 i2 i3

28

Reservation Tables These replicated circuits in Table 3 do not speed

up the execution of any individual instruction

But by avoiding the delay for other instructions, a higher degree of parallelism is enabled, and multiple instructions can retire earlier

Even this is unrealistically simplistic

Some of the modules mi are used for more than one cycle, not necessarily in sequence

Instead, a Reservation Table offers a more realistic representation

Use Reservation Table in Table 4 to figure out, how closely the same instruction can be scheduled back-to-back

29

Collision Vector CV

Collision vector is a tool that identifies, how soon 2 instruction can be scheduled one after the other:

Best case in general: the next identical instruction can be scheduled at the next cycle

Worst case in general: next instruction must be scheduled n cycles after the start of the first, if the first requires n cycles for completion

Goal for a HW designer is to find, how many can be scheduled in between on a regular basis

To analyze this for speed, we use the Reservation Table and Collision Vector (CV)

30

Collision Vector CV

Goal: Find CV by overlapping two identical Reservation Tables (e.g. plastic transparencies ) within the window of the cycles of one operation

If, after shifting a second, duplicate transparency with i = 1..n-1 time steps, two resource-marks of a row land on the same field, we have a collision: both instructions claim a resource at the same time!

Collision means: the second instructing cannot yet be scheduled. So mark field i in the CV with a 1

Otherwise mark field i with a 0, or leave blank

Do so n-1 times, and the CV is complete. But do check for all rows, i.e. for all HW modules mj

31

Collision Vector CV

Table 5: Find Collision Vector for above instruction

Collision Vector has n-1 entries for some n-cycle instruction

Table 4: Reservation Table for 7-step, 5-Module instruction

t1 t2 t3 t4 t5 t6 t7

m1 X m2 X X m3 X X m4 X X m5 X

32

Collision Vector CV

If a second instruction of the kind shown in Table 4 were initiated one cycle after the first, resource m2 will cause a conflict

This is, because instruction 2 requires m2 at cycles 3 and 4

However, instruction 1 is already using m2 at cycles 2 and 3

At step 3 there would also be a conflict

Also resource m3 would cause a conflict

The good news, however, is that this double-conflict causes no further entry in CV

33

Collision Vector CV

Similarly, a new instruction cannot be initiated 2 cycles after the start of the first

This is, because instruction 2 requires m4 at cycles t7 and t9

However, instruction 1 is already using m4 at t5 and t7. At step t7 there would be a conflict

At all other steps a second instruction may be initiated. See the completed CV in Table 6 below:

Table 6: Collision Vector for above 7-cycle, 5-module instruction

1 1 0 0 0 0

34

Reservation Table: Main Example

The next example is an abstract instruction of a hypothetical microprocessor, characterized by the Reservation Table 7, 7 cycles long, using 4 HW modules m1 to m4

Will be used throughout this section

Table 7: Reservation Table 7 for 7-cycle, 4-module Main Example

t1 t2 t3 t4 t5 t6 t7

m1 X X X m2 X X m3 X X m4 X

35

Reservation Table: Main Example The Collision Vector for the Main Example says:

We can start new instruction of the same kind at step t6 or t7

Of course, we can always start a new instruction, identical or of another type, after the current one has completed; no resource will be in use then

Challenge is to start another, while the current is still executing, to maximize parallel execution one has completed. No resource will be in use then. The challenge is to start another one while the current is executing

Table 8: Collision Vector for Main Example

1 1 1 1 0 0

Goal now is to show, that by adding delays we can sometimes speed up execution of pipelined ops!

That is microprocessor architecture beauty! to start another one while the current is executing

36

Main Example Pipelined

For Main Example, initiate a second, pipelined instruction Y at step t6, i.e. 5 cycles after start of X

Greedy Approach to pipeline X and Y as follows:

Observe two-cycle overlap. This is all the speed-gain we can gain. Starting Y earlier (greedy approach) would create delays, and not retire the second instruction Y any earlier

Table 9: Pipelining 2 Instructions of Main Example

t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10

t 11

t 12

m1 X X X Y Y Y Z m2 X X Y Y Z m3 X X Y Y m4 X Y

37

Main Example Pipelined The 3rd pipelined instruction Z can start at time

step t11, by which time the first is retired, the second is half-way through

The fourth instruction can start at step t16, etc.

t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10

t 11

t 12

t 13

t 14

t 15

t 16

t 17

m1 X X X Y Y Y Z Z Z m2 X X Y Y Z Z m3 X X Y Y Z Z m4 X Y Z

Table 10: Pipelining 3 Instructions of Main Example

38

Main Example Pipelined Though the Reservation Table, Table 7 for the

Main Example is sparsely populated, no high degree of pipelining is possible

The maximum overlap is 2 cycles

Can one already infer this low degree of pipelining from the Collision Vector? During the pipelined execution there are 5 cycles per instruction retirement in the steady state, cpi = 5

That means 5 cycles per completion of an instruction, assuming the same instruction is executed over and over again, once the steady state is reached; but not during the priming phase!

We’ll come back to the Main Example and analyze it after further study of Examples 2 and 3

39

Pipeline Example 2 Reservation Table for Example 2 has 7 entries, i.e. 7

X-es, 24 fields, 6 steps, density = 0.29166. Main Example has 8 entries in 28 fields, density = 0.28571

We’ll attempt to pipeline as many identical Example 2 instructions as possible

Table 12’: Collision Vector for Example 2 Figured out by Students

Table 11’: Reservation Table for Example 2

t1 t2 t3 t4 t5 t6

m1 X X m2 X X m3 X m4 X X

40

Pipeline Example 2, With CV Reservation Table for Example 2 has 7 entries, i.e. 7

X-es, 24 fields, 6 steps, density = 0.29166. Main Example has 8 entries in 28 fields, density = 0.28571

We’ll attempt to pipeline as many identical Example 2 instructions as possible

Table 12: Collision Vector for Example 2

Table 11: Reservation Table for Example 2

1 0 1 0 1

t1 t2 t3 t4 t5 t6

m1 X X m2 X X m3 X m4 X X

41

Pipeline Example 2 The Collision Vector suggests to initiate a new

pipelined instruction at time t3, t5, t7, etc.

That would allow 3 instructions X, Y, and Z simultaneously, overlapped, pipelined. By step t7 the first instruction would already be retired

Table 13: Schedule for Pipelining Example 2

42

Pipeline Example 2 Example 2 is lucky to pipeline 3 identical

instructions at the same time

Caution: The CV is not a direct indicator. The reader was mildly misled to make inferences that don’t strictly follow

However, if all positions in the CV were marked 1, there would be no pipelining

For Example 2 the number of cycles per instruction retirement is an amazing cpi = 2

Even though the operation density is slightly higher than in the Main Example, the pipelining overlap in Example 2 is significantly higher, which is counter-intuitive!

On to Example 3!

43

Pipeline Example 3 Interesting to see Example 3, analyzing the Collision

Vector, to see how much we can parallelize! The Reservation Table has numerous resource fields

filled, yet the Collision Vector is sparser than the one in Example 2

Table 15: Collision Vector for Example 3

Table 14: Reservation Table for Example 3

t1 t2 t3 t4 t5 t6

m1 X X X m2 X X X m3 X X X m4 X X X

0 1 0 1 0

44

Pipeline Example 3 The Collision Vector suggests to start new

pipelined instruction 1, 3, or 5 cycles after initiation of the first

The CV of Example 3 is less densely packed with 1s than Example 2, where we could overlap 3 identical instructions and get a rate of cpi = 2

Goal now: find the best cpi rate for Example 3


t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

m1 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m2 X Y X Y X Y X1 Y2 X1 Y2 X1 m3 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m4 X Y X Y X Y X1 Y2 X1 Y2 X1

45

Pipeline Example 3 Example 2 earlier with Collision Vector 10101

allows a higher degree of pipelining

Here in Example 3, cpi = 3, every 6 cycles two instructions can retire

This is in contrast to cpi = 2 of Example 2

The reason for the lower retirement rate is clear:

All 4 HW modules are used every other cycle by one of two instructions, thus one cannot overlap more than twice!

The non-pipelined cpi rate for Example 3 is = 6, the pipelined rate is cpi=3

46

Vertical Expansion for Example 3 If we need higher degree of pipelining for Example 3

with a fill-factor of 0.5, we must pay! Vertically with more hardware, or horizontally with more time for added delays

Let’s analyze a vertically expanded Reservation Table now with 8 Modules; every hardware resource m1 to m4 replicated once; lower new density = 0.25

Table 17: Reservation Table Example 3 with Replicated HW

t1 t2 t3 t4 t5 t6

m1 X X m2 X X m3 X m4 X m1,2 X m2,2 X m3,2 X X m4,2 X X

47

Vertical Expansion for Example 3 Let us pipeline multiple identical instructions for

Reservation Table 17 as densely as possible

With twice the HW, can we overlap perhaps twice as much? The previous rate with half the hardware was cpi = 3. Ideal would be cpi = 1.5; a plausible schedule is shown in Table 18


t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

m1 X Y Z A X Y Z A X Y Z A m2 X Y Z A X Y Z A X Y Z m3 X Y Z A X Y m4 X Y Z A X m1,2 X Y Z A X Y m2,2 X Y Z A X m3,2 X Y Z A X Y Z A X Y Z m4,2 X Y Z A X Y Z A X Y

48

Vertical Expansion for Example 3 Initiation and retirement rates are 4 instructions per

8 cycles, cpi = 2 This is, as suspected, better than the rate of the

original Example 3, not surprising with double the hardware modules

But this is not twice as good a retirement rate, despite twice the HW

Original rate was cpi = 3, the improved rate with double the hardware is cpi = 2

49

Horizontal Expansion, Main Example

Our next case, a variation of the Main Example, shows an expansion of the Reservation Table horizontally

I.e. delays are built-in; HW modules are kept constant

Only the 4 modules m1 to m4 from Main Example are provided

Motivation of architect: if a delay can speed up execution, by all means build it in: delays are cheap!

Common sense tells us, delays tend to slow-down; real architect looks also at counter-intuitive situations

50


After Examples 2 and 3, we expand the Main Example, repeated below, by adding delays, AKA Horizontal Expansion

If we insert delay cycles, clearly execution for a single instruction will slow down

However, if this yields a sufficient increase in the overall degree of pipelining, it may still be a win

Building circuits to delay an instruction is cheap

We analyze this variation next:

51


t1 t2 t3 t4 t5 t6 t7


Table 19: Original Reservation Table for Main Example

Inserting a Delay Cycle after t3, will be the new t4

52

Horizontal Expansion, Main Example We’ll insert delays; but where? A systematic way to

compute optimum position is not shown here

Instead, we’ll suggest a sample position for a single delay and analyze the performance impact

Table 20 shows delay inserted after t3, new t4

Table 21: Collision Vector for Main Example with 1 Delay

Table 20: Reservation Table for Main Example with 1 Delay, at t4

t1 t2 t3 t4 t5 t6 t7 t8


0 1 0 1 1 0 0

53

Horizontal Expansion, Main Example The Greedy Approach is to schedule instruction Y

as soon as possible, when CV has a 0 entry

This would lead us to initiate a second instruction Y at time step t2, one cycle after instruction X. Is this optimal?

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

t11

t12

t13

t14

t15

t16

m1 X Y X Y X Y Z A Z A Z A m2 X Y X Y Z A Z A m3 X Y X Y Z A Z A m4 X Y Z A

Table 22: Schedule for Main Example Pipelined Instructions X, Y, Z,

With Delay Slot, Using Greedy Approach

54

Horizontal Expansion, Main Example Initiation and retirement rates are 2 instructions

every 7 cycles, or cpi = 3.5; see the purple header at each retired instruction in Table 23

This is already better than cpi = 5 for the original Main Example without the delay

Hence we have shown that adding delays can speed up throughput of pipelined instructions

But can we do better?

After all, we have only tried out the first sample of a Greedy Approach!

55

Horizontal Expansion, Main Example In this experiment we start the second instruction

at cycle t4, three cycles after the start of the first

Which cpi shall we get? See Table 23

Table 23: Another Schedule for Pipelined Main Example, with delay

Initiation Later than First Opportunity in Table 22Resulting in Better Throughput:

Message: Starting later renders Execution Faster

t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10

t 11

t 12

t 13

t 14

t 15

t 16

t 17

t 18

m1 X X Y X Y Z Y Z X Z X Y X Y Z Y Z m2 X Y X Z Y X Z Y X Z m3 X X Y Y Z Z X X Y m4 X Y Z X Y

56

Horizontal Expansion, Main Example In the more patient schedule of Table 23 we

complete one identical instruction every 3 cycles in the steady state

Purple cells indicate instruction retirement X retires at completion t8, Y after t11, and Z after t14 Then X again after t17 Now cpi = 3 with the Not-So-Greedy Approach Key learning: To speed up pipelined execution, one

can sometimes enhance throughput by adding delay circuits, or by replicating hardware, or postponing instruction initiation, or a combination of the above

The greedy approach is not necessarily optimal The collision vector only states, when one cannot

initiate a new instruction (value 1); a 0 value is not a solid hint for initiating a new instruction

57

IBM MeasurementsAgarwala and Cocke 1987; see [1]

Memory Bandwidth:1 word/cycle to fetch 1 instruction/cycle from I-cache40% of instructions are memory-accesses (load-

store)Those could all benefit from access to D-cache

Code Characteristics, dynamic:25% of all instructions: loads15% of all instructions: stores40% of all instructions: ALU/RR20% of all instructions: Branches

1/3 unconditional1/3 conditional taken1/3 conditional not taken

58

How Can Pipelining Work? About 1 out of 4 or 5 instructions will be a branch

Branches include all transfer of control instructions; these are: call, return, unconditional and conditional branch, abort, exception and similar machine instructions

If a processor pipeline is deeper than say, 5 stages, there will almost always be a branch in the pipeline, rendering several perfected operations useless

Some processors (e.g. Intel Willamette, [6]) have over 20 stages. For this processor, regular pipelining would practically always cause a stall!

59

How Can Pipelining Work? Remedy is branch prediction

If the processor knows dynamically, from which address to fetch, instead of blindly assuming the subsequent code address pc+1, this would solve the pipeline flushes

Luckily, branch prediction in the 2010s has become at or above 97% accurate, causing only rarely the need to re-prime the pipe

Also, processors no longer are designed with the deep pipeline of the Willamette, of 20+ stages

Here we see interesting interactions of several computer architecture principles: pipelining and branch prediction, one helping the other to become exceedingly advantageous

60

Appendix:Some Definitions

61

DefinitionsBasic Block

Sequence of instructions (one or more) with a single entry point and a single exit point; entry- and exit w.r.t. transfer of control

Entry point may be the destination of a branch, a fall through from a conditional branch, or the program entry point; i.e. destination of an OS jump

Exit point may be an unconditional branch instruction, a call, a return, or a fall-through

Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place

Or fall-through can mean: The successor of the exit point is a branch or call target

62

Definitions

Collision Vector

Observation: An instruction requiring n cycles to completion may be initiated a second time n cycles after the first without possibility of conflict

For each of the n-1 cycles before that, a further instruction of identical type causes a resource conflict, if initiated

The Boolean vector of length n-1 that represents this fact –stating whether or not re-issue is possible– is referred to as collision vector

It can be derived from the Reservation Table

63

Definitions

Cycles Per Instruction: cpi

cpi quantifies how long it takes for a single instruction to execute

Generally, the number of execution cycles per instruction cpi > 1 on a CISC architecture

However, on a pipelined UP architecture, where a new instruction can be initiated at each cycle, it is conceivable to reach a cpi rate of 1; assuming no hazards

Note different meanings of cycle! On a UP pipelined architecture the cpi rate cannot

shrink below one Yet on an MP or superscalar architecture, the cpi

rate may be < 1

64

Definitions

Dependence

If the logic of the underlying program imposes an order between two instructions, there exists dependence -data or other dependence- between them

Generally, the order of execution cannot be permuted

Conventional in CS to call this dependence, not dependency

65

Definitions

Early Pipelined Computers/Processors:

1. CDC 6000 Series of the late 1960s

2. CDC Cyber series of the 1970s

3. IBM 360/91 series

4. Intel® Pentium® IV or XeonTM processor families

66

Definitions

Flushing

When a hazard occurs due to a change in flow of control, the partially execution instructions after the hazard are discarded

This discarding is called flushing

Antonym: priming

Flushing is not needed in case of a stall caused by dependences; waiting instead will resolve this

67

Definitions

Hazard

Instruction i+1 is pre-fetched under the assumption it would be executed after instruction i

Yet after decoding i it becomes clear that that operation i is a control-transfer operation

Hence subsequently pre-fetched instructions i+1… and on are wasted

This is called a hazard

A hazard causes part of the pipeline to be flushed, while a stall (caused by data dependence) also causes a delay, but a simple wait will resolve such a stall conflict

68

Definitions

ILP

Instruction Level Parallelism: Architectural attribute, allowing multiple instructions to be executed at the same time

Related: Superscalar

69

Definitions

Interlock

If HW detects a conflict during execution of instructions i and j and i was initiated earlier, such a conflict, called a stall, delays execution of some j and perhaps subsequent instructions

Interlock is the architecture’s way to respond to and resolve a stall at the expense of degraded performance

Advantage: computation of correct result!

Synonym: stall or wait

70

Definitions

IPC

Instructions per cycle: A measure for Instruction Level Parallelism. How many different instructions are being executed –not necessarily to completion—during one single cycle?

Desired to have an IPC rate > 1

Ideally, given suitable parallelism, IPC >> 1

On conventional, non-pipelined UP CISC architectures it is typical to have IPC << 1

71

Definitions

Pipelining

Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete

Highly pipelined Xeon processors, for example, have a greater than 20-stage pipeline

72

Definitions

Prefetch (Instruction Prefetch)

Bringing an instruction to the execution engine before it is reached by the instruction pointer (ip) is called instruction prefetch

Generally this is done, because some other knowledge exists proving that the instruction will likely be executed soon

Possible to have branch in between, in which case the prefetch may have been wasted

73

Definitions

Priming

Filling the various modules of a pipelined processor (the stages) with different instructions to the point of retirement of the first instruction is called priming

Antonym: flushing

74

Definitions

Register Definition

If an arithmetic or logical operation places the result into register ri we say that ri is being defined

Synonym: Writing a register

Antonym: Register use

75

Definitions

Reservation Table

Table that shows, which hardware resource i (AKA module mi) is being used at which cycle in a multi-cycle instruction

Typically, an X written in the Reservation Table Matrix indicates use

Empty field indicates the corresponding resource is free during that cycle

76

Definitions

Retire

When all parts of an instruction have successfully migrated through all execution stages, that instruction is complete

Hence, it can be discarded, this is called being retired

All results have been posted

77

Definitions

Stall

If instruction i requires operand o that is being computed by another instruction j, and j is not complete when i needs o, there exists dependence between the two instructions i and j, the wait thus created is called stall

A stall prevents the two instructions from being executed simultaneously, since the instruction at step i must wait for the other to complete. See also: hazard, interlock

Stall can also be caused by HW resource conflict: Some earlier instruction i may use HW resource m, while another instruction j needs m

Generally j has to wait until i frees m, causing a stall for j

78

Conclusion

79

Summary

Pipelining can speed up execution

Yet the speedup is not due to the faster clock rate

That fast clock rate is for significantly simpler sub-instructions and cannot be equated with the original, i.e. non-pipelined clock

Pipelining may even benefit from inserting delays

May also benefit from initiating an instruction later than possible

And benefits, not surprisingly, from added HW resources

Branch prediction is a necessary architecture attribute to make pipelining work

80

Bibliography

1. Cocke and Schwartz, Programming Languages and their Compilers, unpublished, 1969, http://portal.acm.org/itation.cfm?id=1097042

2. Harold Stone, High Performance Computer Architecture, 1993 AW

3. cpi rate: http://en.wikipedia.org/wiki/Cycles_per_instruction

4. Introduction to PCI: http://electrofriends.com/articles/computer-science/protocol/introduction-to-pci-protocol/

5. Wiki PCI page: http://en.wikipedia.org/wiki/Conventional_PCI

6. http://en.wikipedia.org/wiki/NetBurst_(microarchitecture)

1 ECE 371 Microprocessors Chapter 3 Instruction Scheduling on Pipelined Architectures Herbert G....

Documents

Transcript of 1 ECE 371 Microprocessors Chapter 3 Instruction Scheduling on Pipelined Architectures Herbert G....