Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...

35
The University of Adelaide, School of Computer Science 31 October 2014 Chapter 2 — Instructions: Language of the Computer 1 1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition 2 Copyright © 2012, Elsevier Inc. All rights reserved. Introduction Pipelining become universal technique in 1985 Overlaps execution of instructions Exploits “Instruction Level Parallelism” Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Not used as extensively in PMP processors Compiler-based static approaches Not as successful outside of scientific applications Introduction

Transcript of Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...

Page 1: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 1

1Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 3

Instruction-Level Parallelism

and Its Exploitation

Computer ArchitectureA Quantitative Approach, Fifth Edition

2Copyright © 2012, Elsevier Inc. All rights reserved.

Introduction

� Pipelining become universal technique in 1985� Overlaps execution of instructions

� Exploits “Instruction Level Parallelism”

� Beyond this, there are two main approaches:� Hardware-based dynamic approaches

� Used in server and desktop processors

� Not used as extensively in PMP processors

� Compiler-based static approaches� Not as successful outside of scientific applications

Intro

du

ctio

n

Page 2: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 2

3Copyright © 2012, Elsevier Inc. All rights reserved.

Instruction-Level Parallelism

� When exploiting instruction-level parallelism, goal is to maximize CPI� Pipeline CPI =

� Ideal pipeline CPI +

� Structural stalls +

� Data hazard stalls +

� Control stalls

� Parallelism with basic block is limited� Typical size of basic block = 3-6 instructions

� Must optimize across branches

Intro

du

ctio

n

4Copyright © 2012, Elsevier Inc. All rights reserved.

Data Dependence

� Loop-Level Parallelism� Unroll loop statically or dynamically

� Use SIMD (vector processors and GPUs)

� Challenges:� Data dependency

� Instruction j is data dependent on instruction i if

� Instruction i produces a result that may be used by instruction j

� Instruction j is data dependent on instruction k and instruction kis data dependent on instruction i

� Dependent instructions cannot be executed simultaneously

Intro

du

ctio

n

Page 3: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 3

5Copyright © 2012, Elsevier Inc. All rights reserved.

Data Dependence

� Dependencies are a property of programs

� Pipeline organization determines if dependence is detected and if it causes a stall

� Data dependence conveys:� Possibility of a hazard

� Order in which results must be calculated

� Upper bound on exploitable instruction level parallelism

� Dependencies that flow through memory locations are difficult to detect

Intro

du

ctio

n

6Copyright © 2012, Elsevier Inc. All rights reserved.

Data Dependences in Floating, Integer Stuff

Loop: L.D F0,0(R1) ;F0=array element

ADD.D F4,F0,F2 ;add scalar in F2

S.D F4,0(R1) ; store result

DADDUI R1,R1,#-8 ;decrement pointer 8 bytes

BNE R1,R2,LOOP ;branch R1!=R2

Loop: L.D F0,0(R1) ;F0=array element

ADD.D F4,F0,F2 ;add scalar in F2

S.D F4,0(R1) ; store result

DADDUI R1,R1,#-8 ;decrement pointer 8 bytes

BNE R1,R2,LOOP ;branch R1!=R2

Page 4: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 4

7Copyright © 2012, Elsevier Inc. All rights reserved.

Name Dependence

� Two instructions use the same name but no flow of information� Not a true data dependence, but is a problem when

reordering instructions

� Antidependence: instruction j writes a register or memory location that instruction i reads

� Initial ordering (i before j) must be preserved

� Output dependence: instruction i and instruction j write the same register or memory location

� Ordering must be preserved

� To resolve, use renaming techniques

Intro

du

ctio

n

8Copyright © 2012, Elsevier Inc. All rights reserved.

Data Hazards

A hazard exists whenever there is a name or data dependence between instructions, and they are close enough that the overlap during execution would change the order of access to the operand

involved in the dependence. Because of the dependence, we

must preserve what is called program order, that is, the order that

the instructions would execute in if executed sequentially one at a

time as determined by the original source program. The goal of

both our software and hardware techniques is to exploit parallelism

by preserving program order only where it affects the outcome

of the program. Detecting and avoiding hazards ensures that

necessary program order is preserved.

Page 5: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 5

9Copyright © 2012, Elsevier Inc. All rights reserved.

Other Factors

� Data Hazards� Read after write (RAW)

� Write after write (WAW)

� Write after read (WAR)

� Control Dependence� Ordering of instruction i with respect to a branch

instruction� Instruction control dependent on a branch cannot be moved

before the branch so that its execution is no longer controller by the branch

� An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Intro

du

ctio

n

10Copyright © 2012, Elsevier Inc. All rights reserved.

Data Hazards

RAW (read after write) — j tries to read a source before i

writes it, so j incorrectly gets the old value.

WAW (write after write) — j tries to write an operand before it is

written by i. The writes end up being performed in the wrong

order, leaving the value written by i rather than the value

written by j in the destination.

WAR (write after read) — j tries to write a destination before it is

read by i, so i incorrectly gets the new value. This hazard arises

from an antidependence. WAR hazards cannot occur in most

static issue pipelines — even deeper pipelines or floating-point

pipelines — because all reads are early (in ID) and all

writes are late (in WB).

Page 6: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 6

11Copyright © 2012, Elsevier Inc. All rights reserved.

Control Dependences

A control dependence determines the ordering of an

instruction, i, with respect to a branch instruction so that

the instruction i is executed in correct program order and

only when it should be.

if p1 {

S1;

};

if p2 {

S2;

}

12Copyright © 2012, Elsevier Inc. All rights reserved.

if p1 {

S1;

};

if p2 {

S2;

}

S1 is control dependent on p1, and S2 is control dependent on p2 but

not on p1. In general, there are two constraints imposed by control

dependences:

1. An instruction that is control dependent on a branch cannot be

moved before the branch so that its execution is no longer

controlled by the branch. For example, we cannot take an

instruction from the then portion of an if statement and move it

before the if statement.

2. An instruction that is not control dependent on a branch cannot

be moved after the branch so that its execution is controlled by the

branch. For example, we cannot take a statement before the if

statement and move it into the then portion.

Page 7: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 7

13Copyright © 2012, Elsevier Inc. All rights reserved.

When processors preserve strict program order, they ensure

that control dependences are also preserved. We may be

willing to execute instructions that should not have been

executed, however, thereby violating the control dependences,

if we can do so without affecting the correctness of the program. Control dependence is not the critical property that must be preserved. Instead, the two properties critical to program correctness — and normally preserved by maintaining both data and control dependence—are the exception behavior and

the data flow.

14Copyright © 2012, Elsevier Inc. All rights reserved.

Examples� OR instruction dependent

on DADDU and DSUBU

� Assume R4 isn’t used after skip� Possible to move DSUBU

before the branch

Intro

du

ctio

n• Example 1:DADDU R1,R2,R3

BEQZ R4,L

DSUBU R1,R1,R6

L: …

OR R7,R1,R8

• Example 2:DADDU R1,R2,R3

BEQZ R12,skip

DSUBU R4,R5,R6

DADDU R5,R4,R9

skip:

OR R7,R8,R9

Page 8: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 8

15Copyright © 2012, Elsevier Inc. All rights reserved.

Compiler Techniques for Exposing ILP

� Pipeline scheduling� Separate dependent instruction from the source

instruction by the pipeline latency of the source instruction

� Example:for (i=999; i>=0; i=i-1)

x[i] = x[i] + s;

Co

mp

iler T

ech

niq

ue

s

16Copyright © 2012, Elsevier Inc. All rights reserved.

Pipeline Stalls

Loop: L.D F0,0(R1)

stall

ADD.D F4,F0,F2

stall

stall

S.D F4,0(R1)

DADDUI R1,R1,#-8

stall (assume integer load latency is 1)

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s

Page 9: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 9

17Copyright © 2012, Elsevier Inc. All rights reserved.

Pipeline Scheduling

Scheduled code:

Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8

ADD.D F4,F0,F2

stall

stall

S.D F4,8(R1)

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s

18Copyright © 2012, Elsevier Inc. All rights reserved.

Loop Unrolling

� Loop unrolling� Unroll by a factor of 4 (assume # elements is divisible by 4)

� Eliminate unnecessary instructions

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1) ;drop DADDUI & BNE

L.D F6,-8(R1)

ADD.D F8,F6,F2

S.D F8,-8(R1) ;drop DADDUI & BNE

L.D F10,-16(R1)

ADD.D F12,F10,F2

S.D F12,-16(R1) ;drop DADDUI & BNE

L.D F14,-24(R1)

ADD.D F16,F14,F2

S.D F16,-24(R1)

DADDUI R1,R1,#-32

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s

� note: number of live registers vs. original loop

Page 10: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 10

19Copyright © 2012, Elsevier Inc. All rights reserved.

Loop Unrolling/Pipeline Scheduling

� Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1)

L.D F6,-8(R1)

L.D F10,-16(R1)

L.D F14,-24(R1)

ADD.D F4,F0,F2

ADD.D F8,F6,F2

ADD.D F12,F10,F2

ADD.D F16,F14,F2

S.D F4,0(R1)

S.D F8,-8(R1)

DADDUI R1,R1,#-32

S.D F12,16(R1)

S.D F16,8(R1)

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s

20Copyright © 2012, Elsevier Inc. All rights reserved.

Summary of the Loop Unrolling and Scheduling

� Determine that unrolling the loop would be useful –

iterations independent

� Use different registers to avoid unnecessary constraints

� Eliminate extra test and branch instructions

� Determine loads, stores can be interchanged by

checking target addresses

� Schedule code – preserving dependencies

Page 11: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 11

21Copyright © 2012, Elsevier Inc. All rights reserved.

Strip Mining

� Unknown number of loop iterations?� Number of iterations = n

� Goal: make k copies of the loop body

� Generate pair of loops:� First executes n mod k times

� Second executes n / k times

� “Strip mining”

Co

mp

iler T

ech

niq

ue

s

22Copyright © 2012, Elsevier Inc. All rights reserved.

Branch Prediction

� Basic 2-bit predictor:� For each branch:

� Predict taken or not taken

� If the prediction is wrong two consecutive times, change prediction

� Correlating predictor:� Multiple 2-bit predictors for each branch

� One for each possible combination of outcomes of preceding nbranches

� Local predictor:� Multiple 2-bit predictors for each branch

� One for each possible combination of outcomes for the last noccurrences of this branch

� Tournament predictor:� Combine correlating predictor with local predictor

Bra

nch

Pre

dic

tion

Page 12: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 12

23Copyright © 2012, Elsevier Inc. All rights reserved.

Branch Prediction PerformanceB

ran

ch

Pre

dic

tion

Branch predictor performance

24Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling

� Rearrange order of instructions to reduce stalls while maintaining data flow

� Advantages:� Compiler doesn’t need to have knowledge of

microarchitecture

� Handles cases where dependencies are unknown at compile time

� Disadvantage:� Substantial increase in hardware complexity

� Complicates exceptions

Bra

nch

Pre

dic

tion

Page 13: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 13

25Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling

� Dynamic scheduling implies:� Out-of-order execution

� Out-of-order completion

� Creates the possibility for WAR and WAW hazards

� Tomasulo’s Approach� Tracks when operands are available

� Introduces register renaming in hardware� Minimizes WAW and WAR hazards

Bra

nch

Pre

dic

tion

26Copyright © 2012, Elsevier Inc. All rights reserved.

Register Renaming

� Example:

DIV.D F0,F2,F4

ADD.D F6,F0,F8

S.D F6,0(R1)

SUB.D F8,F10,F14

MUL.D F6,F10,F8

+ name dependence with F6

Bra

nch

Pre

dic

tion

antidependence

antidependence

Page 14: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 14

27Copyright © 2012, Elsevier Inc. All rights reserved.

Register Renaming

� Example:

DIV.D F0,F2,F4

ADD.D S,F0,F8

S.D S,0(R1)

SUB.D T,F10,F14

MUL.D F6,F10,T

� Now only RAW hazards remain, which can be strictly

ordered

Bra

nch

Pre

dic

tion

28Copyright © 2012, Elsevier Inc. All rights reserved.

Register Renaming

� Register renaming is provided by reservation stations (RS)� Contains:

� The instruction

� Buffered operand values (when available)

� Reservation station number of instruction providing the operand values

� RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file)

� Pending instructions designate the RS to which they will send their output

� Result values broadcast on a result bus, called the common data bus (CDB)

� Only the last output updates the register file

� As instructions are issued, the register specifiers are renamed with the reservation station

� May be more reservation stations than registers

Bra

nch

Pre

dic

tion

Page 15: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 15

29Copyright © 2012, Elsevier Inc. All rights reserved.

In Tomasulo’s scheme, register renaming is provided by

reservation stations, which buffer the operands of

instructions waiting to issue. The basic idea is that a

reservation station fetches and buffers an operand as

soon as it is available, eliminating the need to get the

operand from a register. In addition, pending instructions

designate the reservation station that will provide their input.

Finally, when successive writes to a register overlap in

execution, only the last one is actually used to update the

register. As instructions are issued, the register specifiers

for pending operands are renamed to the names of the

reservation station, which provides register renaming.

30Copyright © 2012, Elsevier Inc. All rights reserved.

The use of reservation stations, rather than a centralized register

file, leads to two other important properties. First, hazard detection

and execution control are distributed: The information held in the

reservation stations at each functional unit determine when an

instruction can begin execution at that unit. Second, results are

passed directly to functional units from the reservation stations where

they are buffered, rather than going through the registers. This

bypassing is done with a common result bus that allows all units

waiting for an operand to be loaded simultaneously (on the 360/91

this is called the common data bus, or CDB). In pipelines with multiple

execution units and issuing multiple instructions per clock, more than

one result bus will be needed.

Page 16: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 16

31Copyright © 2012, Elsevier Inc. All rights reserved.

Tomasulo’s Algorithm

� Load and store buffers� Contain data and addresses, act like reservation

stations

� Top-level design:

Bra

nch

Pre

dic

tion

32Copyright © 2012, Elsevier Inc. All rights reserved.

Tomasulo’s Algorithm

� Three Steps:� Issue

� Get next instruction from FIFO queue

� If available RS, issue the instruction to the RS with operand values if available

� If operand values not available, stall the instruction

� Execute

� When operand becomes available, store it in any reservation stations waiting for it

� When all operands are ready, issue the instruction

� Loads and store maintained in program order through effective address

� No instruction allowed to initiate execution until all branches that proceed it in program order have completed

� Write result

� Write result on CDB into reservation stations and store buffers

� (Stores must wait until address and value are received)

Bra

nch

Pre

dic

tion

Page 17: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 17

33Copyright © 2012, Elsevier Inc. All rights reserved.

Issue—Get the next instruction from the head of the

instruction queue, which is maintained in FIFO order to

ensure the maintenance of correct data flow. If there is

a matching reservation station that is empty, issue the

instruction to the station with the operand values, if they

are currently in the registers. If there is not an empty

reservation station, then there is a structural hazard and

the instruction stalls until a station or buffer is freed. If the

operands are not in the registers, keep track of the

functional units that will produce the operands.

34Copyright © 2012, Elsevier Inc. All rights reserved.

Execute—If one or more of the operands is not yet

available, monitor the common data bus while waiting for

it to be computed. When an operand becomes available,

it is placed into any reservation station awaiting it. When

all the operands are available, the operation can be

executed at the corresponding functional unit. By delaying

instruction execution until the operands are available,

RAW hazards are avoided. (Some dynamically scheduled

processors call this step “issue,” but we use the name

“execute,” which was used in the first dynamically

scheduled processor, the CDC 6600.)

Page 18: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 18

35Copyright © 2012, Elsevier Inc. All rights reserved.

Write result—When the result is available, write it on

the CDB and from there into the registers and into any

reservation stations (including store buffers) waiting

for this result. Stores are buffered in the store buffer

until both the value to be stored and the store address

are available, then the result is written as soon as the

memory unit is free.

36Copyright © 2012, Elsevier Inc. All rights reserved.

Reservation Station Fields

(borrowed from CDC scoreboard)

OP - Operation to perform on S1 and S2

Qj,Qk – reservation stations that will produce

the corresponding Source Operand

Vj,Vk – Value of source operands. Note that only one

of V field or Q field valid for operand

A - Used to hold information for memory address

calculation for a load or store

Busy – Indicates that this reservation station and

its accompanying functional unit are occupied

Qi – (Register file) number of the reservation station

whose result should be stored into this register

Page 19: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 19

37Copyright © 2012, Elsevier Inc. All rights reserved.

1. L.D F6,32(R2)

2. L.D F2,44(R3)

3. MUL.D F0,F2,F4

4. SUB.D F8,F2,F6

5. DIV.D F10,F0,F6

6. ADD.D F6,F8,F2

38Copyright © 2012, Elsevier Inc. All rights reserved.

Example

Bra

nch

Pre

dic

tion

Page 20: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 20

39Copyright © 2012, Elsevier Inc. All rights reserved.

Hardware-Based Speculation

� Execute instructions along predicted execution paths but only commit the results if prediction was correct

� Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative

� Need an additional piece of hardware to prevent any irrevocable action until an instruction commits� I.e. updating state or taking an execution

Bra

nch

Pre

dic

tion

40Copyright © 2012, Elsevier Inc. All rights reserved.

Reorder Buffer

� Reorder buffer – holds the result of instruction between completion and commit

� Four fields:� Instruction type: branch/store/register

� Destination field: register number

� Value field: output value

� Ready field: completed execution?

� Modify reservation stations:� Operand source is now reorder buffer instead of

functional unit

Bra

nch

Pre

dic

tion

Page 21: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 21

41Copyright © 2012, Elsevier Inc. All rights reserved.

Reorder Buffer

� Register values and memory values are not written until an instruction commits

� On misprediction:� Speculated entries in ROB are cleared

� Exceptions:� Not recognized until it is ready to commit

Bra

nch

Pre

dic

tion

42Copyright © 2012, Elsevier Inc. All rights reserved.

Multiple Issue and Static Scheduling

� To achieve CPI < 1, need to complete multiple instructions per clock

� Solutions:� Statically scheduled superscalar processors

� VLIW (very long instruction word) processors

� dynamically scheduled superscalar processors

Mu

ltiple

Issu

e a

nd

Sta

tic S

ch

ed

ulin

g

Page 22: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 22

43Copyright © 2012, Elsevier Inc. All rights reserved.

Multiple IssueM

ultip

le Is

su

e a

nd

Sta

tic S

ch

ed

ulin

g

44Copyright © 2012, Elsevier Inc. All rights reserved.

Page 23: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 23

45Copyright © 2012, Elsevier Inc. All rights reserved.

Explicitly parallel instruction computing (EPIC) is a term coined in

1997 by the HP–Intel alliance[1] to describe a computing paradigm that

researchers had been investigating since the early 1980s.[2] This

paradigm is also called Independence architectures. It was the basis

for Intel and HP development of the Intel Itanium architecture,[3] and HP

later asserted that "EPIC" was merely an old term for the Itanium

architecture.[4] EPIC permits microprocessors to execute software

instructions in parallel by using the compiler, rather than complex

on-die circuitry, to control parallel instruction execution. This was

intended to allow simple performance scaling without resorting

to higher clock frequencies.

46Copyright © 2012, Elsevier Inc. All rights reserved.

VLIW Processors

� Package multiple operations into one instruction

� Example VLIW processor:� One integer instruction (or branch)

� Two independent floating-point operations

� Two independent memory references

� Must be enough parallelism in code to fill the available slots

Mu

ltiple

Issu

e a

nd

Sta

tic S

ch

ed

ulin

g

Page 24: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 24

47Copyright © 2012, Elsevier Inc. All rights reserved.

VLIW Processors

� Disadvantages:� Statically finding parallelism

� Code size

� No hazard detection hardware

� Binary code compatibility

Mu

ltiple

Issu

e a

nd

Sta

tic S

ch

ed

ulin

g

48Copyright © 2012, Elsevier Inc. All rights reserved.

Dynamic Scheduling, Multiple Issue, and Speculation

� Modern microarchitectures:� Dynamic scheduling + multiple issue + speculation

� Two approaches:� Assign reservation stations and update pipeline

control table in half clock cycles� Only supports 2 instructions/clock

� Design logic to handle any possible dependencies between the instructions

� Hybrid approaches

� Issue logic can become bottleneck

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Page 25: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 25

49Copyright © 2012, Elsevier Inc. All rights reserved.

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Overview of Design

50Copyright © 2012, Elsevier Inc. All rights reserved.

� Limit the number of instructions of a given class that can be issued in a “bundle”� I.e. one FP, one integer, one load, one store

� Examine all the dependencies among the instructions in the bundle

� If dependencies exist in bundle, encode them in reservation stations

� Also need multiple completion/commit

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Multiple Issue

Page 26: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 26

51Copyright © 2012, Elsevier Inc. All rights reserved.

Loop: LD R2,0(R1) ;R2=array element

DADDIU R2,R2,#1 ;increment R2

SD R2,0(R1) ;store result

DADDIU R1,R1,#8 ;increment pointer

BNE R2,R3,LOOP ;branch if not last element

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Example

52Copyright © 2012, Elsevier Inc. All rights reserved.

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Example (No Speculation)

Page 27: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 27

53Copyright © 2012, Elsevier Inc. All rights reserved.

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Example

54Copyright © 2012, Elsevier Inc. All rights reserved.

� Need high instruction bandwidth!� Branch-Target buffers

� Next PC prediction buffer, indexed by current PC

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Branch-Target Buffer

Page 28: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 28

55Copyright © 2012, Elsevier Inc. All rights reserved.

� Optimization:

� Larger branch-target buffer

� Add target instruction into buffer to deal with longer

decoding time required by larger buffer

� “Branch folding”

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Branch Folding

56Copyright © 2012, Elsevier Inc. All rights reserved.

� Most unconditional branches come from function returns

� The same procedure can be called from multiple sites

� Causes the buffer to potentially forget about the

return address from previous calls

� Create return address buffer organized as a stack

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Return Address Predictor

Page 29: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 29

57Copyright © 2012, Elsevier Inc. All rights reserved.

� Design monolithic unit that performs:

� Branch prediction

� Instruction prefetch

� Fetch ahead

� Instruction memory access and buffering

� Deal with crossing cache lines

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Integrated Instruction Fetch Unit

58Copyright © 2012, Elsevier Inc. All rights reserved.

� Register renaming vs. reorder buffers

� Instead of virtual registers from reservation stations and

reorder buffer, create a single register pool

� Contains visible registers and virtual registers

� Use hardware-based map to rename registers during issue

� WAW and WAR hazards are avoided

� Speculation recovery occurs by copying during commit

� Still need a ROB-like queue to update table in order

� Simplifies commit:

� Record that mapping between architectural register and physical register is no longer speculative

� Free up physical register used to hold older value

� In other words: SWAP physical registers on commit

� Physical register de-allocation is more difficult

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Register Renaming

Page 30: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 30

59Copyright © 2012, Elsevier Inc. All rights reserved.

� Combining instruction issue with register renaming:

� Issue logic pre-reserves enough physical registers

for the bundle (fixed number?)

� Issue logic finds dependencies within bundle, maps

registers as necessary

� Issue logic finds dependencies between current

bundle and already in-flight bundles, maps registers

as necessary

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Integrated Issue and Renaming

60Copyright © 2012, Elsevier Inc. All rights reserved.

� How much to speculate

� Mis-speculation degrades performance and power

relative to no speculation

� May cause additional misses (cache, TLB)

� Prevent speculative code from causing higher

costing misses (e.g. L2)

� Speculating through multiple branches

� Complicates speculation recovery

� No processor can resolve multiple branches per

cycle

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

How Much?

Page 31: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 31

61Copyright © 2012, Elsevier Inc. All rights reserved.

62Copyright © 2012, Elsevier Inc. All rights reserved.

Page 32: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 32

63Copyright © 2012, Elsevier Inc. All rights reserved.

64Copyright © 2012, Elsevier Inc. All rights reserved.

Page 33: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 33

65Copyright © 2012, Elsevier Inc. All rights reserved.

Hardware vs Software Speculation� To speculate extensively, must disambiguate memory

references – difficult to do at compile time

� Hardware based speculation works better when control

flow is unpredictable and when hardware based branch

prediction is superior to software based branch

prediction

� Hardware based speculation maintains a completely

precise exception model (recent software as well)

� Hardware based – no compensations or bookkeeping

code

� Compiler based – can see further

� Hardware based with dynamic scheduling does not

require different code sequences

66Copyright © 2012, Elsevier Inc. All rights reserved.

� Speculation and energy efficiency

� Note: speculation is only energy efficient when it

significantly improves performance

� Value prediction

� Uses:

� Loads that load from a constant pool

� Instruction that produces a value from a small set of values

� Not been incorporated into modern processors

� Similar idea--address aliasing prediction--is used on

some processors

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Energy Efficiency

Page 34: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 34

67

Threading

� Multithreading allows multiple threads to share the functional units of a single processor in overlapping fashion.

� Thread-level parallelism (TLP)

Copyright © 2012, Elsevier Inc. All rights reserved.

68Copyright © 2012, Elsevier Inc. All rights reserved.

Multi-Threading

Page 35: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 35

69Copyright © 2012, Elsevier Inc. All rights reserved.

70

Fallacies and Pitfalls

� Fallacy: It is easy to predict the performance and energy efficiency of two different versions of the same ISA, if we hold the technology constant.

� Fallacy: Processors with lower CPIs will always be faster.

� Fallacy: Processors with faster clock rates will always be faster.

� Pitfall: Sometimes bigger and dumber is better.

Copyright © 2012, Elsevier Inc. All rights reserved.