Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...

The University of Adelaide, School of Computer Science 31 October 2014

Chapter 2 — Instructions: Language of the Computer 1

1Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 3

Instruction-Level Parallelism

and Its Exploitation

Computer ArchitectureA Quantitative Approach, Fifth Edition


Introduction

� Pipelining become universal technique in 1985� Overlaps execution of instructions

� Exploits “Instruction Level Parallelism”

� Beyond this, there are two main approaches:� Hardware-based dynamic approaches

� Used in server and desktop processors

� Not used as extensively in PMP processors

� Compiler-based static approaches� Not as successful outside of scientific applications

Intro

du

ctio

n




Instruction-Level Parallelism

� When exploiting instruction-level parallelism, goal is to maximize CPI� Pipeline CPI =

� Ideal pipeline CPI +

� Structural stalls +

� Data hazard stalls +

� Control stalls

� Parallelism with basic block is limited� Typical size of basic block = 3-6 instructions

� Must optimize across branches

Intro

du

ctio

n


Data Dependence

� Loop-Level Parallelism� Unroll loop statically or dynamically

� Use SIMD (vector processors and GPUs)

� Challenges:� Data dependency

� Instruction j is data dependent on instruction i if

� Instruction i produces a result that may be used by instruction j

� Instruction j is data dependent on instruction k and instruction kis data dependent on instruction i

� Dependent instructions cannot be executed simultaneously

Intro

du

ctio

n




Data Dependence

� Dependencies are a property of programs

� Pipeline organization determines if dependence is detected and if it causes a stall

� Data dependence conveys:� Possibility of a hazard

� Order in which results must be calculated

� Upper bound on exploitable instruction level parallelism

� Dependencies that flow through memory locations are difficult to detect

Intro

du

ctio

n


Data Dependences in Floating, Integer Stuff

Loop: L.D F0,0(R1) ;F0=array element

ADD.D F4,F0,F2 ;add scalar in F2

S.D F4,0(R1) ; store result

DADDUI R1,R1,#-8 ;decrement pointer 8 bytes

BNE R1,R2,LOOP ;branch R1!=R2

Loop: L.D F0,0(R1) ;F0=array element

ADD.D F4,F0,F2 ;add scalar in F2

S.D F4,0(R1) ; store result

DADDUI R1,R1,#-8 ;decrement pointer 8 bytes

BNE R1,R2,LOOP ;branch R1!=R2




Name Dependence

� Two instructions use the same name but no flow of information� Not a true data dependence, but is a problem when

reordering instructions

� Antidependence: instruction j writes a register or memory location that instruction i reads

� Initial ordering (i before j) must be preserved

� Output dependence: instruction i and instruction j write the same register or memory location

� Ordering must be preserved

� To resolve, use renaming techniques

Intro

du

ctio

n


Data Hazards

A hazard exists whenever there is a name or data dependence between instructions, and they are close enough that the overlap during execution would change the order of access to the operand

involved in the dependence. Because of the dependence, we

must preserve what is called program order, that is, the order that

the instructions would execute in if executed sequentially one at a

time as determined by the original source program. The goal of

both our software and hardware techniques is to exploit parallelism

by preserving program order only where it affects the outcome

of the program. Detecting and avoiding hazards ensures that

necessary program order is preserved.




Other Factors

� Data Hazards� Read after write (RAW)

� Write after write (WAW)

� Write after read (WAR)

� Control Dependence� Ordering of instruction i with respect to a branch

instruction� Instruction control dependent on a branch cannot be moved

before the branch so that its execution is no longer controller by the branch

� An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

Intro

du

ctio

n


Data Hazards

RAW (read after write) — j tries to read a source before i

writes it, so j incorrectly gets the old value.

WAW (write after write) — j tries to write an operand before it is

written by i. The writes end up being performed in the wrong

order, leaving the value written by i rather than the value

written by j in the destination.

WAR (write after read) — j tries to write a destination before it is

read by i, so i incorrectly gets the new value. This hazard arises

from an antidependence. WAR hazards cannot occur in most

static issue pipelines — even deeper pipelines or floating-point

pipelines — because all reads are early (in ID) and all

writes are late (in WB).




Control Dependences

A control dependence determines the ordering of an

instruction, i, with respect to a branch instruction so that

the instruction i is executed in correct program order and

only when it should be.

if p1 {

S1;

};

if p2 {

S2;

}


if p1 {

S1;

};

if p2 {

S2;

}

S1 is control dependent on p1, and S2 is control dependent on p2 but

not on p1. In general, there are two constraints imposed by control

dependences:

1. An instruction that is control dependent on a branch cannot be

moved before the branch so that its execution is no longer

controlled by the branch. For example, we cannot take an

instruction from the then portion of an if statement and move it

before the if statement.

2. An instruction that is not control dependent on a branch cannot

be moved after the branch so that its execution is controlled by the

branch. For example, we cannot take a statement before the if

statement and move it into the then portion.




When processors preserve strict program order, they ensure

that control dependences are also preserved. We may be

willing to execute instructions that should not have been

executed, however, thereby violating the control dependences,

if we can do so without affecting the correctness of the program. Control dependence is not the critical property that must be preserved. Instead, the two properties critical to program correctness — and normally preserved by maintaining both data and control dependence—are the exception behavior and

the data flow.


Examples� OR instruction dependent

on DADDU and DSUBU

� Assume R4 isn’t used after skip� Possible to move DSUBU

before the branch

Intro

du

ctio

n• Example 1:DADDU R1,R2,R3

BEQZ R4,L

DSUBU R1,R1,R6

L: …

OR R7,R1,R8

• Example 2:DADDU R1,R2,R3

BEQZ R12,skip

DSUBU R4,R5,R6

DADDU R5,R4,R9

skip:

OR R7,R8,R9




Compiler Techniques for Exposing ILP

� Pipeline scheduling� Separate dependent instruction from the source

instruction by the pipeline latency of the source instruction

� Example:for (i=999; i>=0; i=i-1)

x[i] = x[i] + s;

Co

mp

iler T

ech

niq

ue

s


Pipeline Stalls

Loop: L.D F0,0(R1)

stall

ADD.D F4,F0,F2

stall

stall

S.D F4,0(R1)

DADDUI R1,R1,#-8

stall (assume integer load latency is 1)

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s




Pipeline Scheduling

Scheduled code:

Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8

ADD.D F4,F0,F2

stall

stall

S.D F4,8(R1)

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s


Loop Unrolling

� Loop unrolling� Unroll by a factor of 4 (assume # elements is divisible by 4)

� Eliminate unnecessary instructions

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1) ;drop DADDUI & BNE

L.D F6,-8(R1)

ADD.D F8,F6,F2

S.D F8,-8(R1) ;drop DADDUI & BNE

L.D F10,-16(R1)

ADD.D F12,F10,F2

S.D F12,-16(R1) ;drop DADDUI & BNE

L.D F14,-24(R1)

ADD.D F16,F14,F2

S.D F16,-24(R1)

DADDUI R1,R1,#-32

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s

� note: number of live registers vs. original loop




Loop Unrolling/Pipeline Scheduling

� Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1)

L.D F6,-8(R1)

L.D F10,-16(R1)

L.D F14,-24(R1)

ADD.D F4,F0,F2

ADD.D F8,F6,F2

ADD.D F12,F10,F2

ADD.D F16,F14,F2

S.D F4,0(R1)

S.D F8,-8(R1)

DADDUI R1,R1,#-32

S.D F12,16(R1)

S.D F16,8(R1)

BNE R1,R2,Loop

Co

mp

iler T

ech

niq

ue

s


Summary of the Loop Unrolling and Scheduling

� Determine that unrolling the loop would be useful –

iterations independent

� Use different registers to avoid unnecessary constraints

� Eliminate extra test and branch instructions

� Determine loads, stores can be interchanged by

checking target addresses

� Schedule code – preserving dependencies




Strip Mining

� Unknown number of loop iterations?� Number of iterations = n

� Goal: make k copies of the loop body

� Generate pair of loops:� First executes n mod k times

� Second executes n / k times

� “Strip mining”

Co

mp

iler T

ech

niq

ue

s


Branch Prediction

� Basic 2-bit predictor:� For each branch:

� Predict taken or not taken

� If the prediction is wrong two consecutive times, change prediction

� Correlating predictor:� Multiple 2-bit predictors for each branch

� One for each possible combination of outcomes of preceding nbranches

� Local predictor:� Multiple 2-bit predictors for each branch

� One for each possible combination of outcomes for the last noccurrences of this branch

� Tournament predictor:� Combine correlating predictor with local predictor

Bra

nch

Pre

dic

tion




Branch Prediction PerformanceB

ran

ch

Pre

dic

tion

Branch predictor performance


Dynamic Scheduling

� Rearrange order of instructions to reduce stalls while maintaining data flow

� Advantages:� Compiler doesn’t need to have knowledge of

microarchitecture

� Handles cases where dependencies are unknown at compile time

� Disadvantage:� Substantial increase in hardware complexity

� Complicates exceptions

Bra

nch

Pre

dic

tion




Dynamic Scheduling

� Dynamic scheduling implies:� Out-of-order execution

� Out-of-order completion

� Creates the possibility for WAR and WAW hazards

� Tomasulo’s Approach� Tracks when operands are available

� Introduces register renaming in hardware� Minimizes WAW and WAR hazards

Bra

nch

Pre

dic

tion


Register Renaming

� Example:

DIV.D F0,F2,F4

ADD.D F6,F0,F8

S.D F6,0(R1)

SUB.D F8,F10,F14

MUL.D F6,F10,F8

+ name dependence with F6

Bra

nch

Pre

dic

tion

antidependence

antidependence




Register Renaming

� Example:

DIV.D F0,F2,F4

ADD.D S,F0,F8

S.D S,0(R1)

SUB.D T,F10,F14

MUL.D F6,F10,T

� Now only RAW hazards remain, which can be strictly

ordered

Bra

nch

Pre

dic

tion


Register Renaming

� Register renaming is provided by reservation stations (RS)� Contains:

� The instruction

� Buffered operand values (when available)

� Reservation station number of instruction providing the operand values

� RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file)

� Pending instructions designate the RS to which they will send their output

� Result values broadcast on a result bus, called the common data bus (CDB)

� Only the last output updates the register file

� As instructions are issued, the register specifiers are renamed with the reservation station

� May be more reservation stations than registers

Bra

nch

Pre

dic

tion




In Tomasulo’s scheme, register renaming is provided by

reservation stations, which buffer the operands of

instructions waiting to issue. The basic idea is that a

reservation station fetches and buffers an operand as

soon as it is available, eliminating the need to get the

operand from a register. In addition, pending instructions

designate the reservation station that will provide their input.

Finally, when successive writes to a register overlap in

execution, only the last one is actually used to update the

register. As instructions are issued, the register specifiers

for pending operands are renamed to the names of the

reservation station, which provides register renaming.


The use of reservation stations, rather than a centralized register

file, leads to two other important properties. First, hazard detection

and execution control are distributed: The information held in the

reservation stations at each functional unit determine when an

instruction can begin execution at that unit. Second, results are

passed directly to functional units from the reservation stations where

they are buffered, rather than going through the registers. This

bypassing is done with a common result bus that allows all units

waiting for an operand to be loaded simultaneously (on the 360/91

this is called the common data bus, or CDB). In pipelines with multiple

execution units and issuing multiple instructions per clock, more than

one result bus will be needed.




Tomasulo’s Algorithm

� Load and store buffers� Contain data and addresses, act like reservation

stations

� Top-level design:

Bra

nch

Pre

dic

tion


Tomasulo’s Algorithm

� Three Steps:� Issue

� Get next instruction from FIFO queue

� If available RS, issue the instruction to the RS with operand values if available

� If operand values not available, stall the instruction

� Execute

� When operand becomes available, store it in any reservation stations waiting for it

� When all operands are ready, issue the instruction

� Loads and store maintained in program order through effective address

� No instruction allowed to initiate execution until all branches that proceed it in program order have completed

� Write result

� Write result on CDB into reservation stations and store buffers

� (Stores must wait until address and value are received)

Bra

nch

Pre

dic

tion




Issue—Get the next instruction from the head of the

instruction queue, which is maintained in FIFO order to

ensure the maintenance of correct data flow. If there is

a matching reservation station that is empty, issue the

instruction to the station with the operand values, if they

are currently in the registers. If there is not an empty

reservation station, then there is a structural hazard and

the instruction stalls until a station or buffer is freed. If the

operands are not in the registers, keep track of the

functional units that will produce the operands.


Execute—If one or more of the operands is not yet

available, monitor the common data bus while waiting for

it to be computed. When an operand becomes available,

it is placed into any reservation station awaiting it. When

all the operands are available, the operation can be

executed at the corresponding functional unit. By delaying

instruction execution until the operands are available,

RAW hazards are avoided. (Some dynamically scheduled

processors call this step “issue,” but we use the name

“execute,” which was used in the first dynamically

scheduled processor, the CDC 6600.)




Write result—When the result is available, write it on

the CDB and from there into the registers and into any

reservation stations (including store buffers) waiting

for this result. Stores are buffered in the store buffer

until both the value to be stored and the store address

are available, then the result is written as soon as the

memory unit is free.


Reservation Station Fields

(borrowed from CDC scoreboard)

OP - Operation to perform on S1 and S2

Qj,Qk – reservation stations that will produce

the corresponding Source Operand

Vj,Vk – Value of source operands. Note that only one

of V field or Q field valid for operand

A - Used to hold information for memory address

calculation for a load or store

Busy – Indicates that this reservation station and

its accompanying functional unit are occupied

Qi – (Register file) number of the reservation station

whose result should be stored into this register




1. L.D F6,32(R2)

2. L.D F2,44(R3)

3. MUL.D F0,F2,F4

4. SUB.D F8,F2,F6

5. DIV.D F10,F0,F6

6. ADD.D F6,F8,F2


Example

Bra

nch

Pre

dic

tion




Hardware-Based Speculation

� Execute instructions along predicted execution paths but only commit the results if prediction was correct

� Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative

� Need an additional piece of hardware to prevent any irrevocable action until an instruction commits� I.e. updating state or taking an execution

Bra

nch

Pre

dic

tion


Reorder Buffer

� Reorder buffer – holds the result of instruction between completion and commit

� Four fields:� Instruction type: branch/store/register

� Destination field: register number

� Value field: output value

� Ready field: completed execution?

� Modify reservation stations:� Operand source is now reorder buffer instead of

functional unit

Bra

nch

Pre

dic

tion




Reorder Buffer

� Register values and memory values are not written until an instruction commits

� On misprediction:� Speculated entries in ROB are cleared

� Exceptions:� Not recognized until it is ready to commit

Bra

nch

Pre

dic

tion


Multiple Issue and Static Scheduling

� To achieve CPI < 1, need to complete multiple instructions per clock

� Solutions:� Statically scheduled superscalar processors

� VLIW (very long instruction word) processors

� dynamically scheduled superscalar processors

Mu

ltiple

Issu

e a

nd

Sta

tic S

ch

ed

ulin

g




Multiple IssueM

ultip

le Is

su

e a

nd

Sta

tic S

ch

ed

ulin

g





Explicitly parallel instruction computing (EPIC) is a term coined in

1997 by the HP–Intel alliance[1] to describe a computing paradigm that

researchers had been investigating since the early 1980s.[2] This

paradigm is also called Independence architectures. It was the basis

for Intel and HP development of the Intel Itanium architecture,[3] and HP

later asserted that "EPIC" was merely an old term for the Itanium

architecture.[4] EPIC permits microprocessors to execute software

instructions in parallel by using the compiler, rather than complex

on-die circuitry, to control parallel instruction execution. This was

intended to allow simple performance scaling without resorting

to higher clock frequencies.


VLIW Processors

� Package multiple operations into one instruction

� Example VLIW processor:� One integer instruction (or branch)

� Two independent floating-point operations

� Two independent memory references

� Must be enough parallelism in code to fill the available slots

Mu

ltiple

Issu

e a

nd

Sta

tic S

ch

ed

ulin

g




VLIW Processors

� Disadvantages:� Statically finding parallelism

� Code size

� No hazard detection hardware

� Binary code compatibility

Mu

ltiple

Issu

e a

nd

Sta

tic S

ch

ed

ulin

g


Dynamic Scheduling, Multiple Issue, and Speculation

� Modern microarchitectures:� Dynamic scheduling + multiple issue + speculation

� Two approaches:� Assign reservation stations and update pipeline

control table in half clock cycles� Only supports 2 instructions/clock

� Design logic to handle any possible dependencies between the instructions

� Hybrid approaches

� Issue logic can become bottleneck

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n




Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Overview of Design


� Limit the number of instructions of a given class that can be issued in a “bundle”� I.e. one FP, one integer, one load, one store

� Examine all the dependencies among the instructions in the bundle

� If dependencies exist in bundle, encode them in reservation stations

� Also need multiple completion/commit

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Multiple Issue




Loop: LD R2,0(R1) ;R2=array element

DADDIU R2,R2,#1 ;increment R2

SD R2,0(R1) ;store result

DADDIU R1,R1,#8 ;increment pointer

BNE R2,R3,LOOP ;branch if not last element

Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Example


Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Example (No Speculation)




Dyn

am

ic S

ch

ed

ulin

g, M

ultip

le Is

su

e, a

nd

Sp

ecu

latio

n

Example


� Need high instruction bandwidth!� Branch-Target buffers

� Next PC prediction buffer, indexed by current PC

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Branch-Target Buffer




� Optimization:

� Larger branch-target buffer

� Add target instruction into buffer to deal with longer

decoding time required by larger buffer

� “Branch folding”

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Branch Folding


� Most unconditional branches come from function returns

� The same procedure can be called from multiple sites

� Causes the buffer to potentially forget about the

return address from previous calls

� Create return address buffer organized as a stack

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Return Address Predictor




� Design monolithic unit that performs:

� Branch prediction

� Instruction prefetch

� Fetch ahead

� Instruction memory access and buffering

� Deal with crossing cache lines

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Integrated Instruction Fetch Unit


� Register renaming vs. reorder buffers

� Instead of virtual registers from reservation stations and

reorder buffer, create a single register pool

� Contains visible registers and virtual registers

� Use hardware-based map to rename registers during issue

� WAW and WAR hazards are avoided

� Speculation recovery occurs by copying during commit

� Still need a ROB-like queue to update table in order

� Simplifies commit:

� Record that mapping between architectural register and physical register is no longer speculative

� Free up physical register used to hold older value

� In other words: SWAP physical registers on commit

� Physical register de-allocation is more difficult

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Register Renaming




� Combining instruction issue with register renaming:

� Issue logic pre-reserves enough physical registers

for the bundle (fixed number?)

� Issue logic finds dependencies within bundle, maps

registers as necessary

� Issue logic finds dependencies between current

bundle and already in-flight bundles, maps registers

as necessary

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Integrated Issue and Renaming


� How much to speculate

� Mis-speculation degrades performance and power

relative to no speculation

� May cause additional misses (cache, TLB)

� Prevent speculative code from causing higher

costing misses (e.g. L2)

� Speculating through multiple branches

� Complicates speculation recovery

� No processor can resolve multiple branches per

cycle

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

How Much?




Hardware vs Software Speculation� To speculate extensively, must disambiguate memory

references – difficult to do at compile time

� Hardware based speculation works better when control

flow is unpredictable and when hardware based branch

prediction is superior to software based branch

prediction

� Hardware based speculation maintains a completely

precise exception model (recent software as well)

� Hardware based – no compensations or bookkeeping

code

� Compiler based – can see further

� Hardware based with dynamic scheduling does not

require different code sequences


� Speculation and energy efficiency

� Note: speculation is only energy efficient when it

significantly improves performance

� Value prediction

� Uses:

� Loads that load from a constant pool

� Instruction that produces a value from a small set of values

� Not been incorporated into modern processors

� Similar idea--address aliasing prediction--is used on

some processors

Ad

v. Te

ch

niq

ue

s fo

r Instru

ctio

n D

eliv

ery

an

d S

pe

cu

latio

n

Energy Efficiency



67

Threading

� Multithreading allows multiple threads to share the functional units of a single processor in overlapping fashion.

� Thread-level parallelism (TLP)

Copyright © 2012, Elsevier Inc. All rights reserved.


Multi-Threading




70

Fallacies and Pitfalls

� Fallacy: It is easy to predict the performance and energy efficiency of two different versions of the same ISA, if we hold the technology constant.

� Fallacy: Processors with lower CPIs will always be faster.

� Fallacy: Processors with faster clock rates will always be faster.

� Pitfall: Sometimes bigger and dumber is better.

Copyright © 2012, Elsevier Inc. All rights reserved.

Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...

Documents

Transcript of Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...