Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...
Transcript of Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor...
![Page 1: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/1.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 1
1Copyright © 2012, Elsevier Inc. All rights reserved.
Chapter 3
Instruction-Level Parallelism
and Its Exploitation
Computer ArchitectureA Quantitative Approach, Fifth Edition
2Copyright © 2012, Elsevier Inc. All rights reserved.
Introduction
� Pipelining become universal technique in 1985� Overlaps execution of instructions
� Exploits “Instruction Level Parallelism”
� Beyond this, there are two main approaches:� Hardware-based dynamic approaches
� Used in server and desktop processors
� Not used as extensively in PMP processors
� Compiler-based static approaches� Not as successful outside of scientific applications
Intro
du
ctio
n
![Page 2: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/2.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 2
3Copyright © 2012, Elsevier Inc. All rights reserved.
Instruction-Level Parallelism
� When exploiting instruction-level parallelism, goal is to maximize CPI� Pipeline CPI =
� Ideal pipeline CPI +
� Structural stalls +
� Data hazard stalls +
� Control stalls
� Parallelism with basic block is limited� Typical size of basic block = 3-6 instructions
� Must optimize across branches
Intro
du
ctio
n
4Copyright © 2012, Elsevier Inc. All rights reserved.
Data Dependence
� Loop-Level Parallelism� Unroll loop statically or dynamically
� Use SIMD (vector processors and GPUs)
� Challenges:� Data dependency
� Instruction j is data dependent on instruction i if
� Instruction i produces a result that may be used by instruction j
� Instruction j is data dependent on instruction k and instruction kis data dependent on instruction i
� Dependent instructions cannot be executed simultaneously
Intro
du
ctio
n
![Page 3: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/3.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 3
5Copyright © 2012, Elsevier Inc. All rights reserved.
Data Dependence
� Dependencies are a property of programs
� Pipeline organization determines if dependence is detected and if it causes a stall
� Data dependence conveys:� Possibility of a hazard
� Order in which results must be calculated
� Upper bound on exploitable instruction level parallelism
� Dependencies that flow through memory locations are difficult to detect
Intro
du
ctio
n
6Copyright © 2012, Elsevier Inc. All rights reserved.
Data Dependences in Floating, Integer Stuff
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ; store result
DADDUI R1,R1,#-8 ;decrement pointer 8 bytes
BNE R1,R2,LOOP ;branch R1!=R2
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ; store result
DADDUI R1,R1,#-8 ;decrement pointer 8 bytes
BNE R1,R2,LOOP ;branch R1!=R2
![Page 4: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/4.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 4
7Copyright © 2012, Elsevier Inc. All rights reserved.
Name Dependence
� Two instructions use the same name but no flow of information� Not a true data dependence, but is a problem when
reordering instructions
� Antidependence: instruction j writes a register or memory location that instruction i reads
� Initial ordering (i before j) must be preserved
� Output dependence: instruction i and instruction j write the same register or memory location
� Ordering must be preserved
� To resolve, use renaming techniques
Intro
du
ctio
n
8Copyright © 2012, Elsevier Inc. All rights reserved.
Data Hazards
A hazard exists whenever there is a name or data dependence between instructions, and they are close enough that the overlap during execution would change the order of access to the operand
involved in the dependence. Because of the dependence, we
must preserve what is called program order, that is, the order that
the instructions would execute in if executed sequentially one at a
time as determined by the original source program. The goal of
both our software and hardware techniques is to exploit parallelism
by preserving program order only where it affects the outcome
of the program. Detecting and avoiding hazards ensures that
necessary program order is preserved.
![Page 5: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/5.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 5
9Copyright © 2012, Elsevier Inc. All rights reserved.
Other Factors
� Data Hazards� Read after write (RAW)
� Write after write (WAW)
� Write after read (WAR)
� Control Dependence� Ordering of instruction i with respect to a branch
instruction� Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controller by the branch
� An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch
Intro
du
ctio
n
10Copyright © 2012, Elsevier Inc. All rights reserved.
Data Hazards
RAW (read after write) — j tries to read a source before i
writes it, so j incorrectly gets the old value.
WAW (write after write) — j tries to write an operand before it is
written by i. The writes end up being performed in the wrong
order, leaving the value written by i rather than the value
written by j in the destination.
WAR (write after read) — j tries to write a destination before it is
read by i, so i incorrectly gets the new value. This hazard arises
from an antidependence. WAR hazards cannot occur in most
static issue pipelines — even deeper pipelines or floating-point
pipelines — because all reads are early (in ID) and all
writes are late (in WB).
![Page 6: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/6.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 6
11Copyright © 2012, Elsevier Inc. All rights reserved.
Control Dependences
A control dependence determines the ordering of an
instruction, i, with respect to a branch instruction so that
the instruction i is executed in correct program order and
only when it should be.
if p1 {
S1;
};
if p2 {
S2;
}
12Copyright © 2012, Elsevier Inc. All rights reserved.
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1, and S2 is control dependent on p2 but
not on p1. In general, there are two constraints imposed by control
dependences:
1. An instruction that is control dependent on a branch cannot be
moved before the branch so that its execution is no longer
controlled by the branch. For example, we cannot take an
instruction from the then portion of an if statement and move it
before the if statement.
2. An instruction that is not control dependent on a branch cannot
be moved after the branch so that its execution is controlled by the
branch. For example, we cannot take a statement before the if
statement and move it into the then portion.
![Page 7: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/7.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 7
13Copyright © 2012, Elsevier Inc. All rights reserved.
When processors preserve strict program order, they ensure
that control dependences are also preserved. We may be
willing to execute instructions that should not have been
executed, however, thereby violating the control dependences,
if we can do so without affecting the correctness of the program. Control dependence is not the critical property that must be preserved. Instead, the two properties critical to program correctness — and normally preserved by maintaining both data and control dependence—are the exception behavior and
the data flow.
14Copyright © 2012, Elsevier Inc. All rights reserved.
Examples� OR instruction dependent
on DADDU and DSUBU
� Assume R4 isn’t used after skip� Possible to move DSUBU
before the branch
Intro
du
ctio
n• Example 1:DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
• Example 2:DADDU R1,R2,R3
BEQZ R12,skip
DSUBU R4,R5,R6
DADDU R5,R4,R9
skip:
OR R7,R8,R9
![Page 8: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/8.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 8
15Copyright © 2012, Elsevier Inc. All rights reserved.
Compiler Techniques for Exposing ILP
� Pipeline scheduling� Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
� Example:for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Co
mp
iler T
ech
niq
ue
s
16Copyright © 2012, Elsevier Inc. All rights reserved.
Pipeline Stalls
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall (assume integer load latency is 1)
BNE R1,R2,Loop
Co
mp
iler T
ech
niq
ue
s
![Page 9: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/9.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 9
17Copyright © 2012, Elsevier Inc. All rights reserved.
Pipeline Scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,8(R1)
BNE R1,R2,Loop
Co
mp
iler T
ech
niq
ue
s
18Copyright © 2012, Elsevier Inc. All rights reserved.
Loop Unrolling
� Loop unrolling� Unroll by a factor of 4 (assume # elements is divisible by 4)
� Eliminate unnecessary instructions
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
Co
mp
iler T
ech
niq
ue
s
� note: number of live registers vs. original loop
![Page 10: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/10.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 10
19Copyright © 2012, Elsevier Inc. All rights reserved.
Loop Unrolling/Pipeline Scheduling
� Pipeline schedule the unrolled loop:
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop
Co
mp
iler T
ech
niq
ue
s
20Copyright © 2012, Elsevier Inc. All rights reserved.
Summary of the Loop Unrolling and Scheduling
� Determine that unrolling the loop would be useful –
iterations independent
� Use different registers to avoid unnecessary constraints
� Eliminate extra test and branch instructions
� Determine loads, stores can be interchanged by
checking target addresses
� Schedule code – preserving dependencies
![Page 11: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/11.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 11
21Copyright © 2012, Elsevier Inc. All rights reserved.
Strip Mining
� Unknown number of loop iterations?� Number of iterations = n
� Goal: make k copies of the loop body
� Generate pair of loops:� First executes n mod k times
� Second executes n / k times
� “Strip mining”
Co
mp
iler T
ech
niq
ue
s
22Copyright © 2012, Elsevier Inc. All rights reserved.
Branch Prediction
� Basic 2-bit predictor:� For each branch:
� Predict taken or not taken
� If the prediction is wrong two consecutive times, change prediction
� Correlating predictor:� Multiple 2-bit predictors for each branch
� One for each possible combination of outcomes of preceding nbranches
� Local predictor:� Multiple 2-bit predictors for each branch
� One for each possible combination of outcomes for the last noccurrences of this branch
� Tournament predictor:� Combine correlating predictor with local predictor
Bra
nch
Pre
dic
tion
![Page 12: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/12.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 12
23Copyright © 2012, Elsevier Inc. All rights reserved.
Branch Prediction PerformanceB
ran
ch
Pre
dic
tion
Branch predictor performance
24Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling
� Rearrange order of instructions to reduce stalls while maintaining data flow
� Advantages:� Compiler doesn’t need to have knowledge of
microarchitecture
� Handles cases where dependencies are unknown at compile time
� Disadvantage:� Substantial increase in hardware complexity
� Complicates exceptions
Bra
nch
Pre
dic
tion
![Page 13: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/13.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 13
25Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling
� Dynamic scheduling implies:� Out-of-order execution
� Out-of-order completion
� Creates the possibility for WAR and WAW hazards
� Tomasulo’s Approach� Tracks when operands are available
� Introduces register renaming in hardware� Minimizes WAW and WAR hazards
Bra
nch
Pre
dic
tion
26Copyright © 2012, Elsevier Inc. All rights reserved.
Register Renaming
� Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
S.D F6,0(R1)
SUB.D F8,F10,F14
MUL.D F6,F10,F8
+ name dependence with F6
Bra
nch
Pre
dic
tion
antidependence
antidependence
![Page 14: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/14.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 14
27Copyright © 2012, Elsevier Inc. All rights reserved.
Register Renaming
� Example:
DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T
� Now only RAW hazards remain, which can be strictly
ordered
Bra
nch
Pre
dic
tion
28Copyright © 2012, Elsevier Inc. All rights reserved.
Register Renaming
� Register renaming is provided by reservation stations (RS)� Contains:
� The instruction
� Buffered operand values (when available)
� Reservation station number of instruction providing the operand values
� RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file)
� Pending instructions designate the RS to which they will send their output
� Result values broadcast on a result bus, called the common data bus (CDB)
� Only the last output updates the register file
� As instructions are issued, the register specifiers are renamed with the reservation station
� May be more reservation stations than registers
Bra
nch
Pre
dic
tion
![Page 15: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/15.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 15
29Copyright © 2012, Elsevier Inc. All rights reserved.
In Tomasulo’s scheme, register renaming is provided by
reservation stations, which buffer the operands of
instructions waiting to issue. The basic idea is that a
reservation station fetches and buffers an operand as
soon as it is available, eliminating the need to get the
operand from a register. In addition, pending instructions
designate the reservation station that will provide their input.
Finally, when successive writes to a register overlap in
execution, only the last one is actually used to update the
register. As instructions are issued, the register specifiers
for pending operands are renamed to the names of the
reservation station, which provides register renaming.
30Copyright © 2012, Elsevier Inc. All rights reserved.
The use of reservation stations, rather than a centralized register
file, leads to two other important properties. First, hazard detection
and execution control are distributed: The information held in the
reservation stations at each functional unit determine when an
instruction can begin execution at that unit. Second, results are
passed directly to functional units from the reservation stations where
they are buffered, rather than going through the registers. This
bypassing is done with a common result bus that allows all units
waiting for an operand to be loaded simultaneously (on the 360/91
this is called the common data bus, or CDB). In pipelines with multiple
execution units and issuing multiple instructions per clock, more than
one result bus will be needed.
![Page 16: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/16.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 16
31Copyright © 2012, Elsevier Inc. All rights reserved.
Tomasulo’s Algorithm
� Load and store buffers� Contain data and addresses, act like reservation
stations
� Top-level design:
Bra
nch
Pre
dic
tion
32Copyright © 2012, Elsevier Inc. All rights reserved.
Tomasulo’s Algorithm
� Three Steps:� Issue
� Get next instruction from FIFO queue
� If available RS, issue the instruction to the RS with operand values if available
� If operand values not available, stall the instruction
� Execute
� When operand becomes available, store it in any reservation stations waiting for it
� When all operands are ready, issue the instruction
� Loads and store maintained in program order through effective address
� No instruction allowed to initiate execution until all branches that proceed it in program order have completed
� Write result
� Write result on CDB into reservation stations and store buffers
� (Stores must wait until address and value are received)
Bra
nch
Pre
dic
tion
![Page 17: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/17.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 17
33Copyright © 2012, Elsevier Inc. All rights reserved.
Issue—Get the next instruction from the head of the
instruction queue, which is maintained in FIFO order to
ensure the maintenance of correct data flow. If there is
a matching reservation station that is empty, issue the
instruction to the station with the operand values, if they
are currently in the registers. If there is not an empty
reservation station, then there is a structural hazard and
the instruction stalls until a station or buffer is freed. If the
operands are not in the registers, keep track of the
functional units that will produce the operands.
34Copyright © 2012, Elsevier Inc. All rights reserved.
Execute—If one or more of the operands is not yet
available, monitor the common data bus while waiting for
it to be computed. When an operand becomes available,
it is placed into any reservation station awaiting it. When
all the operands are available, the operation can be
executed at the corresponding functional unit. By delaying
instruction execution until the operands are available,
RAW hazards are avoided. (Some dynamically scheduled
processors call this step “issue,” but we use the name
“execute,” which was used in the first dynamically
scheduled processor, the CDC 6600.)
![Page 18: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/18.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 18
35Copyright © 2012, Elsevier Inc. All rights reserved.
Write result—When the result is available, write it on
the CDB and from there into the registers and into any
reservation stations (including store buffers) waiting
for this result. Stores are buffered in the store buffer
until both the value to be stored and the store address
are available, then the result is written as soon as the
memory unit is free.
36Copyright © 2012, Elsevier Inc. All rights reserved.
Reservation Station Fields
(borrowed from CDC scoreboard)
OP - Operation to perform on S1 and S2
Qj,Qk – reservation stations that will produce
the corresponding Source Operand
Vj,Vk – Value of source operands. Note that only one
of V field or Q field valid for operand
A - Used to hold information for memory address
calculation for a load or store
Busy – Indicates that this reservation station and
its accompanying functional unit are occupied
Qi – (Register file) number of the reservation station
whose result should be stored into this register
![Page 19: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/19.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 19
37Copyright © 2012, Elsevier Inc. All rights reserved.
1. L.D F6,32(R2)
2. L.D F2,44(R3)
3. MUL.D F0,F2,F4
4. SUB.D F8,F2,F6
5. DIV.D F10,F0,F6
6. ADD.D F6,F8,F2
38Copyright © 2012, Elsevier Inc. All rights reserved.
Example
Bra
nch
Pre
dic
tion
![Page 20: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/20.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 20
39Copyright © 2012, Elsevier Inc. All rights reserved.
Hardware-Based Speculation
� Execute instructions along predicted execution paths but only commit the results if prediction was correct
� Instruction commit: allowing an instruction to update the register file when instruction is no longer speculative
� Need an additional piece of hardware to prevent any irrevocable action until an instruction commits� I.e. updating state or taking an execution
Bra
nch
Pre
dic
tion
40Copyright © 2012, Elsevier Inc. All rights reserved.
Reorder Buffer
� Reorder buffer – holds the result of instruction between completion and commit
� Four fields:� Instruction type: branch/store/register
� Destination field: register number
� Value field: output value
� Ready field: completed execution?
� Modify reservation stations:� Operand source is now reorder buffer instead of
functional unit
Bra
nch
Pre
dic
tion
![Page 21: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/21.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 21
41Copyright © 2012, Elsevier Inc. All rights reserved.
Reorder Buffer
� Register values and memory values are not written until an instruction commits
� On misprediction:� Speculated entries in ROB are cleared
� Exceptions:� Not recognized until it is ready to commit
Bra
nch
Pre
dic
tion
42Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue and Static Scheduling
� To achieve CPI < 1, need to complete multiple instructions per clock
� Solutions:� Statically scheduled superscalar processors
� VLIW (very long instruction word) processors
� dynamically scheduled superscalar processors
Mu
ltiple
Issu
e a
nd
Sta
tic S
ch
ed
ulin
g
![Page 22: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/22.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 22
43Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple IssueM
ultip
le Is
su
e a
nd
Sta
tic S
ch
ed
ulin
g
44Copyright © 2012, Elsevier Inc. All rights reserved.
![Page 23: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/23.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 23
45Copyright © 2012, Elsevier Inc. All rights reserved.
Explicitly parallel instruction computing (EPIC) is a term coined in
1997 by the HP–Intel alliance[1] to describe a computing paradigm that
researchers had been investigating since the early 1980s.[2] This
paradigm is also called Independence architectures. It was the basis
for Intel and HP development of the Intel Itanium architecture,[3] and HP
later asserted that "EPIC" was merely an old term for the Itanium
architecture.[4] EPIC permits microprocessors to execute software
instructions in parallel by using the compiler, rather than complex
on-die circuitry, to control parallel instruction execution. This was
intended to allow simple performance scaling without resorting
to higher clock frequencies.
46Copyright © 2012, Elsevier Inc. All rights reserved.
VLIW Processors
� Package multiple operations into one instruction
� Example VLIW processor:� One integer instruction (or branch)
� Two independent floating-point operations
� Two independent memory references
� Must be enough parallelism in code to fill the available slots
Mu
ltiple
Issu
e a
nd
Sta
tic S
ch
ed
ulin
g
![Page 24: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/24.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 24
47Copyright © 2012, Elsevier Inc. All rights reserved.
VLIW Processors
� Disadvantages:� Statically finding parallelism
� Code size
� No hazard detection hardware
� Binary code compatibility
Mu
ltiple
Issu
e a
nd
Sta
tic S
ch
ed
ulin
g
48Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling, Multiple Issue, and Speculation
� Modern microarchitectures:� Dynamic scheduling + multiple issue + speculation
� Two approaches:� Assign reservation stations and update pipeline
control table in half clock cycles� Only supports 2 instructions/clock
� Design logic to handle any possible dependencies between the instructions
� Hybrid approaches
� Issue logic can become bottleneck
Dyn
am
ic S
ch
ed
ulin
g, M
ultip
le Is
su
e, a
nd
Sp
ecu
latio
n
![Page 25: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/25.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 25
49Copyright © 2012, Elsevier Inc. All rights reserved.
Dyn
am
ic S
ch
ed
ulin
g, M
ultip
le Is
su
e, a
nd
Sp
ecu
latio
n
Overview of Design
50Copyright © 2012, Elsevier Inc. All rights reserved.
� Limit the number of instructions of a given class that can be issued in a “bundle”� I.e. one FP, one integer, one load, one store
� Examine all the dependencies among the instructions in the bundle
� If dependencies exist in bundle, encode them in reservation stations
� Also need multiple completion/commit
Dyn
am
ic S
ch
ed
ulin
g, M
ultip
le Is
su
e, a
nd
Sp
ecu
latio
n
Multiple Issue
![Page 26: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/26.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 26
51Copyright © 2012, Elsevier Inc. All rights reserved.
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
Dyn
am
ic S
ch
ed
ulin
g, M
ultip
le Is
su
e, a
nd
Sp
ecu
latio
n
Example
52Copyright © 2012, Elsevier Inc. All rights reserved.
Dyn
am
ic S
ch
ed
ulin
g, M
ultip
le Is
su
e, a
nd
Sp
ecu
latio
n
Example (No Speculation)
![Page 27: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/27.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 27
53Copyright © 2012, Elsevier Inc. All rights reserved.
Dyn
am
ic S
ch
ed
ulin
g, M
ultip
le Is
su
e, a
nd
Sp
ecu
latio
n
Example
54Copyright © 2012, Elsevier Inc. All rights reserved.
� Need high instruction bandwidth!� Branch-Target buffers
� Next PC prediction buffer, indexed by current PC
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Branch-Target Buffer
![Page 28: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/28.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 28
55Copyright © 2012, Elsevier Inc. All rights reserved.
� Optimization:
� Larger branch-target buffer
� Add target instruction into buffer to deal with longer
decoding time required by larger buffer
� “Branch folding”
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Branch Folding
56Copyright © 2012, Elsevier Inc. All rights reserved.
� Most unconditional branches come from function returns
� The same procedure can be called from multiple sites
� Causes the buffer to potentially forget about the
return address from previous calls
� Create return address buffer organized as a stack
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Return Address Predictor
![Page 29: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/29.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 29
57Copyright © 2012, Elsevier Inc. All rights reserved.
� Design monolithic unit that performs:
� Branch prediction
� Instruction prefetch
� Fetch ahead
� Instruction memory access and buffering
� Deal with crossing cache lines
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Integrated Instruction Fetch Unit
58Copyright © 2012, Elsevier Inc. All rights reserved.
� Register renaming vs. reorder buffers
� Instead of virtual registers from reservation stations and
reorder buffer, create a single register pool
� Contains visible registers and virtual registers
� Use hardware-based map to rename registers during issue
� WAW and WAR hazards are avoided
� Speculation recovery occurs by copying during commit
� Still need a ROB-like queue to update table in order
� Simplifies commit:
� Record that mapping between architectural register and physical register is no longer speculative
� Free up physical register used to hold older value
� In other words: SWAP physical registers on commit
� Physical register de-allocation is more difficult
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Register Renaming
![Page 30: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/30.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 30
59Copyright © 2012, Elsevier Inc. All rights reserved.
� Combining instruction issue with register renaming:
� Issue logic pre-reserves enough physical registers
for the bundle (fixed number?)
� Issue logic finds dependencies within bundle, maps
registers as necessary
� Issue logic finds dependencies between current
bundle and already in-flight bundles, maps registers
as necessary
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Integrated Issue and Renaming
60Copyright © 2012, Elsevier Inc. All rights reserved.
� How much to speculate
� Mis-speculation degrades performance and power
relative to no speculation
� May cause additional misses (cache, TLB)
� Prevent speculative code from causing higher
costing misses (e.g. L2)
� Speculating through multiple branches
� Complicates speculation recovery
� No processor can resolve multiple branches per
cycle
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
How Much?
![Page 31: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/31.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 31
61Copyright © 2012, Elsevier Inc. All rights reserved.
62Copyright © 2012, Elsevier Inc. All rights reserved.
![Page 32: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/32.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 32
63Copyright © 2012, Elsevier Inc. All rights reserved.
64Copyright © 2012, Elsevier Inc. All rights reserved.
![Page 33: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/33.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 33
65Copyright © 2012, Elsevier Inc. All rights reserved.
Hardware vs Software Speculation� To speculate extensively, must disambiguate memory
references – difficult to do at compile time
� Hardware based speculation works better when control
flow is unpredictable and when hardware based branch
prediction is superior to software based branch
prediction
� Hardware based speculation maintains a completely
precise exception model (recent software as well)
� Hardware based – no compensations or bookkeeping
code
� Compiler based – can see further
� Hardware based with dynamic scheduling does not
require different code sequences
66Copyright © 2012, Elsevier Inc. All rights reserved.
� Speculation and energy efficiency
� Note: speculation is only energy efficient when it
significantly improves performance
� Value prediction
� Uses:
� Loads that load from a constant pool
� Instruction that produces a value from a small set of values
� Not been incorporated into modern processors
� Similar idea--address aliasing prediction--is used on
some processors
Ad
v. Te
ch
niq
ue
s fo
r Instru
ctio
n D
eliv
ery
an
d S
pe
cu
latio
n
Energy Efficiency
![Page 34: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/34.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 34
67
Threading
� Multithreading allows multiple threads to share the functional units of a single processor in overlapping fashion.
� Thread-level parallelism (TLP)
Copyright © 2012, Elsevier Inc. All rights reserved.
68Copyright © 2012, Elsevier Inc. All rights reserved.
Multi-Threading
![Page 35: Introduction - University of New Mexicoece-research.unm.edu/pollard/classes/538/CAQA5e_ch3.pdfFor example, we cannot take an instruction from the then portion of an if statement and](https://reader034.fdocuments.us/reader034/viewer/2022042111/5e8d04f7b93c9420a743459d/html5/thumbnails/35.jpg)
The University of Adelaide, School of Computer Science 31 October 2014
Chapter 2 — Instructions: Language of the Computer 35
69Copyright © 2012, Elsevier Inc. All rights reserved.
70
Fallacies and Pitfalls
� Fallacy: It is easy to predict the performance and energy efficiency of two different versions of the same ISA, if we hold the technology constant.
� Fallacy: Processors with lower CPIs will always be faster.
� Fallacy: Processors with faster clock rates will always be faster.
� Pitfall: Sometimes bigger and dumber is better.
Copyright © 2012, Elsevier Inc. All rights reserved.