ILP, Memory and Synchronization
description
Transcript of ILP, Memory and Synchronization
ILP, Memory and SynchronizationILP, Memory and Synchronization
Joseph B. Manzano
Instruction Level Parallelism
• Parallelism that is found between instructions• Dynamic and Static Exploitation
– Dynamic: Hardware related. – Static: Software related (compiler and system
software)
• VLIW and Superscalar• Micro-Dataflow and Tomasulo’s Algorithm
Hazards
• Structural Hazards– Non Pipelining Function Units– One Port Register Bank and one port memory
bank
• Data Hazards– For some
• Forwarding
– For others• Pipeline Interlock
LD R1 A+ R4 R1 R7
Need Bubble / Stall
Data Dependency: A Review
B + C A
A + D E
Flow DependencyRAW Conflicts
A + C B
E + D A
Anti DependencyWAR Conflicts
B + C A
E + D A
Output DependencyWAW Conflicts
RAR are not really a problem
Instruction Level Parallelism
• Static Scheduling– Simple Scheduling– Loop Unrolling– Loop Unrolling + Scheduling– Software Pipelining
• Dynamic Scheduling– Out of order execution– Data Flow computers
• Speculation
Advanced Pipelining
• Instruction Reordering and scheduling within loop body
• Loop Unrolling– Code size suffers
• Superscalar– Compact code– Multiple issued of different instruction types
• VLIW
An ExampleX[i] + a Loop: LD F0, 0 (R1) ; load the vector element
ADDD F4, F0, F2 ; add the scalar in F2SD 0 (R1), F4 ; store the vector elementSUB R1, R1, #8 ; decrement the pointer by
; 8 bytes (per DW)BNEZ R1, Loop ; branch when it’s not zero
Instruction Producer Instruction Consumer Latency
FP ALU op FP ALU op 3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store Double 0
Load can by-pass the storeAssume that latency for Integer ops is zero and latency for Integer load is 1
An ExampleX[i] + a
Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3STALL 4STALL 5SD 0 (R1), F4 6SUB R1, R1, #8 7BNEZ R1, Loop 8STALL 9
Load Latency
FP ALU Latency
Load Latency
This requires 9 Cycles per iteration
An ExampleX[i] + a
Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3SUB R1, R1, #8 4BNEZ R1, Loop 5SD 8 (R1), F4 6
This requires 6 Cycles per iteration
Scheduling
An ExampleX[i] + a
Loop : LD F0, 0 (R1) 1NOP 2ADDD F4, F0, F2 3NOP 4NOP 5SD 0 (R1), F4 6LD F6, -8 (R1) 7NOP 8ADDD F8, F6, F2 9NOP 10NOP 11SD -8 (R1), F8 12LD F10, -16 (R1) 13NOP 14ADDD F12, F10, F2 15NOP 16NOP 17SD -16 (R1), F12 18LD F14, -24 (R1) 19NOP 20ADDD F16, F14, F2 21NOP 22NOP 23SD -24 (R1), F16 24SUB R1, R1, #32 25BNEZ R1, LOOP 26NOP 27
This requires 6.8 Cycles per iteration
Unrolling
An Example
X[i] + a
Loop : LD F0, 0 (R1) 1LD F6, - 8 (R1) 2LD F10, -16 (R1) 3LD F14, -24 (R1) 4ADDD F4, F0, F2 5ADDD F8, F6, F2 6ADDD F12, F10, F2 7ADDD F16, F14, F2 8SD 0 (R1), F4 9SD -8 (R1), F8 10SD -16 (R1), F12 11SUB R1, R1, #32 12 BNEZ R1, LOOP 13SD 8 (R1), F16 14
This requires 3.5 Cycles per iteration
Unrolling + Scheduling
ILP
• ILP of a program– Average Number of Instructions that a superscalar
processor might be able to execute at the same time• Data dependencies• Latencies and other processor difficulties
• ILP of a machine– The ability of a processor to take advantage of the ILP
• Number of instructions that can be fetched and executed at the same time by such processor
Multi Issue Architectures
• Super Scalar– Machines that issue multiple independent instructions per
clock cycle when they are properly scheduled by the compiler and runtime scheduler
• Very Long Instruction Word– A machine where the compiler has complete responsibility
for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue
Patterson & Hennessy P317 and P318
Multiple Instruction Issue
• Multiple Issue + Static Scheduling VLIW• Dynamic Scheduling
– Tomasulo– Scoreboarding
• Multiple Issue + Dynamic Scheduling Superscalar• Decoupled Architectures
– Static Scheduling of R-R Instructions– Dynamic Scheduling of Memory Ops
• Buffers
Software Pipeline
• Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations
• Use less code size– Compared to Unrolling
• Some Architecture has specific software support– Rotating register banks– Predicated Instructions
Software Pipelining
• Overlap instructions without unrolling the loop• Give the vector M in memory, and ignoring the start-up and finishing
code, we have:
Loop: SD 0 (R1), F4 ;stores into M[i]ADDD F4, F0, F2 ;adds to M[i +1]LD F0, -8 (R1) ;loads M[i + 2]BNEZ R1, LOOPSUB R1, R1, #8 ;subtract indelay slot
This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.
Software Pipeline
Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog
Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling
Software Pipeline CodePrologue Epilog
Unrolled
Number of Overlapped instructions
Number of Overlapped instructions
Time
Time
Loop Unrolling V.S. Software Pipelining
• When not running at maximum rate– Unrolling: Pay m/n times overhead when m
iteration and n unrolling– Software Pipelining: Pay two times
• Once at prologue and once at epilog• Moreover
– Code compactness– Optimal runtime– Storage constrains
Comparison of Static Methods
w/o scheduling
scheduling unrolling Unrolling + Scheduling
2 issue 4 issue SP 1-issue
SP 5-Issue
Cycles per iterations
9 6 6.8 3.5 2.4 1.28 5 1
Limitations of VLIW
• Limited parallelism (statically schedule) code– Basic Blocks may be too small– Global Code Motion is difficult
• Limited Hardware Resources• Code Size• Memory Port limitations• A Stall is serious• Cache is difficult to be used (effectively)
– i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width
– Cache miss penalty is increased since the length of instruction word
An VLIW ExampleTM
S32C
62x/
C67
Bloc
k D
iagr
am
Source: TMS320C600 Technical Brief. February 1999
An VLIW Example
TMS32C62x/C67 Data Paths
Source: TMS320C600 Technical Brief. February 1999
Assembly Example
Introduction to SuperScalar
Instruction Issue Policy
• It determinates the processor look ahead policy– Ability to examine instructions beyond the current
PC• Look Ahead must ensure correctness at all
costs• Issue policy
– Protocol used to issue instructions• Note: Issue, execution and completion
Achieve High Performance in Multiple Issued Instruction Machines
• Detection and resolution of storage conflicts– Extra “Shadow” registers– Special bit for reservation
• Organization and control of the buses between the various units in the PU– Special controllers to detect write backs and read
Data Dependencies & SuperScalar
• Hardware Mechanism (dynamic scheduling)
- Scoreboarding
- limited out-of-order issue/completion
- centralized control
- Renaming with reorder buffer is a another attractive approach (based on Tomasulo Alg.)
- Micro dataflow
• Advantage: exact runtime information- Load/cache miss
- resolve storage location related dependence
Scoreboarding• Named after CDC 6600
• Effective when there are enough resources and no data dependencies
• Out-of-order execution• Issue: checking scoreboard and WAW will cause a stall
• Read operand- checking availability of operand and resolve RAW dynamically at this step- WAR will not cause stall
• EX
• Write result- WAR will be checked and will cause stall
. . . . .
Registers
Integer unit
FP add
FP divide
FP multFP mult
Scoreboard
Data buses
Control/status
Control/status
The basic structure of a DLX processor with a scoreboard
Scoreboarding
[CDC6600, Thorton70], [WeissSmith84]• A bit (called “scoreboard bit”) is associated with each register
bit = 1: the register is reserved by a write• An instruction has a source operand with bit = 1will be issued,
but put into an instruction window, with the register identifier to denote the “to-be-written” operand
• Copies of valid operands also be read with pending inst (solve anti-dependence)
• When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued
• An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!
Micro Data Flow
• Fundamental Concepts– “Data Flow”
• Instructions can only be fired when operands are available
– Single assignment and register renaming
• Implementation– Tomasulo’s Algorithm– Reorder Buffer
Renaming/Single Assignment
R0 = R2 / R4; (1)R6 = R0 + R8 (2)R1[0] = R6 (3)R8 = R10 – R14 (4)R6 = R10 * R8 (5)
12
34
5
R0 = R2 / R4; (1)S = R0 + R8 (2)R1[0] = S (3)T = R10 – R14 (4)R6 = R10 * T (5)
12
34
5
Baseline Superscalar Model
Inst Fetch
Inst Decode
Wake Up Select
Register File
ExecData Cache
Bypass
Renaming
Issue Window
Execution BypassData Cache Access
Register Write &Instruction Commit
Micro Data FlowConceptual Model
A R1R1 * B R2R2 / C R1R4 + R1 R4
A
Load
*
/
+
B
C
R1OR4
OR3
OR5OR1
OR6
R2
R4 R1
R4
R1
R2
R3
R4
ROB Stages
• Issue– Dispatch an instruction from the instruction queue– Reserved ROB entry and a reservation station
• Execute– Stall for operands– RAW resolved
• Write Result – Write back to any reservation stations waiting for it and to the ROB
• Commit– Normal Commit: Update Registers– Store Commit: Update Memory– False Branch: Flush the ROB and re-begin execution
Tomasulo’s Algorithm
• Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233)
• IBM 360/91 (three year after CDC 6600 and just before caches)
• Features:• CDB: Common Data Bus• Reservation Units: Hardware features which allow the
fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)
Tomasulo’s Algorithm
• Control and Buffers distributed with Functional Units.• HW renaming of registers• CDB broadcasting• Load / Store buffers Functional Units• Reservation Stations:
– Hazard detection and Instruction control– 4-bit tag field to specify which station or buffer will produce the result
• Register Renaming– Tag Assigned on IS– Tag discarded after write back
Comparison
• Scoreboarding– Centralized Data structure
and control– Register bit
• Simple, low cost– Structural hazards solved by
FU– Solve RAW by register bit– Solve WAR in write – Solve WAW stalls on issue
• Tomasulo’s Algoritjm– Distributed control– Tagged Registers + register
renaming– Structural Hazard stalls on
Reservation Station– Solve RAW by CDB– Solve WAR by copying
operand to Reservation Station
– Solve WAW by renaming– Limited: CDB
• Broadcast• 1 per cycle
The Architecture
654321
Formmemory
Load buffers
From instruction unitFloating-pointoperations FP registers
FP adders FP multipliers
Store buffers
tomemory
Common data bus (CDB)
321
321
Operation bus
21
ReservationStations
Operandbus
- 3 Adders- 2 Multipliers- Load buffers (6)- Store buffers (3)- FP Queue- FP registers- CDB: Common Data Bus
Tomasulo’s Algorithm’s Steps
• Issue- Issue if empty reservation station is found, fetch operands if they are in
registers, otherwise assign a tag- If no empty reservation is found, stall and wait for one to get free- Renaming is performed here and WAW and WAR are resolved
• Execute– If operands are not ready, monitor the CDB for them– RAWs are resolved– When they are ready, execute the op in the FU
• Write Back– Send the results to CDB and update registers and the Store buffers– Store Buffers will write to memory during this step
• Exception Behavior– During Execute: No instructions are allowed to be issued until all branches
before it have been completed
Tomasulo’s Algorithm
• Note that:• Upon Entering a reservation station, source operands are
either filled with values or renamed• The new names are 1-to-1 correspondence to FU names
• Question:• How the output dependencies are resolved?
• Two pending writes to a register• How to determinate that a read will get the most
recent value if they complete out of order
Features of T. Alg.
• The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field.
• Instructions can be issued without even the operands produced (but know they are coming from CDB)
Memory ModelsMemory Models
Programming Execution Models
• A set of rules to create programs• Message Passing Model
– De Facto Multicomputer Programming Model– Multiple Address Space– Explicit Communication / Implicit Synchronization
• Shared Memory Models– De Facto Multiprocessor Programming Model– Single Address Space– Implicit Communication / Explicit Synchronization
Shared Memory Execution Model
A group of rules that deals with data replication, coherency, and memory ordering
Private Data Shared Data
Data that is not visible to other threads Data that can be access by other threads
Thread Model
Memory Model
Synchronization Model
A set of rules for thread creation, scheduling and destruction
Rules that deal with access to shared data
Thread Virtual Machine
Grand Challenge Problems
• Shared Memory Multiprocessor Effective at a number of thousand units
• Optimize and Compile parallel applications• Main Areas: Assumptions about
– Memory Coherency– Memory Consistency
Memory Consistency & Coherence
Memory [Cache] CoherencyThe Problem
P1 P2 P3
U:5 U:5
U:51
4
U:? U:? U:7
2
3
5
What value P1 and P2 will read?
1 3
MCMCategory of Access
As Presented in Mosberger 93
Memory Access
Private Shared
CompetingNon-Competing
SynchronizationNon synchronization
AcquireRelease
ExclusiveNon-exclusive
Uniform V.S. Hybrid
10/03/2007 ELEG652-07F 49
Conventional MCM
• Sequential Consistency– “… the result of any execution is the same as if the
operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport 79]
Memory Consistency Problem
B = 0…A = 1
L1: print B
A = 0…B = 1
L2: print A
Assume that L1 and L2 are issue only after the other 4 instructions have been completed.What are the possible values that are printed on the screen? Is 0, 0 a possible combination?
The MCM: A software and hardware contract
MCM Attributes
• Memory Operations• Location of Access
– Near memory (cache, near memory modules, etc) V.S. far memory• Direction of Access
– Write or Read• Value Transmitted in Access
– Size• Causality of Access
– Check if two access are “causually” related and if they are in which order are they completed
• Category of Access– Static Property of Accesses
Synchronization and Its CostSynchronization and Its Cost
Synchronization
• The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races
• Data Race or Race Condition– “There is an anomaly of concurrent accesses by
two or more threads to a shared memory and at least one of the accesses is a write”
• Atomicity and / or serialibility
Atomicity
• Atomic From the Greek “Atomos” which means indivisible
• An “All or None” scheme• An instruction (or a group of them) will appear
as if it was (they were) executed in a single try– All side effects of the instruction (s) in the block
are seen in its totality or not all• Side effects Writes and (Causal) Reads to the
variables inside the atomic block
Synchronization
• Applied to Shared Variables • Synchronization might enforce ordering or not• High level Synchronization types
– Semaphores– Mutex– Barriers– Critical Sections– Monitors– Conditional Variables
Types of (Software) LocksThe Spin Lock Family
• The Simple Test and Set Lock– Polls a shared Boolean variable: A binary
semaphore– Uses Fetch and Φ operations to operate on the
binary semaphore– Expensive!!!!
• Waste bandwidth• Generate Extra Busses transactions
– The test test and set approach• Just poll when the lock is in use
Types of (Software) LocksThe Spin Lock Family
• Delay based Locks– Spin Locks in which a delay has been introduced in
testing the lock– Constant delay– Exponentional Back-off
• Best Results
– The test test and set scheme is not needed
Types of (Software) LocksThe Spin Lock Family
Pseudo code:Pseudo code:
enum LOCK_ACTIONS = {LOCKED, UNLOCKED};void acquire_lock(lock_t L){
int delay = 1;while(! test_and_set(L, LOCKED) ) {
sleep(delay);delay *= 2;
}}void release_lock(lock_t L){
L = UNLOCKED;}
Types of (Software) LocksThe Ticket Lock
• Reduce the # of Fetch and Φ operations – Only one per lock acquisition
• Strongly fair lock– No starvation
• A FIFO service• Implementation: Two counters
– A Request and Release Counters
Types of (Software) LocksThe Ticket Lock
Pseudocode:Pseudocode:unsigned int next_ticket = 0;unsigned int now_serving = 0;void acquire_lock(){
unsigned int my_ticket = fetch_and_increment(next_ticket);while{
sleep(my_ticket - now_serving);if(now_serving == my_ticket) return;
}}void release_lock(){
now_serving = now_serving + 1;}
Types of (Software) LocksThe Array Based Queue Lock
• Contention on the release counter• Cache Coherence and memory traffic
– Invalidation of the counter variable and the request to a single memory bank
• Two elements– An Array and a tail pointer that index such array– The array is as big as the number of processors– Fetch and store Address of the array element– Fetch and increment Tail pointer
• FIFO ordering
Types of (Software) LocksThe Queue Locks
• It uses too much memory– Linear space (relative to the number of
processors) per lock.
• Array– Easy to implement
• Linked List: QNODE– Cache management
Types of (Software) LocksThe MCS Lock
• Characteristics– FIFO ordering– Spins on locally accessible flag variables– Small amount of space per lock– Works equally well on machines with and without
coherent caches
• Similar to the QNODE implementation of queue locks– QNODES are assigned to local memory – Threads spins on local memory
MCS: How it works?
• Each processor enqueues its own private lock variable into a queue and spins on it– key: spin locally
• CC model: spin in local cache• DSM model: spin in local private memory
– No contention• On lock release, the releaser unlocks the next lock in
the queue– Only have bus/network contention on actual unlock– No starvation (order of lock acquisitions defined by the
list)
MCS Lock
• Requires atomic instruction:– compare-and-swap– fetch-and-store
• If there is no compare-and-swap– an alternative release algorithm
• extra complexity• loss of strict FIFO ordering• theoretical possibility of starvation• Detail: Mellor-Crummey and Scott’s 1991 paper
ImplementationModern Alternatives
• Fetch and Φ operations– They are restrictive– Not all architecture support all of them
• Problem: A general one atomic op is hard!!!• Solution: Provide two primitives to generate atomic
operations• Load Linked and Store Conditional
– Remember PowerPC lwarx and stwcx instructions
Performance Penalty
Example
Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty.
Answer
It takes over 12,000 cycles total for all processor to pass through the lock!
Note: the contention of the lock and the serialization of the bus transactions.
See example on pp 596, Henn/Patt, 3rd Ed.