ILP, Memory and Synchronization

ILP, Memory and SynchronizationILP, Memory and Synchronization

Joseph B. Manzano

Instruction Level Parallelism

• Parallelism that is found between instructions• Dynamic and Static Exploitation

– Dynamic: Hardware related. – Static: Software related (compiler and system

software)

• VLIW and Superscalar• Micro-Dataflow and Tomasulo’s Algorithm

Hazards

• Structural Hazards– Non Pipelining Function Units– One Port Register Bank and one port memory

bank

• Data Hazards– For some

• Forwarding

– For others• Pipeline Interlock

LD R1 A+ R4 R1 R7

Need Bubble / Stall

Data Dependency: A Review

B + C A

A + D E

Flow DependencyRAW Conflicts

A + C B

E + D A

Anti DependencyWAR Conflicts

B + C A

E + D A

Output DependencyWAW Conflicts

RAR are not really a problem

Instruction Level Parallelism

• Static Scheduling– Simple Scheduling– Loop Unrolling– Loop Unrolling + Scheduling– Software Pipelining

• Dynamic Scheduling– Out of order execution– Data Flow computers

• Speculation

Advanced Pipelining

• Instruction Reordering and scheduling within loop body

• Loop Unrolling– Code size suffers

• Superscalar– Compact code– Multiple issued of different instruction types

• VLIW

An ExampleX[i] + a Loop: LD F0, 0 (R1) ; load the vector element

ADDD F4, F0, F2 ; add the scalar in F2SD 0 (R1), F4 ; store the vector elementSUB R1, R1, #8 ; decrement the pointer by

; 8 bytes (per DW)BNEZ R1, Loop ; branch when it’s not zero

Instruction Producer Instruction Consumer Latency

FP ALU op FP ALU op 3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store Double 0

Load can by-pass the storeAssume that latency for Integer ops is zero and latency for Integer load is 1

An ExampleX[i] + a

Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3STALL 4STALL 5SD 0 (R1), F4 6SUB R1, R1, #8 7BNEZ R1, Loop 8STALL 9

Load Latency

FP ALU Latency

Load Latency

This requires 9 Cycles per iteration

An ExampleX[i] + a

Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3SUB R1, R1, #8 4BNEZ R1, Loop 5SD 8 (R1), F4 6

This requires 6 Cycles per iteration

Scheduling

An ExampleX[i] + a

Loop : LD F0, 0 (R1) 1NOP 2ADDD F4, F0, F2 3NOP 4NOP 5SD 0 (R1), F4 6LD F6, -8 (R1) 7NOP 8ADDD F8, F6, F2 9NOP 10NOP 11SD -8 (R1), F8 12LD F10, -16 (R1) 13NOP 14ADDD F12, F10, F2 15NOP 16NOP 17SD -16 (R1), F12 18LD F14, -24 (R1) 19NOP 20ADDD F16, F14, F2 21NOP 22NOP 23SD -24 (R1), F16 24SUB R1, R1, #32 25BNEZ R1, LOOP 26NOP 27

This requires 6.8 Cycles per iteration

Unrolling

An Example

X[i] + a

Loop : LD F0, 0 (R1) 1LD F6, - 8 (R1) 2LD F10, -16 (R1) 3LD F14, -24 (R1) 4ADDD F4, F0, F2 5ADDD F8, F6, F2 6ADDD F12, F10, F2 7ADDD F16, F14, F2 8SD 0 (R1), F4 9SD -8 (R1), F8 10SD -16 (R1), F12 11SUB R1, R1, #32 12 BNEZ R1, LOOP 13SD 8 (R1), F16 14

This requires 3.5 Cycles per iteration

Unrolling + Scheduling

ILP

• ILP of a program– Average Number of Instructions that a superscalar

processor might be able to execute at the same time• Data dependencies• Latencies and other processor difficulties

• ILP of a machine– The ability of a processor to take advantage of the ILP

• Number of instructions that can be fetched and executed at the same time by such processor

Multi Issue Architectures

• Super Scalar– Machines that issue multiple independent instructions per

clock cycle when they are properly scheduled by the compiler and runtime scheduler

• Very Long Instruction Word– A machine where the compiler has complete responsibility

for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue

Patterson & Hennessy P317 and P318

Multiple Instruction Issue

• Multiple Issue + Static Scheduling VLIW• Dynamic Scheduling

– Tomasulo– Scoreboarding

• Multiple Issue + Dynamic Scheduling Superscalar• Decoupled Architectures

– Static Scheduling of R-R Instructions– Dynamic Scheduling of Memory Ops

• Buffers

Software Pipeline

• Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations

• Use less code size– Compared to Unrolling

• Some Architecture has specific software support– Rotating register banks– Predicated Instructions

Software Pipelining

• Overlap instructions without unrolling the loop• Give the vector M in memory, and ignoring the start-up and finishing

code, we have:

Loop: SD 0 (R1), F4 ;stores into M[i]ADDD F4, F0, F2 ;adds to M[i +1]LD F0, -8 (R1) ;loads M[i + 2]BNEZ R1, LOOPSUB R1, R1, #8 ;subtract indelay slot

This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.

Software Pipeline

Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog

Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling

Software Pipeline CodePrologue Epilog

Unrolled

Number of Overlapped instructions

Number of Overlapped instructions

Time

Time

Loop Unrolling V.S. Software Pipelining

• When not running at maximum rate– Unrolling: Pay m/n times overhead when m

iteration and n unrolling– Software Pipelining: Pay two times

• Once at prologue and once at epilog• Moreover

– Code compactness– Optimal runtime– Storage constrains

Comparison of Static Methods

w/o scheduling

scheduling unrolling Unrolling + Scheduling

2 issue 4 issue SP 1-issue

SP 5-Issue

Cycles per iterations

9 6 6.8 3.5 2.4 1.28 5 1

Limitations of VLIW

• Limited parallelism (statically schedule) code– Basic Blocks may be too small– Global Code Motion is difficult

• Limited Hardware Resources• Code Size• Memory Port limitations• A Stall is serious• Cache is difficult to be used (effectively)

– i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width

– Cache miss penalty is increased since the length of instruction word

An VLIW ExampleTM

S32C

62x/

C67

Bloc

k D

iagr

am

Source: TMS320C600 Technical Brief. February 1999

An VLIW Example

TMS32C62x/C67 Data Paths

Source: TMS320C600 Technical Brief. February 1999

Assembly Example

Introduction to SuperScalar

Instruction Issue Policy

• It determinates the processor look ahead policy– Ability to examine instructions beyond the current

PC• Look Ahead must ensure correctness at all

costs• Issue policy

– Protocol used to issue instructions• Note: Issue, execution and completion

Achieve High Performance in Multiple Issued Instruction Machines

• Detection and resolution of storage conflicts– Extra “Shadow” registers– Special bit for reservation

• Organization and control of the buses between the various units in the PU– Special controllers to detect write backs and read

Data Dependencies & SuperScalar

• Hardware Mechanism (dynamic scheduling)

- Scoreboarding

- limited out-of-order issue/completion

- centralized control

- Renaming with reorder buffer is a another attractive approach (based on Tomasulo Alg.)

- Micro dataflow

• Advantage: exact runtime information- Load/cache miss

- resolve storage location related dependence

Scoreboarding• Named after CDC 6600

• Effective when there are enough resources and no data dependencies

• Out-of-order execution• Issue: checking scoreboard and WAW will cause a stall

• Read operand- checking availability of operand and resolve RAW dynamically at this step- WAR will not cause stall

• EX

• Write result- WAR will be checked and will cause stall

. . . . .

Registers

Integer unit

FP add

FP divide

FP multFP mult

Scoreboard

Data buses

Control/status

Control/status

The basic structure of a DLX processor with a scoreboard

Scoreboarding

[CDC6600, Thorton70], [WeissSmith84]• A bit (called “scoreboard bit”) is associated with each register

bit = 1: the register is reserved by a write• An instruction has a source operand with bit = 1will be issued,

but put into an instruction window, with the register identifier to denote the “to-be-written” operand

• Copies of valid operands also be read with pending inst (solve anti-dependence)

• When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued

• An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!

Micro Data Flow

• Fundamental Concepts– “Data Flow”

• Instructions can only be fired when operands are available

– Single assignment and register renaming

• Implementation– Tomasulo’s Algorithm– Reorder Buffer

Renaming/Single Assignment

R0 = R2 / R4; (1)R6 = R0 + R8 (2)R1[0] = R6 (3)R8 = R10 – R14 (4)R6 = R10 * R8 (5)

12

34

5

R0 = R2 / R4; (1)S = R0 + R8 (2)R1[0] = S (3)T = R10 – R14 (4)R6 = R10 * T (5)

12

34

5

Baseline Superscalar Model

Inst Fetch

Inst Decode

Wake Up Select

Register File

ExecData Cache

Bypass

Renaming

Issue Window

Execution BypassData Cache Access

Register Write &Instruction Commit

Micro Data FlowConceptual Model

A R1R1 * B R2R2 / C R1R4 + R1 R4

A

Load

*

/

+

B

C

R1OR4

OR3

OR5OR1

OR6

R2

R4 R1

R4

R1

R2

R3

R4

ROB Stages

• Issue– Dispatch an instruction from the instruction queue– Reserved ROB entry and a reservation station

• Execute– Stall for operands– RAW resolved

• Write Result – Write back to any reservation stations waiting for it and to the ROB

• Commit– Normal Commit: Update Registers– Store Commit: Update Memory– False Branch: Flush the ROB and re-begin execution

Tomasulo’s Algorithm

• Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233)

• IBM 360/91 (three year after CDC 6600 and just before caches)

• Features:• CDB: Common Data Bus• Reservation Units: Hardware features which allow the

fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)


• Control and Buffers distributed with Functional Units.• HW renaming of registers• CDB broadcasting• Load / Store buffers Functional Units• Reservation Stations:

– Hazard detection and Instruction control– 4-bit tag field to specify which station or buffer will produce the result

• Register Renaming– Tag Assigned on IS– Tag discarded after write back

Comparison

• Scoreboarding– Centralized Data structure

and control– Register bit

• Simple, low cost– Structural hazards solved by

FU– Solve RAW by register bit– Solve WAR in write – Solve WAW stalls on issue

• Tomasulo’s Algoritjm– Distributed control– Tagged Registers + register

renaming– Structural Hazard stalls on

Reservation Station– Solve RAW by CDB– Solve WAR by copying

operand to Reservation Station

– Solve WAW by renaming– Limited: CDB

• Broadcast• 1 per cycle

The Architecture

654321

Formmemory

Load buffers

From instruction unitFloating-pointoperations FP registers

FP adders FP multipliers

Store buffers

tomemory

Common data bus (CDB)

321

321

Operation bus

21

ReservationStations

Operandbus

- 3 Adders- 2 Multipliers- Load buffers (6)- Store buffers (3)- FP Queue- FP registers- CDB: Common Data Bus

Tomasulo’s Algorithm’s Steps

• Issue- Issue if empty reservation station is found, fetch operands if they are in

registers, otherwise assign a tag- If no empty reservation is found, stall and wait for one to get free- Renaming is performed here and WAW and WAR are resolved

• Execute– If operands are not ready, monitor the CDB for them– RAWs are resolved– When they are ready, execute the op in the FU

• Write Back– Send the results to CDB and update registers and the Store buffers– Store Buffers will write to memory during this step

• Exception Behavior– During Execute: No instructions are allowed to be issued until all branches

before it have been completed


• Note that:• Upon Entering a reservation station, source operands are

either filled with values or renamed• The new names are 1-to-1 correspondence to FU names

• Question:• How the output dependencies are resolved?

• Two pending writes to a register• How to determinate that a read will get the most

recent value if they complete out of order

Features of T. Alg.

• The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field.

• Instructions can be issued without even the operands produced (but know they are coming from CDB)

Memory ModelsMemory Models

Programming Execution Models

• A set of rules to create programs• Message Passing Model

– De Facto Multicomputer Programming Model– Multiple Address Space– Explicit Communication / Implicit Synchronization

• Shared Memory Models– De Facto Multiprocessor Programming Model– Single Address Space– Implicit Communication / Explicit Synchronization

Shared Memory Execution Model

A group of rules that deals with data replication, coherency, and memory ordering

Private Data Shared Data

Data that is not visible to other threads Data that can be access by other threads

Thread Model

Memory Model

Synchronization Model

A set of rules for thread creation, scheduling and destruction

Rules that deal with access to shared data

Thread Virtual Machine

Grand Challenge Problems

• Shared Memory Multiprocessor Effective at a number of thousand units

• Optimize and Compile parallel applications• Main Areas: Assumptions about

– Memory Coherency– Memory Consistency

Memory Consistency & Coherence

Memory [Cache] CoherencyThe Problem

P1 P2 P3

U:5 U:5

U:51

4

U:? U:? U:7

2

3

5

What value P1 and P2 will read?

1 3

MCMCategory of Access

As Presented in Mosberger 93

Memory Access

Private Shared

CompetingNon-Competing

SynchronizationNon synchronization

AcquireRelease

ExclusiveNon-exclusive

Uniform V.S. Hybrid

10/03/2007 ELEG652-07F 49

Conventional MCM

• Sequential Consistency– “… the result of any execution is the same as if the

operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport 79]

Memory Consistency Problem

B = 0…A = 1

L1: print B

A = 0…B = 1

L2: print A

Assume that L1 and L2 are issue only after the other 4 instructions have been completed.What are the possible values that are printed on the screen? Is 0, 0 a possible combination?

The MCM: A software and hardware contract

MCM Attributes

• Memory Operations• Location of Access

– Near memory (cache, near memory modules, etc) V.S. far memory• Direction of Access

– Write or Read• Value Transmitted in Access

– Size• Causality of Access

– Check if two access are “causually” related and if they are in which order are they completed

• Category of Access– Static Property of Accesses

Synchronization and Its CostSynchronization and Its Cost

Synchronization

• The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races

• Data Race or Race Condition– “There is an anomaly of concurrent accesses by

two or more threads to a shared memory and at least one of the accesses is a write”

• Atomicity and / or serialibility

Atomicity

• Atomic From the Greek “Atomos” which means indivisible

• An “All or None” scheme• An instruction (or a group of them) will appear

as if it was (they were) executed in a single try– All side effects of the instruction (s) in the block

are seen in its totality or not all• Side effects Writes and (Causal) Reads to the

variables inside the atomic block

Synchronization

• Applied to Shared Variables • Synchronization might enforce ordering or not• High level Synchronization types

– Semaphores– Mutex– Barriers– Critical Sections– Monitors– Conditional Variables

Types of (Software) LocksThe Spin Lock Family

• The Simple Test and Set Lock– Polls a shared Boolean variable: A binary

semaphore– Uses Fetch and Φ operations to operate on the

binary semaphore– Expensive!!!!

• Waste bandwidth• Generate Extra Busses transactions

– The test test and set approach• Just poll when the lock is in use


• Delay based Locks– Spin Locks in which a delay has been introduced in

testing the lock– Constant delay– Exponentional Back-off

• Best Results

– The test test and set scheme is not needed


Pseudo code:Pseudo code:

enum LOCK_ACTIONS = {LOCKED, UNLOCKED};void acquire_lock(lock_t L){

int delay = 1;while(! test_and_set(L, LOCKED) ) {

sleep(delay);delay *= 2;

}}void release_lock(lock_t L){

L = UNLOCKED;}

Types of (Software) LocksThe Ticket Lock

• Reduce the # of Fetch and Φ operations – Only one per lock acquisition

• Strongly fair lock– No starvation

• A FIFO service• Implementation: Two counters

– A Request and Release Counters

Types of (Software) LocksThe Ticket Lock

Pseudocode:Pseudocode:unsigned int next_ticket = 0;unsigned int now_serving = 0;void acquire_lock(){

unsigned int my_ticket = fetch_and_increment(next_ticket);while{

sleep(my_ticket - now_serving);if(now_serving == my_ticket) return;

}}void release_lock(){

now_serving = now_serving + 1;}

Types of (Software) LocksThe Array Based Queue Lock

• Contention on the release counter• Cache Coherence and memory traffic

– Invalidation of the counter variable and the request to a single memory bank

• Two elements– An Array and a tail pointer that index such array– The array is as big as the number of processors– Fetch and store Address of the array element– Fetch and increment Tail pointer

• FIFO ordering

Types of (Software) LocksThe Queue Locks

• It uses too much memory– Linear space (relative to the number of

processors) per lock.

• Array– Easy to implement

• Linked List: QNODE– Cache management

Types of (Software) LocksThe MCS Lock

• Characteristics– FIFO ordering– Spins on locally accessible flag variables– Small amount of space per lock– Works equally well on machines with and without

coherent caches

• Similar to the QNODE implementation of queue locks– QNODES are assigned to local memory – Threads spins on local memory

MCS: How it works?

• Each processor enqueues its own private lock variable into a queue and spins on it– key: spin locally

• CC model: spin in local cache• DSM model: spin in local private memory

– No contention• On lock release, the releaser unlocks the next lock in

the queue– Only have bus/network contention on actual unlock– No starvation (order of lock acquisitions defined by the

list)

MCS Lock

• Requires atomic instruction:– compare-and-swap– fetch-and-store

• If there is no compare-and-swap– an alternative release algorithm

• extra complexity• loss of strict FIFO ordering• theoretical possibility of starvation• Detail: Mellor-Crummey and Scott’s 1991 paper

ImplementationModern Alternatives

• Fetch and Φ operations– They are restrictive– Not all architecture support all of them

• Problem: A general one atomic op is hard!!!• Solution: Provide two primitives to generate atomic

operations• Load Linked and Store Conditional

– Remember PowerPC lwarx and stwcx instructions

Performance Penalty

Example

Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty.

Answer

It takes over 12,000 cycles total for all processor to pass through the lock!

Note: the contention of the lock and the serialization of the bus transactions.

See example on pp 596, Henn/Patt, 3rd Ed.

ILP, Memory and Synchronization

Documents

Transcript of ILP, Memory and Synchronization