4541.775 Topics on...

Introduction

Spring 2011

4541.775Topics on Compilers

4541.775 Topics on Compilers 2

Today’s lecture

● Introduction to Compiler Challenges for Modern Architectures

● Overview

● Pipelining

● Vector Instructions

● Superscalar and VLIW Processors

● Processor Parallelism

● Memory Hierarchies

● Summing it Up


Overview

● Moore’s Law upheld thanks to various kinds of parallelism

(source: J.Dongarra, Univ. of Tennessee)


Pipelining

● Definition: dividing a complex operation into a sequence of independent suboperations in such a manner that, if the suboperations use different resources, operations can be overlapped by starting the next operation as soon as its predecessor has completed the first suboperation.

● several types of pipelining

● pipelined instruction units

● pipelined execution units

● parallel function units


Pipelining

● Pipelined Instruction Units

● as early as 1962 in the IBM 7094

● typical fivestage pipeline

– instruction fetch (IF)

– instruction decode (ID)

– execute (EX)

– memory access (MEM)

– write back (WB)

handles all kinds of typical RISC instructions (ALU, memory, and branch)

● operation latency: 1 cycle

throughput: (n + pipeline stages – 1)/n (n: number of operations)optimal throughput for big values of n: 1 operation/cycle

IF

clock cycles

ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB


Pipelining

● Pipelined Execution Units

● complex instructions cannot perform the computation in one cycle only

● representative example: execution stage of a floatingpoint adder

– fetch operands (FO)

– equate exponents (EE)

– add mantissas (AM)

– normalize result (NR)

● combined with pipelined instruction units

● operation latency: l cycles

throughput: (n + pipeline stages – 1)/n (n: number of operations)optimal throughput for big values of n: 1 operation/cycle

EE AM NRFO

IF

clock cycles

ID MEM WBEE AM NR

IF ID MEM WBEE AM NR

IF ID MEM WBEE AM NR


Pipelining

● Parallel Functional Units

● replicate whole functional units

● finegrained parallelism

● + operational freedom cost (transistor count, die area, energy, …) complicated coordination

● combining with pipelining possible

adder 1

adder 2

adder 2

dispatch results


Pipelining

● Compilation Issues with Scalar Pipelines

● pipeline stalls

● a pipeline stall (i.e., the next operation cannot be inserted into the pipeline at the beginning of a new cycle) is caused by one of three hazard conditions (Hennessy and Patterson, Computer Architecture: A Quantitative Approach)

– structural hazard

– data hazard

– control hazard


Pipelining


● structural hazardH/W restricts overlapping of certain (sub) operations

● example: pipelined unit with one memory port

structural hazard when executing MemoryALUALUALU→

● cannot be avoided through compiler strategies

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Memory

ALU

ALU

ALU

core memory

instruction fetch

data memory access


Pipelining


● data hazardoccurs when input operands are not ready yet

● examples:

– no forwarding

– zerocycle forwarding solves the above hazard

but not

– instruction latency

● avoid by good instruction scheduling

IF ID EX MEM WB

IF ID EX MEM WB

sub

add

sub r3 ← r2, r1

add r4 ← r3, r5

IF ID EX MEM WB

IF ID EX MEM WB

sub

add

sub r3 ← r2, r1

add r4 ← r3, r5

IF ID EX MEM WB

IF ID EX MEM WB

ld

add

ld r3 ← mem[r2, r1]

add r4 ← r3, r5

IF ID EE AM NR

IF ID EE AM NR

add_f

add_f

add_f r3 ← r2, r1

add_f r4 ← r3, r5

MEM WB

MEM WB


Pipelining


● control hazardcaused by (mispredicted) control transfers

● example: branch target not know until the EX stage of the branch completes

● avoid through (a combination of)

– performing the comparison in the ID stage (1 stall cycle) H/W

– branch prediction buffer H/W

– S/W branch prediction hints compiler

– expose the branch delay slot compiler

● principle compiler strategy to avoid stalls: instruction scheduling

IF ID EX MEM WB

IF ID EX MEM WB

bcond

.bb5

beq r1, r2 → .bb5


Vector Instructions

● Definition: a single instruction that executes an elementwise operation on two vector quantities in special vector registers or memory.

● Introduced in the 1970s to simplify instruction processing. To keep the pipeline full complex hardware strategies such as lookahead with outoforder execution had been developed but had become a burden for hardware designers.

● + simple to keep the pipeline full complicated decoder logic to support a large number of instructions increased processor state (H/W cost, context switching) cause problems with traditional memory hierarchy design


Vector Instructions

● Compilation Issues with Vector Pipelines

● retaining the program semantics

→ proper data dependence analysis necessary to decide whether aloop is vectorizable or not

vectorization

code generation

for (i=0; i<64; i++)C[i] = A[i] + B[i]

C[0:63] = A[0:63] + B[0:63]

vload v1, Avload v2, Bvadd v3, v1, v2vstore C, v3

vectorization

code generation

for (i=0; i<64; i++)A[i+1] = A[i] + B[i]

A[1:64] = A[0:63] + B[0:63]

vload v1, Avload v2, Bvadd v3, v1, v2vstore A+1, v3

correct? correct?


Superscalar and VLIW Processors

● idea: keep the instruction set design simple, yet keep the pipeline busy by issuing several instructions per cycle

● superscalar processors: hardware looks ahead in the instruction stream and searches for operations that are ready to execute, i.e., have all required inputs ready. Some superscalar processors can even execute instructions out of order.

● VLIW (very long instruction word) processors: the processor executes instructions in bundles. Typically, each instruction in a bundle corresponds to an operation on a different functional unit. The compiler/programmer is expected to bundle instructions correctly, i.e., no instruction executes before all its inputs are available.



● + superscalar/VLIW processors can achieve the speed of vector machines require high bandwidth to memory and large instruction caches strideone accesses are critical to good performance for cached data



● Compilation Issues with Superscalar/VLIW Processors

● careful reordering of (machine) instructions is required to exploit all of the available hardware resources. To generate good performing code from highlevel languages the compiler must

i. recognize independent operations dependence analysis (vectorization)→

ii. generate the shortest possible schedule instruction scheduling→

● example: assume all operations have a 2cycle latency

ld_f r3 ← mem[x]ld_f r2 ← mem[y]add_f r1 ← r3, r2st_f mem[u] ← r1ld_f r5 ← mem[x]add_f r4 ← r5, r1st_f mem[v] ← r4

what is the current schedule length? is there a better schedule? what about if the machine is a 2issue VLIW processor?


Processor Parallelism

● Definition: processor parallelism reduces the execution time of an application by running the same or several tasks operating on different data sets using multiple processors.

● two main variations:

● synchronous processor parallelismexecute the same thread in lockstep on different parts of the data+ cheap task creation+ cheap synchronization do not handle control flow well

● asynchronous processor parallelismexecute different threads/part of a program simultaneously+ can handle control flow well expensive synchronization through memory expensive task create


Processor Parallelism

● Compilation Issues with Asynchronous Parallelism

● exploit coarsegrain parallelism, i.e., parallelize whole loop iterations

● Bernstein’s conditionsdetermine whether two loop iterations can be safely executed in parallel or not. I = input set, O = output set, index: loop iteration index

i. Ii ∩ Ok = Ø

ii. Ik ∩ Oi = Ø

iii. Oi ∩ Ok = Ø

● granularity

for(i=0; i<N; i++) { A[I+1] = A[i] + B[i];}

for(i=0; i<N; i++) { A[I-1] = A[i] + B[i];}

for(i=0; i<N; i++) { S = A[i] + B[i];}


Memory Hierarchies

● Measures of a Memory System

● latency: number of cycles required to fetch a single element from the memory

● bandwidth: number of data elements the memory can deliver in each cycle

● Latency avoidance vs tolerance

● latency avoidance: reduce the memory latencies incurred in a computation

memory hierarchies→

● latency tolerance: do something else while waiting for the data to arrive prefetching, nonblocking loads, Cray/Tera MTA→


Memory Hierarchies

● Compilation Issues with Memory Hierarchies

● efficiency of the code depends on the problem size and the cache size

● as long as the data fits in the data cache, high performance is achieved

● as soon as the problem size exceeds the size of the cache, extensive thrashing may occur

thrashing occurs for an LRU cache with M > cache size.

Basic idea: process data in chunks, e.g., strips:

● machinespecific optimization

for(i=0; i<N; i++) { for (j=0; j<M; j++) { A[i] = A[i] + B[j]; }}

for (k=0; k<M; k=k+L) { for(i=0; i<N; i++) { for (j=k; j<k+L-1; j++) { A[i] = A[i] + B[j]; } }}


Case Study

● Case Study: Matrix Multiplication

● A: mxp matrix, B: pxn matrix, C = mxn matrix

C A B

= X

C i , j=∑k=1

p

Ai , k×Bk , j


Case Study

● for simplicity, m=n=N

● straightforward C implementationoptimal on a scalar, nonpipelined machine with no cache

for (int i=0; i<N; i++) { for (int j=0; j<N; j++) { C[i,j] = 0.0; for (int k=0; k<N; k++) { C[i,j] = C[i,j] + A[i,k] * B[k,j]; } }}


Case Study

● pipelined floatingpoint unitmultiplyadder with four pipeline stages

for (int i=0; i<N; i++) { for (int j=0; j<N; j=j+4) { C[i,j+0] = 0.0; C[i,j+1] = 0.0; C[i,j+2] = 0.0; C[i,j+3] = 0.0; for (int k=0; k<N; k++) { C[i,j+0] = C[i,j+0] + A[i,k] * B[k,j+0]; C[i,j+1] = C[i,j+1] + A[i,k] * B[k,j+1]; C[i,j+2] = C[i,j+2] + A[i,k] * B[k,j+2]; C[i,j+3] = C[i,j+3] + A[i,k] * B[k,j+3]; } }}


Case Study

● vector machinewith 32element vector registers

for (int i=0; i<N; i++) { for (int j=0; j<N; j=j+32) { C[i,j:j+31] = 0.0; for (int k=0; k<N; k++) { C[i,j:j+31] = C[i,j:j+31] + A[i,k] * B[k,j:j+31]; } }}


Case Study

● 4issue pipelined VLIW4issue, four floating point multiplyadders, four pipeline stages

for (int i=0; i<N; i=i+4) { for (int j=0; j<N; j=j+4) { C[i+0,j:j+3] = 0.0; C[i+1,j:j+3] = 0.0; C[i+2,j:j+3] = 0.0; C[i+3,j:j+3] = 0.0; for (int k=0; k<N; k++) { C[i+0,j:j+3] = C[i+0,j:j+3] + A[i+0,k] * B[k,j:j+3]; C[i+1,j:j+3] = C[i+1,j:j+3] + A[i+1,k] * B[k,j:j+3]; C[i+2,j:j+3] = C[i+2,j:j+3] + A[i+2,k] * B[k,j:j+3]; C[i+3,j:j+3] = C[i+3,j:j+3] + A[i+3,k] * B[k,j:j+3]; } }}


Case Study

● symmetric multiprocessor

#pragma omp parallel forfor (int i=0; i<N; i++) { for (int j=0; j<N; j++) { C[i,j] = 0.0; for (int k=0; k<N; k++) { C[i,j] = C[i,j] + A[i,k] * B[k,j]; } }}


Case Study

● unpipelined scalar uniprocessor with a cachefullyassociative cache that can hold more than 3L2 floats

for (int I=0; I<N; I=I+L) { for (int J=0; J<N; J=J+L) { for (int i=I; i<I+L-1; i++) { for (int j=J; J<J+L-1; j++) { C[i,j] = 0.0; } } for (int K=0; K<N; K=K+L) { for (int i=I; i<I+L-1; i++) { for (int j=J; j<J+L-1; j++) { for (int k=K; k<K+L-1; k++) { C[i,j] = C[i,j] + A[i,k] * B[k, j]; } } } } }}


Case Study

● good code for a 4issue pipelined VLIW with a cache?

● Lesson learned:

● different machine types require different explicit representations of parallelism to guarantee optimal use of the hardware

hardware tailoring, portability→

● original code is the same for all machines. The optimal parallel version can be derived from it through relatively simple program transformations

→ let the compiler do it!


Summing it Up

● John Backus, The history of FORTRAN I, II, and III. ACM SIGPLAN Notices 13(8):165180, August 1978

“It was our belief that if Fortran, during the first months, were to translate any reasonable “scientific” source program into an object program only half as fast as its handcoded counterpart, then acceptance of our system would be in serious danger […]In fact, I believe that we are in a similar, but unrecognized, situation today: in spite of all the fuss that has been made over myriad language details, current conventional languages are still very weak programming aids, and far more powerful languages would be in used today if anyone had found a way to make them run with adequate efficiency.”


Summing it Up

● Trend in modern architectures: shift the burden of achieving high performance from the hardware to the software

● Programming languages and compilers do not keep up with the advances in complex modern architectures

● Programming to achieve optimal performance still requires tricky hand transformations tailored to a specific system.

● explicitly managed memory hierarchies

● loop optimizations

● …

● Most of these hand transformations should really be performed by compilers.


Outlook

● next class Monday, March 7 11:00 a.m.

● assignments none!


Case Study: Matrix Multiplication

4541.775 Topics on...

Documents

Transcript of 4541.775 Topics on...