Introduction to Programming 3D Applications CE00056-1 Lecture 6 Compiler options and makefiles.
4541.775 Topics on...
Transcript of 4541.775 Topics on...
Introduction
Spring 2011
4541.775Topics on Compilers
4541.775 Topics on Compilers 2
Today’s lecture
● Introduction to Compiler Challenges for Modern Architectures
● Overview
● Pipelining
● Vector Instructions
● Superscalar and VLIW Processors
● Processor Parallelism
● Memory Hierarchies
● Summing it Up
4541.775 Topics on Compilers 3
Overview
● Moore’s Law upheld thanks to various kinds of parallelism
(source: J.Dongarra, Univ. of Tennessee)
4541.775 Topics on Compilers 4
Pipelining
● Definition: dividing a complex operation into a sequence of independent suboperations in such a manner that, if the suboperations use different resources, operations can be overlapped by starting the next operation as soon as its predecessor has completed the first suboperation.
● several types of pipelining
● pipelined instruction units
● pipelined execution units
● parallel function units
4541.775 Topics on Compilers 5
Pipelining
● Pipelined Instruction Units
● as early as 1962 in the IBM 7094
● typical fivestage pipeline
– instruction fetch (IF)
– instruction decode (ID)
– execute (EX)
– memory access (MEM)
– write back (WB)
handles all kinds of typical RISC instructions (ALU, memory, and branch)
● operation latency: 1 cycle
throughput: (n + pipeline stages – 1)/n (n: number of operations)optimal throughput for big values of n: 1 operation/cycle
IF
clock cycles
ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
4541.775 Topics on Compilers 6
Pipelining
● Pipelined Execution Units
● complex instructions cannot perform the computation in one cycle only
● representative example: execution stage of a floatingpoint adder
– fetch operands (FO)
– equate exponents (EE)
– add mantissas (AM)
– normalize result (NR)
● combined with pipelined instruction units
● operation latency: l cycles
throughput: (n + pipeline stages – 1)/n (n: number of operations)optimal throughput for big values of n: 1 operation/cycle
EE AM NRFO
IF
clock cycles
ID MEM WBEE AM NR
IF ID MEM WBEE AM NR
IF ID MEM WBEE AM NR
4541.775 Topics on Compilers 7
Pipelining
● Parallel Functional Units
● replicate whole functional units
● finegrained parallelism
● + operational freedom cost (transistor count, die area, energy, …) complicated coordination
● combining with pipelining possible
adder 1
adder 2
adder 2
dispatch results
4541.775 Topics on Compilers 8
Pipelining
● Compilation Issues with Scalar Pipelines
● pipeline stalls
● a pipeline stall (i.e., the next operation cannot be inserted into the pipeline at the beginning of a new cycle) is caused by one of three hazard conditions (Hennessy and Patterson, Computer Architecture: A Quantitative Approach)
– structural hazard
– data hazard
– control hazard
4541.775 Topics on Compilers 9
Pipelining
● Compilation Issues with Scalar Pipelines
● structural hazardH/W restricts overlapping of certain (sub) operations
● example: pipelined unit with one memory port
structural hazard when executing MemoryALUALUALU→
● cannot be avoided through compiler strategies
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Memory
ALU
ALU
ALU
core memory
instruction fetch
data memory access
4541.775 Topics on Compilers 10
Pipelining
● Compilation Issues with Scalar Pipelines
● data hazardoccurs when input operands are not ready yet
● examples:
– no forwarding
– zerocycle forwarding solves the above hazard
but not
– instruction latency
● avoid by good instruction scheduling
IF ID EX MEM WB
IF ID EX MEM WB
sub
add
sub r3 ← r2, r1
add r4 ← r3, r5
IF ID EX MEM WB
IF ID EX MEM WB
sub
add
sub r3 ← r2, r1
add r4 ← r3, r5
IF ID EX MEM WB
IF ID EX MEM WB
ld
add
ld r3 ← mem[r2, r1]
add r4 ← r3, r5
IF ID EE AM NR
IF ID EE AM NR
add_f
add_f
add_f r3 ← r2, r1
add_f r4 ← r3, r5
MEM WB
MEM WB
4541.775 Topics on Compilers 11
Pipelining
● Compilation Issues with Scalar Pipelines
● control hazardcaused by (mispredicted) control transfers
● example: branch target not know until the EX stage of the branch completes
● avoid through (a combination of)
– performing the comparison in the ID stage (1 stall cycle) H/W
– branch prediction buffer H/W
– S/W branch prediction hints compiler
– expose the branch delay slot compiler
● principle compiler strategy to avoid stalls: instruction scheduling
IF ID EX MEM WB
IF ID EX MEM WB
bcond
.bb5
beq r1, r2 → .bb5
4541.775 Topics on Compilers 12
Vector Instructions
● Definition: a single instruction that executes an elementwise operation on two vector quantities in special vector registers or memory.
● Introduced in the 1970s to simplify instruction processing. To keep the pipeline full complex hardware strategies such as lookahead with outoforder execution had been developed but had become a burden for hardware designers.
● + simple to keep the pipeline full complicated decoder logic to support a large number of instructions increased processor state (H/W cost, context switching) cause problems with traditional memory hierarchy design
4541.775 Topics on Compilers 13
Vector Instructions
● Compilation Issues with Vector Pipelines
● retaining the program semantics
→ proper data dependence analysis necessary to decide whether aloop is vectorizable or not
vectorization
code generation
for (i=0; i<64; i++)C[i] = A[i] + B[i]
C[0:63] = A[0:63] + B[0:63]
vload v1, Avload v2, Bvadd v3, v1, v2vstore C, v3
vectorization
code generation
for (i=0; i<64; i++)A[i+1] = A[i] + B[i]
A[1:64] = A[0:63] + B[0:63]
vload v1, Avload v2, Bvadd v3, v1, v2vstore A+1, v3
correct? correct?
4541.775 Topics on Compilers 14
Superscalar and VLIW Processors
● idea: keep the instruction set design simple, yet keep the pipeline busy by issuing several instructions per cycle
● superscalar processors: hardware looks ahead in the instruction stream and searches for operations that are ready to execute, i.e., have all required inputs ready. Some superscalar processors can even execute instructions out of order.
● VLIW (very long instruction word) processors: the processor executes instructions in bundles. Typically, each instruction in a bundle corresponds to an operation on a different functional unit. The compiler/programmer is expected to bundle instructions correctly, i.e., no instruction executes before all its inputs are available.
4541.775 Topics on Compilers 15
Superscalar and VLIW Processors
● + superscalar/VLIW processors can achieve the speed of vector machines require high bandwidth to memory and large instruction caches strideone accesses are critical to good performance for cached data
4541.775 Topics on Compilers 16
Superscalar and VLIW Processors
● Compilation Issues with Superscalar/VLIW Processors
● careful reordering of (machine) instructions is required to exploit all of the available hardware resources. To generate good performing code from highlevel languages the compiler must
i. recognize independent operations dependence analysis (vectorization)→
ii. generate the shortest possible schedule instruction scheduling→
● example: assume all operations have a 2cycle latency
ld_f r3 ← mem[x]ld_f r2 ← mem[y]add_f r1 ← r3, r2st_f mem[u] ← r1ld_f r5 ← mem[x]add_f r4 ← r5, r1st_f mem[v] ← r4
what is the current schedule length? is there a better schedule? what about if the machine is a 2issue VLIW processor?
4541.775 Topics on Compilers 17
Processor Parallelism
● Definition: processor parallelism reduces the execution time of an application by running the same or several tasks operating on different data sets using multiple processors.
● two main variations:
● synchronous processor parallelismexecute the same thread in lockstep on different parts of the data+ cheap task creation+ cheap synchronization do not handle control flow well
● asynchronous processor parallelismexecute different threads/part of a program simultaneously+ can handle control flow well expensive synchronization through memory expensive task create
4541.775 Topics on Compilers 18
Processor Parallelism
● Compilation Issues with Asynchronous Parallelism
● exploit coarsegrain parallelism, i.e., parallelize whole loop iterations
● Bernstein’s conditionsdetermine whether two loop iterations can be safely executed in parallel or not. I = input set, O = output set, index: loop iteration index
i. Ii ∩ Ok = Ø
ii. Ik ∩ Oi = Ø
iii. Oi ∩ Ok = Ø
● granularity
for(i=0; i<N; i++) { A[I+1] = A[i] + B[i];}
for(i=0; i<N; i++) { A[I-1] = A[i] + B[i];}
for(i=0; i<N; i++) { S = A[i] + B[i];}
4541.775 Topics on Compilers 19
Memory Hierarchies
● Measures of a Memory System
● latency: number of cycles required to fetch a single element from the memory
● bandwidth: number of data elements the memory can deliver in each cycle
● Latency avoidance vs tolerance
● latency avoidance: reduce the memory latencies incurred in a computation
memory hierarchies→
● latency tolerance: do something else while waiting for the data to arrive prefetching, nonblocking loads, Cray/Tera MTA→
4541.775 Topics on Compilers 20
Memory Hierarchies
● Compilation Issues with Memory Hierarchies
● efficiency of the code depends on the problem size and the cache size
● as long as the data fits in the data cache, high performance is achieved
● as soon as the problem size exceeds the size of the cache, extensive thrashing may occur
thrashing occurs for an LRU cache with M > cache size.
Basic idea: process data in chunks, e.g., strips:
● machinespecific optimization
for(i=0; i<N; i++) { for (j=0; j<M; j++) { A[i] = A[i] + B[j]; }}
for (k=0; k<M; k=k+L) { for(i=0; i<N; i++) { for (j=k; j<k+L-1; j++) { A[i] = A[i] + B[j]; } }}
4541.775 Topics on Compilers 21
Case Study
● Case Study: Matrix Multiplication
● A: mxp matrix, B: pxn matrix, C = mxn matrix
C A B
= X
C i , j=∑k=1
p
Ai , k×Bk , j
4541.775 Topics on Compilers 22
Case Study
● for simplicity, m=n=N
● straightforward C implementationoptimal on a scalar, nonpipelined machine with no cache
for (int i=0; i<N; i++) { for (int j=0; j<N; j++) { C[i,j] = 0.0; for (int k=0; k<N; k++) { C[i,j] = C[i,j] + A[i,k] * B[k,j]; } }}
4541.775 Topics on Compilers 23
Case Study
● pipelined floatingpoint unitmultiplyadder with four pipeline stages
for (int i=0; i<N; i++) { for (int j=0; j<N; j=j+4) { C[i,j+0] = 0.0; C[i,j+1] = 0.0; C[i,j+2] = 0.0; C[i,j+3] = 0.0; for (int k=0; k<N; k++) { C[i,j+0] = C[i,j+0] + A[i,k] * B[k,j+0]; C[i,j+1] = C[i,j+1] + A[i,k] * B[k,j+1]; C[i,j+2] = C[i,j+2] + A[i,k] * B[k,j+2]; C[i,j+3] = C[i,j+3] + A[i,k] * B[k,j+3]; } }}
4541.775 Topics on Compilers 24
Case Study
● vector machinewith 32element vector registers
for (int i=0; i<N; i++) { for (int j=0; j<N; j=j+32) { C[i,j:j+31] = 0.0; for (int k=0; k<N; k++) { C[i,j:j+31] = C[i,j:j+31] + A[i,k] * B[k,j:j+31]; } }}
4541.775 Topics on Compilers 25
Case Study
● 4issue pipelined VLIW4issue, four floating point multiplyadders, four pipeline stages
for (int i=0; i<N; i=i+4) { for (int j=0; j<N; j=j+4) { C[i+0,j:j+3] = 0.0; C[i+1,j:j+3] = 0.0; C[i+2,j:j+3] = 0.0; C[i+3,j:j+3] = 0.0; for (int k=0; k<N; k++) { C[i+0,j:j+3] = C[i+0,j:j+3] + A[i+0,k] * B[k,j:j+3]; C[i+1,j:j+3] = C[i+1,j:j+3] + A[i+1,k] * B[k,j:j+3]; C[i+2,j:j+3] = C[i+2,j:j+3] + A[i+2,k] * B[k,j:j+3]; C[i+3,j:j+3] = C[i+3,j:j+3] + A[i+3,k] * B[k,j:j+3]; } }}
4541.775 Topics on Compilers 26
Case Study
● symmetric multiprocessor
#pragma omp parallel forfor (int i=0; i<N; i++) { for (int j=0; j<N; j++) { C[i,j] = 0.0; for (int k=0; k<N; k++) { C[i,j] = C[i,j] + A[i,k] * B[k,j]; } }}
4541.775 Topics on Compilers 27
Case Study
● unpipelined scalar uniprocessor with a cachefullyassociative cache that can hold more than 3L2 floats
for (int I=0; I<N; I=I+L) { for (int J=0; J<N; J=J+L) { for (int i=I; i<I+L-1; i++) { for (int j=J; J<J+L-1; j++) { C[i,j] = 0.0; } } for (int K=0; K<N; K=K+L) { for (int i=I; i<I+L-1; i++) { for (int j=J; j<J+L-1; j++) { for (int k=K; k<K+L-1; k++) { C[i,j] = C[i,j] + A[i,k] * B[k, j]; } } } } }}
4541.775 Topics on Compilers 28
Case Study
● good code for a 4issue pipelined VLIW with a cache?
● Lesson learned:
● different machine types require different explicit representations of parallelism to guarantee optimal use of the hardware
hardware tailoring, portability→
● original code is the same for all machines. The optimal parallel version can be derived from it through relatively simple program transformations
→ let the compiler do it!
4541.775 Topics on Compilers 29
Summing it Up
● John Backus, The history of FORTRAN I, II, and III. ACM SIGPLAN Notices 13(8):165180, August 1978
“It was our belief that if Fortran, during the first months, were to translate any reasonable “scientific” source program into an object program only half as fast as its handcoded counterpart, then acceptance of our system would be in serious danger […]In fact, I believe that we are in a similar, but unrecognized, situation today: in spite of all the fuss that has been made over myriad language details, current conventional languages are still very weak programming aids, and far more powerful languages would be in used today if anyone had found a way to make them run with adequate efficiency.”
4541.775 Topics on Compilers 30
Summing it Up
● Trend in modern architectures: shift the burden of achieving high performance from the hardware to the software
● Programming languages and compilers do not keep up with the advances in complex modern architectures
● Programming to achieve optimal performance still requires tricky hand transformations tailored to a specific system.
● explicitly managed memory hierarchies
● loop optimizations
● …
● Most of these hand transformations should really be performed by compilers.
4541.775 Topics on Compilers 31
Outlook
● next class Monday, March 7 11:00 a.m.
● assignments none!
4541.775 Topics on Compilers 32
Case Study: Matrix Multiplication