4541.775 Topics on...

32
Introduction Spring 2011 4541.775 Topics on Compilers

Transcript of 4541.775 Topics on...

Page 1: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

Introduction

Spring 2011

4541.775Topics on Compilers

Page 2: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 2

Today’s lecture

● Introduction to Compiler Challenges for Modern Architectures

● Overview

● Pipelining

● Vector Instructions

● Superscalar and VLIW Processors

● Processor Parallelism

● Memory Hierarchies

● Summing it Up

Page 3: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 3

Overview

● Moore’s Law upheld thanks to various kinds of parallelism

(source: J.Dongarra, Univ. of Tennessee)

Page 4: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 4

Pipelining

● Definition: dividing a complex operation into a sequence of independent sub­operations in such a manner that, if the sub­operations use different resources, operations can be overlapped by starting the next operation as soon as its predecessor has completed the first sub­operation.

● several types of pipelining

● pipelined instruction units

● pipelined execution units

● parallel function units

Page 5: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 5

Pipelining

● Pipelined Instruction Units

● as early as 1962 in the IBM 7094

● typical five­stage pipeline

– instruction fetch (IF)

– instruction decode (ID)

– execute (EX)

– memory access (MEM)

– write back (WB)

handles all kinds of typical RISC instructions (ALU, memory, and branch)

● operation latency: 1 cycle

throughput: (n + pipeline stages – 1)/n (n: number of operations)optimal throughput for big values of n: 1 operation/cycle

IF

clock cycles

ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Page 6: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 6

Pipelining

● Pipelined Execution Units

● complex instructions cannot perform the computation in one cycle only

● representative example: execution stage of a floating­point adder

– fetch operands (FO)

– equate exponents (EE)

– add mantissas (AM)

– normalize result (NR)

● combined with pipelined instruction units

● operation latency: l cycles

throughput: (n + pipeline stages – 1)/n (n: number of operations)optimal throughput for big values of n: 1 operation/cycle

EE AM NRFO

IF

clock cycles

ID MEM WBEE AM NR

IF ID MEM WBEE AM NR

IF ID MEM WBEE AM NR

Page 7: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 7

Pipelining

● Parallel Functional Units

● replicate whole functional units

● fine­grained parallelism

● + operational freedom­ cost (transistor count, die area, energy, …)­ complicated coordination

● combining with pipelining possible

adder 1

adder 2

adder 2

dispatch results

Page 8: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 8

Pipelining

● Compilation Issues with Scalar Pipelines

● pipeline stalls

● a pipeline stall (i.e., the next operation cannot be inserted into the pipeline at the beginning of a new cycle) is caused by one of three hazard conditions (Hennessy and Patterson, Computer Architecture: A Quantitative Approach)

– structural hazard

– data hazard

– control hazard

Page 9: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 9

Pipelining

● Compilation Issues with Scalar Pipelines

● structural hazardH/W restricts overlapping of certain (sub­) operations

● example: pipelined unit with one memory port

 structural hazard when executing Memory­ALU­ALU­ALU→

● cannot be avoided through compiler strategies

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Memory

ALU

ALU

ALU

core memory

instruction fetch

data memory access

Page 10: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 10

Pipelining

● Compilation Issues with Scalar Pipelines

● data hazardoccurs when input operands are not ready yet

● examples: 

– no forwarding 

– zero­cycle forwarding solves the above hazard

but not

– instruction latency

● avoid by good instruction scheduling

IF ID EX MEM WB

IF ID EX MEM WB

sub

add

sub r3 ← r2, r1

add r4 ← r3, r5

IF ID EX MEM WB

IF ID EX MEM WB

sub

add

sub r3 ← r2, r1

add r4 ← r3, r5

IF ID EX MEM WB

IF ID EX MEM WB

ld

add

ld r3 ← mem[r2, r1]

add r4 ← r3, r5

IF ID EE AM NR

IF ID EE AM NR

add_f

add_f

add_f r3 ← r2, r1

add_f r4 ← r3, r5

MEM WB

MEM WB

Page 11: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 11

Pipelining

● Compilation Issues with Scalar Pipelines

● control hazardcaused by (mispredicted) control transfers

● example: branch target not know until the EX stage of the branch completes

● avoid through (a combination of)

– performing the comparison in the ID stage (1 stall cycle) H/W

– branch prediction buffer H/W

– S/W branch prediction hints compiler

– expose the branch delay slot compiler

● principle compiler strategy to avoid stalls: instruction scheduling 

IF ID EX MEM WB

IF ID EX MEM WB

bcond

.bb5

beq r1, r2 → .bb5

Page 12: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 12

Vector Instructions

● Definition: a single instruction that executes an element­wise  operation on two vector quantities in special vector registers or memory.

● Introduced in the 1970s to simplify instruction processing. To keep the pipeline full complex hardware strategies such as lookahead with out­of­order execution had been developed but had become a burden for hardware designers.

● + simple to keep the pipeline full­ complicated decoder logic to support a large number of instructions­ increased processor state (H/W cost, context switching)­ cause problems with traditional memory hierarchy design

Page 13: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 13

Vector Instructions

● Compilation Issues with Vector Pipelines

● retaining the program semantics

 → proper data dependence analysis necessary to decide whether aloop is vectorizable or not

vectorization

code generation

for (i=0; i<64; i++)C[i] = A[i] + B[i]

C[0:63] = A[0:63] + B[0:63]

vload v1, Avload v2, Bvadd v3, v1, v2vstore C, v3

vectorization

code generation

for (i=0; i<64; i++)A[i+1] = A[i] + B[i]

A[1:64] = A[0:63] + B[0:63]

vload v1, Avload v2, Bvadd v3, v1, v2vstore A+1, v3

correct? correct?

Page 14: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 14

Superscalar and VLIW Processors

● idea: keep the instruction set design simple, yet keep the pipeline busy by issuing several instructions per cycle

● superscalar processors: hardware looks ahead in the instruction stream and searches for operations that are ready to execute, i.e., have all required inputs ready. Some superscalar processors can even execute instructions out of order.

● VLIW (very long instruction word) processors: the processor executes instructions in bundles. Typically, each instruction in a bundle corresponds to an operation on a different functional unit. The compiler/programmer is expected to bundle instructions correctly, i.e., no instruction executes before all its inputs are available.

Page 15: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 15

Superscalar and VLIW Processors

● + superscalar/VLIW processors can achieve the speed of vector    machines­ require high bandwidth to memory and large instruction caches­ stride­one accesses are critical to good performance for  cached data

Page 16: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 16

Superscalar and VLIW Processors

● Compilation Issues with Superscalar/VLIW Processors

● careful reordering of (machine) instructions is required to exploit all of the available hardware resources. To generate good performing code from high­level languages the compiler must

i. recognize independent operations     dependence analysis (vectorization)→

ii. generate the shortest possible schedule     instruction scheduling→

● example: assume all operations have a 2­cycle latency

ld_f r3 ← mem[x]ld_f r2 ← mem[y]add_f r1 ← r3, r2st_f mem[u] ← r1ld_f r5 ← mem[x]add_f r4 ← r5, r1st_f mem[v] ← r4

what is the current schedule length? is there a better schedule? what about if the machine is a 2­issue VLIW processor?

Page 17: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 17

Processor Parallelism

● Definition: processor parallelism reduces the execution time of an application by running the same or several tasks operating on  different data sets using multiple processors.

● two main variations:

● synchronous processor parallelismexecute the same thread in lock­step on different parts of the data+ cheap task creation+ cheap synchronization­ do not handle control flow well

● asynchronous processor parallelismexecute different threads/part of a program simultaneously+ can handle control flow well­ expensive synchronization through memory­ expensive task create

Page 18: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 18

Processor Parallelism

● Compilation Issues with Asynchronous Parallelism

● exploit coarse­grain parallelism, i.e., parallelize whole loop iterations

● Bernstein’s conditionsdetermine whether two loop iterations can be safely executed in parallel or not. I = input set, O = output set, index: loop iteration index

i. Ii ∩ Ok = Ø

ii. Ik ∩ Oi = Ø

iii. Oi ∩ Ok = Ø

● granularity

for(i=0; i<N; i++) { A[I+1] = A[i] + B[i];}

for(i=0; i<N; i++) { A[I-1] = A[i] + B[i];}

for(i=0; i<N; i++) { S = A[i] + B[i];}

Page 19: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 19

Memory Hierarchies

● Measures of a Memory System

● latency: number of cycles required to fetch a single element from the memory

● bandwidth: number of data elements the memory can deliver in each cycle

● Latency avoidance vs tolerance

● latency avoidance: reduce the memory latencies incurred in a computation

 memory hierarchies→

● latency tolerance: do something else while waiting for the data to arrive prefetching, non­blocking loads, Cray/Tera MTA→

Page 20: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 20

Memory Hierarchies

● Compilation Issues with Memory Hierarchies

● efficiency of the code depends on the problem size and the cache size

● as long as the data fits in the data cache, high performance is achieved

● as soon as the problem size exceeds the size of the cache, extensive thrashing may occur

thrashing occurs for an LRU cache with M > cache size.

Basic idea: process data in chunks, e.g., strips:

● machine­specific optimization

for(i=0; i<N; i++) { for (j=0; j<M; j++) { A[i] = A[i] + B[j]; }}

for (k=0; k<M; k=k+L) { for(i=0; i<N; i++) { for (j=k; j<k+L-1; j++) { A[i] = A[i] + B[j]; } }}

Page 21: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 21

Case Study

● Case Study: Matrix Multiplication

● A: mxp matrix, B: pxn matrix, C = mxn matrix

C A B

= X

C i , j=∑k=1

p

Ai , k×Bk , j

Page 22: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 22

Case Study

● for simplicity, m=n=N

● straightforward C implementationoptimal on a scalar, non­pipelined machine with no cache

for (int i=0; i<N; i++) { for (int j=0; j<N; j++) { C[i,j] = 0.0; for (int k=0; k<N; k++) { C[i,j] = C[i,j] + A[i,k] * B[k,j]; } }}

Page 23: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 23

Case Study

● pipelined floating­point unitmultiply­adder with four pipeline stages

for (int i=0; i<N; i++) { for (int j=0; j<N; j=j+4) { C[i,j+0] = 0.0; C[i,j+1] = 0.0; C[i,j+2] = 0.0; C[i,j+3] = 0.0; for (int k=0; k<N; k++) { C[i,j+0] = C[i,j+0] + A[i,k] * B[k,j+0]; C[i,j+1] = C[i,j+1] + A[i,k] * B[k,j+1]; C[i,j+2] = C[i,j+2] + A[i,k] * B[k,j+2]; C[i,j+3] = C[i,j+3] + A[i,k] * B[k,j+3]; } }}

Page 24: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 24

Case Study

● vector machinewith 32­element vector registers

for (int i=0; i<N; i++) { for (int j=0; j<N; j=j+32) { C[i,j:j+31] = 0.0; for (int k=0; k<N; k++) { C[i,j:j+31] = C[i,j:j+31] + A[i,k] * B[k,j:j+31]; } }}

Page 25: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 25

Case Study

● 4­issue pipelined VLIW4­issue, four floating point multiply­adders, four pipeline stages

for (int i=0; i<N; i=i+4) { for (int j=0; j<N; j=j+4) { C[i+0,j:j+3] = 0.0; C[i+1,j:j+3] = 0.0; C[i+2,j:j+3] = 0.0; C[i+3,j:j+3] = 0.0; for (int k=0; k<N; k++) { C[i+0,j:j+3] = C[i+0,j:j+3] + A[i+0,k] * B[k,j:j+3]; C[i+1,j:j+3] = C[i+1,j:j+3] + A[i+1,k] * B[k,j:j+3]; C[i+2,j:j+3] = C[i+2,j:j+3] + A[i+2,k] * B[k,j:j+3]; C[i+3,j:j+3] = C[i+3,j:j+3] + A[i+3,k] * B[k,j:j+3]; } }}

Page 26: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 26

Case Study

● symmetric multiprocessor 

#pragma omp parallel forfor (int i=0; i<N; i++) { for (int j=0; j<N; j++) { C[i,j] = 0.0; for (int k=0; k<N; k++) { C[i,j] = C[i,j] + A[i,k] * B[k,j]; } }}

Page 27: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 27

Case Study

● unpipelined scalar uniprocessor with a cachefully­associative cache that can hold more than 3L2 floats

for (int I=0; I<N; I=I+L) { for (int J=0; J<N; J=J+L) { for (int i=I; i<I+L-1; i++) { for (int j=J; J<J+L-1; j++) { C[i,j] = 0.0; } } for (int K=0; K<N; K=K+L) { for (int i=I; i<I+L-1; i++) { for (int j=J; j<J+L-1; j++) { for (int k=K; k<K+L-1; k++) { C[i,j] = C[i,j] + A[i,k] * B[k, j]; } } } } }}

Page 28: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 28

Case Study

● good code for a 4­issue pipelined VLIW with a cache?

● Lesson learned:

● different machine types require different explicit representations of parallelism to guarantee optimal use of the hardware

 hardware tailoring, portability→

● original code is the same for all machines. The optimal parallel version can be derived from it through relatively simple program transformations

 → let the compiler do it!

Page 29: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 29

Summing it Up

● John Backus, The history of FORTRAN I, II, and III. ACM SIGPLAN Notices 13(8):165­180, August 1978

“It was our belief that if Fortran, during the first months, were to translate any reasonable “scientific” source program into an object program only half as fast as its hand­coded counterpart, then acceptance of our system would be in serious danger […]In fact, I believe that we are in a similar, but unrecognized, situation today: in spite of all the fuss that has been made over myriad language details, current conventional languages are still very weak programming aids, and far more powerful languages would be in used today if anyone had found a way to make them run with adequate efficiency.”

Page 30: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 30

Summing it Up

● Trend in modern architectures: shift the burden of achieving high performance from the hardware to the software

● Programming languages and compilers do not keep up with the advances in complex modern architectures

● Programming to achieve optimal performance still requires tricky hand transformations tailored to a specific system.

● explicitly managed memory hierarchies

● loop optimizations

● …

● Most of these hand transformations should really be performed by compilers.

Page 31: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 31

Outlook

● next class Monday, March 7 11:00 a.m.

● assignments none!

Page 32: 4541.775 Topics on Compilersaces.snu.ac.kr/.../teaching/4541.775/lecture/4541.775.1.Introduction… · Today’s lecture Introduction to Compiler Challenges for Modern Architectures

4541.775 Topics on Compilers 32

Case Study: Matrix Multiplication