Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

Embedded Systems in SiliconTD5102

Compilerswith emphasis on ILP compilation

Henk Corporaalhttp://www.ics.ele.tue.nl/~heco/courses/EmbSystems

Technical University Eindhoven

DTI / NUS Singapore

2005/2006

H.C. TD 5102 2

Compiling for ILP Architectures

Overview:

• Motivation and Goals

• Measuring and exploiting available parallelism

• Compiler basics

• Scheduling for ILP architectures

• Summary and Conclusions

H.C. TD 5102 3

Motivation

• Performance requirements increase• Applications may contain much instruction

level parallelism• Processors offer lots of hardware

concurrency

Problem to be solved: – how to exploit this concurrency automatically?

H.C. TD 5102 4

Goals of code generation

• High speedup– Exploit all the hardware concurrency– Extract all application parallelism

• obey true dependencies only• resolve false dependencies by renaming

• No code rewriting: automatic parallelization– However: application tuning may be required

• Limit code expansion

H.C. TD 5102 5

Overview

• Motivation and Goals

• Measuring and exploiting available parallelism

• Compiler basics

• Scheduling for ILP architectures

• Summary and Conclusions

H.C. TD 5102 6

Measuring and exploiting available parallelism

• How to measure parallelism within applications?– Using existing compiler– Using trace analysis

• Track all the real data dependencies (RaWs) of instructions from issue window

– register dependence– memory dependence

• Check for correct branch prediction– if prediction correct continue– if wrong, flush schedule and start in next cycle

H.C. TD 5102 7

Trace analysis

Program

For i := 0..2

A[i] := i;

S := X+3;

Compiled code

set r1,0

set r2,3

set r3,&A

Loop: st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Execution trace

set r1,0

set r2,3

set r3,&A

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3How parallel can this code be executed?

H.C. TD 5102 8

Trace analysis

Parallel Trace

set r1,0 set r2,3 set r3,&A

st r1,0(r3) add r1,r1,1 add r3,r3,4

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

brne r1,r2,Loop

add r1,r5,3

Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7

H.C. TD 5102 9

Ideal ProcessorAssumptions for ideal/perfect processor:

1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided

2. Branch and Jump prediction – Perfect => all program instructions available for execution

3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal

Also: – unlimited number of instructions issued/cycle (unlimited resources), and

– unlimited instruction window

– perfect caches

– 1 cycle latency for all instructions (FP *,/)

Programs were compiled using MIPS compiler with maximum optimization level

H.C. TD 5102 10

Upper Limit to ILP: Ideal Processor

Programs

Inst

ruct

ion

Iss

ues

per

cycl

e

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60 FP: 75 - 150

IPC

H.C. TD 5102 11

Different effects reduce the exploitable parallelism

• Reducing window size– i.e., the number of instructions to choose from

• Non-perfect branch prediction– perfect (oracle model)– dynamic predictor

(e.g. 2 bit prediction table with finite number of entries)

– static prediction (using profiling)– no prediction

• Restricted number of registers for renaming– typical superscalars have O(100) registers

• Restricted number of other resources, like FUs

H.C. TD 5102 12

• Non-perfect alias analysis (memory disambiguation)

Models to use:– perfect– inspection: no dependence in following cases:

r1 := 0(r9) r1 := 0(fp)

4(r9) := r2 0(gp) := r2

A more advanced analysis may disambiguate most stack and global references, but not the heap references

– none• Important:

– good branch prediction, 128 registers for renaming, alias analysis on stack and global accesses, and for FloatingPt a large window size

Different effects reduce the exploitable parallelism

H.C. TD 5102 13

Summary• Amount of parallelism is limited

– higher in Multi-Media– higher in kernels

• Trace analysis detects all types of parallelism– task, data and operation types

• Detected parallelism depends on– quality of compiler– hardware– source-code transformations

H.C. TD 5102 14

Overview

• Motivation and Goals• Measuring and exploiting available

parallelism• Compiler basics• Scheduling for ILP architectures• Source level transformations• Compilation frameworks• Summary and Conclusions

H.C. TD 5102 15

Compiler basics

• Overview– Compiler trajectory / structure / passes– Abstract Syntax Tree (AST)– Control Flow Graph (CFG)– Data Dependence Graph (DDG)– Basic optimizations– Register allocation– Code selection

H.C. TD 5102 16

Compiler basics: trajectory

Preprocessor

Compiler

Assembler

Loader/Linker

Source program

Object program

Error messages

Library code

H.C. TD 5102 17

Compiler basics: structure / passes

Lexical analyzer

Parsing

Code optimization

Register allocation

Source code

Sequential code

Intermediate code

Code generation

Scheduling and allocation

Object code

token generation

check syntax check semantic parse tree generation

data flow analysis local optimizations global optimizationscode selection peephole optimizations

making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP

H.C. TD 5102 18

Compiler basics: structure Simple compilation example

Lexical analyzer

Syntax analyzer

Intermediate code generator

position := initial + rate * 60

id := id + id * 60

:=

+id

*id

60id

Code optimizer

Code generator

temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3

temp1 := id3 * 60.0id1 := id2 + temp1

movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1

H.C. TD 5102 19

Compiler basics: structure - SUIF-1 toolkit example

pre-processing

C front-end

converting non-standard structures to SUIF

constant propagation

forward propagation

induction variable identification

scalar privatization analysis

reduction analysis

locality optimization and parallelism analysis

parallel code generation

FORTRAN specific transformations

SUIF to text SUIF to postscript SUIF to C

SUIF text postscript C

FORTRAN to C

FORTRAN C

high-SUIF to low-SUIF

constant propagation

strength reduction

dead-code elimination

register allocation

assembly code generation

assembly code

H.C. TD 5102 20

Compiler basics: Abstract Syntax Tree (AST)

C input code:

if (a > b) { r = a % b; } else { r = b % a; }

Parse tree: ‘infinite’ nesting:

Stat IF Cmp > Var a Var b Statlist Stat Expr Assign Var r Binop % Var a Var b

Statlist Stat Expr Assign Var r Binop % Var b Var a

H.C. TD 5102 21

Compiler basics: Control flow graph (CFG)

C input code:

CFG: 1 sub t1, a, b bgz t1, 2, 3

4 ………….. …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

if (a > b) { r = a % b; } else { r = b % a; }

H.C. TD 5102 22

a := b + 15;

c := 3.14 * d;

e := c / f;

Translation to DDG

ld

+

st

&b

15

&a

ld *

/ st

ld

st

&f 3.14

&e

&d

&c

Data Dependence Graph (DDG)

H.C. TD 5102 23

• Machine independent optimizations

• Machine dependent optimizations

(details are in any good compiler book)

Compiler basics: Basic optimizations

H.C. TD 5102 24

– Common subexpression elimination– Constant folding– Copy propagation– Dead-code elimination– Induction variable elimination– Strength reduction– Algebraic identities

• Commutative expressions• Associativity: Tree height reduction

– Note: not always allowed(due to limited precision)

Machine independent optimizations

H.C. TD 5102 25

What’s the optimal implementation of a*34 ?– Use multiplier: mul Tb,Ta,34

• Pro: No thinking required• Con: May take many cycles

– Alternative:SHL Tc, Ta, 1ADD Tb, Tc, TzeroSHL Tc, Tc, 4ADD Tb, Tb, Tc

• Pros: May take fewer cycles• Cons:• Uses more registers• Additional instructions ( I-cache load / code size)

Machine dependent optimization example

H.C. TD 5102 26

• Register Organization Conventions needed for parameter passing and register usage across function calls; a MIPS example:

Compiler basics: Register allocation

r31

r21

r20

r11

r10

r1

r0

Callee saved registers

Caller saved registers

Argument and result transfer

Hard-wired 0

Temporaries

H.C. TD 5102 27

Register allocation using graph coloring

Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?

• A variable is defined at a point in program when a value is assigned to it.

• A variable is used at a point in a program when its value is referenced in an expression.

• The live range of a variable is the execution range between definitions and uses of a variable.

H.C. TD 5102 28

Program:

a := c := b := := bd := := a := c := d

a b c dLive Ranges


Example:

H.C. TD 5102 29


a

b c

d

Inference Graph

a

b c

d

Coloring:a = redb = greenc = blued = green

Graph needs 3 colors (chromatic nr =3)=> program needs 3 registers

H.C. TD 5102 30

Register allocation using graph coloringSpill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph

Example: Only two registers available !!

Program:

a := c := store cb := := bd := := aload c := c := d

a b c dLive Ranges

H.C. TD 5102 31

• CISC era– Code size important– Determine shortest sequence of code

• Many options may exist

– Pattern matchingExample M68020:

D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] ADD ([10,A1], D2*16, 20), D1

• RISC era– Performance important– Only few possible code sequences– New implementations of old architectures optimize RISC

part of instruction set only; for e.g. i486 / Pentium / M68020

Compiler basics: Code selection

H.C. TD 5102 32

Overview

• Motivation and Goals• Measuring and exploiting available

parallelism• Compiler basics• Scheduling for ILP architectures• Source level transformations• Compilation frameworks• Summary and Conclusions

H.C. TD 5102 33

What is scheduling?• Time allocation:

– Assigning instructions or operations to time slots– Preserve dependences:

• Register dependences• Memory dependences

– Optimize code with respect to performance/ code size/ power consumption/ ..

• Space allocation – satisfy resource constraints:

• Bind operations to FUs• Bind variables to registers/ register files• Bind transports to buses

H.C. TD 5102 34

Why scheduling?Let’s look at the execution time:

Texecution = Ncycles x Tcycle

= Ninstructions x CPI x Tcycle

Scheduling may reduce Texecution

– Reduce CPI (cycles per instruction)• early scheduling of long latency operations• avoid pipeline stalls due to structural, data and control hazards

• allow Nissue > 1 and therefore CPI < 1

– Reduce Ninstructions

• compact many operations into each instruction (VLIW)

H.C. TD 5102 35

Scheduling data hazards RaW dependence

Avoiding RaW stalls:

Reordering of instructions by the compiler

Example: avoiding one-cycle load interlock

Code:

a = b + c d = e - f

Unscheduled code:Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4

Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4

H.C. TD 5102 36

Scheduling control hazardsBranch requires 3 actions:• Compute new address• Determine condition• Perform the actual branch (if taken): PC := new address

IF ID OF EX WB

IF ID OF EX WB

IF ID OF EX WB

IF ID OF EX

IF ID OF EX WB

time

Branch L

Predict not taken

L:

H.C. TD 5102 37

Control hazards: what's the penalty?

CPI = CPIideal + fbranch x Pbranch

Pbranch = Ndelayslots x miss_rate

• Superscalars tend to have large branch penalty Pbranch due to– many pipeline stages– multiple instructions (or operations) / cycle

• Note: – the lower CPI the larger the effect of penalties

H.C. TD 5102 38

What can we do about control hazards and CPI penalty?• Keep penalty Pbranch low:

– Early computation of new PC– Early determination of condition– Visible delay slots filled by compiler (MIPS)

• Branch prediction• Reduce control dependencies (control height

reduction) [Schlansker and Kathail, Micro’95]• Remove branches: if-conversion

– Conditional instructions: CMOVE, cond skip next– Guarding all instructions: TriMedia

H.C. TD 5102 39

Scheduling: Conditional instructions

After conversion:

• Example: Cmove (supported by Alpha)

If (A=0) S = T;

assume:

r1: A,

r2: S,

r3: T

Object code: Bnez r1, LMov r2, r3

L: . . . .

Cmovz r2, r3, r1

H.C. TD 5102 40

Scheduling: Conditional instructionsConditional instructions are useful, however:• Squashed instructions still take execution time and execution resources

– Consequence: long target blocks can not be if-converted • Condition has to be known early• Moving operations across multiple branches requires complicated

predicates• Compatibility: change of ISA (instruction set architecture)

Practice:• Current superscalars support a limited set of conditional instructions• CMOVE: alpha, MIPS, PowerPC, SPARC• HP PA: any RR instruction can conditionally squash next instruction

Large VLIWs profit from making all instructions conditional• guarded execution: TriMedia, Intel/HP IA-64, TI C6x

H.C. TD 5102 41

Guarded executionSLT r1,r2,r3

BEQ r1,r0, else

then: ADDI r2,r2,1

..X..

j cont

else: SUBI r2,r2,1

..Y..

cont: MUL r4,r2

SLT b1,r2,r3

b1:ADDI r2,r2,1 !b1: SUBI r2,r2,1

b1:..X.. !b1: ..Y..

MUL r4,r2

IF-conversion

H.C. TD 5102 42


Full guard supportIf-conversion of conditional codeAssume:• tbranch branch latency• pbranch branching probability• ttrue execution time of the TRUE branch• tfalse execution time of the FALSE branch Execution times of original and if-converted code for non-ILP

architecture:

toriginal_code = (1 + pbranch) x tbranch + p x ttrue + (1 - pbranch) x tfalse

tif_converted_code = ttrue + tfalse

H.C. TD 5102 43


Speedup of if-converted code for non-ILP architectures

Only interesting for short target blocks!

H.C. TD 5102 44

Scheduling: Conditional instructionsSpeedup of if-converted code for ILP architectures with sufficient resources

Much larger area of interest !!

convertedift

tif_converted = max(ttrue, tfalse)

H.C. TD 5102 45


• Full guard support for large ILP architectures has a number of advantages:– Removing unpredictable branches– Enlarging scheduling scope– Enabling software pipelining– Enhancing code motion when speculation is not

allowed– Resource sharing; even when speculation is allowed

guarding may be profitable

H.C. TD 5102 46

Scheduling: OverviewTransforming a sequential program into a parallel program:

read sequential program read machine description file for each procedure do

perform function inlining

for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do

perform instruction scheduling write parallel program

H.C. TD 5102 47

Scheduling: Int.Lin.Programming

Integer linear programming scheduling method• Introduce:

– Decision variables: xi,j = 1 if operation i is scheduled in cycle j– Constraints like:– Limited resources:

where xt operation of type t and Mt number of resources of type t– Data dependence constraints– Timing constraints

• Problem: too many decision variables

i

ttj,i, Mx j,

H.C. TD 5102 48

List Scheduling

• Make a dependence graph• Determine minimal length• Determine ASAP, ALAP, and slack of each operation• Place each operation in first cycle with sufficient

resources

Note:– Scheduling order sequential– Priority determined by used heuristic; e.g. slack

H.C. TD 5102 49

Basic Block Scheduling

ADD

LD

A C

y

<1,3>

<2,4>MUL

A B

z

<1,4>

ADD

ADD

SUB

NEG LD

A

B C

X

<3,3>

<4,4>

<2,2>

<2,3>

<1,1>

ASAP cycle

ALAP cycle

slack

H.C. TD 5102 50

ASAP and ALAP formulas

asap(v) =

max{asap(u) + delay(u,v) | (u,v) E } if pred(v)

0 otherwise

alap(v) = min{alap(u) - delay(u,v) | (u,v) E } if succ(v)

Lmax otherwise

slack(v) = alap(v) - asap(v)

H.C. TD 5102 51

Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } // all nodes which have no predecessor ready’ = ready // all nodes which can be scheduled in sched = // current cycle current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle}endwhileendproc

H.C. TD 5102 52

Problem with basic block scheduling

• Basic blocks contain on average only about 6 instructions

• Unrolling may help for loops

• Go beyond basic blocks: 1. Extended basic block scheduling 2. Software pipelining

H.C. TD 5102 53

Extended basic block scheduling: Scope

B C

E F

D

G

A

Trace Superblock

B C

F E’

D’

G’

A

E

D

G

tail duplication

Partitioning a CFG into scheduling scopes:

H.C. TD 5102 54


B C

E F

D

G

A

Hyperblock/ region

Partitioning a CFG into scheduling scopes:

B C

E’ F’

D’

G’’

A

E

D

G

Decision Tree

tail duplication

F

G’

H.C. TD 5102 55

Comparing scheduling scopes:

Trace Sup. block

Hyp. block

Dec. Tree

Region

Multiple exc. paths No No Yes Yes Yes Side-entries allowed Yes No No No No Join points allowed Yes No Yes No Yes Code motion down joins Yes No No No No Must be if-convertible No No Yes No No Tail dup. before sched. No Yes No Yes No


H.C. TD 5102 56

Extended basic block scheduling: Code Motion

A a) add r4, r4, 4 b) beq . . .

D e) st r1, 8(r4)

C d) sub r1, r1, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A

H.C. TD 5102 57

Extended basic block scheduling: Code Motion

D/b

ID

II

ID

b’

M

M

M

MD

I

M

b’

b

Legend:

Basic blocks between source and destination basic blocks

Control flow edges where off-liveness checks have to be performed

Basic blocks where duplication have to be placed

Destination basic blocks

Source basic blocks

• SCP (single copy on a path) rule: no path may exist between 2 different D blocks

H.C. TD 5102 58

Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?

H.C. TD 5102 59

Scheduling: Loops

B C

D

A

B

C’’

D

A

C’

C B

C’’

D

A

C’

C

Loop peeling Loop unrolling

Loop Optimizations:

H.C. TD 5102 60

Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time

H.C. TD 5102 61

Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling

5/3 cycles/iteration

Software pipelining

1 cycle/iteration

H.C. TD 5102 62

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

a[i+6] = 3* a[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code without loop control

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations• Epilogue drains the SW pipeline

H.C. TD 5102 63

Summary and Conclusions

• Compilation for ILP architectures is getting mature and enters the commercial area.

• However:– Great discrepancy between available and

exploitable parallelism

What if you need more parallelism?

- source-to-source transformations

- use other algorithms

H.C. TD 5102 64

Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation

Documents

Transcript of Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation