CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs...

80
CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1

Transcript of CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs...

Page 1: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

CS8803: Compilers for Embedded SystemSantosh Pande – Summer 2007

Chapter 8Compiling for VLIWs and ILP

1

Page 2: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Outline

• 8.1 Profiling• 8.2 Scheduling

– Acyclic Region Types and Shapes– Region Formation– Schedule Construction– Resource Management During Scheduling– Loop Scheduling– Clustering

• 8.3 Register Allocation• 8.4 Speculation and Predication• 8.5 Instruction Selection

2

Page 3: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Overview

• This chapter…– Focuses on optimizations, or code transformations– These topics are common across all types of ILP-

processors, for both general-purpose and embedded applications

– Compilers and toolchains used for embedded processors are very similar to those in general-purpose computers

3

Page 4: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

1. Profiling

• Profiles– Statistics about how a program spends its time

and resources– Many ILP optimizations require good profile

information

• Two types of profiles– “Point profiles”

• Call graphs and CFG

– “Path profiles”

4

Page 5: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Types of Profiles

• Call graph– Nodes: procedures– Edges: procedure calls– Information

• How many times each proc was called?• How many times each caller proc invoked a callee?

– Limitation: • Can’t tell what to do possibly beneficial procedures

5

Page 6: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Types of Profiles (cont.)

• Control Flow Graph (CFG)– Nodes: each basic blocks

• Basic block: a sequence of always executed instructions

– Edges: one basic block can execute after another basic block

– Information• How many times a particular basic block was executed?• How many times control flowed from one basic block to

one of its immediate neighbors?

6

Page 7: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

7

Call Graph

Control Flow Graph

Page 8: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Types of Profiles (cont.)

• Path profiles– Measuring # of times a path, or sequence of

contiguous blocks in CFG is executed– Optimizations using path profiles appeared in

research compilers, but not into production compilers

– Note that call graphs and CFG are “Point profiles”

8

Page 9: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Profile Collection

• Instrumentation– Extra code is inserted into program to gather data– Can be done by compilers or post-compilation tool

• e.g. Pin: dynamic instrumentation tools and API– http://rogue.colorado.edu/pin/

– Hardware techniques• Special registers record stats various events• Statistical-sampling profilers

9

Page 10: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Synthetic Profiles (Heuristics in Lieu of Profiles)

• Synthetic profile– Assigns weights to each part of program based

solely on the structure of source program– Pros

• Need not to collect stats on actual running programs

– Cons• Can’t see how the program behaves w/ read data

– None of synthetic profile techniques does as well as actual profiling

10

Page 11: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

2. Scheduling

• Instruction scheduling– Directly responsible for identifying and grouping

operations that can be executed in parallel

• Taxonomy– Cyclic: operates on loops in the program– Acyclic: handles loop-free regions, not directly loops– Current compilers include both schedulers

• Hardware support– Helps the choices available to the scheduler

11

Page 12: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

12

Page 13: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes

• Shapes of Regions– Basic blocks, Traces, …

• Basic Blocks– A “degenerate” form of region– Maximal straight line code fragments

13

Page 14: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

• Traces: the first proposed region– Linear paths thru code: multiple-entrances & exits– A trace consists of the operations from a list of

basic blocks with the following properties• Each basic block is a predecessor of the next on the list

– e.g. Bk falls thru or branches to Bk+1

• For any i and k, there is no path Bi->Bk->Bi except for those that go through B0

– e.g. Code is cycle free, except entire region can be part of some encompassing loop

– Allow forward branches and so on: complex!14

Page 15: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

15

Basic Block

Control Flow

Trace:a linear,

multiple-entry, multiple-exit

region

side entrance

Page 16: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

• Superblocks– Traces with added restriction

• Single-entry, multiple-exit traces

– Same properties with traces, but one addition• There may be no branches into a block in the region,

except to B0. These outlawed branches are referred to in the superblock literature as side entrances

– Tail duplication: a region enlarging technique• Avoids side entrances and adds compensation codes

16

Page 17: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

17

Tail duplication to eliminate side

entrances

e.g. 70*0.8=56

Superblock

Page 18: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

• Hyperblocks– Single-entry, multiple-exit regions with internal

control flow– Variants of superblocks that employ predication to

fold multiple control paths into a single superblock– Removing some control flow complexity

18

Page 19: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

19

Hyperblock

if-conversion of basic blocks B2,

B5

Page 20: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

• Treegions– Regions containing a tree of basic blocks within

the control flow of the program– Properties

• Each basic block Bj except for B0 has exactly one predecessor.

• That predecessor, Bi, is on the list, where i < j.

– Any path thru treegion yield a superblock• A trace with no side entrances

– Treegion-2: w/o restriction on side entrances

20

Page 21: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

21

Treegion 1

Treegion 2

Treegion 3

Trace-2: trace

Page 22: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

• Percolation Scheduling– Many code motion rules are applied to regions

that resemble traces– One of the earliest versions of DAG scheduling

• DAG scheduling: most general of acyclic scheduling

• Cycle scheduler– Limited region shapes

• A single innermost loop• An inner loop that has very simple control flow

22

Page 23: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Acyclic Region Types and Shapes (cont.)

23

Page 24: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation

• So far, discussed about a region shape• Remaining two questions

– Region Formation• How does one divide a program into regions?• Region formation is more than selecting good regions

from CFG; also includes duplication (region enlargement)

– Schedule Construction• How does one build schedules for them?• Well-selected regions are critical for schedule

construction– Using profiles: how frequently executed?

24

Page 25: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Region Selection– Trace growing

• The most popular algorithm

– Using the mutual most likely heuristic– Steps

• A is the last block of the current trace• Block B is A’s most likely successor, and vice versa

– A and B are “mutually most likely”

• Adds B to the trace• Repeats until no mutually-most-likely successor

25

Page 26: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Region Selection– Shortcomings of using point profiles

• Cumulative effect of conditional probability• Point profiles independently measure probability• Probability of remaining on the trace rapidly decreases• Example:

– A trace that crosses ten splits, each with 90% of staying on the trace, appears to have only 35% (=0.9^10) probability of running from start to end

• Solutions: – building different shaped regions, predication– Using predication to remove branches

26

Page 27: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Region Selection– Hyperblock formation

• Based on the mutual-most-likely trace formation• Considers block size and execution frequency• Predication can remove unpredictable branches

– Researches on better statistics• Using global, bounded-length path profiles to improve

static branch prediction

27

Page 28: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Enlargement Techniques– Region selection is not enough alone– Needs to increase ILP by using enlargement

• Code size increased, but better scheduled code• Based on the fact programs iterate (loop)

– Loop unrolling• Performed before region selection to make the larger

unrolled codes available to region selector• Induction variable simplification and etc performed to

expose more parallelism across iterations

28

Page 29: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

29

• Simplified example of variants of loop unrollingFor while loop:most general

case

For for loop:counted loops

Page 30: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Induction var manipulations for loops

30

Page 31: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Enlargement Techniques– Different approach for superblocks

• Superblock loop unrolling– Unrolling superblock loops (the most likely exit from some

superblocks jump to the beginning)

• Superblock loop peeling– Profile suggests a small # of iterations for the superblock loop– The expected # of iterations is copied

• Superblock target expansion– Similar to the mutual-most-likely heuristic for growing traces– If superblock A ends in a likely branch to B, then B is added

31

Page 32: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

32

Superblock-enlarging optimizations

Target expansion Loop unrolling Loop peeling

Page 33: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Region Formation (cont.)

• Phase-ordering Considerations– Which one first?

• Multiflow compiler: enlargement before trace selection• Superblock-based chose and formed superblocks first• Neither is clearly preferable

– Other transformations• i.e. Dependence height reduction should be run before

region formation

33

Page 34: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction

• So far, discussed about region formations– Selecting and enlarging individual regions

• A Schedule– Set of annotations that indicate unit assignment

and cycle time of the operations in a region– Depending on the shape of the region

• Goal: minimizing objective function– Estimated completion time + code size or energy

efficiency (in embedded systems)

34

Page 35: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Analyzing Programs for Schedule Construction– Dependences (data & control) prohibit reordering

• Partial ordering on the pieces of code• Represented as a DAG or its variants

– DDG (data dependence graph)– PDG (program dependence graph)

• Creating DDG and PDG typically O(n^2)– Where, n is the number of operations

35

Page 36: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

36Data dependences example

Output dependence

True dependence

Page 37: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

37Control dependence example

Control flow example

Page 38: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Compaction Techniques– Cycle versus Operation Scheduling

• Two strategies to minimize an objective function• 1) Operation scheduling

– Selects an operation in the region and allocates it in the “best” cycle w/o dependences

• 2) Cycle scheduling– Fills a cycle with operations from region, proceeding to the

next cycle only after exhausting available operations

• Operation scheduling is theoretically powerful because of consideration of long-latency operations

38

Page 39: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Compaction Techniques– Linear Techniques

• Algorithm using DDG gives O(n^2) • In practical, linear O(n) used in modern compilers• Two techniques• 1) As-soon-as-possible (ASAP) scheduling

– Placing op in the earliest possible cycle (top-down linear scan)

• 2) As-late-as-possible (ALAP) scheduling– Placing op in the latest possible cycle (bottom-up linear scan)

• Example: critical-path scheduling uses ASAP followed by ALAP to identify operations in the critical path

39

Page 40: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Compaction Techniques– Graph-based Techniques (List Scheduling)

• Linear techniques can’t see the global properties (DDG)• Repeatedly assigning a cycle to operation w/o

backtracking (greedy algorithms): O(nlogn)• Steps

– Selects an operation from a data-ready-queue (DRQ)– An op is ready when all of its DDG predecessors scheduled– Once scheduled, op is removed from the DRQ

• Performance is dependent on the order selecting candidates, or on the scheduler’s greediness

40

Page 41: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Compensation Code– Restoring the correct flow of data and control– Four basic scenarios

41

• (a) No Compensation– Code motion don’t change relative order of

operations wrt joins and splits– Also covers moving operations above a split

point (becoming speculative)– Recall that compensation code for speculative

code motions depends on recovery model

Page 42: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Compensation Code

42

• (b) Joint Compensation– B moves above a join point A– Drop a copy of B (B’) in the join path

• (c) Split Compensation• Split op B (i.e. branch) moves

above a previous op A• Produces a copy of A (A’) in the

split path

Page 43: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Schedule Construction (cont.)

• Compensation Code

• Summary– In general, make sure preserve all paths from the

original sequence in the transformed control flow after scheduling

43

• (d) Joint Compensation– Splits moved above joins (in the figure)– Splits moved above splits

Z-B-W path

Page 44: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Resource Management During Scheduling

• Resource hazards– Dependences and operational latencies and

available resources (i.e., functional units)

• Approaches– Reservation table: a simple and early method– Using finite-state-automata

44

Page 45: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Resource Management During Scheduling (cont.)

• Resource Vectors– Easy scheduling of instructions– Row: each cycle of schedule– Col: each resource in the machine– Recent work on reduction of the size

45

Busy

Page 46: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Resource Management During Scheduling (cont.)

• Finite-state Automata– Intuition

• Is this instruction sequence a resource-legal schedule?– Similar with “Does this FSA accept this string?”

• A schedule is a sequence of instructions– Similar with “a string is a sequence of alphabet character”– Resource-valid schedules = a language

– FSAs are enough to accept these language– Several approaches for improving efficiency

• Breaking them into “factor” automata, reversing automata, and non-determinism

46

Page 47: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Resource Management During Scheduling (cont.)

• Finite-state Automata

47

• Original automaton: representing two-resource machine• Factored automata: “Letter” and “Number” since independent operations• Cross-product of factored automaton is equivalent to the original one

Page 48: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Resource Management During Scheduling (cont.)

• TODO:– Reverse automata?– Nondeterminism?

48

Page 49: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling

• Loop scheduling approaches– Most of execution time spent in loops– The simplest approach was loop unrolling– Software pipelining

• Exploits inter-iteration ILP: parallelism across iterations• Modulo scheduling

– Produces a kernel of code– Kernel: overlapped multiple iterations of a loop, where

neither data dependence, nor resource conflicts

• Prologues and epilogues code is needed for correctness– Increased code size, H/W techniques can reduce this

49

Page 50: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

• Conceptual illustration of software pipelining

Loop Scheduling

50

Page 51: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Initiation Interval (II)

• The length of the kernel: the constant interval b/w start of successive kernel iterations

• Minimum II (MII)– Determines lower bound on II

• Two constraints on the MII– Recurrence-constrained minimum II (RecMII)– Resource-constrained minimum II (ResMII)

51

Page 52: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Goal

• Arranging operations so that they can be repeated at the smallest possible II distance (related throughput)

– Rather than minimizing the stage count of each iterations, which means minimizing latency

– But, stage count is also important because it relates to prologue (pipeline filling) and epilogue (pipeline draining)

– Downsides of modular scheduling• Hard to handle nested loops• Control flow in the loop handled by only predication

52

Page 53: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

• Conceptual model of Modulo scheduling– 4-wide, load (3 cycles), mult & compare (2 cycles)

53

How many inter-iteration dependences?

Page 54: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Modulo Reservation Table (MRT)

• Find a resource conflict-free schedule over multiple II intervals

• Ensure the same resources are not reused more than once in the same cycle

• MRT records and checks resources usage for cycle

54

Page 55: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Modulo Reservation Table

55

Page 56: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Searching for the II

• Find two candidates: minII and maxII• maxII: trivial, sum of all latencies of operations in loop• minII: complex, max(resII, recII)

– Consider resource constraints, and both intra- and inter-iteration dependences

• Then, find a legal schedule within the range– Usually using a modified list scheduling in which resource

checking for each assignment through MRT

56

Page 57: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Searching for the II

• basic scheme of iterative modulo scheduling

57

minII = compute_minII();maxII = compute_maxII();found = false;II = minII;while (!found && II < maxII) { found = try_to_modulo_schedule(II, budget); II = II + 1;}if (!found)trouble(); /* wrong maxII */

Page 58: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Prologues and Epilogues

• Partial copies of kernel• More complex when multiple-exit loops• In practice, multiple epilogues are almost always a

necessity (but, this is beyond our scope!)• Kernel-only loop scheduling

– Condition 1: prologues and epilogues are proper subsets of kernel code in which some operations have been disabled

– Condition 2: fully predication architecture

58

Kernel-only code by predicates

Page 59: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Modulo Variable Expansion

• MRT solved a correct resource scheduling for a given II• What about register allocation when lifetime of a value

within an iteration exceeds the II length?– Simple register allocation policy won’t work: overwritten!

• Solution: artificially extend II w/o perf degradation by unrolling loop body -> Modulo Variable Expansion

• Must unroll at least by a factor k = ceil (v / II)– v = the length of the longest life time

59

Page 60: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Modulo Variable Expansion

• But, increased length of kernel code, reg pressure, …• Solution: rotating registers

– Physical register instantiation: combination of a logical identifier and a register base incremented at every iteration

• A reference to register r at iteration i points to a different location than iteration i+1

– It’s possible to avoid modulo variable expansion

60

Page 61: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

61

Register r1 needs to hold the same variable in twodifferent iterations, but the lifetimes overlap

Unroll kernel twice!

Page 62: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

62

Used two registers (r1, r11) to resolve overlappingSame throughput, but code size hurts

Page 63: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Modular Scheduling– Iterative Modulo Scheduling

• Sometimes hard to find a schedule due to complex MRT• To improve probability of finding a schedule, allow a

controlled form of backtracking (unscheduling and rescheduling of instructions)

– Advanced Modulo Scheduling Techniques• So far, several heuristics: e.g. guessing a good minII• Recent techniques

– e.g. Hypernode reduction modulo scheduling (HRMS):» reduces loop-variant lifetimes while keeping II constant

63

Page 64: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Clustering– Review of the need of clustering

• A practical solution to solve high register demands rather than multiported register file, or bypassing logic

– Multiports are expensive and poor scalability

• A clustered architecture divides into separate clusters• Each cluster has its own register bank and func units• In general, intercluster (explicit) operations needed

– Compilers’ new role• Minimizing intercluster moves and balancing clusters

64

Page 65: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Loop Scheduling (cont.)

• Clustering– Preassignment techniques

• In general, clustering before scheduling• Two techniques

– Bottom-up-greedy (BUG)» Two phases: traversing from exit to entry, and assignment

– Partial-component clustering (PCC)» Reduce complexity by constructing macronodes

– Clustering overheads• Two clusters: 15~20% lost cycles, Four: 25~30%

65

Page 66: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

3. Register Allocation

• Register allocation– Memory >> register space– NP-Hard problem– This problem is old and well known

• Standard technique: coloring of interference graph• Recent: nonstandard register allocation techniques

– Faster and better than graph-coloring– linear-scan allocators

» Interested in JIT, dynamic translation

• Tradeoffs b/w compile- and run-time– Feasible today because of faster machines

66

Page 67: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Phase-ordering Issues

• Phase ordering is hard problem– Should it be don before, after, or same time?– Register allocation and scheduling conflicts goals

• Register allocator tries to minimize spill and restore, creating sequential constraints for register reuse)

• Scheduler tries to fill all parallel units• How to order them?

– Very tricky problem

67

Page 68: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Phase-ordering Issues

• Scheduling followed by Register Allocation followed by Post-scheduling– The most popular choice (common for modern RISC)

– ILP over efficient register utilization• Enough registers are available

– Post-scheduler rearranges the code

68

Scheduling: without regard for the

number of physical registers actually

available

Register allocation:though no allocation

might exist that makes the schedule legal, so insert spills/restores

Post-scheduling: after inserting spills/restores,

fix up schedule, making it legal, with least

possible added cycles

Page 69: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Phase-ordering Issues

• Register Allocation followed by Scheduling– Register use over exploiting ILP– Works well with few GPRs (e.g. x86)– But, register allocator introduces additional

dependences every time it reuses a register

69

Register allocation:producing code withall registers assigned

Register allocation:Scheduling (though not very

effectively, because the register allocation has inserted many

false dependences)

Page 70: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Phase-ordering Issues

• Combined Register Allocation and Scheduling– Potentially very powerful, but very complex– A list-scheduling algorithm may not converge

• Cooperative Approaches– Scheduler monitors register resources and

estimates pressure in its heuristics

70

Scheduling and register allocation done together:

difficult engineering, and it is difficult to ensure that

scheduling will ever terminate)

Page 71: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

4. Speculation and Predication

• Speculation and Predication– Removes and transforms control dependences– Usually, they are independent techniques, and

one is much more appropriate than the other– Note that predication is important in software

pipelining

71

Page 72: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Control and Data Speculation

• Control and Data Speculation– Recall exception behavior in recovery model

• Nonexcepting parts and sentinel (checking) parts• In compiler’s perspective

– It’s complicated to support nonexcepting loads because of recovery code handling

– Speculative code motion (or code hoisting)• Removes actual control dependences unlike predication• Compiler need to consider supported exception model

and speculative memory operations

72

Page 73: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

73

Speculative code motion example

load operation becomes speculative load (load.s)

Page 74: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Predicated Execution

• Compiler techniques for predication– Examples: if-conversion, logical reduction of

predicates, reverse if-conversion, and hyperblock-based scheduling

– If-conversion• Translates control dependence into data dependence• Converts an acyclic subset of CFG from an unpredicated

code into straight-line code with predication• Also try to minimize # of predicate values

– logical reduction of predicates

74

Page 75: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Predicated Execution (cont.)

• Compiler techniques for predication– Reverse if-conversion

• Removing predicates, returning to unpredicated code• May be worthwhile to if-convert• When insufficient predicate registers, selectively

reverse if-converting

– Hyperblock based scheduling• Unified framework for both speculation and predication• First, choose a hyperblock region, then if-conversions

– Gives the schedule constructor much more freedom to schedule, and removes speculative constraints

75

Page 76: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

76

Example of predicated codesAlways executed

Page 77: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Predicated Execution (cont.)

• Case studies in embedded systems– No usually full predicated like IPF architecture– ARM includes a 4-bit predicates in every operation

• Looks like always being predicated• But, the predicate registers is usual set of condition

code flags instead of an index to general predicates

– TI C6x supports full predication• Five of GPR can be specified as condition registers

77

Page 78: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Prefetching

• Memory prefetching– A form of speculation, and invisible to programs– Compiler-supported prefetching better than pure

hardware prefetching in many cases• Compiler assist in prefetching

– ISA includes a prefetch instruction• Only hints to the hardware

– Automatic insertion requires to understand loop behaviors

– Unneeded prefetches waste resources

78

Page 79: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

Other Topics

• Data Layout Methods– Increase locality by considering cache line

• Static and Hybrid Branch Prediction– Profiles are used to set static branch predication– More sophisticated approach

• Hybrid method: statically or dynamically

– e.g. IPF includes four branch encoding hints• static taken, static not-taken, dynamic T, and dynamic NT

79

Page 80: CS8803: Compilers for Embedded System Santosh Pande – Summer 2007 Chapter 8 Compiling for VLIWs and ILP 1.

5. Instruction Selection

• Instruction Selection– Translates from a tree-structured linguistically-

oriented IR to operation- and machine-oriented IR– Especially important with complex instruction sets– Recent technique

• Cost-based pattern-matching rewriting systems– “match” or “cover” parse tree produced by front end using

minimum-cost set of operation subtrees

• e.g. BURS (bottom-up rewriting systems)– 1st pass, labels each node in parse tree– 2nd pass, reads labels & generates target machine operations

80