Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

57
Topic 6a Topic 6a Topic 6a Topic 6a Basic Back Basic Back Basic Back Basic Back- - -End Optimization End Optimization End Optimization End Optimization Instruction Selection 2008/4/15 \course\cpeg421-08s\Topic6a.ppt 1 Instruction scheduling Register allocation

Transcript of Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Page 1: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Topic 6a Topic 6a Topic 6a Topic 6a

Basic BackBasic BackBasic BackBasic Back----End OptimizationEnd OptimizationEnd OptimizationEnd Optimization

Instruction Selection

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 1

Instruction Selection

Instruction scheduling

Register allocation

Page 2: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

ABET Outcome

� Ability to apply knowledge of basic code generation techniques,

e.g. Instruction scheduling, register allocation, to solve code generation problems.

� An ability to identify, formulate and solve loops scheduling problems using software pipelining techniques

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 2

problems using software pipelining techniques

� Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness.

� Ability to use a modern compiler development platform and tools for the practice of above.

� A Knowledge on contemporary issues on this topic.

Page 3: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Reading ListReading ListReading ListReading List

(1) K. D. Cooper & L. Torczon, Engineering a Compiler, Chapter 12

(2) Dragon Book, Chapter 10.1 ~ 10.4

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 3

Page 4: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

A Short Tour on A Short Tour on A Short Tour on A Short Tour on

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 4

Data DependenceData DependenceData DependenceData Dependence

Page 5: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Basic Concept and MotivationBasic Concept and MotivationBasic Concept and MotivationBasic Concept and Motivation

�Data dependence between 2 accesses

The same memory location

Exist an execution path between them

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 5

At least one of them is a write

�Three types of data dependencies

�Dependence graphs

�Things are not simple when dealing with loops

Page 6: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Data Dependencies

�There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 6

location and at least one of the statements writes into it, and

There is a feasible run-time execution path from Si to Sj

Page 7: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Types of Types of Data Dependencies

� Flow (true) Dependencies - write/read (δ)x := 4;

y := x + 1;

� Output Dependencies - write/write (δo)

δ

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 7

� Output Dependencies - write/write (δo)x := 4;

x := y + 1;

� Anti-dependencies - read/write (δ-1)y := x + 1;

x := 4;

-1

δ--

Page 8: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

(1) x := 4

(2) y := 6

(3) p := x + 2

(4) z := y + p

(5) x := z

x := 4 y := 6

p := x + 2

An Example of An Example of Data Dependencies

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 8

(5) x := z

(6) y := pz := y + p

y := p x := z

Flow

Output

Anti

Page 9: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Data Dependence Graph (DDG)

� Forms a data dependence graph between statements

nodes = statements

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 9

nodes = statements

edges = dependence relation (type label)

Page 10: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

Example 1:

S1: A = 0

S1

S2

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 10

S2: B = A

S3: C = A + D

S4: D = 2

Sx δ Sy ⇒ flow dependence

S3

S4

Page 11: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Example 2:

S1: A = 0

S2: B = A

S3: A = B + 1

S4: C = A

S1

S2

S3

Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 11

S4: C = A S3

S4

Page 12: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Should we consider input Should we consider input dependence?dependence?

Should we consider input Should we consider input dependence?dependence?

= XIs the reading of the

same X important?

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 12

= X Well, it may be!

(if we intend to group the 2 reads

together for cache optimization!)

Page 13: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

- register allocation- instruction scheduling- loop scheduling

Applications of Data Applications of Data

Dependence GraphDependence Graph

Applications of Data Applications of Data

Dependence GraphDependence Graph

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 13

- loop scheduling- vectorization- parallelization - memory hierarchy optimization- ……

Page 14: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Data Dependence in Loops

Problem: How to extend the concept to loops?

(s1) do i = 1,5

(s2) x := a + 1; s2 δ-1 s3, s2 δ s3

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 14

(s2) x := a + 1; s2 δ-1 s3, s2 δ s3

(s3) a := x - 2;

(s4) end do s3 δ s2 (next iteration)

Page 15: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Reordering Transformation

A reordering transformation is any program transformation that merely changes the order of execution of the code, without adding or deleting any executions of any statements.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 15

deleting any executions of any statements.

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence.

Page 16: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

� Instruction Scheduling

� Loop restructuring

� Exploiting Parallelism

Reordering Transformations (Con’t)

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 16

� Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location. Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them.

� …

Page 17: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Reordering Transformation using DDG

�Given a correct data dependence graph, any order-based optimization that does not change the dependences of a

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 17

not change the dependences of a program is guaranteed not to change the results of the program.

Page 18: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Instruction SchedulingInstruction SchedulingInstruction SchedulingInstruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

Motivation

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 18

multiple independent instructions through

pipelining and multiple functional units.

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions.

Page 19: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Instruction scheduling (con’t)

• Assume all instructions are essential, i.e., we

have finished optimizing the IR.

Instruction

Schedular

Original Code Reordered Code

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 19

have finished optimizing the IR.

• Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP).

• It is one of the instruction-level optimizations

• Instruction scheduling (IS) is NP-complete, so

heuristics must be used.

Page 20: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Instruction scheduling:

A Simple Example

a = 1 + x

b = 2 + y

time

a = 1 + x b = 2 + y c = 3 + z

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 20

c = 3 + z

Since all three instructions are independent, we can

execute them in parallel, assuming adequate hardware

processing resources.

Page 21: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Hardware ParallelismHardware ParallelismHardware ParallelismHardware Parallelism

Three forms of parallelism are found in

modern hardware:

¤ pipelining

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 21

¤ superscalar processing

¤ multiprocessing

Of these, the first two forms are commonly

exploited by instruction scheduling.

Page 22: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Pipelining & Superscalar Pipelining & Superscalar Pipelining & Superscalar Pipelining & Superscalar

ProcessingProcessingProcessingProcessing

PipeliningDecompose an instruction’s execution into a sequence of

stages, so that multiple instruction executions can be

overlapped. It has the same principle as the assembly

line.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 22

line.

Superscalar ProcessingMultiple instructions proceed simultaneously through the

same pipeline stages. This is accomplished by adding

more hardware, for parallel execution of stages and for

dispatching instructions to them.

Page 23: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

A Classic FiveA Classic FiveA Classic FiveA Classic Five----Stage PipelineStage PipelineStage PipelineStage Pipeline

IF RF EX ME WB

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 23

time

- memory access

- write back to register file

Page 24: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Pipeline IllustrationPipeline IllustrationPipeline IllustrationPipeline Illustration

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard Von Neumann model

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 24

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

In a given cycle, each

instruction is in a different

stage, but every stage is active

The pipeline is “full” here

Page 25: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Parallelism in a pipeline

Example:i1: add r1, r1, r2

i2: add r3 r3, r1

i3: lw r4, 0(r1)

i4: add r5 r3, r4

Assume:

Register instruction 1 cycle

Memory instruction 3 cycle

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 25

Consider two possible instruction schedules (permutations):`

i1 i2 i3 i4

2 Idle Cycles

Schedule S1 (completion time = 6 cycles):

i1 i3 i2 i4Schedule S2 (completion time = 5 cycles):

1 Idle Cycle

Page 26: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Superscalar IllustrationSuperscalar IllustrationSuperscalar IllustrationSuperscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

FU1

FU2

FU1

FU

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 26

IF RF EX ME WB

IF RF EX ME WB

the same pipeline stage at

the same timeIF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

FU2

FU1

FU2

FU1

FU2

FU1

FU2

Page 27: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

A Quiz

Give the following instructions:i1: move r1 ← r0

i2: mul r4 ← r2, r3

i3: mul r5 ← r4, r1

i4: add r6 ← r4, r2

Assume mul takes 2 cycles, other instructions take 1 cycle.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 27

Assume mul takes 2 cycles, other instructions take 1 cycle.

Schedule the instructions in a “clean” pipeline.

Q1. For above sequence, can the pipeline issue an instruction in

each cycle? Why?

Q2. Is there a possible instruction scheduling such that the

pipeline can issue an instruction in each cycle?

i2 i1 i3 i4

Yes!. There is a schedule: i1 i2

i3 i4

No. think about i2 and i3.

Page 28: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B, then B can’t

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 28

read by instruction B, then B can’t

execute before A is completed.

Resource hazards

Finiteness of hardware function units

means limited parallelism.

Page 29: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Scheduling Complications

�� Hardware Resources

• finite set of FUs with instruction type, and width,

and latency constraints

�� Data Dependences

• can’t consume a result before it is produced

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 29

• can’t consume a result before it is produced

• ambiguous dependences create many challenges

�� Control Dependences

• impractical to schedule for all possible paths

• choosing an “expected” path may be difficult

• recovery costs can be non-trivial if you are wrong

Page 30: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Legality Constraint for Legality Constraint for Legality Constraint for Legality Constraint for Instruction SchedulingInstruction SchedulingInstruction SchedulingInstruction Scheduling

Question: when must we preserve the

order of two instructions, i

and j ?

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 30

and j ?

Answer: when there is a dependence from i

to j.

Page 31: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

General Approaches ofGeneral Approaches ofGeneral Approaches ofGeneral Approaches ofInstruction SchedulingInstruction SchedulingInstruction SchedulingInstruction Scheduling

�Trace scheduling

�Software pipelining

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 31

�Software pipelining

�List scheduling

� ……

Page 32: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Trace Scheduling

A technique for scheduling instructions across basic blocks.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 32

�The Basic Idea of trace scheduling

:: Uses information about actual program behaviors to select regions for scheduling.

Page 33: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Software Pipelining

A technique for scheduling instructions across loop iterations.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 33

� The Basic Idea of software pipelining

:: Rewrite the loop as a repeating pattern that overlaps instructions from different iterations.

Page 34: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

List Scheduling

The basic idea of list scheduling:

:: Maintain a list of instructions that are ready to execute

A most common technique for scheduling instructions within a basic block.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 34

• data dependence constraints would be preserved

• machine resources are available

:: Moving cycle-by-cycle through the schedule template:

• choose instructions from the list & schedule them

• update the list for the next cycle

:: Uses a greedy heuristic approach

:: Has forward and backward forms

:: Is the basis for most algorithms that perform scheduling over regions larger than a single block.

Page 35: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline/function unit as follows:

• Each DDG node is labeled a resource-reservation

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 35

• Each DDG node is labeled a resource-reservation

table whose value is the resource-reservation table

associated with the operation type of this node.

• Each edge e from node j to node k is labeled with a

weight (latency or delay) de indicting that the

destination node k must be issued no earlier than de

cycles after the source node j is issued.

Dragon book 722

Page 36: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Example of a Weighted Example of a Weighted Example of a Weighted Example of a Weighted Data Dependence GraphData Dependence GraphData Dependence GraphData Dependence Graph

i1: add r1, r1, r2

i2: add r3 r3, r1

i3: lw r4, (r1)

i4: add r5 r3, r4

i21 1

ALU

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 36

i4: add r5 r3, r4

i3

i1 i4

1 1

31

ALU

ALU

Mem

Assume:

Register instruction 1 cycle

Memory instruction 3 cycle

Page 37: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Legal Schedules for Pipeline

Consider a basic block with m instructions,

i1, …, im.

A legal sequence, S, for the basic block on a

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 37

A legal sequence, S, for the basic block on a

pipeline consists of:

A permutation f on 1…m such that f(j) (j =1,…,m)

identifies the new position of instruction j in the

basic block. For each DDG edge form j to k, the

schedule must satisfy f(j) <= f(k)

Page 38: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Legal Schedules Pipeline (Con’t)

• Instruction start-time

An instruction start-time satisfies the following

conditions:

• Start-time (j) >= 0 for each instruction j

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 38

• Start-time (j) >= 0 for each instruction j

• No two instructions have the same start-time value

• For each DDG edge from j to k,

start-time(k) >= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

Page 39: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

• Schedule length

The length of a schedule S is defined as:

L(S) = completion time of schedule S

= MAX ({ completion-time (j)})

Legal Schedules Pipeline (Con’t)

1 ≤ j ≤ m

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 39

1 ≤ j ≤ m

• Time-optimal schedule

A schedule Si is time-optimal if L(Si) ≤ L(Sj) for all

other schedule Sj that contain the same set of operations.

The schedule S must have at least one operation n with start-time(n) = 1

Page 40: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Instruction SchedulingInstruction SchedulingInstruction SchedulingInstruction Scheduling(Simplified)(Simplified)(Simplified)(Simplified)

Given an acyclic weighted data dependence graph G

1

2 3

d12

d23

d13

d35dd

Problem Statement:

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 40

data dependence graph G with:• Directed edges: precedence

• Undirected edges: resource constraints

Determine a schedule S

such that the length of the schedule is minimized!

4 5

6

d35d34

d45

d56

d46

d26

d24

Page 41: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Simplify Resource Simplify Resource Simplify Resource Simplify Resource ConstraintsConstraintsConstraintsConstraints

Assume a machine M with n functional units or a “clean”

pipeline with n stages.

What is the complexity of a optimal scheduling algorithm

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 41

under such constraints ?

Scheduling of M is still hard!

• n = 2 : exists a polynomial time algorithm [CoffmanGraham]

• n = 3 : remain open, conjecture: NP-hard

Page 42: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

A Heuristic Rank (priority) Function Based on Critical paths

1. Attach a dummy node START as the virtual beginning node of the block, and a dummy node END as the virtual terminating node.

2. Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows (this is a forward pass):

EST[START] = 0

Critical Path: The longest path through the DDG. It determines overall

execution time of the instruction sequence represented by this DDG.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 42

EST{y] = MAX ({EST[x] + edge_weight (x, y) |

there exists an edge from x to y } )

3. Set CPL := EST[END], the critical path length of the augmented DDG.

4. Compute LST (Latest Starting Time) of all nodes (this is a backward pass):

LST[END] = EST(END)

LST{y] = MIN ({LST[x] – edge_weight (y, x ) |

there exists an edge from y to x } )

5. Set rank (i) = LST [i] - EST [i], for each instruction i

NOTE: there are other heuristics

Build a priority list L of the instructions in non-decreasing order of ranks.

(all instructions on a critical path will have zero rank) Why?

Page 43: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Example of Rank ComputationExample of Rank ComputationExample of Rank ComputationExample of Rank Computation

i2

i3

i1 i4

1

3END

1START0

1

1

i1: add r1, r1, r2

i2: add r3 r3, r1

i3: lw r4, (r1)

i4: add r5 r3, r4

Register instruction 1 cycle

Memory instruction 3 cycle

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 43

Node x EST[X] LST[x] rank (x)Start 0 0 0

i1 0 0 0

i2 1 3 2

i3 1 1 0

i4 4 4 0

END 5 5 0

==> Priority list = (i1, i3, i4, i2)

i3Memory instruction 3 cycle

Page 44: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Other Heuristics for Ranking

� Node’s rank is the number of immediate successors?� Node’s rank is the total number of descendants ?� Node’s rank is determined by long latency?� Node’s rank is determined by the last use of a value?Critical Resources?

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 44

� Critical Resources?� Source Ordering?� Others?

Note: these heuristics help break ties, but none dominates the others.

Page 45: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Heuristic Solution:Heuristic Solution:Heuristic Solution:Heuristic Solution:Greedy List Scheduling AlgorithmGreedy List Scheduling AlgorithmGreedy List Scheduling AlgorithmGreedy List Scheduling Algorithm

for each instruction j dopred-count[j] := #predecessors of j in DDG // initializeready-instructions := {j | pred-count[j] = 0 }

while (ready-instructions is non-empty) doj := first ready instruction according to the order in priority list L;output j as the next instruction in the schedule;

Holds any operations that can

execute in the current cycle.

Remove the node j from the

processors node set of each

successor node of j. If the set

is empty, means one of the

successor of j can be issued –

no predecessor!

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 45

output j as the next instruction in the schedule;ready-instructions := ready-instructions- { j };for each successor k of j in the DDG do

pred-count[k] := pred-count[k] - 1;if (pred-count[k] = 0 ) then

ready-instructions := ready-instruction + { k };end if

end forend while Consider resource constraints beyond a single clean pipeline

execute in the current cycle.

Initially contains all the leave

nodes of the DDG because they

depend on no other operations.

If there are more than one ready

instructions, choose one

according to their orders in L.

Issue the instruction. Note: no

timing information is considered

here!

Page 46: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Instruction Scheduling for a Basic Block

Goal: find a legal schedule with minimum completion time

1. Rename to avoid output/anti-depedences (optional).2. Build the data dependence graph (DDG) for the basic block

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 46

• Node = target instruction• Edge = data dependence (flow/anti/output)

3. Assign weights to nodes and edges in the DDG so as to model target processor.• For all nodes, attach a resource reservation table• Edge weight = latency

4. Create priority list5. Iteratively select an operation and schedule

Page 47: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

QuizQuizQuizQuiz

The list scheduling algorithm does not consider the timing constraints (delay time for each instruction). How to change the algorithm so that it works with the timing information?

Question 1

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 47

it works with the timing information?

The list scheduling algorithm does not consider the resource constraints. How to change the algorithm so that it works with the resource constraints?

Question 2

Page 48: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Special Performance Special Performance Special Performance Special Performance BoundsBoundsBoundsBounds

The list scheduling produces a schedule that is within a factor of 2 of optimal for a machine with one or more identical pipelines and a factor of

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 48

one or more identical pipelines and a factor of p+1 for a machine that has p pipelines with different functions. [Lawler et al. Pipeline

Scheduling: A Survey, 1987]

Page 49: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Properties of List Scheduling

• Complexity: O(n2) --- where n is the number of nodes in the DDG

• In practice, it is dominated by DDG

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 49Note : we are considering basic block scheduling here

• In practice, it is dominated by DDG building which itself is also O(n2)

Page 50: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Local vs. Global SchedulingLocal vs. Global SchedulingLocal vs. Global SchedulingLocal vs. Global Scheduling

1. Straight-line code (basic block) – Local scheduling

2. Acyclic control flow – Global scheduling

• Trace scheduling

• Hyperblock/superblock scheduling

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 50

• Hyperblock/superblock scheduling

• IGLS (integrated Global and Local Scheduling)

3. Loops - a solution for this case is loop

unrolling+scheduling, another solution is

software pipelining or modulo scheduling i.e., to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations.

Page 51: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

SummarySummarySummarySummary

1. Data Dependence and DDG

2. Reordering Transformations

3. Hardware Parallelism

4. Parallelism Constraints

5. Scheduling Complications

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 51

5. Scheduling Complications

6. Legal Schedules for Pipeline

7. List Scheduling• Weighted DDG

• Rank Function Based on Critical paths

• Greedy List Scheduling Algorithm

Page 52: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Case Study

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 52

Instruction Scheduling in Open64

Page 53: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Phase Ordering

Amenable for

SWP ?

Acyclic Global sched

-Multiple block-Large loop body-And else

YesNo

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 53

SWP

SWP-RA

Global Register Alloc

Local Register Alloc

Acyclic Global sched

Code emit

Page 54: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Global Acyclic Instruction Scheduling

�Perform scheduling within loop or a area not enclosed by any loop

�It is not capable of moving instruction across iterations or out (or into a loop)

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 54

iterations or out (or into a loop)

�Instructions are moved across basic block boundary

�Primary priority function: the dep-height weighted by edge-frequency

�Prepass schedule: invoked before register allocator

Page 55: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Scheduling Region Hierarchy

Rgn formation

Nested regions are visited (by scheduler)

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 55

Global CFG

formation

Region hierarchy

scheduler) prior to their enclosing outer region

SWP candidate

Irreducible loop, not for global sched

Page 56: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Global Scheduling Example

After scheduleBefore schedule

bb3

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 56

Page 57: Topic 6a Basic BackBasic Back- ---End OptimizationEnd ...

Local Instruction Scheduling

�Postpass : after register allocation

�On demand: only schedule those block whose instruction are changed after global scheduling.

2008/4/15 \course\cpeg421-08s\Topic6a.ppt 57

�Forward List scheduling

�Priority function: dependence height + others used to break tie. (compare dep-height and slack)