Optimal Scheduling of Mathematical Transformations for ...ajohar/ECE565.pdf · Optimal Scheduling...

Optimal Scheduling of Mathematical Transformations for Array Architectures

Amanjyot Singh Johar

Department of Electrical and Computer Engineering

University of Illinois at Chicago

Abstract

Mathematical transforms play an important role in Digital Signal Processing Systems. In

high- level synthesis for digital signal processing systems of array structured architecture,

one of the most important steps is scheduling operations governing such transformations

as FFT, DFT, DCT, CORDIC etc. By taking into account the allocation of operations to

processors, it is mandatory to take into account the communication time between

processors. This project proposes a scheduling method which derives an optimal schedule

achieving the minimum iteration period and latency for a given signal processing

algorithm on the specified processor array. The scheduling problem is modeled as an

integer linear programming and solved by an ILP solver.

Introduction

With the development of VLSI technology, wire delay is becoming relatively larger than

gate delay. In high level descriptions and designs to implement a high speed VLSI, it is

essential to estimate not only the gate delay but the wire delay as well. A parallel

processing system on array architecture is one of the suitable architectures for high speed

VLSIs. It realizes parallel processing which is the key to fully utilize an enormous

number of gates on a VLSI chip. An array architecture consists of a lage number of

processing elements (PE) interconnected as shown in figure 1. In the array architecture,

the direct data communications are limited to PEs which are physically adjacent on a

VLSI chip. The data communication between not physically adjacent PEs is achieved by

intermediate PEs relaying the data. In this communication model, it is easy to estimate

the wire delay (data communication delay) in high- level design of an array architecture.

The data communication time is proportional to the distance of the source and the

destination PEs.

Figure 1: Interconnection of Processing Elements in an Array architecture.

One of the most important procedures of high-level synthesis is scheduling. In general,

scheduling consists of time assignment and processor allocation for a particular

operation. The time assignment determines when each operation is executed. The

processor allocation determines which PE executes the operations. It is well known that

the optimal scheduling must consider the time assignment and the processor allocation

simultaneously and it is a NP-hard problem. Most of the scheduling techniques divide

the scheduling problem into time scheduling and processor allocation separately to

improve on the CPU time for scheduling. For an array architecture however, the

scheduling has to consider time assignment and processor allocation simultaneously. This

is because the processor allocation affects the data communication time between

operations if two back to back operations are not scheduled on the same or adjacent

processors. Further, the time assignment depends on the data communication time. In

addition, the time assignment to resolve resource conflict affects the processor allocation.

In order to obtain optimized schedules the problem can be modeled as an integer linear

programming (ILP) prob lem and solved by an ILP solver. An ILP model of scheduling

for an array architecture is formulated in this project.

High-Level Synthesis and Scheduling

High- level synthesis can be described as the process of translation of a behavioral

description into a structural description that consists of a set of connected components

called the data-path and a controller that sequences and controls the functioning of these

components. High- level synthesis starts at the systems level and proceeds downwards to

register transfer (RT) level, logic level and finally circuit level, each time adding some

additional information needed at the next level of synthesis. The five major tasks

involved in high-level synthesis are described below. The first three steps lead to the

data-path formation and the last step leads to the formation of the controller.

1. Compilation : Compilation involves translation of the design description into an

intermediate representation that is most suitable for high- level synthesis.

2. Partitioning : Partitioning deals with division of the intermediate representation

(i.e, the behavioral description or the design) into sub-representations in order to

reduce the problem size.

3. Scheduling : Scheduling partitions the intermediate representation into time steps,

thereby generating a finite state machine model.

4. Allocation : Allocation though closely intertwined with scheduling, involves

partitioning of intermediate representation with respect to space (hardware

resources) which is also known as spatial mapping.

5. Control generation : Finally, this step involves the derivation of the controller that

sequences the design and controls the functional and storage units in the datapath.

Scheduling is one of the most important and primary tasks in high- level synthesis.

Scheduling can be described as the process of dividing the intermediate representation

into states and control steps, in such a way that it can directly synthesized into a Finite

State Machine with Datapath (FSMD) model. In other words scheduling does a tempora l

mapping of the given representation. A behavioral description and hence the intermediate

representation consists of a sequence of operations to be performed by the synthesized

hardware. The task of scheduling partitions these operations into time steps such that

each operation is executed in one time step. Each time/control step corresponds to one

state of the controlling finite state machine in the FSMD model.

Scheduling

Scheduling determines the precise start time of each operation for a given data flow

graph. The start times must satisfy the original dependencies of the graph, which limit the

amount of parallelism of the operations. This means that the scheduling determines the

concurrency of the resultant implementation, which, in turn, affects the performance. The

maximum number of concurrent operations of any given type at any step of the schedule

is a lower bound on the number of required hardware resources of the type. Therefore,

the choice of a schedule affects the area and the BIST design. Three commonly used

scheduling algorithms are ASAP (As Soon As Possible), ALAP (As Late As possible),

and list scheduling.

As we have seen DFGs expose parallelism in the design. Consequently each node has a

range of control steps in which it can be assigned. Most of the algorithms require the

earliest and the latest bounds within which operations in the DFG can be scheduled. The

first and simplest schemes that are used to determine these bounds are called the As Soon

As Possible (ASAP) and the As Late As Possible (ALAP) algorithms.

Fig. 2: A DFG (Paulin’s DFG) to be scheduled

Fig. 3: ASAP scheduling of Paulin’s DFG

The Basic Scheduling Algorithms

ASAP Algorithm

The ASAP Algorithm starts with the highest nodes (that have no parents) in the DFG and

assigns time steps in increasing order as it proceeds downwards. It follows the simple

rule that a successor node can execute only after its parent has executed. This algorithm

clearly gives the fastest schedule possible. In other words, it schedules in least number of

control steps but never takes into account the resource constraints.This technique has

proved useful for near-optimal microcode compaction.

– ASAP scheduling algorithm ASAP scheduling algorithm

ASAP (G(V, E)) { ASAP (G(V, E)) {

schedule v0 by setting tos =1;

repeat { repeat {

select a vertex vi whose predecessors are all scheduled; select a

vertex vi who schedule vi by setting tis = max(tjs + dj)

}

until (vn is scheduled); is scheduled);

return (ts);

}

In the ASAP scheduling, the start time of each operation is assigned its as soon as

possible value. This scheduling solves an unconstrained minimum- latency scheduling

problem in polynomial time. An example DFG and the corresponding ASAP scheduled

one are shown in Fig. 3.

ALAP Algorithm

This approach is a refinement of the ASAP scheduling concept with conditional

postponement of operations.This postponement occurs whenever the operation

concurrency is higher than the number of available functional units. The ALAP algorithm

works exactly in the same way as the ASAP algorithm expect that it starts at the bottom

of the DFG and proceeds upwards. This algorithm gives the slowest possible schedule

that takes the maximum number of control steps. However this doesn't necessarily reduce

the number of functional units used.

ALAP scheduling algorithm ALAP scheduling algorithm

ALAP (G(V, E), λ’’) { ) {

Schedule vn by setting tn =λ’ +1;

Repeat{

select a vertex vi whose successors are all scheduled; select a vertex vi

whose successschedule vii by setting ti = min (tjL- dj);

}

until (v0 is scheduled); is scheduled);

Fig. 5: List scheduling of Paulin’s DFG Fig. 4: ALAP scheduling of Paulin’s DFG

return (tL); );

}

where l’ = tnS - toSt0

In the ALAP scheduling, the start time of each operation is assigned its as late as possible

value, and the scheduling is usually constrained in its latency. When it is applied to an

unconstrained scheduling, the latency bound l (which is the upper bound of latency) is the

length of the schedule computed by the ASAP algorithm. When the ALAP algorithm is

applied to the DFG in Fig. 2 the resultant DFG is obtained as shown in Fig 4.

List Scheduling Algorithm

The list scheduling one of the most popular heuristic methods is used to solve scheduling

problems with resource constraints or latency constraints. A list scheduling maintains a

priority list of the operations. A commonly used priority list is obtained by labeling each

vertex with the weight of its longest path to the sink and ranking the vertices in the

decreasing order. The most urgent operations are scheduled first. It constructs a schedule

that satisfies the constraints. However, the computed schedule may not have the

minimum latency. Fig. 6 shows the result of the list scheduling for the DFG.

Time Constrained Scheduling

Time constrained scheduling is also called as fixed-control-step approach. Time

constrained scheduling is important for designs targeted towards applications in real-time

systems like digital signal processing systems where the main objective is to minimize

the cost of the hardware. Time constrained scheduling algorithms usually use three

different techniques:

1. Mathematical Programming : One of the most popular techniques is the integer

linear programming method.

2. Constructive heuristics : Force directed scheduling method is an example of a

constructive heuristic.

3. Iterative Refinement : Iterative rescheduling is a common example of this type.

Integer Linear Programming (ILP)

The integer linear programming (ILP) formulation tries to find an optimal schedule using

a branch-and-bound search algorithm. It involves some amount of backtracking, i.e.,

decisions made earlier are changed later on. A simplified formulation of the ILP method

is given below.

First it calculates the mobility range for each operation, based on the ALAP And ASAP

values. The mobility range determines the bounds within which the operations can be

scheduled. The general scheduling problem in ILP is defined by the following equations:

∑ ∑= ≤≤

≤≤∀=n

k LjEjikk

ii

erationsnumberofopniixandNCMinimize1

, ,1,,1)*(

where operation types are available, and is the number of FUs of

operation type k and is the cost of each FU. Each is 1 if the operation i is

assigned in control step j and 0 otherwise. Two more equations that enforce the resource

and data dependency constraints are:

where p and q are the control steps assigned to the operations and respectively.

The ILP formulation increases rapidly with the number of control steps. For unit increase

in the number of control steps we will have n additional x variables. Therefore the time of

execution of the algorithm also increases rapidly. In practice the ILP approach is

applicable only to very small problems.

If it is possible to eliminate the backtracking involved in the ILP method considerable

amount of computation time could be saved. Heuristic methods do the job by scheduling

one operation at a time based on some criterion. The following section describes one such

method.

The Discrete Cosine Transform

The Discrete Cosine Transform is primarily applied to real data values, and has found

wide applications in data compression, filtering etc. A number of fastr algorithms have

been published. A two-dimensional DCT can be obtained by first applying a one-

dimensional DCT over the rows of an input data matrix and then over the columns of the

matrix. The N-point DCT is defined as follows:

A given data sequence {xn, n = 0,1,2,3,……N-1} is transformed into another sequence

{yn, n = 0,1,2,3,……N-1} by the equation:

Where k = 0,1…N-1

Hence the structure of an 8-point DCT is as shown in Figure 6 below.

Figure 6: The data flow graph for an 8-point Discrete Cosine Transform

The above data flow graph clearly shows the operations needed for evaluating the DCT

based on AT&T Bell Labs algorithm. The 8-point DCT consists of 11 multiplication

∑−

=

+=

1

0 4)12(2

cos)()(N

nk N

knnxCaky

π

=4

cos0πa 1,......2,11 −=∀= Nkak

operations and 29 addition operations for the required transformation. The multiplications

shown in the above DFG signify a multiplication by a factor of 20.5. Now to obtain a fast

response for signal processing systems, the DCT must be able to speed up the operations.

Using a large number of processing elements to compute the transformation increases the

hardware cost of the VLSI chip and hence is not encouraged.

Scheduling model for high level synthesis

The scheduling model of array architecture is defined as follows:

1. Array topology

The total number of Processing Elements (PEs) and the topology (consisting of

the interconnectivity information) of array structure are given as a specification. Figure 7

shows the array topology used for scheduling the DCT algorithm. The input sequence is

obtained at PE2 and the computed output sequence is available at PE5.

Figure 7: The array topology showing the interconnectivity of the six processors

2. Processing element

PE1

PE5 PE4

PE2 PE3

PE6

w1

w8 w2

w9

w11

w5 w10

w4

w6

w7 w13

w1

w14

w3

A Processing element (PE) can execute operations and data communications with

adjacent PEs simultaneously. Adjacent PEs are those which are connected through a

common communication link. In addition, a PE can relay data from an adjacent PE to

another adjacent PE as long as there is no conflict on the communication links. The PE is

capable of performing common operations like addition and multiplication. For this

particular application, it is further assumed that a processing element uses two clock

cycles for a multiplication operation and a single clock cycle for an addition operation.

3. Data communication

Data communication links are limited between physically adjacent PEs. Data

communication between physically distant PEs is achieved by intermediate PEs relaying

the data. Therefore, data communication time is proportional to the distance between the

sender PE and the receiver PE. This distance information is a feature of the topology. For

array topology with N processing elements, the maximum distance between the any two

nodes is N/2, which is also the diameter of he network of PEs.

4. Data Input/Output

The locations of PEs which input and/or output the data are given as specification.

Moreover, if the processing algorithm consumes and produces multiple data, then the

data format of input and output is also specified.

Based on the scheduling model defined above, scheduling is done to satisfy the following

scheduling constraints.

1. Satisfy precedence relations

If there is a data dependency between operations, the precedence relation between

these operations must be satisfied. If an operation depends on the data produced by

another operation, the former operation cannot start until the latter completes the

execution and the produced data is sent to the former operation thus accounting for the

processing time by the first element and the communication delay in sending the data.

2. No resource conflict

Resource conflict is defined as the situation that the resource (PE or a

communication link) is used at the same time by more than one operation or

communication. Hence if resource conflict occurs in a schedule, the schedule cannot be

realized. Only one operation can be executed on a PE at a particular time instant. Also

only one data can be sent or received on a data communication link at the same time.

Objective: The objective of the scheduling is to find a schedule which achieves the

minimum iteration period for a given processing algorithm and a given array topology. If

there exist more than one such schedule, then choose one which achieves the minimum

latency.

Basic Scheduling Strategy

At first, an ILP model is constructed to decide whether a schedule of a processing

algorithm exists or not which satisfies all the scheduling constraints for a specified

iteration period and latency on a PE array of a given topology. A lower bound of iteration

period and the lower bound of latency are computed. Then the complete model is

generated and run to decide whether a schedule exists. If the complete model does not

terminate with a solution, i.e., no schedule satisfying scheduling constraints exists for this

iteration period and the latency, then the latency or the iteration period is increased to get

a feasible solution. By repeating the process, the complete model eventually terminates

with a solution, i.e., a schedule satisfying all the scheduling constraints and returning a

minimum iteration period and minimum latency for the scheduling. This approach always

terminates because a schedule where all the operations are executed sequentially on one

of the PEs is a valid schedule and it can be obtained if the iteration period and the latency

are sufficiently large.

The basic Algorithm is given as:

1. Identify the inputs of the array topology

2. Compute the lower bound of iteration period Ti

3. Compute the lower bound of latency Lt

4. Solve for the complete model taking into account the PEs and the CLs together

5. If the model is solved Goto end

6. Increment Lt

7. If Lt > Ti, Ti + Ti + 1;Goto 3

8. Goto 4

9. Obtain the optimal solution

The flowchart for the algorithm is given below as:

Figure 8: Flowchart for an unrefined ILP formulation for scheduling

The model is solved for by a set of linear equations focusing on latency minimization.

The objective here is to minimize the time period of a single iteration keeping the latency

as small as possible. First the lower bound for the iteration period and the latency are

computed and assigned a guess estimate. The complete model is generated and run to

satisfy these bounds. If the model does not terminate, the latency and the guess are

increased until the model does terminate. The model eventually terminates because a

schedule can always be found where all the operations are executed sequentially on one

of the processing elements and this is a valid schedule if the iteration period and the

latency are large enough.

Refined Scheduling method

This basic scheduling method can be modified suitably to get a refined scheduling

method. To strictly constrain precedence relations and check resource conflict, the above

model requires many binary variables for a large processing algorithm and therefore its

solution time is very long and sometimes it cannot be solved. The refined scheduling

method takes into account two Linear programming formulations for resource and

communication allocation separately and should be able to solve faster. This is because

the communication allocator checks for only valid schedules and leaves out all schedules

rendered infeasible by the resource allocator. This formulation is given as:

1. Identify the inputs and the structure of the array topology

2. Compute the lower bound of iteration period Ti

3. Compute the lower bound of latency Lt

4. Solve for the “datamodel” taking into account only the PEs

5. If the ““datamodel”” is solved Goto 9

6. Increment Lt

7. If Lt > Ti, Ti + Ti + 1;Goto 3

8. Goto 4

9. Solve for ““commodel”” taking into account the CLs

10. Obtain the optimal solution

The “datamodel” is the complete model described earlier except that it does not check for

the resource conflicts on the data communication channels.

The purpose of this model is to determine the start and the end times of the operations

and schedule them in the most optimal way in a manner that the precedence relations are

satisfied. The ““commodel”” that follows the “datamodel” is then the entire complete

model defined earlier but with a very strict limitation on the start and end times of each

operation based on the “datamodel”. The two linear programming formulations can be

solved more quickly than solving a complete model with no bounds on the process start

times. The “datamodel” determines the start time for each operation so htat the

precedence relations are satisfied and no resource conflicts occur on the processing

elements. Based on these start times the “commodel” finds a schedule where all the

precedence relations are satisfied and no resource conflicts on data communication links

or processing elements occur. The “commodel” also checks whether an operation cvan be

shifted “m” time units to avoid for any resource conflicts(on the data communication

channels) that might arise from considering only the ““datamodel””.

The “commodel” m = 0checks the existence of a schedule by fixing the start time of all

the operations as determined by the “datamodel”. If he model now terminates without a

solution, m is incremented by 1 and the model is run again till a solution is found. It may

be necessary to increase the latency of the solution to find a solution that satisfies all the

constraints. Although for a very large value of m, the new model is equivalent to the

earlier ILP model.

The flowchart is given below:

Figure 9: Flowchart for a refined ILP formulation for scheduling

The linear equations governing the solution of the “datamodel” and the commmodel are

formulated with respect to the following guidelines:

1) Each operation is executed only once

2) Only one operation is executed on one element at a given time thus avoiding resource

conflict

3) Precedence relations between operation and operation, operation and communication

and communication and communication

These three constraints are modeled by the equations given below:

Processing Algorithm Array Topology

Data I/O specification

Compute lower bound of iteration period Ti

Compute the lower bound of latency Lt

Is “datamodel” solved

Increment Lt Does Lt exceed the upper bound

Increment Ti yes

no

no

m = 0

Is “commodel” solved Optimal Solution

Is “commodel” identical to complete

model

m = m+1

yes

no

no

4) Flow of information from one processor to another through a communication link

5) Data flow out from a cutset of PE is equal to data flow in to another cutset of PEs.

∑ ∑+∈

=

∈∀≤NINi

PCONkp

iik

kpkpk

PkTflowf

1,,

,,

In the above equations, the following terminology is used:

Xi,j,k =1 implies that operation I is scheduled at time j on a processing element k

Ti is the iteration period of the algorithm

flowfk,i =1 implies that a data produced by operation I is output from processing element

k

P is the set of all processors

ALAPi, ASAPi are the ALAP and ASAP scheduling times for a particular operation

Rx,i is the set of times ranging from ASAPi to ALAPi where the process i can be

scheduled.

N is the total set of operating nodes in the corresponding DFG

An optimal solution is found when all resource conflicts on data communication links as

well as processing elements are resolved. For a pair of operations I and J such that I

preceeds J, if the time difference from the execution of operation I to the execution of

( )[ ]

∑ ∑

∑ ∑ ∑

∑ ∑

∈ +−−<

∈

−

= =−+

∈ ∈

≤

≤≤≤

∈∀=

−−

Pkp TDPCONCjjpkpjpikjip

iNi

L

q

TALAP

pkqTpji

Rxj Pkkji

iipikpki

i iqji

i

i

XX

TjX

NiX

*,,,,

1

0

/

0,*,

,,

,,

1_____,1

1

∑ ∑∈ ∈

−≥xi xipRj Rj

kjipkjiik XXflowf ,,,,,

operation J is large, then it is easy to resolve resource conflict by modifying the execution

time of these operations without violating the precedence relation between operations I

and J. Hence the objective function to be maximized can be stated as

MAX Σ(start time of operation J - end time of operation I)

Where the summation is over all the operations ‘N’ to be performed in the DFG given

that the constraints represented by 1-5 above are satisfied.

Results:

The constraint equations as described above were modeled as an integer linear

programming code and solved for the optimality of scheduling. The operation execution

time is assumed to be 3 units of time for multiplication and 1 unit of time for an addition

operation. Also it is assumed htat the operations are not pipelined. For the DCT schedule,

there are 29 addition operations and 11 multiplication operations. Hence the total

execution time is 51 units of time (29 + 2*11). Since there are 6 Processing Elements, the

lower bound on the iteration period is 51/6 = 9 (the greatest integer). This means that

there cannot exist a schedule that has an iteration period of less than 9 time units. The

appendix gives the variables for the formulation of the ILP equations. These formulation

was done and simulated using “lpsolve”: a non-commercial linear programming code

written in ANSI C by Michel Berkelaar, who claims to have solved problems as large as

30,000 variables and 50,000 constraints. The simulations were carried out on a sun spark

100MHz workstation and the model took 2 hours and 48 minutes to solve. The entire

model was not run due to difficulties in programming for the “commodel” which checks

for the final constraints. Hence a sufficiently large latency of 20 time units was assumed

to terminate the iterations.

The results of the formulation are tabulated below:

Processing

Element

1 2 3 4 5 6

Time

T1 O4

T2 O5 O6

T3 O16 O7 O3

T4 O16 O1 O2

T5 O20 O12 O8 O22

T6 O23 O11 O14 O22

T7 O23 O28 O17 O24

T8 O19 O31 O17 O24 O18

T9 O31 O13 O21 O18

T10 O15 O30

T11 O34 O30 O10

T12 O32 O33 O36

T13 O32 O25 O38

T14 O40 O38

T15 O27 O26

T16 O35

T17 O39

T18 O39

T19 O29

T20 O37

This is a correct schedule as it takes 20 time units(the maximum allowed ) to determine

the first set of outputs

– 4 of the 6 PEs are occupied for 9 time units

– 1 PE is occupied for 8 time units

– 1 PE is occupied for 7 time units

This shows that the solution is optimally reached with respect to processor allocation as

the total time for iteration is 51 time units.

References:

1. J. Lee, Y. Hsu, and Y. Lin, ``A new Integer Linear

Programming Formulation for the Scheduling Problem in Data-

Path Synthesis,'' Proc. of the Int. conf. on Computer-Aided

Design, pp. 20-23, 1989.

2. C. Loeffler, A Ligtenberg and G.S. Mosytz, “Practical, fast, 1-

D DCT algorithms with 11 multiplications,” in Proc. IEEE

ICASSP, pp322-329, Nov. 1994.

3. C.T.Hwang, J.H. Lee, Y.C. Hsu, “A formal Approach to the

Scheduling Problem in High Level Synthesis”, IEEE Trans.

Computer Aided Desig, vol CAD 10 pp. 464-475, April 1991.

4. Brinkmann, R.; Drechsler, R, “RTL-datapath verification using

integer linear programming”, Design Automation Conference,

2002. Proceedings of ASP-DAC 2002. 7th Asia and South

Pacific and the 15th International Conference on VLSI Design.

Proceedings,2002 Page(s): 741 -746

5. Chang, G.W.; Aganagic, M.; Waight, J.G.; Medina, J.; Burton,

T.; Reeves, S.; Christoforidis, M.,” Experiences with mixed

integer linear programming based approaches on short-term

hydro scheduling”, Power Systems, IEEE Transactions on ,

Volume: 16 Issue: 4,Nov.2001, Page(s): 743 -749.

6. Chakrabarty, K.,” Test scheduling for core-based systems using

mixed-integer linear programming”, Computer-Aided Design

of Integrated Circuits and Systems, IEEE Transactions on ,

Volume: 19 Issue: 10 , Oct. 2000

Page(s): 1163 -1174

7. Chakrabarty, K, ”Design of system-on-a-chip test access

architectures using integer linear programming VLSI Test

Symposium, 2000. Proceedings. 18th IEEE,2000 Page(s): 127 -

134

8. Kaufman, D.E.; Nonis, J.; Smith, R.L.,” A mixed integer linear

programming formulation of the dynamic traffic assignment

problem” , Systems, Man and Cybernetics, 1992., IEEE

International Conference on , 1992 Page(s): 232 -235 vol.1

9. N. Park, and A.C. Parker, “ Sehwa: A Software Package for

Synthesis of Pipelines from Behavioral Specifications”, IEEE

Trans. Computer Aided Design, vol 7, March 1998

10. Ning Liu; Cios, K.J, “Learning rules by integer linear

programming”, Industrial Electronics, 1992., Proceedings of

the IEEE International Symposium on , 1992 Page(s): 246 -250

Appendix

Operation Earliest Start Time Latest Start Time

1 1 14 2 1 14 3 1 14 4 1 14 5 1 14 6 1 14 7 1 14 8 1 14 9 2 19

10 2 19 11 2 16 12 2 16 13 5 18 14 2 15 15 5 18 16 2 16 17 3 16 18 2 16 19 2 16 20 2 15 21 5 18 22 2 16 23 3 16 24 2 16 25 6 20 26 6 20 27 6 20 28 6 17 29 6 20 30 3 18 31 4 18 32 3 18 33 6 19 34 6 19 35 6 19 36 6 19 37 7 20 38 7 20 39 7 20 40 7 20

Optimal Scheduling of Mathematical Transformations for ...ajohar/ECE565.pdf · Optimal Scheduling...

Documents

Transcript of Optimal Scheduling of Mathematical Transformations for ...ajohar/ECE565.pdf · Optimal Scheduling...