Optimal Scheduling of Mathematical Transformations for ...ajohar/ECE565.pdf · Optimal Scheduling...
Transcript of Optimal Scheduling of Mathematical Transformations for ...ajohar/ECE565.pdf · Optimal Scheduling...
Optimal Scheduling of Mathematical Transformations for Array Architectures
Amanjyot Singh Johar
Department of Electrical and Computer Engineering
University of Illinois at Chicago
Abstract
Mathematical transforms play an important role in Digital Signal Processing Systems. In
high- level synthesis for digital signal processing systems of array structured architecture,
one of the most important steps is scheduling operations governing such transformations
as FFT, DFT, DCT, CORDIC etc. By taking into account the allocation of operations to
processors, it is mandatory to take into account the communication time between
processors. This project proposes a scheduling method which derives an optimal schedule
achieving the minimum iteration period and latency for a given signal processing
algorithm on the specified processor array. The scheduling problem is modeled as an
integer linear programming and solved by an ILP solver.
Introduction
With the development of VLSI technology, wire delay is becoming relatively larger than
gate delay. In high level descriptions and designs to implement a high speed VLSI, it is
essential to estimate not only the gate delay but the wire delay as well. A parallel
processing system on array architecture is one of the suitable architectures for high speed
VLSIs. It realizes parallel processing which is the key to fully utilize an enormous
number of gates on a VLSI chip. An array architecture consists of a lage number of
processing elements (PE) interconnected as shown in figure 1. In the array architecture,
the direct data communications are limited to PEs which are physically adjacent on a
VLSI chip. The data communication between not physically adjacent PEs is achieved by
intermediate PEs relaying the data. In this communication model, it is easy to estimate
the wire delay (data communication delay) in high- level design of an array architecture.
The data communication time is proportional to the distance of the source and the
destination PEs.
Figure 1: Interconnection of Processing Elements in an Array architecture.
One of the most important procedures of high-level synthesis is scheduling. In general,
scheduling consists of time assignment and processor allocation for a particular
operation. The time assignment determines when each operation is executed. The
processor allocation determines which PE executes the operations. It is well known that
the optimal scheduling must consider the time assignment and the processor allocation
simultaneously and it is a NP-hard problem. Most of the scheduling techniques divide
the scheduling problem into time scheduling and processor allocation separately to
improve on the CPU time for scheduling. For an array architecture however, the
scheduling has to consider time assignment and processor allocation simultaneously. This
is because the processor allocation affects the data communication time between
operations if two back to back operations are not scheduled on the same or adjacent
processors. Further, the time assignment depends on the data communication time. In
addition, the time assignment to resolve resource conflict affects the processor allocation.
In order to obtain optimized schedules the problem can be modeled as an integer linear
programming (ILP) prob lem and solved by an ILP solver. An ILP model of scheduling
for an array architecture is formulated in this project.
High-Level Synthesis and Scheduling
High- level synthesis can be described as the process of translation of a behavioral
description into a structural description that consists of a set of connected components
called the data-path and a controller that sequences and controls the functioning of these
components. High- level synthesis starts at the systems level and proceeds downwards to
register transfer (RT) level, logic level and finally circuit level, each time adding some
additional information needed at the next level of synthesis. The five major tasks
involved in high-level synthesis are described below. The first three steps lead to the
data-path formation and the last step leads to the formation of the controller.
1. Compilation : Compilation involves translation of the design description into an
intermediate representation that is most suitable for high- level synthesis.
2. Partitioning : Partitioning deals with division of the intermediate representation
(i.e, the behavioral description or the design) into sub-representations in order to
reduce the problem size.
3. Scheduling : Scheduling partitions the intermediate representation into time steps,
thereby generating a finite state machine model.
4. Allocation : Allocation though closely intertwined with scheduling, involves
partitioning of intermediate representation with respect to space (hardware
resources) which is also known as spatial mapping.
5. Control generation : Finally, this step involves the derivation of the controller that
sequences the design and controls the functional and storage units in the datapath.
Scheduling is one of the most important and primary tasks in high- level synthesis.
Scheduling can be described as the process of dividing the intermediate representation
into states and control steps, in such a way that it can directly synthesized into a Finite
State Machine with Datapath (FSMD) model. In other words scheduling does a tempora l
mapping of the given representation. A behavioral description and hence the intermediate
representation consists of a sequence of operations to be performed by the synthesized
hardware. The task of scheduling partitions these operations into time steps such that
each operation is executed in one time step. Each time/control step corresponds to one
state of the controlling finite state machine in the FSMD model.
Scheduling
Scheduling determines the precise start time of each operation for a given data flow
graph. The start times must satisfy the original dependencies of the graph, which limit the
amount of parallelism of the operations. This means that the scheduling determines the
concurrency of the resultant implementation, which, in turn, affects the performance. The
maximum number of concurrent operations of any given type at any step of the schedule
is a lower bound on the number of required hardware resources of the type. Therefore,
the choice of a schedule affects the area and the BIST design. Three commonly used
scheduling algorithms are ASAP (As Soon As Possible), ALAP (As Late As possible),
and list scheduling.
As we have seen DFGs expose parallelism in the design. Consequently each node has a
range of control steps in which it can be assigned. Most of the algorithms require the
earliest and the latest bounds within which operations in the DFG can be scheduled. The
first and simplest schemes that are used to determine these bounds are called the As Soon
As Possible (ASAP) and the As Late As Possible (ALAP) algorithms.
Fig. 2: A DFG (Paulin’s DFG) to be scheduled
Fig. 3: ASAP scheduling of Paulin’s DFG
The Basic Scheduling Algorithms
ASAP Algorithm
The ASAP Algorithm starts with the highest nodes (that have no parents) in the DFG and
assigns time steps in increasing order as it proceeds downwards. It follows the simple
rule that a successor node can execute only after its parent has executed. This algorithm
clearly gives the fastest schedule possible. In other words, it schedules in least number of
control steps but never takes into account the resource constraints.This technique has
proved useful for near-optimal microcode compaction.
– ASAP scheduling algorithm ASAP scheduling algorithm
ASAP (G(V, E)) { ASAP (G(V, E)) {
schedule v0 by setting tos =1;
repeat { repeat {
select a vertex vi whose predecessors are all scheduled; select a
vertex vi who schedule vi by setting tis = max(tjs + dj)
}
until (vn is scheduled); is scheduled);
return (ts);
}
In the ASAP scheduling, the start time of each operation is assigned its as soon as
possible value. This scheduling solves an unconstrained minimum- latency scheduling
problem in polynomial time. An example DFG and the corresponding ASAP scheduled
one are shown in Fig. 3.
ALAP Algorithm
This approach is a refinement of the ASAP scheduling concept with conditional
postponement of operations.This postponement occurs whenever the operation
concurrency is higher than the number of available functional units. The ALAP algorithm
works exactly in the same way as the ASAP algorithm expect that it starts at the bottom
of the DFG and proceeds upwards. This algorithm gives the slowest possible schedule
that takes the maximum number of control steps. However this doesn't necessarily reduce
the number of functional units used.
ALAP scheduling algorithm ALAP scheduling algorithm
ALAP (G(V, E), λ’’) { ) {
Schedule vn by setting tn =λ’ +1;
Repeat{
select a vertex vi whose successors are all scheduled; select a vertex vi
whose successschedule vii by setting ti = min (tjL- dj);
}
until (v0 is scheduled); is scheduled);
Fig. 5: List scheduling of Paulin’s DFG Fig. 4: ALAP scheduling of Paulin’s DFG
return (tL); );
}
where l’ = tnS - toSt0
In the ALAP scheduling, the start time of each operation is assigned its as late as possible
value, and the scheduling is usually constrained in its latency. When it is applied to an
unconstrained scheduling, the latency bound l (which is the upper bound of latency) is the
length of the schedule computed by the ASAP algorithm. When the ALAP algorithm is
applied to the DFG in Fig. 2 the resultant DFG is obtained as shown in Fig 4.
List Scheduling Algorithm
The list scheduling one of the most popular heuristic methods is used to solve scheduling
problems with resource constraints or latency constraints. A list scheduling maintains a
priority list of the operations. A commonly used priority list is obtained by labeling each
vertex with the weight of its longest path to the sink and ranking the vertices in the
decreasing order. The most urgent operations are scheduled first. It constructs a schedule
that satisfies the constraints. However, the computed schedule may not have the
minimum latency. Fig. 6 shows the result of the list scheduling for the DFG.
Time Constrained Scheduling
Time constrained scheduling is also called as fixed-control-step approach. Time
constrained scheduling is important for designs targeted towards applications in real-time
systems like digital signal processing systems where the main objective is to minimize
the cost of the hardware. Time constrained scheduling algorithms usually use three
different techniques:
1. Mathematical Programming : One of the most popular techniques is the integer
linear programming method.
2. Constructive heuristics : Force directed scheduling method is an example of a
constructive heuristic.
3. Iterative Refinement : Iterative rescheduling is a common example of this type.
Integer Linear Programming (ILP)
The integer linear programming (ILP) formulation tries to find an optimal schedule using
a branch-and-bound search algorithm. It involves some amount of backtracking, i.e.,
decisions made earlier are changed later on. A simplified formulation of the ILP method
is given below.
First it calculates the mobility range for each operation, based on the ALAP And ASAP
values. The mobility range determines the bounds within which the operations can be
scheduled. The general scheduling problem in ILP is defined by the following equations:
∑ ∑= ≤≤
≤≤∀=n
k LjEjikk
ii
erationsnumberofopniixandNCMinimize1
, ,1,,1)*(
where operation types are available, and is the number of FUs of
operation type k and is the cost of each FU. Each is 1 if the operation i is
assigned in control step j and 0 otherwise. Two more equations that enforce the resource
and data dependency constraints are:
where p and q are the control steps assigned to the operations and respectively.
The ILP formulation increases rapidly with the number of control steps. For unit increase
in the number of control steps we will have n additional x variables. Therefore the time of
execution of the algorithm also increases rapidly. In practice the ILP approach is
applicable only to very small problems.
If it is possible to eliminate the backtracking involved in the ILP method considerable
amount of computation time could be saved. Heuristic methods do the job by scheduling
one operation at a time based on some criterion. The following section describes one such
method.
The Discrete Cosine Transform
The Discrete Cosine Transform is primarily applied to real data values, and has found
wide applications in data compression, filtering etc. A number of fastr algorithms have
been published. A two-dimensional DCT can be obtained by first applying a one-
dimensional DCT over the rows of an input data matrix and then over the columns of the
matrix. The N-point DCT is defined as follows:
A given data sequence {xn, n = 0,1,2,3,……N-1} is transformed into another sequence
{yn, n = 0,1,2,3,……N-1} by the equation:
Where k = 0,1…N-1
Hence the structure of an 8-point DCT is as shown in Figure 6 below.
Figure 6: The data flow graph for an 8-point Discrete Cosine Transform
The above data flow graph clearly shows the operations needed for evaluating the DCT
based on AT&T Bell Labs algorithm. The 8-point DCT consists of 11 multiplication
∑−
=
+=
1
0 4)12(2
cos)()(N
nk N
knnxCaky
π
=4
cos0πa 1,......2,11 −=∀= Nkak
operations and 29 addition operations for the required transformation. The multiplications
shown in the above DFG signify a multiplication by a factor of 20.5. Now to obtain a fast
response for signal processing systems, the DCT must be able to speed up the operations.
Using a large number of processing elements to compute the transformation increases the
hardware cost of the VLSI chip and hence is not encouraged.
Scheduling model for high level synthesis
The scheduling model of array architecture is defined as follows:
1. Array topology
The total number of Processing Elements (PEs) and the topology (consisting of
the interconnectivity information) of array structure are given as a specification. Figure 7
shows the array topology used for scheduling the DCT algorithm. The input sequence is
obtained at PE2 and the computed output sequence is available at PE5.
Figure 7: The array topology showing the interconnectivity of the six processors
2. Processing element
PE1
PE5 PE4
PE2 PE3
PE6
w1
w8 w2
w9
w11
w5 w10
w4
w6
w7 w13
w1
w14
w3
A Processing element (PE) can execute operations and data communications with
adjacent PEs simultaneously. Adjacent PEs are those which are connected through a
common communication link. In addition, a PE can relay data from an adjacent PE to
another adjacent PE as long as there is no conflict on the communication links. The PE is
capable of performing common operations like addition and multiplication. For this
particular application, it is further assumed that a processing element uses two clock
cycles for a multiplication operation and a single clock cycle for an addition operation.
3. Data communication
Data communication links are limited between physically adjacent PEs. Data
communication between physically distant PEs is achieved by intermediate PEs relaying
the data. Therefore, data communication time is proportional to the distance between the
sender PE and the receiver PE. This distance information is a feature of the topology. For
array topology with N processing elements, the maximum distance between the any two
nodes is N/2, which is also the diameter of he network of PEs.
4. Data Input/Output
The locations of PEs which input and/or output the data are given as specification.
Moreover, if the processing algorithm consumes and produces multiple data, then the
data format of input and output is also specified.
Based on the scheduling model defined above, scheduling is done to satisfy the following
scheduling constraints.
1. Satisfy precedence relations
If there is a data dependency between operations, the precedence relation between
these operations must be satisfied. If an operation depends on the data produced by
another operation, the former operation cannot start until the latter completes the
execution and the produced data is sent to the former operation thus accounting for the
processing time by the first element and the communication delay in sending the data.
2. No resource conflict
Resource conflict is defined as the situation that the resource (PE or a
communication link) is used at the same time by more than one operation or
communication. Hence if resource conflict occurs in a schedule, the schedule cannot be
realized. Only one operation can be executed on a PE at a particular time instant. Also
only one data can be sent or received on a data communication link at the same time.
Objective: The objective of the scheduling is to find a schedule which achieves the
minimum iteration period for a given processing algorithm and a given array topology. If
there exist more than one such schedule, then choose one which achieves the minimum
latency.
Basic Scheduling Strategy
At first, an ILP model is constructed to decide whether a schedule of a processing
algorithm exists or not which satisfies all the scheduling constraints for a specified
iteration period and latency on a PE array of a given topology. A lower bound of iteration
period and the lower bound of latency are computed. Then the complete model is
generated and run to decide whether a schedule exists. If the complete model does not
terminate with a solution, i.e., no schedule satisfying scheduling constraints exists for this
iteration period and the latency, then the latency or the iteration period is increased to get
a feasible solution. By repeating the process, the complete model eventually terminates
with a solution, i.e., a schedule satisfying all the scheduling constraints and returning a
minimum iteration period and minimum latency for the scheduling. This approach always
terminates because a schedule where all the operations are executed sequentially on one
of the PEs is a valid schedule and it can be obtained if the iteration period and the latency
are sufficiently large.
The basic Algorithm is given as:
1. Identify the inputs of the array topology
2. Compute the lower bound of iteration period Ti
3. Compute the lower bound of latency Lt
4. Solve for the complete model taking into account the PEs and the CLs together
5. If the model is solved Goto end
6. Increment Lt
7. If Lt > Ti, Ti + Ti + 1;Goto 3
8. Goto 4
9. Obtain the optimal solution
The flowchart for the algorithm is given below as:
Figure 8: Flowchart for an unrefined ILP formulation for scheduling
The model is solved for by a set of linear equations focusing on latency minimization.
The objective here is to minimize the time period of a single iteration keeping the latency
as small as possible. First the lower bound for the iteration period and the latency are
computed and assigned a guess estimate. The complete model is generated and run to
satisfy these bounds. If the model does not terminate, the latency and the guess are
increased until the model does terminate. The model eventually terminates because a
schedule can always be found where all the operations are executed sequentially on one
of the processing elements and this is a valid schedule if the iteration period and the
latency are large enough.
Refined Scheduling method
This basic scheduling method can be modified suitably to get a refined scheduling
method. To strictly constrain precedence relations and check resource conflict, the above
model requires many binary variables for a large processing algorithm and therefore its
solution time is very long and sometimes it cannot be solved. The refined scheduling
method takes into account two Linear programming formulations for resource and
communication allocation separately and should be able to solve faster. This is because
the communication allocator checks for only valid schedules and leaves out all schedules
rendered infeasible by the resource allocator. This formulation is given as:
1. Identify the inputs and the structure of the array topology
2. Compute the lower bound of iteration period Ti
3. Compute the lower bound of latency Lt
4. Solve for the “datamodel” taking into account only the PEs
5. If the ““datamodel”” is solved Goto 9
6. Increment Lt
7. If Lt > Ti, Ti + Ti + 1;Goto 3
8. Goto 4
9. Solve for ““commodel”” taking into account the CLs
10. Obtain the optimal solution
The “datamodel” is the complete model described earlier except that it does not check for
the resource conflicts on the data communication channels.
The purpose of this model is to determine the start and the end times of the operations
and schedule them in the most optimal way in a manner that the precedence relations are
satisfied. The ““commodel”” that follows the “datamodel” is then the entire complete
model defined earlier but with a very strict limitation on the start and end times of each
operation based on the “datamodel”. The two linear programming formulations can be
solved more quickly than solving a complete model with no bounds on the process start
times. The “datamodel” determines the start time for each operation so htat the
precedence relations are satisfied and no resource conflicts occur on the processing
elements. Based on these start times the “commodel” finds a schedule where all the
precedence relations are satisfied and no resource conflicts on data communication links
or processing elements occur. The “commodel” also checks whether an operation cvan be
shifted “m” time units to avoid for any resource conflicts(on the data communication
channels) that might arise from considering only the ““datamodel””.
The “commodel” m = 0checks the existence of a schedule by fixing the start time of all
the operations as determined by the “datamodel”. If he model now terminates without a
solution, m is incremented by 1 and the model is run again till a solution is found. It may
be necessary to increase the latency of the solution to find a solution that satisfies all the
constraints. Although for a very large value of m, the new model is equivalent to the
earlier ILP model.
The flowchart is given below:
Figure 9: Flowchart for a refined ILP formulation for scheduling
The linear equations governing the solution of the “datamodel” and the commmodel are
formulated with respect to the following guidelines:
1) Each operation is executed only once
2) Only one operation is executed on one element at a given time thus avoiding resource
conflict
3) Precedence relations between operation and operation, operation and communication
and communication and communication
These three constraints are modeled by the equations given below:
Processing Algorithm Array Topology
Data I/O specification
Compute lower bound of iteration period Ti
Compute the lower bound of latency Lt
Is “datamodel” solved
Increment Lt Does Lt exceed the upper bound
Increment Ti yes
no
no
m = 0
Is “commodel” solved Optimal Solution
Is “commodel” identical to complete
model
m = m+1
yes
no
no
4) Flow of information from one processor to another through a communication link
5) Data flow out from a cutset of PE is equal to data flow in to another cutset of PEs.
∑ ∑+∈
=
∈∀≤NINi
PCONkp
iik
kpkpk
PkTflowf
1,,
,,
In the above equations, the following terminology is used:
Xi,j,k =1 implies that operation I is scheduled at time j on a processing element k
Ti is the iteration period of the algorithm
flowfk,i =1 implies that a data produced by operation I is output from processing element
k
P is the set of all processors
ALAPi, ASAPi are the ALAP and ASAP scheduling times for a particular operation
Rx,i is the set of times ranging from ASAPi to ALAPi where the process i can be
scheduled.
N is the total set of operating nodes in the corresponding DFG
An optimal solution is found when all resource conflicts on data communication links as
well as processing elements are resolved. For a pair of operations I and J such that I
preceeds J, if the time difference from the execution of operation I to the execution of
( )[ ]
∑ ∑
∑ ∑ ∑
∑ ∑
∈ +−−<
∈
−
= =−+
∈ ∈
≤
≤≤≤
∈∀=
−−
Pkp TDPCONCjjpkpjpikjip
iNi
L
q
TALAP
pkqTpji
Rxj Pkkji
iipikpki
i iqji
i
i
XX
TjX
NiX
*,,,,
1
0
/
0,*,
,,
,,
1_____,1
1
∑ ∑∈ ∈
−≥xi xipRj Rj
kjipkjiik XXflowf ,,,,,
operation J is large, then it is easy to resolve resource conflict by modifying the execution
time of these operations without violating the precedence relation between operations I
and J. Hence the objective function to be maximized can be stated as
MAX Σ(start time of operation J - end time of operation I)
Where the summation is over all the operations ‘N’ to be performed in the DFG given
that the constraints represented by 1-5 above are satisfied.
Results:
The constraint equations as described above were modeled as an integer linear
programming code and solved for the optimality of scheduling. The operation execution
time is assumed to be 3 units of time for multiplication and 1 unit of time for an addition
operation. Also it is assumed htat the operations are not pipelined. For the DCT schedule,
there are 29 addition operations and 11 multiplication operations. Hence the total
execution time is 51 units of time (29 + 2*11). Since there are 6 Processing Elements, the
lower bound on the iteration period is 51/6 = 9 (the greatest integer). This means that
there cannot exist a schedule that has an iteration period of less than 9 time units. The
appendix gives the variables for the formulation of the ILP equations. These formulation
was done and simulated using “lpsolve”: a non-commercial linear programming code
written in ANSI C by Michel Berkelaar, who claims to have solved problems as large as
30,000 variables and 50,000 constraints. The simulations were carried out on a sun spark
100MHz workstation and the model took 2 hours and 48 minutes to solve. The entire
model was not run due to difficulties in programming for the “commodel” which checks
for the final constraints. Hence a sufficiently large latency of 20 time units was assumed
to terminate the iterations.
The results of the formulation are tabulated below:
Processing
Element
1 2 3 4 5 6
Time
T1 O4
T2 O5 O6
T3 O16 O7 O3
T4 O16 O1 O2
T5 O20 O12 O8 O22
T6 O23 O11 O14 O22
T7 O23 O28 O17 O24
T8 O19 O31 O17 O24 O18
T9 O31 O13 O21 O18
T10 O15 O30
T11 O34 O30 O10
T12 O32 O33 O36
T13 O32 O25 O38
T14 O40 O38
T15 O27 O26
T16 O35
T17 O39
T18 O39
T19 O29
T20 O37
This is a correct schedule as it takes 20 time units(the maximum allowed ) to determine
the first set of outputs
– 4 of the 6 PEs are occupied for 9 time units
– 1 PE is occupied for 8 time units
– 1 PE is occupied for 7 time units
This shows that the solution is optimally reached with respect to processor allocation as
the total time for iteration is 51 time units.
References:
1. J. Lee, Y. Hsu, and Y. Lin, ``A new Integer Linear
Programming Formulation for the Scheduling Problem in Data-
Path Synthesis,'' Proc. of the Int. conf. on Computer-Aided
Design, pp. 20-23, 1989.
2. C. Loeffler, A Ligtenberg and G.S. Mosytz, “Practical, fast, 1-
D DCT algorithms with 11 multiplications,” in Proc. IEEE
ICASSP, pp322-329, Nov. 1994.
3. C.T.Hwang, J.H. Lee, Y.C. Hsu, “A formal Approach to the
Scheduling Problem in High Level Synthesis”, IEEE Trans.
Computer Aided Desig, vol CAD 10 pp. 464-475, April 1991.
4. Brinkmann, R.; Drechsler, R, “RTL-datapath verification using
integer linear programming”, Design Automation Conference,
2002. Proceedings of ASP-DAC 2002. 7th Asia and South
Pacific and the 15th International Conference on VLSI Design.
Proceedings,2002 Page(s): 741 -746
5. Chang, G.W.; Aganagic, M.; Waight, J.G.; Medina, J.; Burton,
T.; Reeves, S.; Christoforidis, M.,” Experiences with mixed
integer linear programming based approaches on short-term
hydro scheduling”, Power Systems, IEEE Transactions on ,
Volume: 16 Issue: 4,Nov.2001, Page(s): 743 -749.
6. Chakrabarty, K.,” Test scheduling for core-based systems using
mixed-integer linear programming”, Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on ,
Volume: 19 Issue: 10 , Oct. 2000
Page(s): 1163 -1174
7. Chakrabarty, K, ”Design of system-on-a-chip test access
architectures using integer linear programming VLSI Test
Symposium, 2000. Proceedings. 18th IEEE,2000 Page(s): 127 -
134
8. Kaufman, D.E.; Nonis, J.; Smith, R.L.,” A mixed integer linear
programming formulation of the dynamic traffic assignment
problem” , Systems, Man and Cybernetics, 1992., IEEE
International Conference on , 1992 Page(s): 232 -235 vol.1
9. N. Park, and A.C. Parker, “ Sehwa: A Software Package for
Synthesis of Pipelines from Behavioral Specifications”, IEEE
Trans. Computer Aided Design, vol 7, March 1998
10. Ning Liu; Cios, K.J, “Learning rules by integer linear
programming”, Industrial Electronics, 1992., Proceedings of
the IEEE International Symposium on , 1992 Page(s): 246 -250
Appendix
Operation Earliest Start Time Latest Start Time
1 1 14 2 1 14 3 1 14 4 1 14 5 1 14 6 1 14 7 1 14 8 1 14 9 2 19
10 2 19 11 2 16 12 2 16 13 5 18 14 2 15 15 5 18 16 2 16 17 3 16 18 2 16 19 2 16 20 2 15 21 5 18 22 2 16 23 3 16 24 2 16 25 6 20 26 6 20 27 6 20 28 6 17 29 6 20 30 3 18 31 4 18 32 3 18 33 6 19 34 6 19 35 6 19 36 6 19 37 7 20 38 7 20 39 7 20 40 7 20