[IEEE 2010 23rd International Conference on VLSI Design: concurrently with the 9th International...

Instruction Selection in ASIP Synthesis using Functional Matching

Nidhi AroraCSE Dept, IIT Delhi

New Delhi, [email protected]

Kiran ChandramohanSynfora India

Bangalore, [email protected]

Nagaraju PothineniGoogle India Pvt Ltd.

Bangalore, [email protected]

Anshul KumarCSE Dept., IIT Delhi

New Delhi, [email protected]

Abstract

In embedded systems, Application Specific InstructionSet Processors (ASIPs) are used commonly with the aim toget high performance without losing flexibility. A crucialoperation required during ASIP synthesis (in particular, se-lection of custom instructions) as well as code generationfor ASIPs is identifying portions of an application programthat can be executed by custom functional units (CFUs).Most existing solutions achieve this by matching structure ofpatterns corresponding to CFUs with sub-graphs of appli-cation data flow graphs. Often it happens that the computa-tions performed by the two are equivalent, but due to struc-tural dissimilarities the match is missed. What is neededis a method that can match two graphs functionally ratherthan structurally. In this paper, we present a novel methodto do this and give implementation results to show its effec-tiveness.

1 Introduction

Application Specific Instruction Set Processors (ASIPs)are commonly used in embedded system applications whichare compute intensive, for example, network routing, imageprocessing, wireless communication, security etc. Thereare several ways of designing application specific proces-sors [10]. A common approach is to take an applica-tion and a base processor as inputs and extend the latterwith a set of custom instructions to meet certain perfor-mance/power/chip area goals.

Custom instructions implemented using Custom Func-tion Units (CFUs) in an ASIP, are designed to execute somecommonly occurring computations in the given application.There can be a large number of candidates for custom in-structions, out of which a smaller sub-set is required to beselected, keeping in view the given design goals and con-straints. For selecting custom instructions, it is required toidentify portions of an application program that can be im-plemented by custom instructions. This can be viewed as a

pattern matching and covering problem over directed graphas follows. The application represented as a Data FlowGraph (DFG) is considered as the subject graph. The can-didate instructions are also represented by graph called pat-tern graph. A sub-graph of the subject graph that matchesa pattern graph can be replaced with a node representingthe custom instruction corresponding to that pattern, pos-sibly leading to saving in energy and/or computation cy-cles. To derive maximum benefit of custom instructions,it is required to cover as many non-overlapping sub-graphsof the subject graph with matching pattern graphs as possi-ble, within the given constraints. The sub-set of candidateinstructions that provide the best cover are selected for ex-tending the base processor. A key step in this process ismatching of two sub-graphs. Most of the previous workconsidered only structural matching (checking graph iso-morphism) during covering, e.g. [6, 8].

Structural matching approach misses many potentialmatches. For example, graphs in Figure 1(a) and Figure1(b) are not isomorphic and therefore do not match struc-turally. However, taking into account distributive propertyof the operations, a match can be considered. As anotherexample, the graphs in Figure 2(a) and Figure 2(b) are alsonot isomorphic. But by assigning A = P,B = Q,E = Rand C, D = 0 , the pattern in Figure 2(a) can perform thecomputation of the sub-graph in Figure 2(b).

In this paper we introduce an approach for match-ing which uses functional matching rather than structuralmatching. This is based on a definition of normal form ofexpression which exploits the algebraic properties of oper-ators and allows expression graphs which are functionallyequivalent to be matched.

The paper is organized as follows. Section 2 summarizessome related work. Section 3 introduces the definitions ofnormal form of expressions. Section 4 describes the processof transforming expressions into their normal form. Sec-tion 5 gives procedures for matching and covering based onthe normal form. Section 6 describes implementation andexperimental results. Section 7 records the overall conclu-sions.

2010 23rd International Conference on VLSI Design

1063-9667/10 $26.00 © 2010 IEEE

DOI 10.1109/VLSI.Design.2010.68

146

P

<<

R Q

<<

R

S

-

+

(a) Pattern (b) Subgraph

Figure 1. Mismatch due to distributivity

(a) Pattern (b) Subgraph

Figure 2. Mismatch due to missing inputs

2 Related Work

The problem of functional semantic matching of two ex-pressions has caught attention of researchers in many con-texts, for example, formal verification, high level synthesis,code generation etc. There have been varied approaches tothe problem. One of the approaches is to convert the expres-sions into a lower level, such as boolean level [4] or arith-metic bit level [13], and then perform matching. The ap-proach by Ghodrat et.al. [9] involves finding sub-domainsof operand values where the two expressions have same val-ues or different. This is done using interval analysis and par-tial evaluation. Solutions specific to expressions in the formof polynomials of single variable [12] or multiple variables[5] have also been developed.

In context of instruction set extension, a symbolic al-gebra based method has been presented by Peymandoustet. al. in [11]. In this work, a data-flow graph is repre-sented as a polynomial and the CFUs as a set of side rela-tions. Then the polynomial is simplified modulo these siderelations. For finding the best possible solution, the side-relation set has to be set equal to all subsets of the instruc-tion set with all possible permutations of input variables.Therefore choosing the side-relation set is computationallyexpensive. Also, symbolic algebra methods cannot handlebit-wise logical operators.

The approach based on Normal Form presented in thispaper is quite distinct from all these works. It is computa-tionally efficient and covers all the operations of interest ininstruction selection for ASIP synthesis.

3 Normal Form of Expressions

Normal forms for boolean expressions, such as disjunc-tive normal form and conjunctive normal form, are wellknown. However, such forms have not been defined forexpressions involving other operations such as arithmetic.The basic idea behind the normal form proposed here is tokeep the operations in certain order in the DAGs (DirectedAcyclic Graphs) involved in matching. DAGs which arenot trees are converted into trees by node duplication, be-cause existence of unique paths in trees from root to variousleaves makes it easy to define ordering of operations. Op-eration ordering is defined group-wise as shown in Table 1.Since trees and expressions have a direct correspondence,we use these two terms interchangeably. This ordering isachieved by applying some transformation rules (defined innext section) to the expressions. Transformation of graph toreorder operations has also been used in [14] in context ofdata-path synthesis.

Group Operations with orderingfrom root to leaf

Arithmetic subtract → add → multiply(integer ”-” ”+” ”*”Bit-wise or → and → notLogical ”|” ”&” ”¬”Shift left-shift right-shift (any order)

”�” ”�”

Table 1. Order of operations in Normal Form

The set of operations has been chosen based on severalconsiderations. Firstly, the operations that have been se-lected were found to have a high frequency of occurrence insome benchmarks that were studied [1, 2]. Secondly, op-erations like compare, modulus, and sign extension havebeen excluded because of their lower frequency of occur-rence and lack of rules to move operations across these inan attempt to obtain a desired order. Thirdly, memory ac-cess operations, though not infrequent, are excluded assum-ing that CFUs access only register file.

Grouping of operations given in Table 1, stratifies ex-pression trees into clusters (maximal sub-trees) of opera-tions from the same group, separated by operations that areexcluded from the table. These clusters are called Arith-metic (A), Bit-wise (B) and Shift (S) clusters. Combinedclusters or clusters with operations from multiple groupsare called SA (Shift-Arithmetic), SB (Shift-Bit-wise), BA(Bit-wise-Arithmetic) and SBA (Shift-Bit-wise-Arithmetic)

147

clusters. Normal forms for A, B and S clusters follow Ta-ble 1. In a normal A or B cluster, on every path from rootto leaf, each operator can appear atmost once. In a normalS cluster, on every path from root to leaf no two left shiftsor no two right shifts can appear together. Normal formsfor combined cluster are composed of normal A, B and Sclusters, satisfying the following conditions.

1. A left shift operation does not have a left sub-tree withan arithmetic operation as its root.

2. The right shift or left shift operation does not have aleft sub-tree with a bit-wise operation other than notoperation as its root.

4 Conversion to Normal Form

4.1 Normalization Procedure

The process of conversion to normal form can be viewedas a series of application of Rewrite Rules shown in Table2. The rules are grouped into categories depending upon thetype of cluster they operate on. Rules in a category are usedfor some common purpose and the same is mentioned in theright column. For example, CAT A2 rules are applicable to’A’ clusters and these rules shift + operations below * oper-ations (towards root). First the rule categories for individ-ual clusters S, A and B are selected and then for combinedclusters SA and SB. This sequence is repeated until no morerules are applicable.

After applying rules of category CAT A1, CAT A2 andCAT A3 to A clusters, there may be a possibility of can-celling terms in the two sub-trees of a ”-” node or collapsingof some multiplicative constants. Rules for doing this areshown in Table 3. The form obtained after applying theserules is called Reduced Normal Form.

After applying CAT B1, CAT B2 and CAT B3 rules toa B cluster, we get an expression in which not occurs onlybefore literals and not, and, or are the only operations oc-curring in the expression. CAT B4 rules convert these tosum of products. Each product term is then replaced with asum of corresponding minterms. Finally CAT B2 rules areapplied again to simplify and remove duplicates.

Figure 3(a) shows a Combined Cluster which is not inthe normalized form. This combined cluster in its normal-ized form is shown in Figure 3(b).

4.2 Correctness of the procedure

The normalization procedure given here specifies the or-der in which rules across the categories are to be considered,but does not specify the order of rules within the same cat-egory. Therefore the procedure is non-deterministic. How-ever, it has the confluence property, as defined below.

Cat A1 Rulesa ∗ (b − c) → (a ∗ b) − (a ∗ c) Shifting subtract(b − c) ∗ a → (b ∗ a) − (c ∗ a) towards the roota + (b − c) → (a + b) − c (below add and(b − c) + a → (b + a) − c multiply operator)a − (b − c) → (a − b) + c Reducing the(a − b) − c → a − (b + c) subtract operator

Cat A2 Rulesc ∗ (a + b) → (c ∗ a) + (c ∗ b) Shift add(a + b) ∗ c → (a ∗ c) + (b ∗ c) below multiply

Cat A3 RulesSimplifying rules involving identity elements.

Cat B1 RulesConversion of TRIMARAN operationsandcm, orcm, nand, nor, xor and xorcm

Cat B2 RulesSimplifying rules involving identity elementsand those based on idempotent and absorption

Cat B3 Rules¬(¬a) −→ a reducing not operator¬(a|b) −→ ¬a&¬b Shifting not¬(a&b) −→ ¬a|¬b operator towards leaf

Cat B4 Rules(a|b)&c → (a&c)|(b&c) Shifting ora&(b|c) → (a&b)|(a&c) below and

Cat S1 Rules(a � b) � c → a � (b + c) Reducing the(a � b) � c → a � (b + c) shift operators

Cat SA1 Rules(a + b) � c → (a � c) + (b � c) Shifting arithmetic(a − b) � c → (a � c) − (b � c) operators towards(a ∗ b) � c → (a � c) ∗ b root

Cat SB1 Rules(a&b) � c → (a � c)&(b � c) Shifting(a&b) � c → (a � c)&(b � c) bit-wise operators(a | b) � c → (a � c) | (b � c) below(a | b) � c → (a � c) | (b � c) shift operator

Table 2. Normalization Rules

Normal Form → Reduced Normal Form(a + b) − (c + b) → a − c

m ∗ a + n ∗ a → p ∗ a(p,m,n are constants and p=m+n)

Table 3. Reduced Normal Form Rules

• Confluence: If there are two distinct reductions P andQ starting from the same term, then there exists a termthat is reachable via a sequence of reductions fromboth P and Q.

Further, the number of steps required to reach the normalform is not predetermined, but the procedure has conver-gence property, as defined below.

• Convergence: The procedure terminates with a definitevalue in finite number of steps.

The proof of confluence and convergence proprerties isomitted here for brevity.

5 Matching and Covering

Instruction Selection using functional matching is doneby finding a large set of equivalent pattern graphs and

148

c

f

d ec d

c

e&&

&

|

>>

a

−

+

+

g

b

e’

d’a |b c

d e

>> &

>>

−

+

g

f

a) Subject Graph b) Normalized Graph

Figure 3. Normalization

sub-graphs of DFG of the given application generated us-ing Cong’s algorithm [8, 7]. We match normalized pat-tern graphs and normalized sub-graphs instead of matchingthem directly. By doing normalization, matches are foundeven when the graphs are not isomorphic (not structurallymatching) to each other but are functioanally equivalent.Matched sub-graphs are covered with corresponding pat-tern graphs p such that execution of DFG requires mini-mum compute cycles. We present two methods of matchingand their corresponding covering algorithms in the follow-ing paragraphs.

5.1 Method 1

In the first method, normalized sub-graphs of DFG Gare simply matched with normalized patterns.Matching A graph x is converted into its normal form(as discussed in section 3) by using the function Norm(x).To check whether two graphs x1 and x2 match, functionMatch(x1, x2) is used which returns true if x1 and x2

match structurally taking into account commutativity andassociativity of the operators as well as substitution ofconstants for operands. A set M of matched pairs of patterngraph p ∈ P and sub-graph s ∈ S is obtained by using thefunction MatchSet(P,S) defined below.

MatchSet(P,S) = {(s,p)| s ∈ S, p ∈ P,Match(Norm(s),Norm(p))}

Covering Covering involves replacing matched sub-graphswith nodes denoting corresponding patterns. We usefunction Replace(X,Y) to do these replacements. Here X isa DFG and Y is a set of pair of graphs y1 and y2 such thaty1 is sub-graph of X and y2 is a pattern that matched withit. The function replaces all y1’s with nodes representingcorresponding y2’s. Let M be the match set found inmatching step. Then MinCover(M,G) returns a sub-set Lof M containing mutually disjoint sub-graphs, such thaton replacing all sub-graphs with the nodes representingcorresponding pattern graphs, number of computationcycles is minimized. This is defined as follows.

CoveredGraph(M,G) = Replace(G,L∗)where L∗ = MinCover(M,G)

Subject GraphPattern Graph

Subject GraphNormalized

Pattern GraphNormalized

+

bca

uw

x vy

z

+ +

−

+

−

+

−

u

w x y z

+va b

+ c

+

Figure 4. Matching in method 2

MinCover(M,G) =argminL∈P(M)|Disjoint(L)(Cycles(Replace(G,L)))

Cycle(X) denotes the total compute cycles required tosimulate the graph X, Disjoint(L) denotes whether intersec-tion of nodes of all sub-graphs in the list L is null or not andP(M) denotes the power set of M.

5.2 Method 2

Structure of a sub-graph s gets changed when it is nor-malized to s’. In some cases it might be possible that apattern graph p which is not equivalent to s becomes equiv-alent to sub-graph of s’ as shown in Figure 4 where shadednodes are the matched nodes. Such matchings which arenot found by method 1 are taken care by this method.

Matching Here we use the function MatchSet2(P,S)which generates a set of triplets. Each triplet containsa sub-graph s ∈ S, its normalized form s’, and a set ofmatched pairs of sub-graphs s” of s’ and equivalent patternsp. This function is defined asMatchSet2(P,S) = {(s,s’,L) | s ∈ S, s’ = Norm(s),

{L = (s”,p) | s” ⊂ s’, p ∈ P, Match(s”,Norm(p))}}

Covering In this phase, sub-graphs of DFG G arereplaced with their normalized graphs and then sub-graphsof the normalized sub-graphs are replaced with nodesrepresenting matched pattern graphs. First we find a localminimum covers L∗ from each triplet (s,s’,L) generated inmatching phase.

CoveredSubgraph((s,s’,L),G) = Replace(s,L∗)where L∗ = MinCover(L,Subs(G,{(s,s’)}))

Here the function Subs is similar to Replace except that itreplaces a sub-graph with another sub-graph rather than anode. Then we construct a set M as following.

M = {(s,s∗) | (s,s’,L) ∈ MatchSet2(P,S),s∗ = CoveredSubgraph((s,s’,L),G) }

Now minimum cover obtained by method 2 can be definedas follows.

149

MinCover2(M,G)=argminM ′∈P(M)|Disjoint(M ′) (Cycles(Subs(G,M ′)))

6 Implementation and Results

6.1 Implementation

To test effectiveness of the proposed matching and cov-ering algorithms, these algorithms were implemented andintegrated in Trimaran 3.7 simulator tool chain [3] as shownin Figure 5. These algorithms form a part of the new moduleadded in the elcor part of trimaran. As trimaran has an up-per limit of 4 on number of operands of the functional units,elcor and simu parts of trimaran were also updated to acceptmore number of operands in functional units. There are twoinputs to the system - a) the machine description in MDESformat (the base architecture augmented with custom func-tional units implementing the new instructions) and b) anapplication program in C. For each basic block of the bench-mark, its DFG G transformed into a DFG G’ after matchingand covering. Latencies of the custom functional units wereobtained by using the Synopsys synthesis tool with 130nmUMC library.

ApplicationC Source

CompilerOpen Impact

Lcode IR

DFG G

& Covering

MatchingASIP

DFG G’

Rebel IR

SimulatorSimu

StatisticsExecuting

FUs + CFUs)

(Mdes for

Description

Machine

CompilerElcor

Figure 5. Insrtuction Selection Design Flow

6.2 Results

The first matching and covering method was exper-imented over various benchmarks, results of which areshown in Table 4. This table shows the percentage increasein matches using functional matching over structural match-ing and percentage reduction in the compute cycles of bothstructural(S) and functional(F) matching methods over base

Benchmark Increase in Reduction Reduction Total TotalMatching in Compute in Compute time in time in

Cycles(S) Cycles(F) sec (S) sec (F)imagefilter 70% 11.0% 13.13% 0.79 0.82crc32 100% 18% 18% 0.53 0.64g721encode 690% 11% 13.14% 9.56 16.16fir 500% 2.3% 2.5% 1.33 1.42fft 371% 0.39% 15.83% 6.57 8.93djpeg 172% 13.77% 26.77% 84.45 92.38dct int2 600% 38.81% 69% 367.03 369.75

Table 4. Simulation Results

processor. The last two columns in the table record the com-pilation time for matching and covering and show that over-head of functional matching is very small.

It was observed that as we moved from structural tofunctional matching, increase in the number of matches var-ied from moderate (70% in image filter) to very high (690%in g721encode), whereas increase in compute cycles reduc-tion varied from nil(18% → 18% in crc32) to significant(38.8% → 69% in dct int2). The main reason for this isthat many matched sub-graphs are overlapping and are notselected in covering. In some cases, after covering, criticalpath does not reduce much. Though all the above factorshinder the results, but if the size of DFGs of the frequentlyoccurring basic blocks is large and is computationally heavythen this method gives much better results as shown fordct int2 and djpeg.

Benchmarks which showed good results are fft, dct intand djpeg. Some of them namely, dct int and djpeg werefurther tested with the following variations

• both the methods for matching and covering describedin Section 5 were applied.

• issue width of the base VLIW processor was variedfrom 1 to 5.

• number of operands in the patterns was increased from4 to 6 in djpeg and 4 to 9 in dct int.

Results for djpeg are shown in Figure 6. Where B repre-sents base processor, S4 and S6 represent ASIP synthesisusing structural matching, F4 and F6 represent ASIP syn-thesis using method 1 and FM4 and FM6 represent ASIPsynthesis using method 2. The number 4 and 6 refer to theconstraints on the number of operands in custom instruc-tions. Similarly results of dct int are shown in Figure 7

From these tables, it could be seen that as the size of pat-tern graphs increases, reduction in compute cycles also in-creases (compared to base processor) using method 1. Thishappens because method 1 finds matches only if normalizedpattern is equivalent to normalized sub-graph. If after nor-malization, size of sub-graph becomes larger than the sizeof pattern graph, then they are not considered for matching.Thus as the size of pattern graph increases, match set in-creases and more reduction in compute cycles is observed.By using method 2, a large amount of improvement is seen

150

Figure 6. Djpeg Cycles V/S Issue width

Figure 7. Dct int Cycles v/s Issue width

even for small size patterns because in this method sub-graphs of the normalized sub-graphs are matched with thepattern graphs, thus fewer candidate matches are rejected.This method does not give significant improvement with theincrease in size of pattern graphs because in many cases alarge pattern graph replaces the small sub-graph, thus in-stead of reducing the computation cycle it increases thecomputation time of the part of the DFG. Thus the choicebetween the two methods could be made on the bases of ei-ther the number of inputs in the patterns permitted (method1 preferred over method 2 if this number is large and viceverse) or trade-off between synthesis time and compute cy-cle reduction (method 2 took nearly double the computationtime).

7 Conclusion

In this paper we have defined an approach for functionalmatching of expressions for the purpose of ASIP instruc-tions selection and code generation. This method is basedon normalization of the expressions before comparing theirstructures. Expressions of certain class can be normalized

by applying a finite set of transformation rules. The pro-posed process of functional matching is efficient as wellas effective in improving utilization of custom functionalunits. This method is currently developed for expressionsor single output computation patterns. In future these maybe extented to multiple output patterns.

References

[1] Mediabench benchmarks. http://euler.slu.edu/ fritts/ media-bench.

[2] Mibench benchmarks. http://www.eecs.umich.edu/mibench.[3] Trimaran simulator. http://trimaran.org.[4] N. Cheung, S. Parameswaran, J. Henkel, and J. Chan.

Mince: matching instructions using combinational equiva-lence for extensible processor. In Conference on Design,Automation and Test in Europe, pages 1020–1025, 2004.

[5] M. J. Ciesielski, P. Kalla, and S. Askar. Taylor expansiondiagrams: A canonical representation for verification of dataflow designs. IEEE Trans. Computers, 55(9):1188–1201,2006.

[6] N. T. Clark, H. Zhong, and S. A. Mahlke. Automated cus-tom instruction generation for domain-specific processor ac-celeration. IEEE Transactions on Computers, 54(10):1258–1270, 2005.

[7] J. Cong and Y. Ding. Combinational logic synthesis for LUTbased field programmable gate arrays. ACM Transactions onDesign Automation of Electronic Systems, 1:145–204, 1996.

[8] J. Cong, Y. Fan, G. Han, and Z. Zhang. Application-specificinstruction generation for configurable processor architec-tures. In FPGA ’04: Proceedings of the 2004 ACM/SIGDA12th international symposium on Field programmable gatearrays, pages 183–189, New York, NY, USA, 2004.

[9] M. A. Ghodrat, T. Givargis, and A. Nicolau. Expressionequivalence checking using interval analysis. IEEE Trans.VLSI Syst., 14(8):830–842, 2006.

[10] M. K. Jain, M. Balakrishnan, and A. Kumar. Asip designmethodologies: Survey and issues. In VLSI01: Proceedingsof the IEEE / ACM International Conference on VLSI De-sign., pages 76–81, 2001.

[11] A. Peymandoust, L. Pozzi, P. Ienne, and G. D. Micheli. Au-tomatic instruction set extension and utilization for embed-ded processors. In Proceedings. IEEE International Con-ference on Application-Specific Systems, Architectures, andProcessors, 2003., pages 108– 118, 2003.

[12] N. Shekhar, P. Kalla, S. Gopalakrishnan, and F. Enescu. Ex-ploiting vanishing polynomials for equivalence veri.cationof fixed-size arithmetic datapaths. In ICCD ’05: Proceed-ings of the 2005 International Conference on Computer De-sign, pages 215–220, Washington, DC, USA, 2005.

[13] D. Stoffel and W. Kunz. Equivalence checking of arithmeticcircuits on the arithmetic bit level. IEEE Trans. on CAD ofIntegrated Circuits and Systems, 23(5):586–597, 2004.

[14] A. K. Verma and P. Ienne. Improved Use of the Carry-Save Representation for the Synthesis of Complex Arith-metic Circuits. In Proceedings of the International Confer-ence on Computer Aided Design, pages 791–98, 2004.

151

[IEEE 2010 23rd International Conference on VLSI Design: concurrently with the 9th International...

Documents

Transcript of [IEEE 2010 23rd International Conference on VLSI Design: concurrently with the 9th International...