Mapping of Regular Nested Loop Programs to Coarse...

8
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Frank Hannig, Hritam Dutta, and J ¨ urgen Teich Department of Computer Science 12, Hardware-Software-Co-Design, University of Erlangen-Nuremberg, Germany, hannig, dutta, teich @codesign.informatik.uni-erlangen.de URL: http://www12.informatik.uni-erlangen.de Abstract Apart from academic, recently more and more commer- cial coarse-grained reconfigurable arrays have been devel- oped. Computational intensive applications from the area of video and wireless communication seek to exploit the computational power of such massively parallel SoCs. Con- ventionally, DSP processors are used in the digital signal processing domain. Thus, the existing compilation tech- niques are closely related to approaches from the DSP world. These approaches employ several loop transforma- tions, like pipelining or temporal partitioning, but they are not able to exploit the full parallelism of a given algorithm and the computational potential of a typical 2-dimensional array. In this paper, (i) we present an overview of constraints which have to be considered when mapping applications to coarse-grained reconfigurable arrays, (ii) we present our design methodology for mapping regular algorithms onto massively parallel arrays which is characterized by loop parallelization in the polytope model, and (iii), in a first case study, we adapt our design methodology for targeting reconfigurable arrays. The case study shows that the pre- sented regular mapping methodology may lead to highly ef- ficient implementations taking into account the constraints of the architecture. 1 Introduction These days, semiconductor technology allows imple- menting arrays of hundreds of 32-bit microprocessors on a single die and more. Computationally intensive appli- cations like video and image processing in consumer elec- tronics and the rapidly evolving market of mobile and per- sonal digital devices are the driving forces in this technol- ogy. The increasing amount of functionality and adaptabil- ity in these devices has lead to a growing consideration of coarse-grained reconfigurable architectures which provide the flexibility of software combined with the performance of hardware. But, on the other hand, there is the dilemma of not being able to focus the hardware complexity of such devices because of a lack of mapping tools. Hence, paral- lelization techniques and compilers will be of utmost impor- tance in order to map computationally intensive algorithms efficiently to these coarse-grained reconfigurable arrays. In this context, our paper deals with the specific prob- lem of mapping a certain class of regular nested loop pro- grams onto a dedicated processor array. This work can be classified to the area of loop parallelization in the polytope model [5]. The rest of the paper is structured as follows. In Sec- tion 2, a brief survey of previous work on reconfigurable computing is presented. Section 3 introduces our design flow and the class of algorithms we are dealing with. Sim- ilar to programmable devices such as processors or micro controllers also reconfigurable logic devices can be built as software, by programming the configuration of the array. Therefore, reconfigurable logic devices are constrained in several points. In Section 4, we give an overview which architecture constraints have to be taken into account dur- ing compilation of algorithms onto coarse-grained reconfig- urable architectures. Afterwards in Section 5, a case study of our mapping methodology is given and results are dis- cussed. Future work and concluding remarks are presented in Section 6. 2 Related Work Reconfigurable architectures span a wide range of ab- straction levels from fine-grained Look-Up Table (LUT) based reconfigurable logic devices, like FPGAs (field- programmable gate-arrays) [15], to distributed and hierar- chical systems with heterogeneous reconfigurable compo- 0-7695-2132-0/04/$17.00 (C) 2004 IEEE Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Transcript of Mapping of Regular Nested Loop Programs to Coarse...

Page 1: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

Mapping of Regular Nested Loop Programs toCoarse-grained Reconfigurable Arrays – Constraints and Methodology

Frank Hannig, Hritam Dutta, and Jurgen TeichDepartment of Computer Science 12, Hardware-Software-Co-Design,

University of Erlangen-Nuremberg, Germany,�hannig, dutta, teich�@codesign.informatik.uni-erlangen.de

URL: http://www12.informatik.uni-erlangen.de

Abstract

Apart from academic, recently more and more commer-cial coarse-grained reconfigurable arrays have been devel-oped. Computational intensive applications from the areaof video and wireless communication seek to exploit thecomputational power of such massively parallel SoCs. Con-ventionally, DSP processors are used in the digital signalprocessing domain. Thus, the existing compilation tech-niques are closely related to approaches from the DSPworld. These approaches employ several loop transforma-tions, like pipelining or temporal partitioning, but they arenot able to exploit the full parallelism of a given algorithmand the computational potential of a typical 2-dimensionalarray.

In this paper, (i) we present an overview of constraintswhich have to be considered when mapping applications tocoarse-grained reconfigurable arrays, (ii) we present ourdesign methodology for mapping regular algorithms ontomassively parallel arrays which is characterized by loopparallelization in the polytope model, and (iii), in a firstcase study, we adapt our design methodology for targetingreconfigurable arrays. The case study shows that the pre-sented regular mapping methodology may lead to highly ef-ficient implementations taking into account the constraintsof the architecture.

1 Introduction

These days, semiconductor technology allows imple-menting arrays of hundreds of 32-bit microprocessors ona single die and more. Computationally intensive appli-cations like video and image processing in consumer elec-tronics and the rapidly evolving market of mobile and per-sonal digital devices are the driving forces in this technol-ogy. The increasing amount of functionality and adaptabil-

ity in these devices has lead to a growing consideration ofcoarse-grained reconfigurable architectures which providethe flexibility of software combined with the performanceof hardware. But, on the other hand, there is the dilemmaof not being able to focus the hardware complexity of suchdevices because of a lack of mapping tools. Hence, paral-lelization techniques and compilers will be of utmost impor-tance in order to map computationally intensive algorithmsefficiently to these coarse-grained reconfigurable arrays.

In this context, our paper deals with the specific prob-lem of mapping a certain class of regular nested loop pro-grams onto a dedicated processor array. This work can beclassified to the area of loop parallelization in the polytopemodel [5].

The rest of the paper is structured as follows. In Sec-tion 2, a brief survey of previous work on reconfigurablecomputing is presented. Section 3 introduces our designflow and the class of algorithms we are dealing with. Sim-ilar to programmable devices such as processors or microcontrollers also reconfigurable logic devices can be built assoftware, by programming the configuration of the array.Therefore, reconfigurable logic devices are constrained inseveral points. In Section 4, we give an overview whicharchitecture constraints have to be taken into account dur-ing compilation of algorithms onto coarse-grained reconfig-urable architectures. Afterwards in Section 5, a case studyof our mapping methodology is given and results are dis-cussed. Future work and concluding remarks are presentedin Section 6.

2 Related Work

Reconfigurable architectures span a wide range of ab-straction levels from fine-grained Look-Up Table (LUT)based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays) [15], to distributed and hierar-chical systems with heterogeneous reconfigurable compo-

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 2: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

nents. A taxonomy of reconfigurable logic devices is givenin [19].In [3], the authors characterize the structure of reconfig-urable architectures and compare the efficiencies of archi-tectural points across broad application characteristics. Oneconclusion of their work is that standard arithmetic comput-ing is less efficient on fine-grained architectures, becauseof the large routing area overhead. In order to handle thisrouting area overhead and herewith configurations’ size an-other approach are coarse-grained reconfigurable architec-tures. Several academic coarse-grained reconfigurable ar-rays have been developed [10] and, since a while, more andmore commercial are being developed like the D-Fabrix [4],the DRP from NEC [14], the PACT XPP [1], or QuickSilverTechnology’s Adaptive Computing Machine (ACM) [18].

Both fine- and coarse-grained architectures have a lackof programmability in common, due to their own paradigmswhich are totally different from the von Neumann’s. Toovercome this obstacle some research work exists. For in-stance, the Nimble framework [13] for compiling applica-tions specified in C to an FPGA. A hardware/software par-titioning algorithm that partitions applications onto a CPUand the datapath (the FPGA) is also contained in the frame-work. In [25], the authors present a software pipeline vec-torization technique to synthesize hardware pipelines. Aregular mapping methodology for mapping nested loop pro-grams onto FPGAs is also presented in the work of [17] andwill be later described in this paper.

Only few research work is published which deals withthe compilation to coarse-grained reconfigurable architec-tures. The authors in [24] describe a compiler frameworkto analyze SA-C programs, perform optimizations, and au-tomatically map the application onto the MorphoSys archi-tecture [20], a row-parallel or column-parallel SIMD (Sin-gle Instruction-stream Multiple Data-Stream) architecture.This approach is limited since the order of the synthesis ispredefined by the loop order and no data dependencies be-tween iterations are allowed.

Another approach for mapping loops onto coarse-grained reconfigurable architectures is presented by Dutt etal. in [12]. Outstanding in their compilation flow is the tar-get architecture, the DRAA, a generic reconfigurable archi-tecture template which can represent a wide range of coarse-grained reconfigurable arrays. The mapping technique itselfis based on loop pipelining and partitioning of the programtree into clusters which can be placed on a line of the array.

In this paper we present a case study based on mappingof regular algorithms and loop parallelization in the poly-tope model. The advantage of this approach is the exploita-tion of full data parallelism and that an efficient algorithmmapping in terms of space and time may be directly derivedfrom the polytope model.

HDL Generation

Array Structure

Processor Element

Controller

Allocation

Placement

C Code (subset)

SAC Code

Parallelization

HW Synthesis

Mathematical Libraries and Solvers(PolyLib, LPSolve, CPLEX, PIP)

Design Space Exploration

SchedulingCost Estimation

Energy Estimation

Search Space Reduction

Localization

Operator Splitting

Control GenerationAffine Transformations

Partitioning

...

Core Transformations

Figure 1. PARO Design Flow.

3 Design Flow for Regular Mapping

In this section we give an overview of our existing map-ping methodology PARO [2, 17] when generating synthe-sizable descriptions of massively parallel processor arraysfrom regular algorithms. The design flow of our approachis depicted in Fig. 1. The main transformations during thedesign flow are briefly described in the following.

Starting from a given nested loop program in a sequen-tial high-level language (subset of C) the program is paral-lelized by data dependence analysis into single assignmentcode (SAC). This algorithm class can be written as a classof recurrence equations defined as follows:

Definition 3.1 (Piecewise Regular Algorithm). A piecewiseregular algorithm contains � quantified equations

�� �� � � � � � � �� �� � � � � � � �� �� �

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 3: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

Each equation �� �� � is of the form

�� �� � � �� �� � � � �� �� � ���� � � � ��

where � � �� � ��, �� �� � are indexed variables, �� are

arbitrary functions, ��� � �� are constant data dependencevectors, and � � � denote similar arguments.

The domains �� are called index spaces, and in our casedefined as follows:

Definition 3.2 (Linearly Bounded Lattice). A linearlybounded lattice denotes an index space of the form

� � �� � �� � � � ��� � � � ��

where � � ��, � � ����, � ��, � ���� and � � ��.�� � �� �� � �� defines an integral convex polyhedron orin case of boundedness a polytope in ��. This set is affinelymapped onto iteration vectors � using an affine transforma-tion (� � ��� ).

Definition 3.3 (Block Pipelining Period). The blockpipelining period of an allocated and scheduled piecewiseregular algorithm is the time interval between the initiationsof two successive problem instances and is denoted by �.

With this representation of equations and index spaces sev-eral combinations of parallelizing transformations in thepolytope model can be applied:

Affine Transformations, like skewing of data spacesor the embedding of variables into a common indexspace.

Localization of affine data dependencies to uniformdata dependencies by propagation of variables fromone index point to a neighbor index point.

Operator Splitting, equations with more than one op-eration can be split into equations with only twooperands.

Exploration of Space-Time Mappings. Linear transfor-mations as in Eq. (1), are used as space-time mappingsin order to assign a processor (space) and sequencingindex � (time) to index vectors.

�� �� �

��

�� (1)

In Eq. (1), � � �������� and � � �

���. Themain reasons for using linear allocation and schedul-ing functions is that the data flow between PEs is lo-cal and regular which is essential for low power VLSIimplementations. The interpretation of such a linear

transformation is as follows: The set of operations de-fined at index points � � � const. are scheduled at thesame time step. The index space of allocated process-ing elements (processor space) is denoted by � and isgiven by the set � � � � � � � � � � ��. Thisset can also be obtained by choosing a projection ofthe dependence graph along a vector � � ��, i.e., anycoprime1 vector � satisfying � � � � [11] describesthe allocation equivalently.

The scheduling vector � are obtained by the formula-tion of a latency minimization problem as a mixed in-teger linear program (MILP) [22,23]. This well-knownmethod is used here during exploration as a subroutine.In this MILP, the number of resources inside each pro-cessing element can be limited. Also given is the pos-sibility that an operation can be mapped onto differ-ent resource types (module selection), and pipeliningis also possible.

Besides, the consideration of latency as a measure ofperformance, also area cost and energy consumptioncan be considered during the exploration of space-timemappings [7–9].

Partitioning. In order to match resource constraintssuch as limited number of processing elements, par-titioning techniques have to be applied [21].

Control Generation. If the functionality of one pro-cessing element can change over the time controlmechanism are necessary. Further control structuresare necessary to control the internal schedule of a PE.

HDL Generation & Synthesis. Finally after all the re-fining transformations a synthesizable description ina hardware description language like VHDL may begenerated. This is done by generating one PE and therepetitive generation of the entire array.

4 Special Constraints related to Coarse-grained Reconfigurable Arrays

Full custom ASIC designs have the advantage that theyare only constrained by technological and economic param-eters. Since programmable devices and reconfigurable logicdevices are predetermined and limited in terms of resources,even if they are given as IP (intellectual property) model. Inthe following we outline which constraints have to be takeninto account when mapping algorithms onto homogeneouscoarse-grained (re)configurable architectures. Briefly, suchan architecture consists of an array of processor elements,

1A vector � is said to be coprime if the absolute value of the greatestvalue of the greatest common divisor of its elements is one.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 4: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

memory, interconnect structures, I/O-ports, synchroniza-tion and reconfiguration mechanisms.

Array. The size of the processor array. How many proces-sor elements are in the array and how are they aligned,as one line of processing elements, several lines of pro-cessing elements, or one array of � �� processingelements.

Processor element. Each processor element consists of re-sources for functional operations. This can be eitherone or more dedicated functional units like (multipli-ers, adders, etc.) or one or more arithmetic logic units(ALUs). In case of an ALU we have to consider if thisunit is configured in advance or during a reconfigura-tion phase, or if this ALU can be programmed with aninstruction set. In case of programmability it has to beconsidered if a local program is only modulo sequen-tially executed by a sequencer or if the instruction setincludes also conditional branches.

Memory. Memory can be divided into local memory in theform of register files inside each processor element andinto memory banks with storage capacities in the rangefrom hundreds to thousands of words. The alignmentand the number of such memory banks are importantfor the data mapping. Furthermore, knowledge of sev-eral memory modes is helpful, e.g., configuration asFIFO (no address inputs needed).If a processor element contains an instruction pro-grammable ALU, besides the internal register file alsoan instruction memory is given.

Interconnect. Here, the structure and number of communi-cation channels is of interest. Which type of intercon-nect is used, buses or point-to-point connections? Howare these channels aligned, vertically, horizontally, orin both directions? How long can point-to-point con-nections be, without delay, or how many cycles have tobe taken into account when communicating data fromprocessor element ������ to processor element ������?Additionally, similar structures are required to handlethe control flow.

Synchronization. Whether the synchronization is explicitor implicit like in a packet-oriented dataflow architec-ture.

I/O-ports. The maximum bandwidth is defined by thenumber and width of the I/O-ports. The placement ofthese I/O-ports is important, since they are responsi-ble for feeding data in and out. Furthermore, it has tobe considered if the I/O-port is a streaming port or aninterface to external memory.

Reconfiguration. Here, the configuration time and thenumber of configuration contexts have to be consid-ered. In addition, possibilities of partial and dynamicalreconfiguration during the execution have to be consid-ered.

5 Case Study

In this case study our regular mapping methodology hasbeen applied for a matrix multiplication algorithm. Here,different solutions are discussed and compared to existingresults. But first, in the next subsection a PACT XPP64-Aprocessor is described that is used as the target architectureof our case study. Afterwards, our mapping methodology isapplied.

5.1 The PACT XPP64-A Reconfigurable Proces-sor

The PACT XPP64-A [1, 16] is a high performance run-time reconfigurable processor array, a schematic diagram ofthe internal structure is shown in Figure 2. The XPP64-Acontains 64 ALU-PAEs (processing array elements) of 24bit data with in an � � � array. As depicted in Figure 3each ALU-PAE contains of three objects. The ALU-PAEperforms several arithmetic, boolean, and shift operations.The Back-Register-object (BREG) provides routing chan-nels for data and events from bottom to top, additional arith-metic functions and a barrel-shifter. The Forward-Register-object (FREG) is mainly intended for routing of data fromtop to bottom and for control of the data flow by means ofevent signals. All objects are connected to horizontal rout-ing channels. RAM-PAEs are located in two columns at theleft and right border of the array. The dual ported RAMhas two separate ports for independent read and write oper-ations. Furthermore, the RAM can be configured to FIFOMode (no address inputs needed). Each of the 16 RAM-PAEs has a 512� 24 bit storage capacity. Four independentI/O interfaces in the corners of the array provide high band-width connections from the internal data path to externalstreaming data or direct access to external RAMs.

5.2 Matrix Multiplication Algorithm

In the following regular mapping study we consider amatrix multiplication algorithm. The product � � � � �

of two matrices � � ���� and � � ���� is defined asfollows

��� �

��

���

������ � � � � � � � � � �� (2)

Let the matrix product defined in Eq. (2) be given as C-program (Fig. 4). It is assumed that the input matrices� and

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 5: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

Configuration Manager

RAM-PAEs ALU-PAEs I/O Elements

Figure 2. Structure of the PACT XPP64-A re-configurable processor.

FREG ALU BREG

Register

Register Register Register

RegisterRegister

reconfLUT

Figure 3. ALU-PAE objects.

� are stored already in the arraysa[N][N] and b[N][N],and that the arrays’ elements of c are initialized with zero.After performing successively parallelization, embeddingand localization of variables the algorithm in Fig. 5 is de-rived. The matrices � and � are embedded into the arraysas follows, a[i][0][k]� ���, b[0][j][k]� ��� .Considering the amount of allocated memory it is obviousthat resources are wasted by the C-program. On the otherhand, full data parallelism is explicitly represented by thisprogram. Furthermore, the algorithm is regular since affinedata dependencies have been transformed to uniform datadependencies by localization. The computations of sucha regular algorithm may be represented by a dependencegraph (DG). The DG of the algorithm in Fig. 5 is shownin Fig. 6. The DG expresses the partial order between theoperations. Each variable of the algorithm is represented atevery index point � � � by a node. The edges correspondto the data dependencies of the algorithm. They are regularthroughout the algorithm, i.e., ���� �� �� is directly dependenton ���� � � �� ��. The DG specifies implicitly all legal exe-cution orderings of operations: if there is a directed path inthe DG from one node �� � to a node ��� where �� � �,then the computation of �� � must precede the computationof ���. Then, the corresponding dependence graphs canbe represented in a reduced form, the so-called reduced de-

for (i = 0; i<N; i++){ for (j = 0; j<N; j++){ for (k = 0; k<N; k++)

{ c[i][j] = a[i][k] * b[k][j] + c[i][j];}

}}

Figure 4. Matrix multiplication algorithm, C-Code.

for (i = 1; i<N+1; i++){ for (j = 1; j<N+1; j++){ for (k = 1; k<N+1; k++)

{ a[i][j][k] = a[i][j-1][k];b[i][j][k] = b[i-1][j][k];z[i][j][k] = a[i][j][k] * b[i][j][k];c[i][j][k] = c[i][j][k-1] + z[i][j][k];

}}

}

Figure 5. Matrix multiplication algorithm af-ter parallelization, operator splitting, embed-ding, and localization.

pendence graph.

Definition 5.1 (Reduced Dependence Graph). A reduceddependence graph (RDG) � � � ����� �� of dimension �is a network where is a set of nodes and � � � isa set of edges. To each edge � � ���� ��� there is associateda dependence vector ��� � ��.

In Fig. 7, the RDG of the algorithm introduced in Fig. 5 isshown.

Virtual Processor Elements (VPEs) are used to map thePE obtained from the design flow to the given architecture,because a single PE in the reconfigurable architecture maynot be able to implement the functionality of the PE ob-tained from the design flow.

In order to map the matrix multiplication algorithm tothe 2-dimensional PACT XPP array a design space explo-ration [7] of possible space-time mappings is performed.This exploration gives a set of optimal mappings in termsof number of VPEs and in terms of latency. From this setof optimal mappings we have selected as mapping a pro-jection in direction � � �� � ��� and a scheduling vector� � �� � ��. In the following case we consider a matrixof size � � �. The processor array for this projection isshown in Figure 8. Due to the localization procedure thematrix elements can be read into the array at the edges.Therefore, the matrix� is sequentially fed from the left sideinto the array where one row of the matrix is stored in oneseparate FIFO. The matrix � is sequentially fed row by row

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 6: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

a

b

+*

a

b

+*

a

b

+*

a

b

+*

a

b

+*

a

b

+*

a

b

+*

a

b

+*

i

k

j

Figure 6. Dependence graph of transformedmatrix multiplication algorithm for � � �.

100

0000

00

000

010

001

a

*b

+

Figure 7. Reduced dependence graph.

from the top into the array. Each column of � is stored alsoin one separate FIFO. Afterwards, the elements of � (�)are propagated cycle by cycle to the right (lower) neighborprocessor element. Similar to this data propagation, also anevent signal is propagated to reset every fourth clock cy-cle the accumulators of the processor elements. This is thesame rate as the pipelinerate of the whole array, because af-ter a delay of seven clock cycles every fourth cycle a newmatrix multiplication can be performed. So the utilizationof the 16 PAE’s is 100% since in every cycle 16 Multiplyand Accumulate (MAC) operations are completed. Sincethe result matrix � is distributed over all 16 processor ele-ments, the readout of the result data has to be serialized.

5.3 Serialization of Output Data

Let �� be the output-variable space of variable � of thespace-time mapped or partitioned index space. The outputspace can be two-dimensional, i.e., the transformed output

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

FIFO

Event

FIFO

PE(1,1) PE(2,1) PE(3,1) PE(4,1)

PE(1,2) PE(2,2) PE(3,2) PE(4,2)

PE(1,3) PE(2,3) PE(3,3) PE(4,3)

PE(1,4) PE(2,4) PE(3,4) PE(4,4)

A1

A2

A3

A4

B1

B2

B3

B4

Gen

Figure 8. �� � processor array.

variables are not aligned to one border of the array but ratherare distributed over the entire array.

As most coarse-grained reconfigurable architectureshave horizontal or vertical interconnect structures, hence-forth, without loss of generality we consider above struc-tures to collect the data from one processor’s line �� andfeed them out to an array border. Let � be the block pipelin-ing period of the scheduled regular algorithm, ��� ���� � ���� � � � � const.� be the th row of pro-cessors, and ��� � �� � ���� � � � � � const.� bethe �th column of processors, respectively. Furthermore,������� � � � � � � ��� � �� ��, � � ����

denote the time instances � ������� where the variable �produces an output at processor element���� . Then, the out-put data can be serialized if one of the following conditionholds:�

���������

��������� � � ��

���������

��������� � ��

This condition can be seen as a bandwidth constraint, sincethe number of output variables per line can not be greaterthan the block pipelining period. If the set ������� is notdense, the output variables can be fed out in an interleavedmanner. The output packets of different ���� � ��� haveto be delayed to avoid overlapping. Finally, in order to feedout the data new equations may inserted into the algorithmusing boolean OR operations, or, since it can be guaranteedthat only one data packet arrives in one clock cycle all dataoutput ports can be connected to the same data input port.

5.4 Partitioned Implementation of the MatrixMultiplication Algorithm

The example of the � � � matrix multiplication algo-rithm benefits from the parallel access to several internal

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 7: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

1 2

3 4

1 2

3 4

1 2

3 4

D D

MUX

MU

X

D

D

EX

ME

M0

AD

D G

EN E’0101’

E’0011’

D

DMU

XM

UX

VPE(2,1)

VPE(2,2)VPE(1,2)

VPE(1,1)

ADD GEN EX

ME

M1

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

A

B

A

B

1 2

3 4

1 2

3 4

(a) (b) (c)

Figure 9. (a), dataflow graph of the LPGS-partitioned matrix multiplication ��� example. (b), dataflowgraph after performing localization inside each tile. (c), array implementation of the partitionedexample

memory blocks which have to be very closely coupled toa surrounding architecture, e.g., to a cache. Considering amore realistic scenario with three external memory blocksfor each of the matrices�,�, and the result matrix�. Eachmemory block can be accessed via an independent I/O-port,i.e., in one clock cycle one element of matrix �, one of �can be read, and one of � can be written. From this fol-lows that the reading of one column of � or one row �,respectively, takes 4 clock cycles and a MAC operation isperformed only every fourth cycle per PE. Due to the I/Obottleneck, the processor elements (the PEs which are re-sponsible for the arithmetic part of the calculation) are onlyutilized by 25%. To increase the utilization of the VPEsthe number is reduced by 75% to four VPEs. In order tomap the � � � matrix multiplication algorithm to the re-duced number of VPEs the algorithm has to be partitioned.When large problems have to mapped which do not fit ontothe array this technique has to be applied. In Fig. 9 thepartitioning scheme chosen here and the resulting processorarray is depicted. In this example, an LPGS (local paral-lel, global sequential) partitioning scheme has been used.In this case, each index point within one tile corresponds toone physical processor that executes the index points at thesame position of other tiles sequentially. The sequential ex-ecution order is depicted by the numbers inside the tiles. Inorder to save resources (local registers) firstly, the partition-ing is performed, Fig. 9 (a), and afterwards the localization,Fig. 9 (b) [21]. The � � � processor array is again regularand similar to the architecture of the non-partitioned exam-ple. The sole difference is that for the accumulation of thevariable �, a shift register of length four is necessary in eachVPE. The input to the virtual processor elements are takeninternally from other VPEs or externally from multiplexersas shown in Fig. 9 (c). The multiplexers are used to selectdelayed or incoming data from the external memory accord-

ing to the event generators. The address generators are usedto access the matrix coefficients from external memory. Theimplementation of the partitioned version requires a differ-ent addressing scheme which is stored as LUT in an internalmemory.

5.5 Results and Comparison

In both of our implementations, the fullsize and the par-titioned version of the matrix-matrix multiplication with in-put matrices of dimension � � �, optimal utilization of re-sources is observed, i.e., each configured MAC-unit per-forms in each cycle one operation.

Comparing our � � � array implementation with thematrix-vector multiplication as implemented by Gunnars-son et al. [6], it is observed that using fewer resources andwith better implementation more performance per cycle canbe achieved. The number of ALUs is reduced from �����to ����. The comparison was done by implementing ourmatrix-matrix multiplication as � matrix-vector multipli-cation. Furthermore, in our implementation the output seri-alization, i.e., merging and writing of output data streamsis overlapped with computations in PE’s, so we have nooverhead associated with writing the data. This is a con-siderable improvement from the implementation presentedin [6] where overhead associated with merging and writingdata is ���

��. Therefore, the computational time (block

pipelining period) is reduced from ����� to ����, or incase Gunnarsson et al. use two Processing Array Clusters(PACs) from �� �

��� to ����.

6 Conclusions and Future Work

In this paper we presented a mapping methodologybased on loop parallelization in the polytope model in or-

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

Page 8: Mapping of Regular Nested Loop Programs to Coarse ...ksuweb.kennesaw.edu/~dlo2/teaching/cs7123/Fall2005/...based reconfigurable logic devices, like FPGAs (field-programmable gate-arrays)

der to map nested loop kernels onto coarse-grained recon-figurable arrays for the first time. The obtained results areefficient in terms of utilization of resources and executiontime. In the future we would like to adapt our PARO de-sign methodology to coarse-grained reconfigurable arraysin order to perform automatic compilation of nested loopprograms.

7 Acknowledgments

Supported in part by the German Science Foundation(DFG) Project SFB 376 “Massively Parallel Computation”.Furthermore, the authors would like to thank PACT XPPTechnologies for their collaboration and support.

References

[1] V. Baumgarte, G. Ehlers, F. May, A. Nuckel, M. Vorbach,and M. Weinhardt. PACT XPP – A Self-ReconfigurableData Processing Architecture. The Journal of Supercomput-ing, 26(2):167–184, 2003.

[2] M. Bednara and J. Teich. Synthesis of FPGA Implementa-tions from Loop Algorithms. In First International Confer-ence on Engineering of Reconfigurable Systems and Algo-rithms (ERSA’01), pages 1–7, Las Vegas, NV, June 2001.

[3] A. DeHon. Reconfigurable Architectures for General-Purpose Computing. Technical Report A.I. TR No. 1586,MIT AI Lab, Massachusetts, 1996.

[4] Elixent Ltd. www.elixent.com.[5] P. Feautrier. Automatic Parallelization in the Polytope

Model. Technical Report 8, Laboratoire PRiSM, Univer-site des Versailles St-Quentin en Yvelines, 45, avenue desEtats-Unis, F-78035 Versailles Cedex, June 1996.

[6] F. Gunnarsson, C. Hansson, D. Johnsson, and B. Svensson.Implementing High Speed Matrix Processing on a Reconfig-urable Parallel Dataflow Processor. In Second InternationalConference on Engineering of Reconfigurable Systems andAlgorithms (ERSA’02), pages 74–80, Las Vegas, NV, June2002.

[7] F. Hannig and J. Teich. Design Space Exploration for Mas-sively Parallel Processor Arrays. In V. Malyshkin, editor,Parallel Computing Technologies, 6th International Con-ference, PaCT 2001, Proceedings, volume 2127 of LectureNotes in Computer Science (LNCS), pages 51–65, Novosi-birsk, Russia, Sept. 2001. Springer.

[8] F. Hannig and J. Teich. Energy Estimation of Nested LoopPrograms. In Proceedings 14th Annual ACM Symposium onParallel Algorithms and Architectures (SPAA 2002), Win-nipeg, Manitoba, Canada, Aug. 2002. ACM Press.

[9] F. Hannig and J. Teich. Domain-Specific Processors: Sys-tems, Architectures, Modeling, and Simulation, chapter 6,Energy Estimation and Optimization for Piecewise RegularProcessor Arrays, pages 107–126. Number 20 in Signal Pro-cessing and Communications. Marcel Dekker, New York,U.S.A., 2004.

[10] R. Hartenstein. A Decade of Reconfigurable Computing: AVisionary Retrospective. In Proceedings of Design, Automa-tion and Test in Europe, pages 642–649, Munich, Germany,Mar. 2001. IEEE Computer Society.

[11] R. Kuhn. Transforming Algorithms for Single-Stage andVLSI Architectures. In Workshop on Interconnection Net-works for Parallel and Distributed Processing, pages 11–19,West Layfaette, IN, Apr. 1980.

[12] J. Lee, K. Choi, and N. Dutt. An Algorithm for MappingLoops onto Coarse-grained Reconfigurable Architectures.In Languages, Compilers, and Tools for Embedded Systems(LCTES’03), pages 183–188, San Diego, CA, June 2003.ACM Press.

[13] Y. Li, T. Callahan, E. Darnell, R. Harr, U. Kurkure, andJ. Stockwood. Hardware-software Co-Design of EmbeddedReconfigurable Architectures. In 37th Design AutomationConference, pages 507–512, Los Angeles, CA, June 2000.

[14] M. Motomura. A Dynamically Reconfigurable ProcessorArchitecture. In Microprocessor Forum, CA, 2002.

[15] J. Oldfield and R. Dorf. Field Programmable Gate Arrays:Reconfigurable Logic for Rapid Prototyping and Implemen-tation of Digital Systems. John Wiley & Sons, Chichester,New York, 1995.

[16] PACT XPP Technologies. XPP64-A1 Reconfigurable Pro-cessor – Datasheet. Munich, Germany, 2003.

[17] PARO Design System Project.www12.informatik.uni-erlangen.de/research/paro.

[18] QuickSilver Technology. www.qstech.com.[19] P. Schaumont, I. Verbauwhede, K. Keutzer, and M. Sar-

rafzadeh. A Quick Safari Through the Reconfiguration Jun-gle. In 38th Design Automation Conference, pages 172–177,Las Vegas, NV, June 2001.

[20] H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi,and E. M. C. Filho. Morphosys: An Integrated Reconfig-urable System for Data-Parallel and Computation-IntensiveApplications. IEEE Transactions on Computers, 49(5):465–481, 2000.

[21] J. Teich and L. Thiele. Exact Partitioning of Affine Depen-dence Algorithms. In E. Deprettere, J. Teich, and S. Vassil-iadis, editors, Embedded Processor Design Challenges, vol-ume 2268 of Lecture Notes in Computer Science (LNCS),pages 135–153, Mar. 2002.

[22] J. Teich, L. Thiele, and L. Zhang. Scheduling of PartitionedRegular Algorithms on Processor Arrays with ConstrainedResources. Journal of VLSI Signal Processing, 17(1):5–20,Sept. 1997.

[23] L. Thiele. Resource Constrained Scheduling of Uniform Al-gorithms. Journal of VLSI Signal Processing, 10:295–310,1995.

[24] G. Venkataramani, W. Najjar, F. Kurdahi, N. Bagherzadeh,W. Bohm, and J. Hammes. Automatic Compilation toa Coarse-grained Reconfigurable System-on-Chip. ACMTrans. on Embedded Computing Systems, to appear Novem-ber 2003.

[25] M. Weinhardt and W. Luk. Pipeline Vectorization. IEEETransactions on Computer-Aided Design of Integrated Cir-cuits and Systems, 20:234–248, Feb. 2001.

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)