Tool #2 Introduction DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE...

1
Tool #2 Introduction DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO PROCESSING A. Akoglu, A. Dasu, S. Panchanathan Center for Ubiquitous Computing, Arizona State University, Tempe, AZ {akoglu,dasu,panch}@asu.edu Abstract This research work presents results obtained from hybrid FPGA architecture design methodology proposed in earlier work. Hybrid architecture is formed of ASIC units and LUT based processing elements. ASIC units represent tasks or core clusters obtained through common sub-graph analysis between basic blocks within and across routines of computation intensive applications and are basically recurring patterns. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations. Since ASICs cover only parts of data flow graphs, remaining computations are implemented on LUT based reconfigurable hardware. A new packing function is proposed to form LUT based processing elements. Packing cost function prioritizes reduction of input/output pins of the clusters being formed. Results show that significant savings in number of nets to be routed are obtained through proposed method. Captions to be set in Times or Times New Roman or equivalent, italic, 18 to 24 points, to the length of the column in case a figure takes more than 2/3 of column width. Captions to be set in Times or Times New Roman or equivalent, italic, between 18 and 24 points. Left aligned if it refers to a figure on its left. Caption starts right at the top edge of the picture (graph or photo). applications. We have conducted experiments on MPEG-4 VVM, Gnu Scientific Library (GSL) and NAMD molecular modeling library. A map report based on Spartan 2E architecture was obtained based on the synthesis report. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer the potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations (Table 1). Since CIPEs cover only parts of DFGs, remaining computations (reconfigurable data flow computations) are then implemented on LUT based reconfigurable hardware. Methodology in Figure 4 proposes to provide optimum interconnection pathways between different hierarchy levels with variable size processing elements, allocating just enough switching and wiring resources as a result of profiling the computational characteristics of the application domains. In existing approaches packing threats number of intersecting nets as positive gain and doesn’t address how wiring requirement grows after including an LUT into a cluster. We also argue that cost function should give priority to the nets causing a decrease in the number of input or output pins of the target cluster. Rent’s rule based packing mechanism (Figure 5) designed to improve the routing architecture by reducing the number of nets to be routed has been implemented. This mechanism prioritizes (Figure 6,7) nets that lead to reduction of number of input/output pins during packing in addition to routability driven cost metrics defined by other researchers. Table 2 presents the performance of packing compared to V-Pack ( Rose et. al) and R- Pack( Sarrafzadeh et. al) analysis between basic blocks within and across routines are basically recurring computation patterns implemented as ASICs on non- reconfigurable area(CIPEs). Designing processing elements based on identifying correlated compute intensive regions within each application and between applications result in large amounts of processing in localized regions of the chip. This reduces the amount of reconfigurations and on- chip communication hence results with faster application switching and reduced power consumption. This task comprises of finding the Common Sub-graphs (Figure 3), which is closely related to the Largest Common Sub-Graph problem (a proven NP complete problem). Core reusable regions that have been detected as common within or across applications by peer research efforts, have either been at the granularity of MAC units (2 nodes) or at the granularity of entire function modules. There has been no reported work that has detected core reusable regions consisting of several operation (multiply, add, divide etc) nodes between basic blocks in applications, with emphasis on accelerating data flow on hardware. Our method generates ASIC cores of higher granularity by specifically focusing on Dataflow graphs of Hardware Computations and taking advantages of the restrictions that they offer. In Hybrid-FPGA model, we propose that the CIPE region be constrained and mapped onto a slab (a region of LUTs isolated by MacroBus as in the Virtex architecture) or implemented as gates in ASIC technology. Even though a large ASIC on chip increases the costs of mask design, it offers the maximum amount of gate savings. Remaining slabs are implemented on LUT based reconfigurable. To the best of our knowledge currently there exists no known technology that maps regions within a single DFG into multiple Slabs for Partial Reconfiguration We have conducted experiments on several complex routines from the target Conclusion From this research effort we believe that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer the potential for massive savings in gate density and by eliminating the need for unnecessary and redundant sub-circuit pattern configurations. We believe that this direction will lead to the next generation FPGA devices geared towards computationally intensive applications such as bio-chemical algorithms and scientific applications. Selected Publications 1. “Reconfigurable Media Processing”, A. Dasu et.al, Parallel Computing Vol. 28, August 2002. Pg(s) 1111 - 1139. 2. A.Akoglu, A. Dasu and S. Panchanathan ,”A Framework for the Design of the Heterogeneous Hierarchical Routing Architecture of a Dynamically Reconfigurable Application Specific Media Processor”,Workshop on Embedded Systems for Media Processing,Dec 17, 2003,Hyderabad, India 3. A. Dasu, A.Akoglu, S.Panchanathan , “An Analysis Tool Set for Reconfigurable Media Processing”,The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA'03),June 2003,Las Vegas 4. 3 Patents pending under US and International Application Source code Application Source code Application Source code C O M P I L E R Lance, SUIF, gcc A N A L Y S I S T O O L S •Function level ? •BB level ? •Or something in between ? Figure 1. Methodology Figure 2a. Hybrid-FPGA Table-1 configuration bits and clock cycles savings Figure 5. Rent ‘s Rule based Packing Parameters 7 1 ) ( i i Case Gain Figure 7. Packing Cost Function Table-2 Amount of nets and tracks savings Figure 4. LUT Based Architecture Methodology Figure 6. How to prioritize net reduction Methodology (Figure 1) involves extraction of tasks or core clusters in Control Data Flow Graphs (CDFGs) of applications followed by designing the architecture to embed them in Hybrid-FPGA environments (Figure 2a,2b). By Hybrid, we mean that the proposed FPGA architectures will involve LUT and ASIC regions. Tasks or core clusters obtained through the common-sub-graph Numerical simulations in Computational Fluid Dynamics, Molecular Modeling have some common computation features, and performed iteratively. Similarly video and image processing applications involve tasks (mosaic building to compress video into images, image compression such as DCT, DWT etc.) which also have some common computation features and require iterative processing. It has been shown by several researchers that these applications are well suited to be executed on spatially parallel processor architectures. FPGAs in particular offer large amounts of on-chip spatial parallel units, thus capable of performing orders of magnitude faster than regular serial processors. But FPGAs suffer from the drawbacks of being application agnostic and hence incur penalties of loss of clock cycles in redundant reconfigurations, generic routing and poor memory architectures which impact speed, power and silicon area. All these factors have led us into exploring the reconfigurable architecture design space with the application domain being prioritized. Figure 2b. Hybrid-FPGA Finding Comon Sub-Graph Dominant Sub-Graph Figure 3 ) ( m B r m m CB P B R 3 2 2 R C W 1 1 1 1 2 1 1 1 1 1 1 1 1 1 ' ' ' ' ' ' ' ' N K N K L T L T C N S L C N S L size size size size ' 1 L Win 1 L Win ' 1 L Wout 1 L Wout 1 ' 1 ' size size size size s size s s size C C S S C C C L S Step-1 Step-2 Step-3 Step-5 L i T i L i T i i C r C R 0 0 Step-4

Transcript of Tool #2 Introduction DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE...

Page 1: Tool #2 Introduction DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO.

Tool #2

Introduction

DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO PROCESSING

A. Akoglu, A. Dasu, S. Panchanathan Center for Ubiquitous Computing, Arizona State University, Tempe, AZ

{akoglu,dasu,panch}@asu.edu

AbstractThis research work presents results obtained from hybrid FPGA architecture design methodology proposed in earlier work. Hybrid architecture is formed of ASIC units and LUT based processing elements. ASIC units represent tasks or core clusters obtained through common sub-graph analysis between basic blocks within and across routines of computation intensive applications and are basically recurring patterns. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations. Since ASICs cover only parts of data flow graphs, remaining computations are implemented on LUT based reconfigurable hardware. A new packing function is proposed to form LUT based processing elements. Packing cost function prioritizes reduction of input/output pins of the clusters being formed. Results show that significant savings in number of nets to be routed are obtained through proposed method.

Captions to be set in Times or Times New Roman or equivalent, italic, 18 to 24 points, to the length of the column in case a figure takes more than 2/3 of column width.

Captions to be set in Times or Times New Roman or equivalent, italic, between 18 and 24 points. Left aligned if it refers to a figure on its left. Caption starts right at the top edge of the picture (graph or photo).

applications. We have conducted experiments on MPEG-4 VVM, Gnu Scientific Library (GSL) and NAMD molecular modeling library. A map report based on Spartan 2E architecture was obtained based on the synthesis report. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer the potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations (Table 1).

Since CIPEs cover only parts of DFGs, remaining computations (reconfigurable data flow computations) are then implemented on LUT based reconfigurable hardware. Methodology in Figure 4 proposes to provide optimum interconnection pathways between different hierarchy levels with variable size processing elements, allocating just enough switching and wiring resources as a result of profiling the computational characteristics of the application domains. In existing approaches packing threats number of intersecting nets as positive gain and doesn’t address how wiring requirement grows after including an LUT into a cluster. We also argue that cost function should give priority to the nets causing a decrease in the number of input or output pins of the target cluster. Rent’s rule based packing mechanism (Figure 5) designed to improve the routing architecture by reducing the number of nets to be routed has been implemented. This mechanism prioritizes (Figure 6,7) nets that lead to reduction of number of input/output pins during packing in addition to routability driven cost metrics defined by other researchers. Table 2 presents the performance of packing compared to V-Pack ( Rose et. al) and R-Pack( Sarrafzadeh et. al)

analysis between basic blocks within and across routines are basically recurring computation patterns implemented as ASICs on non-reconfigurable area(CIPEs). Designing processing elements based on identifying correlated compute intensive regions within each application and between applications result in large amounts of processing in localized regions of the chip. This reduces the amount of reconfigurations and on-chip communication hence results with faster application switching and reduced power consumption. This task comprises of finding the Common Sub-graphs (Figure 3), which is closely related to the Largest Common Sub-Graph problem (a proven NP complete problem). Core reusable regions that have been detected as common within or across applications by peer research efforts, have either been at the granularity of MAC units (2 nodes) or at the granularity of entire function modules. There has been no reported work that has detected core reusable regions consisting of several operation (multiply, add, divide etc) nodes between basic blocks in applications, with emphasis on accelerating data flow on hardware. Our method generates ASIC cores of higher granularity by specifically focusing on Dataflow graphs of Hardware Computations and taking advantages of the restrictions that they offer. In Hybrid-FPGA model, we propose that the CIPE region be constrained and mapped onto a slab (a region of LUTs isolated by MacroBus as in the Virtex architecture) or implemented as gates in ASIC technology. Even though a large ASIC on chip increases the costs of mask design, it offers the maximum amount of gate savings. Remaining slabs are implemented on LUT based reconfigurable. To the best of our knowledge currently there exists no known technology that maps regions within a single DFG into multiple Slabs for Partial Reconfiguration We have conducted experiments on several complex routines from the target

ConclusionFrom this research effort we believe that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer the potential for massive savings in gate density and by eliminating the need for unnecessary and redundant sub-circuit pattern configurations. We believe that this direction will lead to the next generation FPGA devices geared towards computationally intensive applications such as bio-chemical algorithms and scientific applications.

Selected Publications1. “Reconfigurable Media Processing”, A. Dasu et.al, Parallel

Computing Vol. 28, August 2002. Pg(s) 1111 - 1139. 2. A.Akoglu, A. Dasu and S. Panchanathan ,”A Framework for the

Design of the Heterogeneous Hierarchical Routing Architecture of a Dynamically Reconfigurable Application Specific Media Processor”,Workshop on Embedded Systems for Media Processing,Dec 17, 2003,Hyderabad, India

3. A. Dasu, A.Akoglu, S.Panchanathan , “An Analysis Tool Set for Reconfigurable Media Processing”,The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA'03),June 2003,Las Vegas

4. 3 Patents pending under US and International Protection in the technologies for “Designing High Performance Reconfigurable Processing”

Application

Source code

Application

Source code

Application

Source code

COMPILER

Lance, SUIF, gcc

ANALYSIS

TOOLS

•Function level ?•BB level ?•Or something in between ?

Figure 1. Methodology

Figure 2a. Hybrid-FPGA

Table-1 configuration bits and clock cycles savings

Figure 5. Rent ‘s Rule based Packing Parameters

7

1

)(i

iCaseGain

Figure 7. Packing Cost Function

Table-2 Amount of nets and tracks savingsFigure 4. LUT Based Architecture Methodology

Figure 6. How to prioritize net reduction

Methodology (Figure 1) involves extraction of tasks or core clusters in Control Data Flow Graphs (CDFGs) of applications followed by designing the architecture to embed them in Hybrid-FPGA environments (Figure 2a,2b). By Hybrid, we mean that the proposed FPGA architectures will involve LUT and ASIC regions. Tasks or core clusters obtained through the common-sub-graph

Numerical simulations in Computational Fluid Dynamics, Molecular Modeling have some common computation features, and performed iteratively. Similarly video and image processing applications involve tasks (mosaic building to compress video into images, image compression such as DCT, DWT etc.) which also have some common computation features and require iterative processing. It has been shown by several researchers that these applications are well suited to be executed on spatially parallel processor architectures. FPGAs in particular offer large amounts of on-chip spatial parallel units, thus capable of performing orders of magnitude faster than regular serial processors. But FPGAs suffer from the drawbacks of being application agnostic and hence incur penalties of loss of clock cycles in redundant reconfigurations, generic routing and poor memory architectures which impact speed, power and silicon area. All these factors have led us into exploring the reconfigurable architecture design space with the application domain being prioritized.

Figure 2b. Hybrid-FPGA

Finding Comon Sub-Graph Dominant Sub-GraphFigure 3

)( mBrmm CBP BR

3

2

2

RCW

11112

11111

1

1

1

1

''

''

''

''

NKNK

LTLT

CN

SL

CN

SL

size

size

size

size

'1LWin

1LWin

'1LWout

1LWout

1'

1'

sizesize

sizesize

ssize

sssize

CC

SS

CC

CLS

Step-1 Step-2 Step-3

Step-5

L

iT

iL

iT

i

i

C

rCR

0

0

Step-4