A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu,...

download A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He, Pei Zhang, Huazhong Yang ypliu@tsinghua.edu.cn.

If you can't read please download the document

Transcript of A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu,...

  • Slide 1

A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He, Pei Zhang, Huazhong Yang [email protected] 2012-Jan-30 Slide 2 Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 2 Slide 3 Background A rapid rising demand for the high quality C language to RTL (C2RTL) tools [1] Design gap between design productivity and transistor resources Huge silicon capacity Extensive use of embedded processors Reuse of behavior IPs Extensive adoption of accelerators More time-to-market pressure 3 Slide 4 Background Designers have successfully developed various applications using C2RTL tools with much shorter design time.[2]-[4] FFocusing on Stream applications GAUT[5], ROCCC[6], Impulse C[7], ASC[8] FFocusing on general purpose applications Catapult C[9] Little control on the architecture leads to suboptimal results.[10] Hierarchical C2RTL framework Leaving the user to determine the FIFO capacity between modules. Algorithm for FIFO-connected blocks 4 Slide 5 To design a stream application, researchers also had investigated on FIFO sizing.[11]-[14] All of them adopts a more complicated behavior model for PE streams, which is not necessary in the hierarchical C2RTL framework. Our Hierarchical C2RTL Framework for FIFO-Connected Stream Applications: The hierarchical implementation provides even10.43 times speedup compared to the flatten design. We develop an algorithm to find the optimal FIFO capacity in a multiple-module design. We demonstrate the proposed method in seven real applications. 5 Slide 6 Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 6 Slide 7 Motivation Hierarchical vs Flatten Approach The synthesized quality for larger algorithms is generally not so good as small ones. The translating time is unacceptable. 10.43 times performance speedup can be made by JPEG encoder case. 7 Performance with Different FIFO Capacity FIFO capacity affect both speed and area. Designers need to decide the FIFO size in an iterative way which is time consuming. Slide 8 Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 8 Slide 9 Hierarchical C2RTL framework Step 1: We partition C codes into appropriate-size functions. Step 2: We use C2RTL tools to translate each function into a hardware process element (PE), which has a FIFO interface. Step 3: We connect those PEs with proper sized FIFOs. 9 Slide 10 Hierarchical C2RTL framework Step 1: The C code partition has a great impact on the hardware performance. A example of GSM case is shown in the figure. Currently, we use a manual partition strategy. An integer linear programming based partition approach is presented in [16]. 10 Slide 11 Hierarchical C2RTL framework Step 2: We use the C2RTL tool of eXCite from Y Exploration, Inc[15] Step 3: We define the FIFO interface and Modeling the FIFO Define the PE interface Modeling PEs. 11 Slide 12 Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 12 Slide 13 Given a design consisting of N PEs, we need to determine the depth D (i-1)i of each FIFO, which maximizes the throughput TH all and minimizes the area A all. That is: Without losing generality, we set TH ref =TH best and A ref =. Algorithm for FIFO-connected blocks 13 FIFO Interconnection Formulation: Slide 14 Algorithm for FIFO-connected blocks Optimal FIFO Capacity for Two PEs: For PE 1 and PE 2, we assume that FIFO 01 never empty and FIFO 23 never full, and then we calculate FIFO 12 s capacity. 14 Slide 15 Algorithm for FIFO-connected blocks Step 1: Find out the bottleneck Talk about later Step 2: FIFO sizing for two block Optimal FIFO Capacity for Two PEs theory Step 3: Merging two PEs into a new one Talk about later 15 FIFO Capacity for Multiple Blocks: Slide 16 Step 1(Find out bottleneck): We set TH n as the throughput of the system consisting of PE 1 to PE n. We have a recursive equation As TH all =TH N, we can express TH all : Step 3(Merging two PEs): We give out the equations to compute the interface parameters for the merged PE Algorithm for FIFO-connected blocks 16 FIFO Capacity for Multiple Blocks: Slide 17 Outline Background Motivation Hierarchical C2RTL framework Algorithm for FIFO-connected blocks Experiments Future works Reference 17 Slide 18 Experiments Experimental Configurations eXCite + ModelSim + Quartus Seven benchmarks from real applications. They are: JPEG encode/decode AES encryption/decryption GSM ADPCM Filter Group The benefit of optimize FIFO capacity algorithm: Area saving Timing efficiency The average time cost by the entire exploration time is N *log2(p) *C, using a binary searching algorithm 18 Slide 19 Experiments Flatten vs Hierarchical Approach: 19 Slide 20 Experiments Optimal FIFO Capacity: 20 Compared our results with exhaustive simulations under random inputs. Lists both the analytical results and the experimental ones on FIFO sizing in seven real cases. Slide 21 Future works The automatically C code partition Using our algorithm in more complex architectures with feedback and branches 21 Slide 22 Reference [1] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, High-level synthesis for fpgas: From prototyping to deployment, Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-actions on, vol. 30, no. 4, pp. 473491, 2011. [2] B. Schafer, A. Trambadia, and K. Wakabayashi, Design of complex image processing systems in esl, in Asia and South Pacific Design Automation Conference. IEEE Press, 2010, pp. 809814. [3] Y. Guo and D. McCain, Rapid prototyping and vlsi exploration for 3g/4g mimo wireless systems using integrated catapult-c methodology, in Wireless Communications and Networking Conference. IEEE, 2006, pp. 958963. [4] M. Rossler, H. Wang, U. Heinkel, N. Engin, and W. Drescher, Rapid prototyping of a dvb-sh turbo decoder using high-level-synthesis, in Specification & Design Languages. IEEE Forum on, 2009, pp. 16. [5] G. Lhairech-Lebreton, P. Coussy, and E. Martin, Hierarchical and multiple-clock domain high-level synthesis for low-power design on fpga, in International Conference on Field Programmable Logic and Applications. IEEE, 2010, pp. 464468. [6] J. Villarreal, A. Park, W. Najjar, and R. Halstead, Designing mod-ular hardware accelerators in c with roccc 2.0, in IEEE Annual International Symposium on Field- Programmable Custom Computing Machines. IEEE, 2010, pp. 127134. 22 Slide 23 Reference [7] M. Gokhale, J. Stone, J. Arnold, and M. Kalinowski, Stream-oriented fpga computing in the streams-c high level language, in Field-Programmable Custom Computing Machines. IEEE Symposium, 2000, pp. 4956. [8] O. Mencer, Asc: a stream compiler for computing with fpgas, Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-actions on, vol. 25, no. 9, pp. 16031617, 2006. [9] http://www.mentor.com. [10] A. Agarwal, Comparison of high level design methodologies for algorithmic ips: Bluespec and c-based synthesis, Ph.D. dissertation, Massachusetts Institute of Technology, 2009. [11] K. Lahiri, A. Raghunathan, and S. Dey, System-level performance anal-ysis for designing on-chip communication architectures, in Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 20, no. 6, 2001, pp. 768 783. [12] R. Cruz, Quality of service guarantees in virtual circuit switched networks, in Selected Areas in Communications, IEEE Journal on, vol. 13, no. 6, 1995, pp. 1048 1056. [13] A. Maxiaguine, S. K unzli, S. Chakraborty, and L. Thiele, Rate analysis for streaming applications with on-chip buffer constraints, in Asia and South Pacific Design Automation Conference. IEEE Press, 2004, pp. 131136. 23 Slide 24 Reference [14] Y. Liu, S. Chakraborty, and R. Marculescu, Generalized rate analysis for media- processing platforms, in International Conference on Em-bedded and Real-Time Computing Systems and Applications, RTCSA. IEEE, 2006, pp. 305314. [15] http://www.yxi.com. [16] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, Partitioning of behavioral descriptions with exploiting function-level parallelism, in Fundamentals,IEICE Trans on, vol. E93-A, 2010, pp. 488499. [17] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, Chstone: A benchmark program suite for practical c-based high-level synthesis, in Circuits and Systems. IEEE International Symposium on, 2008, pp. 11921195. 24 Slide 25 Thank you !