DesignCon 2010 Wireless HLS Final

DesignCon 2010 Designing Scalable Wireless Application-Specific Accelerators Using PICO High Level Synthesis Yang Sun, Rice University [email protected] Kiarash Amiri, Rice University [email protected] Joseph R. Cavallaro, Rice University [email protected] Tai Ly, Synfora Inc. [email protected]

Abstract This paper presents a system level methodology of designing and exploring scalable and flexible wireless application-specific accelerators. Current hardware designs and implementations for wireless systems have a huge time gap between the development of algorithms for new standards and their hardware implementation. Hardware design using traditional HDL flows has such a long design time that by the end of the design cycle, the algorithms have already moved to the next wireless standard, out-dating the hardware design. The high level synthesis tools create application accelerators from high abstraction-level, un-timed C for complex processing hardware, which greatly reduces the design cycle while still maintaining area and power efficiency. This paper presents two complex wireless designs using program-in chip-out (PICO) high level synthesis methodology: 1) High speed multiple-input multiple-output (MIMO) signal detector design, and 2) High throughput low-density parity-check (LDPC) codes decoder design.

Authors Biographies Yang Sun received the B.S. degree in Testing Technology & Instrumentation from Zhejiang University, Hangzhou, China, in 2000, and the M.S. degree in Instrument Science & Technology from Zhejiang University, Hangzhou, China, in 2003. From 2003 to 2004, he was with S3 Graphics Co. Ltd. as an ASIC design engineer. From 2004 to 2005, he was with Conexant Systems Inc. as an ASIC design engineer. During the summer of 2007 and 2008, he worked at the broadband communication lab, DSP Solution R&D Center, Texas Instruments, Dallas as an intern developing LDPC/Turbo error correction codes decoder for 4G wireless system. He is currently a Ph.D. candidate in the Department of Electrical and Computer Engineering at Rice University, Houston, TX. His research interests include parallel algorithms and VLSI architectures for wireless communication systems. Kiarash (Kia) Amiri received his M.S. degree in Electrical and Computer Engineering from Rice University, Houston, Texas, in 2007. He is currently a PhD candidate in Electrical and Computer Engineering at Rice University where he is a member of the Center for Multimedia Communication (CMC) lab, and his research focus is in the area of wireless communication. During the summer and fall of 2007, Kia was part of the Advanced Systems Technology Group in Xilinx, San Jose, CA. Since then, he has been collaborating with researchers and engineers in Xilinx and Nokia to develop state-of-the-art implementations that boost the speed and reliability of wireless services. His algorithms and ideas have been filed as patents, and published in various conferences and journals. Kia is also the recipient of Rice Graduate Student Fellowship. Joseph R. Cavallaro received the B.S. degree from the University of Pennsylvania, Philadelphia, Pa, in 1981, the M.S. degree from Princeton University, Princeton, NJ, in 1982, and the Ph.D. degree from Cornell University, Ithaca, NY, in 1988, all in electrical engineering. From 1981 to 1983, he was with AT&T Bell Laboratories, Holmdel, NJ. In 1988, he joined the faculty of Rice University, Houston, TX, where he is currently a

Professor of Electrical and Computer Engineering. His research interests include computer arithmetic, VLSI design and microlithography, and DSP and VLSI architectures for applications in wireless communications. During the 1996–1997 academic year, he served at the USA National Science Foundation as Director of the Prototyping Tools and Methodology Program. He was a Nokia Foundation Fellow and a Visiting Professor at the University of Oulu, Finland in 2005 and continues his affiliation as an Adjunct Professor there. He is currently the Associate Director of the Center for Multimedia Communication at Rice University. He is a Senior Member of the IEEE. He was Co-chair of the 2004 Signal Processing for Communications Symposium at the IEEE Global Communications Conference and General Co-chair of the 2004 IEEE 15th International Conference on Application-Specific Systems, Architectures and Processors (ASAP). Tai Ly is Senior Field Application Engineer at Synfora, Inc. with more than 17 years of EDA experience, in both Synthesis and Verification. He holds a Ph.D in Electrical Engineering from the University of Alberta, Canada.

Introduction The continuously changing and evolving wireless specifications pose a major challenge to designers to implement the highly complex wireless algorithms in hardware as rapidly as possible, while still maintaining area and power efficiency. Many applications, such as digital video, need new high data rate wireless communication algorithms. Computationally intensive algorithms are used to remove high levels of multiuser interference especially in the presence of multiple transmit and multiple receive antenna (MIMO) systems. Time varying wireless channel environments can also dramatically deteriorate the performance of the transmission, further requiring powerful MIMO detection and channel decoding algorithms for different fading conditions at the mobile handset. In these types of environments, it is often the case that the amount of parallel computation in a given user application or kernel far exceeds the available functional units in the general-purpose DSP processor. Further, the area and power constraints of mobile handsets make a software-only solution difficult to realize. Designing a complex application-specific DSP hardware accelerator could be very time-consuming if using the traditional hand-coded HDL design flow. In addition, low power consumption is an important design consideration, since "talk" time and stand-by-time on a single charge is a differentiating feature for mobile devices. The hardware designer's most important task is to implement highly complex algorithms into hardware as quickly as possible, while still retaining area and power efficiencies. High level synthesis (HLS) promises to be one of the solutions to cope with the significant increase in the demand for wireless products. The high level synthesis technology allows the creation of application-specific accelerators with higher abstraction levels and fast implementations of complex algorithms. The design methodology in this paper is based on the high level synthesis (HLS). Synfora PICO HLS tool was used to produce efficient RTL directly from a sequential un-timed C algorithm. We will demonstrate this methodology by using two complex wireless design examples: 1) Multiple-input multiple-output (MIMO) signal detector, and 2) Low-density parity-check (LDPC) codes decoder.

Wireless Application-Specific Engine Design Challenges Wireless application engines in the next generation wireless systems require high performance at very low power and low area. In third-generation (3G) wireless systems, the algorithmic complexity of digital signal processing algorithms has begun to exceed the processing capabilities of general purpose digital signal processors (DSPs). With the inclusion of advanced error correction codes (i.e. low-density parity-check codes and Turbo codes) and multiple-input multiple-output (MIMO) technologies in the fourth-generation (4G) wireless system, the wireless algorithm complexity has far exceeded the processing capabilities of general purpose DSPs. Given area and power constraints for the mobile handsets, the designers cannot just simply implement computation intensive

wireless algorithms with embedded DSPs. On the other hand, highly parallel processors, such as the IBM Cell processor, can support high speed DSP processing but at a cost of high power consumption. Typically, dedicated hardware systems can achieve the best performance in terms of power and area compared to the programmable processors. However, custom design has its cost. Hand design of wireless application-specific engines or accelerators is very expensive in terms of design time and non-recurring engineering cost. Wireless systems are growing in complexity because more and more features are included in the new standards. In our view, automatic wireless application accelerator creation from a high level algorithmic description is the key to success. It enables the designer to explore multiple algorithms and implementation alternatives rapidly to achieve the best quality of result (QoR).

High Level Synthesis Methodology High level synthesis allows the design of algorithms on silicon to be completed at a higher level of abstraction such C/C++. It not only can provide significant time and cost savings in the design of verification of hardware, but also can make significant contribution to power reduction at the system and architecture level. For example, it can automate power management such as multi-level clock gating which typically has to be done manually at the RTL level. Thus, high level synthesis helps designers to implement complex algorithms quickly and efficiently.

PICO Overview The high level synthesis used in this paper is the PICO C-Synthesis [9, 10], which creates application accelerators from un-timed C for complex processing hardware in video, audio, imaging, wireless and encryption domains. Figure 1 shows the overall design flow for creating application accelerators using PICO. The user provides a C description of their algorithm along with functional test inputs and design constraints such as the target throughput, clock frequency, and technology library. The PICO system automatically generates the synthesizable RTL, customized test benches, System C models at various levels of accuracy as well as synthesis and simulation scripts. PICO is based on an advanced parallelizing compiler that finds and exploits the parallelism at all levels in the C code. PICO provides multi-level hierarchical design capability for complex designs, such as LDPC decoder and block-level clock gating to minimize power at the architecture level. The quality of the generated RTL is competitive with manual design, and the RTL is guaranteed to be functionally equivalent to the algorithmic C input description. The generated RTL can then be taken through standard simulation, synthesis, place and route tools and integrated into the SoC through automatically configured scripts.

Figure 1. System level design flow using PICO

PICO Hardware Architecture Figure 2 shows the general structure of hardware generated by PICO from a high level C procedure. This architecture template is called a pipeline of processing arrays (PPA). Using this architecture template, the PICO compiler will map each loop in the top level C procedure to a hardware block or a processing array (PA). PAs communicate with each other via FIFOs, memories, or raw signals. A timing controller is used to schedule the pipeline, and to preserve the sequential semantics of the original C procedure. The host interface and the task frame memory were used to provide the integration of the PPA hardware into a system using memory mapped IO.

Figure 2. The PPA architecture template

PICO Power-Saving Feature HLS tools also offer specific power-saving features, designed to solve the problems of power optimization. In any design, there are huge opportunities for power reduction at both the system and the architecture levels. HLS can make a significant contribution to power reduction at the architecture level, specifically by offering the following: 1) Ease of architecture and micro-architecture exploration, 2) Ease of frequency and voltage exploration, and 3) Use of high level power reduction techniques such as multi-level clock gating, which are time-consuming and error-prone when done manually at the RTL level. Power-saving opportunities at the RTL and gate level are limited and have a much smaller impact on the total power consumption.

PICO Verification Flow Figure 3 shows a complete design and validation environment from sequential C code. PICO takes the original C algorithm and C test bench and uses those to drive verification and validation at multiple steps during the process of transforming sequential, untimed C into parallel, synchronous hardware. PICO supports both standalone RTL simulation and co-simulation of the RTL with the original C testbench. User’s original testbench is used to automatically generate the test-vectors and the expected responses, which are compared with the actual responses produced during simulation to verify the correctness of the RTL. An important aspect of the PICO approach is to enable finding issues in software rather than in the RTL. To this end, PICO provides “Dynamic Linting” to catch application errors such as bitwidth overflows, un-initialized variables and out-of-bounds accesses to arrays. These errors may be benign in the original C simulation but may have undefined behavior in the hardware.

Figure 3. PICO synthesis and verification flow

Application-Specific Accelerators for Wireless Communications Processors in 4G cellular systems typically require high speed, throughput, and flexibility. However, the complex wireless algorithms in 4G systems pose a significant challenge for hardware designers. Even with modern VLIW style DSPs, the number of available functional units in a given clock cycle is limited and prevents full parallelization of the application for maximum performance. Furthermore, the area and power constraints of mobile handsets make a software-only solution difficult to realize. Thus, for computation intensive wireless algorithms, it is more appropriate to implement these algorithms with dedicated hardware accelerators. However, the traditional RTL based design flow has a long design cycle and a high non-recurring engineering (NRE) cost. Data throughput is an important metric to consider when designing a wireless receiver, such as 3GPP LTE with 326 Mbps downlink peak data rate and IEEE 802.16e WiMax with 144 Mbps downlink peak data rate. In this paper, we utilize the PICO high level synthesis tools to implement high throughput signal processing blocks in a 4G receiver. Figure 4 shows a typical MIMO receiver structure. Two of the most complex blocks, MIMO detector and channel decoder, will be discussed in this paper.

Figure 4. Structure of an MIMO receiver.

Design Examples: MIMO Decoders Multiple-input multiple-output (MIMO) systems, Fig 4, consist of multiple antennas on the transmitter and receiver sides, and can increase the reliability and data rate in wireless systems [12]. MIMO systems can increase the transmit data rate using the spatial multiplexing (SM) scheme. In the spatial multiplexing scheme, independent symbols are transmitted from different antennas at different time slots; hence, supporting even higher data rates compared to single antenna systems. The spatial multiplexing MIMO system can be modeled as:

y = Hx + n, where H is the N × M channel matrix, x is the M-element column vector where its xi-th element corresponds to the complex symbol transmitted from the i-th antenna, and y is the received N-th element column vector where yi is the perturbed received element at the i-th receive antenna. The additive white Gaussian noise vector on the receive antennas is denoted by n.

Figure 5. Multiple Input Multiple Output (MIMO) system with M transmit antennas and

N receive antennas In this section, we discuss some of the main algorithmic and architectural features of such detectors for spatial multiplexing MIMO systems shown in Figure 5. The optimum solution for MIMO systems, i.e. the maximum-likelihood (ML) detector, is obtained by minimizing the

norm over all the possible choices of x ΩM. This brute-force search incurs an exponential complexity in the number of antennas: for M transmit antennas and a modulation order of w = |Ω|, the number of possible x vectors is wM. This makes it impossible to build the detector for the next generation wireless system [13]. In order to address this problem, sphere detection can be used as an alternative that can achieve ML (or close-to-ML) error performance with reduced complexity [14] compared to ML. In fact, while the above norm minimization is exponential complexity, it has been shown that using the sphere detection method, the ML solution can be obtained with much lower complexity [14]. In order to avoid the significant overhead of the ML detection, the distance norm can be simplified [15] as follows:

where H = QR represents the channel matrix QR decomposition, R is an upper triangular matrix, QQH = I and y′ = QHy. The above norm in can be computed in M iterations starting with i = M. When i = M, i.e. the first iteration, the initial partial norm is set to zero, TM+1 (s(M+1) ) = 0. Using the notation of [16], at each iteration the Partial Euclidean Distances (PEDs) at the next levels are given by :

The norm can be computed in M iterations starting with i = M. When i = M, i.e. the first iteration, the initial partial norm is set to zero, PNM+1 = 0. At each iteration, partial distances,

corresponding to the i-th level, are calculated and added to the partial norm of the respective parent node in the (i − 1)-th level, PNi = PNi−1 + PDi . Finishing the iterations gives the final value of the norm. One can envision this iterative algorithm as a tree traversal problem where each level of the tree represents one i value, each node has its own PN, and w children, see Figure 6. In order to reduce the search complexity, a threshold, C, can be set to discard the nodes with PN > C. Therefore, whenever a node k with a PNk > C is reached, any of its children will have PN ≥ PNk > C. Hence, not only the k-th node, but also its children, and all nodes lying beneath the children in the tree, can be pruned out.

Figure 6. Calculating the distances using a tree. Partial norms, PNs, of dark nodes are less than the threshold. White nodes are pruned out. There are different approaches to search the entire tree, mainly classified as depth-first (DFS) search approach and K-best approach, where the latter is based on the breadth-first search (BFS) strategy. In DFS, the tree is traversed vertically [16, 17]; while in BFS [18], the nodes are visited horizontally, i.e. level by level. In the DFS approach, starting from the top level, one node is selected, the PNs of its children are calculated, and among those new computed PNs, one of them, e.g. the one with the least PN, is chosen, and that becomes the parent node for the next iteration. The PNs of its children are calculated, and the same procedure continues until a leaf is reached. At this point, the value of the global threshold is updated with the PN of the recently visited leaf. Then, the search continues with another node at a higher level, and the search controller traverses the tree down to another leaf. If a node is reached with a PN larger than the radius, i.e. the global threshold, then that node, along with all nodes lying beneath that, are pruned out, and the search continues with another node. The tree traversal can also be performed in a breadth-first manner. At each level, only the

best K nodes, i.e. the K nodes with the smallest Ti , are chosen for expansion. This type of detector is generally known as the K-best detector. In the K-best method, at each level, the best K candidates, i.e. the ones with the lowest norms, are selected and expanded to the next level. Therefore, the number of children will be Kw. Then, within those children, the best K candidates are again chosen and expanded to the next level. Note that compared to the depth-first approach, the K-best method is better suited for hardware architecture since it offers a fixed complexity. Therefore, it is critical to design K-best detectors that could support the next generation of wireless multi-antenna standards.

Figure 7. The architecture for a K-best detector with M levels. Each level computes the best K candidates and forwards them to the next level. In this paper, we present the implementation results of a K-best detector using Synfora’s Pico Extreme software package. The C code was written for a 4x4 16-QAM system, and a K-best scheme with K=5. The synthesis results, shown in Table 1, show that a clock frequency of 100 MHz can be achieved, and the total area, for a TSMC 65-nm technology, is 0.28 mm2.

Table 1. Design statistics for the K-best detector

System Parameter 4x4, 16-QAM Core area .28 mm2 Maximum clock frequency 100 MHz Technology 65 nm Number of bits (precision) 6 bit K value 5 Data rate 400 Mbps

The detector is followed by the channel decoder in the wireless receiver chain. In the next section, we discuss the channel decoder design.

Design Examples: Low-Density Parity-Check Decoders Low-density parity-check (LDPC) codes [11] have received tremendous attention in the coding community because of their excellent error correction capability and near-capacity performance. Some randomly constructed LDPC codes, measured in Bit Error Rate (BER), come very close to the Shannon limit for the AWGN channel with iterative decoding and very long block sizes. The approaching fourth-generation (4G) wireless systems are projected to provide 100 Mbps to 1 Gbps speeds, which consequently leads to orders of magnitude complexity increases in the wireless receiver System-on-Chip. As a core technology in wireless communications, FEC (forward error correction) coding has migrated from the basic convolutional/block codes to more powerful Turbo codes and LDPC codes. LDPC codes are proposed for 4G wireless systems because of their excellent error correction performance and highly parallel decoding scheme. As of now, many wireless standards have adopted LDPC codes, such as DVB-S2, IEEE 802.11n, and IEEE 802.16e WiMAX. To meet the data rate and power consumption constraints in wireless handsets, it is very challenging to design a high performance LDPC decoder at low area cost with reduced development time. The LDPC decoder architectures range from fully-parallel to partial-parallel, and fully sequential. Most of the research on LDPC decoder design so far has focused on one particular system in which specific optimizations are made to improve the decoder performance. For example, Blanksby and Howland presented a 1 Gbps fully-parallel LDPC decoder by wiring the entire parity check matrix into hardware [1]. However, this architecture requires a huge amount of hardware resources and can only support one particular LDPC code which makes it less attractive in a wireless SoC. Brack et. al. discussed LDPC decoders for IEEE 802.16e WiMax system [2], and Rovini et. al. presented LDPC decoders for IEEE 802.11n WLAN system [3]. As different wireless standards employ different types of LDPC codes, it is very important to design a flexible and scalable LDPC decoder that can be tailored to different wireless applications. In this paper, we will explore the design space of efficient implementations of LDPC decoders using the PICO high level synthesis methodology. Under the guidance of the designers, PICO can effectively exploit the parallelism of a given algorithm, and then create an efficient hardware architecture for the algorithm. In this section, we will show a parallel LDPC decoder implementation using PICO.

Introduction of Low-Density Parity-Check Codes A binary LDPC code is a linear block code specified by a very sparse binary M by N parity check matrix:

0xH =⋅ T ,

where x is a codeword and H can be viewed as a bipartite graph where each column and row in H represents a variable node and a check node, respectively. Each element of the parity check matrix is either a zero or a one, where nonzero entries are typically placed at random to achieve good performance. During the encoding process, N-K redundant bits are added to the K information bits to create a codeword length of N bits. The code rate is the ratio of the information bits to the total bits in a codeword. LDPC codes are often represented by a bi-partite graph called a Tanner graph. There are two types of nodes in a Tanner graph, variable nodes and check nodes. A variable node corresponds to a coded bit or a column of the parity check matrix, and a check node corresponds to a parity check equation or a row of the parity check matrix. There is an edge between each pair of nodes if there is a one in the corresponding parity check matrix entry. The number of nonzero elements in each row or column of a parity check matrix is called the degree of that node. An LDPC code is regular or irregular based on the node degrees. If variable or check nodes have different degrees, then the LDPC code is called irregular, otherwise, it is called regular. Generally, irregular codes have better performance than regular codes. On the other hand, irregularity of the code will result in more complex hardware architecture. As an example, Figure 8 shows a Tanner graph of a simple LDPC code, where the variable and check nodes have degrees of two and four, respectively.

Figure 8. Tanner graph of a LDPC code.

Quasi-Cyclic LDPC Codes Typically, non-zero elements in H are placed at random positions to achieve good performance. However, this randomness is unfavorable for efficient VLSI implementation that calls for structured design. To address this issue, block-structured Quasi-cyclic LDPC codes are recently proposed for several new wireless standards such as IEEE 802.11n, IEEE 802.16e, DVB-S2 and DMB-T. As shown in Figure 9(a)(b), for a QC-LDPC code, the parity check matrix is constructed from a B by D seed matrix by replacing each "1" in the seed matrix with a z by z cyclically shifted identity sub-matrix, where z is an expansion factor. A corresponding Tanner graph representation of a QC LDPC code is shown in Figure 5(c). It groups the variable nodes and the check nodes into clusters of size z such that if there exists an edge between variable and check node clusters, then it means z variable nodes connect to z check nodes via a permutation network.

Figure 9. Parity check matrix for a QC-LDPC code.

Generally, a block structured parity check matrix H consists of a B by D array of z by z cyclically shifted identity matrices with random shift values. Table 2 summarizes the design parameters for H in two wireless standards. Figure 10 shows the parity check matrix for a length = 2304, rate = 1/2 WiMax LDPC code. Each sub-matrix in Figure 10 is either a zero matrix, which is represented with an empty square box in the figure, or a cyclic shifted identity matrix with shifted values shown in the figure.

Table 2: LDPC codes definitions in two wireless standards

802.11n WLAN 802.16e WiMax Block-column (D) 24 24

Block-row (B) 4 - 12 4 - 12 Expansion factor (z) 27 - 81 24 - 96

Code length 768 - 1944 648 - 2304 Code rate 1/2, 2/3, 3/4, 5/6 1/2, 2/3, 3/4, 5/6

Figure 10. Parity check matrix for WiMax LDPC code (Rate = 1/2, Length = 2304).

LDPC Soft Decoding Algorithm The decoder architecture proposed in this paper utilizes the iterative layered belief propagation (LBP) algorithm as proposed in [4]. As depicted in Figure 5(b), a block-structured parity-check matrix, which is a B by D array of z by z sub-matrices, each sub-matrix is either a zero or a shifted identity matrix with random shift value. In every layer,

each column has at most one 1, which satisfies that there are no data dependencies between the variable node messages, so that the messages flow in tandem only between the adjacent layers. The block size z is variable corresponding to the code definition in the standards. To simplify the hardware implementation, the scaled min-sum algorithm [5] is adopted. This algorithm is summarized as follows. Let Qmn denote the variable node log likelihood ratio (LLR) message sent from variable node n to the check node m, Rmn denote the check node LLR message sent from the check node m to the variable node n, and APPn denote the a posteriori probability ratio (APP) for variable node n, then:

mnmnn

mjnjjmjnjj

mn

mnnmn

RQAPP

QQsignsRRAPPQ

+=

×∏×=−=

≠≠

'

|)|min()('::

where s is a scaling factor. The APP messages are initialized with the channel reliability values of the coded bits. Hard decisions can be made after every horizontal layer based on the sign of APPn. If all parity-check equations are satisfied or the pre-determined maximum number of iterations is reached, then the decoding algorithm stops. Otherwise, the algorithm repeats for the next horizontal layer.

LDPC Decoder Design Using PICO In the layered decoding algorithm, a full iteration is divided into L sub-iterations where each sub-iteration corresponds to one layer's data decoding. In the traditional layered decoder architecture, each z by z sub-matrix is treated as a block within which all the z parity checks are processed simultaneously using z number of check node processors. The processor cores are all independent as there is no data dependence between adjacent check rows. The parallelism is at the sub-circulant level because it is easier to treat each sub-matrix as a whole processing block. The architecture can be easily modeled in C code and then be explored by the PICO compiler to generate the PPA hardware. As illustrated in Figure 11, z = 96 processor cores are generated to provide the maximum parallelism. The "pragma unroll" directive in the C procedure is used to ask the PICO compiler to unroll this loop. Note that this loop can also be partially unrolled to save area. By changing the constraint, the designer can quickly analyze the tradeoff between area and time.

Figure 11. Scalable check node processor cores generated by PICO The pseudo-code for the layered LDPC decoding algorithm is summarized as follows: Pseudo-code for layered LDPC decoding:

for iteration = 0 to I-1 for layer = 0 to L-1 for m = z*L to z*(L+1) -1 for n = 0 to N-1 Read APPn and Rmn from memory Calculate equation (1). Store APP'n and R'mn to memory end end end end

To implement this algorithm in hardware, we use a block-serial decoding method [6]: data in each layer is processed block-column by block-column. The decoder first reads APP and R messages from memory, calculates Q, and then finds the minimum and the second minimum values for each row m over all column n. Then, the decoder computes the new R and APP values based on the two minimum values, and writes the new R and APP values back to memory. Figure 12 depicts the PPA architecture generated by PICO. The parallelism of this architecture is at the level of the sub-matrix size z. For example, z is 96 for a WiMax LDPC code. To resolve the pipeline hazards as suggested in [7], a scoreboard is used to keep track of the memory read and write conflicts and to stall the decoder when needed. Figure 13 shows an example of the pipelining hazard when layer i+1 tries to read data that has not yet been updated by layer i.

Figure 12. Pipelined LDPC decoder.

Figure 13. Pipelining hazard

To compare the latency and area performance of the LDPC decoder, we have synthesized the PICO generated RTL using Synopsys Design Compiler on a TSMC 65nm technology. Figure 14 shows the latency and standard cell area performance of the decoder for different clock frequencies.

0

20

40

60

80

100

120

100 200 300 400MHz

Cyc

les

per i

tera

tion

0.34

0.35

0.36

0.37

0.38

0.39

0.4

0.41

0.42

0.43

0.44

100 200 300 400

MHz

Are

a sq

ure

mm

Figure 14. Latency and area performance

VLSI Implementation Result for the LDPC Decoder A configurable LDPC decoder which supports the IEEE 802.16e WiMax standard has been described in an un-timed C procedure, and then the PICO software was used to create synthesizable RTLs. The generated RTLs were synthesized using Synopsys Design Compiler, and placed & routed using Cadence SoC Encounter on a TSMC 65nm 0.9V 8-metal layer CMOS technology. Table 3 summarizes the main features of this decoder.

Table 3. Design statistics for the LDPC decoder

Core area 1.2 mm2 Maximum clock frequency 400 MHz Maximum power consumption 180 mW Technology 65 nm LLR input quantization 6 bit Maximum codeword length 2304 bit Memory usage 82,944 bit Maximum throughput (for code rate 1/2) 415 Mbps Maximum latency (for code rate 1/2) 2.8 µs

Compared to the manual RTL designs [6, 7, 8] which usually took 6-8 months to finish, the C based design using PICO technology only took 2-4 weeks to complete, and is able to achieve high performance in terms of area, power, and throughput. The area overhead is about 15% compared to the manual LDPC decoders [6, 7, 8] that we have implemented before at Rice University. Table 4 compares the power consumption of the LDPC decoder with and without clock-gating (external SRAMs are not included in the analysis). A 29% reduction in “sequential internal power” and a 20% reduction in total power consumption was achieved by using multi-level clock-gating feature of the HLS tool. These power savings are in addition to the power savings given by register-level clock gating. Table 4: Power estimates with and without multi-level clock gating

Power Leakage Internal Switching Total W clock-gating 3.43mW 46.1mW 22.5mW 72.0mW W/O clock-gating 3.43mW 64.5mW 22.5mW 90.4mW

Conclusion and Future Work As the demand for high performance wireless systems is rapidly increasing, the chip designer faces the challenge of implementing complex algorithms quickly and efficiently without compromising on power consumption. High level synthesis, which can automatically create efficient hardware from an untimed C algorithm, can provide the solution. In this paper, we have successfully built wireless application-specific engines using PICO to demonstrate the high level synthesis methodology. Future work includes implementing channel equalization algorithms and estimation algorithms using high level synthesis tools.

Acknowledgement The first, second, and third authors would like to thank Nokia, NSN, Xilinx, and National Science Foundation (under grants CCF-0541363, CNS-0551692, CNS-0619767, EECS-0925942 and CNS-0923479) for their support of the research. Reference [1] A.J. Blanksby and C.J. Howland, “A 690-mW 1-Gb/s 1024-b, rate- 1/2 low-density parity-check code decoder,” IEEE Journal of Solid-State Circuits, vol. 37, no. 3, pp. 404-412, 2002. [2] T. Brack, M. Alles, F. Kienle, and N. Wehn, “A Synthesizable IP Core for WIMAX 802.16e LDPC Code Decoding,” in IEEE 17th Int. Symp. Personal, Indoor and Mobile Radio Communications (PIMRC), 2006, pp. 1 - 5. [3] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, “A Scalable Decoder Architecture for IEEE 802.11n LDPC Codes,” in GLOBECOM, 2007, pp. 3270-3274. [4] D. E. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” in IEEE Workshop on Signal Processing Systems, SIPS, pp. 107–112, 2004. [5] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier and X. Hu, “Reduced-Complexity Decoding of LDPC Codes,” IEEE Transactions on Communications, vol. 53, pp. 1232-1232, 2005. [6] Y. Sun and J. R. Cavallaro, “A low-power 1-Gbps reconfigurable LDPC decoder design for multiple 4G wireless standards,” in IEEE International SOC Conference (SoCC), Sept. 2008, pp. 367–370. [7] Y. Sun, M. Karkooti, and J. R. Cavallaro, “High throughput, parallel, scalable LDPC encoder/decoder architecture for OFDM systems,” in 2006 IEEE Dallas/CAS Workshop on Design, Applications, Integration and Software, (Richardson, TX), pp. 39–42, Oct. 2006. [8] Y. Sun, M. Karkooti, and J.R. Cavallaro, “VLSI Decoder Architecture for High Throughput, Variable Block-size and Multi-rate LDPC Codes,” in IEEE International Symposium on Circuits and Systems (ISCAS)., May 2007. [9] Synfora PICO Product, http://www.synfora.com. [10] S. Aditya and V. Kathail, Algorithmic Synthesis Using PICO. Springer Netherlands, 2008, pp. 53–74.

[11] R. Gallager, “Low-density parity-check codes,” IEEE Transaction on Information Theory, vol. 8, pp. 21–28, Jan. 1962. [12] I. E. Teletar, “Capacity of multiantenna Gaussian channels,” European Transaction on Telecommunication, vol. 10, Nov. 1999 [13] D. Garrett, L. Davis, S. ten Brink, B. Hochwald and G. Knagge, “ Silicon complexity for maximum likelihood MIMO detection using spherical decoding,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004 [14] U. Fincke and M. Pohst, “ Improved methods for calculating vectors for short length in a lattice, including a complexity analysis,” Math Computatation, vol. 44, no. 170, 1985 [15] M. O. Damen, H. E. Gamal and G. Caire, “On maximum likelihood detection and the search for the closest lattice point,” IEEE Transaction on Information Theory, vol. 49, no. 10, pp. 2389–2402, Oct. 2003. [16] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner and H. Bolcskei, “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE Journal of Solid-State Circuits, vol. 40, no. 7, pp. 1566–1577, Jul. 2005. [17] K. Amiri and J. R. Cavallaro, “FPGA implementation of dynamic threshold sphere detection for MIMO systems,” 40th IEEE Asilomar Conf on Signals, Systems and Computers, pp. 94–98, Nov 2006. [18] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-Best sphere decoding for MIMO detection,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 3, pp. 491–503, Mar. 2006.

DesignCon 2010 Wireless HLS Final

Documents

Transcript of DesignCon 2010 Wireless HLS Final