04837873

14
1602 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009 Backward Interpolation Architecture for Algebraic Soft-Decision Reed–Solomon Decoding Jiangli Zhu, Student Member, IEEE, Xinmiao Zhang, Member, IEEE, and Zhongfeng Wang, Senior Member, IEEE Abstract—Recently developed algebraic soft-decision (ASD) decoding of Reed–Solomon (RS) codes have attracted much in- terest due to the fact that they can achieve significant coding gain with polynomial complexity. One major step of ASD decoding is the interpolation. Available interpolation algorithms can only add interpolation points or increase interpolation multiplicities. However, backward interpolation, which eliminates interpolation points or reduces interpolation multiplicities, is indispensable to enable the reusing of interpolation results in the following two scenarios: 1) interpolation needs to be carried out on multiple test vectors, which share common entries and 2) iterative ASD de- coding where interpolation points have decreasing multiplicities. Examples for these cases are the low-complexity Chase (LCC) decoding and bit-level generalized minimum distance (BGMD) decoding. With lower complexity, these algorithms can achieve similar or higher coding gain than other practical ASD algo- rithms. In this paper, we propose novel backward interpolation schemes and corresponding efficient implementation architectures for LCC and BGMD decoding through constructing equivalent Gröbner bases. The proposed architectures share computational units with forward interpolation architectures. Hence, the area overhead for incorporating the backward interpolation is very small. Substantial area saving or speedup can be achieved by using the backward interpolation. When the proposed architecture is applied to the LCC decoding of a (255, 239) RS code with , the area is reduced to 39% of those required by prior architec- tures. In terms of speed/area ratio, the proposed architecture is 48% more efficient than the best available architecture. For the BGMD decoding of the same code, the proposed architecture can achieve around 20% higher efficiency. Index Terms—Backward interpolation, bit-level generalized minimum distance (BGMD) decoding, low-complexity Chase (LCC), Reed–Solomon (RS) codes, soft-decision decoding, VLSI architecture. I. INTRODUCTION R eed–Solomon (RS) codes are used as error-correcting codes in many digital communication and storage sys- tems, such as wireless communications, deep-space probing, magnetic and optical recording, and digital television. Com- pared to hard-decision decoding (HDD) algorithms of RS codes, such as the Berlekamp–Massey algorithm (BMA) [1], soft-decision decoding algorithms can correct more errors by Manuscript received March 07, 2008; revised July 29, 2008. First published April 21, 2009; current version published October 21, 2009. This work was supported by the National Science Foundation under the Grant 0802159 and Grant 0708685. J. Zhu and X. Zhang are with the Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106 USA (e-mail: [email protected]; [email protected]). Z. Wang is with Broadcom Corporation, Irvine, CA 92617 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2008.2005575 making use of the probability information from the channel. Various soft-decision decoding algorithms have been proposed [2]–[6] for RS codes. The major issue that prevents these al- gorithms from being employed in practical applications is that they can either achieve only relatively low coding gain or have very high complexity. More recently, algebraic soft-decision (ASD) decoding algorithms have been developed [7]–[9] for RS codes. These algorithms incorporate channel probability in- formation into the algebraic interpolation process proposed by Sudan[10] and Guruswami and Sudan [11]. With a complexity that is polynomial with respect to the codeword length, these algorithms can achieve significant coding gain. One major step of ASD decoding is the interpolation. In prac- tice, the interpolation problem can be solved by the Nielson’s algorithm [12], [13] and Lee-O’Sullivan (LO) algorithm [14]. Although the LO algorithm can lead to a more efficient imple- mentation when the maximum interpolation multiplicity is two [15], [16], the interpolation points and their multiplicities cannot be changed once the interpolation started. On the contrary, the Nielson’s algorithm is a point-by-point algorithm, interpolation points can be added and multiplicities can be increased by car- rying out more interpolation iterations. Extensive studies have been carried out on reducing the complexity and facilitating the hardware implementation of the Nielson’s algorithm [17]–[23]. Nevertheless, how to eliminate interpolation points or reduce interpolation multiplicities from a given interpolation result has not been addressed in any previous effort. In this paper, elimi- nating points or reducing multiplicities from a given interpola- tion result is referred to as the “backward interpolation,” while adding points or increasing multiplicities using the Nielson’s al- gorithm is called the “forward interpolation.” The backward in- terpolation is of great importance because it enables the reusing of interpolation results in the following two scenarios: 1) inter- polation needs to be carried out on multiple test vectors, which share common entries and 2) iterative ASD decoding where in- terpolation points have decreasing multiplicities. An example for the first scenario is the low-complexity Chase (LCC) decoding [9], which interpolates over test vectors. With multiplicity one for each point in the test vectors, the LCC algorithm can achieve similar or higher coding gain than the Koetter–Vardy (KV) ASD decoding [7] with maximum multiplicity four. In order to reduce the overall computation in the interpolation of LCC decoding, intermediate interpolation results can be stored and shared among the test vectors. Nev- ertheless, in this case, intermediate interpolation results need to be stored at the same time, which leads to large memory requirement. The large memory accounts for a significant pro- portion of the overall interpolator area. Alternatively, the test vectors in the LCC decoding can be ordered such that adjacent vectors have only one different point. In this paper, we propose 1063-8210/$26.00 © 2009 IEEE

description

IEEE PAPERS 042837873

Transcript of 04837873

Page 1: 04837873

1602 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

Backward Interpolation Architecture for AlgebraicSoft-Decision Reed–Solomon Decoding

Jiangli Zhu, Student Member, IEEE, Xinmiao Zhang, Member, IEEE, and Zhongfeng Wang, Senior Member, IEEE

Abstract—Recently developed algebraic soft-decision (ASD)decoding of Reed–Solomon (RS) codes have attracted much in-terest due to the fact that they can achieve significant coding gainwith polynomial complexity. One major step of ASD decodingis the interpolation. Available interpolation algorithms can onlyadd interpolation points or increase interpolation multiplicities.However, backward interpolation, which eliminates interpolationpoints or reduces interpolation multiplicities, is indispensable toenable the reusing of interpolation results in the following twoscenarios: 1) interpolation needs to be carried out on multiple testvectors, which share common entries and 2) iterative ASD de-coding where interpolation points have decreasing multiplicities.Examples for these cases are the low-complexity Chase (LCC)decoding and bit-level generalized minimum distance (BGMD)decoding. With lower complexity, these algorithms can achievesimilar or higher coding gain than other practical ASD algo-rithms. In this paper, we propose novel backward interpolationschemes and corresponding efficient implementation architecturesfor LCC and BGMD decoding through constructing equivalentGröbner bases. The proposed architectures share computationalunits with forward interpolation architectures. Hence, the areaoverhead for incorporating the backward interpolation is verysmall. Substantial area saving or speedup can be achieved by usingthe backward interpolation. When the proposed architecture isapplied to the LCC decoding of a (255, 239) RS code with � �,the area is reduced to 39% of those required by prior architec-tures. In terms of speed/area ratio, the proposed architecture is48% more efficient than the best available architecture. For theBGMD decoding of the same code, the proposed architecture canachieve around 20% higher efficiency.

Index Terms—Backward interpolation, bit-level generalizedminimum distance (BGMD) decoding, low-complexity Chase(LCC), Reed–Solomon (RS) codes, soft-decision decoding, VLSIarchitecture.

I. INTRODUCTION

R eed–Solomon (RS) codes are used as error-correctingcodes in many digital communication and storage sys-

tems, such as wireless communications, deep-space probing,magnetic and optical recording, and digital television. Com-pared to hard-decision decoding (HDD) algorithms of RScodes, such as the Berlekamp–Massey algorithm (BMA) [1],soft-decision decoding algorithms can correct more errors by

Manuscript received March 07, 2008; revised July 29, 2008. First publishedApril 21, 2009; current version published October 21, 2009. This work wassupported by the National Science Foundation under the Grant 0802159 andGrant 0708685.

J. Zhu and X. Zhang are with the Department of Electrical Engineering andComputer Science, Case Western Reserve University, Cleveland, OH 44106USA (e-mail: [email protected]; [email protected]).

Z. Wang is with Broadcom Corporation, Irvine, CA 92617 USA (e-mail:[email protected]).

Digital Object Identifier 10.1109/TVLSI.2008.2005575

making use of the probability information from the channel.Various soft-decision decoding algorithms have been proposed[2]–[6] for RS codes. The major issue that prevents these al-gorithms from being employed in practical applications is thatthey can either achieve only relatively low coding gain or havevery high complexity. More recently, algebraic soft-decision(ASD) decoding algorithms have been developed [7]–[9] forRS codes. These algorithms incorporate channel probability in-formation into the algebraic interpolation process proposed bySudan[10] and Guruswami and Sudan [11]. With a complexitythat is polynomial with respect to the codeword length, thesealgorithms can achieve significant coding gain.

One major step of ASD decoding is the interpolation. In prac-tice, the interpolation problem can be solved by the Nielson’salgorithm [12], [13] and Lee-O’Sullivan (LO) algorithm [14].Although the LO algorithm can lead to a more efficient imple-mentation when the maximum interpolation multiplicity is two[15], [16], the interpolation points and their multiplicities cannotbe changed once the interpolation started. On the contrary, theNielson’s algorithm is a point-by-point algorithm, interpolationpoints can be added and multiplicities can be increased by car-rying out more interpolation iterations. Extensive studies havebeen carried out on reducing the complexity and facilitating thehardware implementation of the Nielson’s algorithm [17]–[23].Nevertheless, how to eliminate interpolation points or reduceinterpolation multiplicities from a given interpolation result hasnot been addressed in any previous effort. In this paper, elimi-nating points or reducing multiplicities from a given interpola-tion result is referred to as the “backward interpolation,” whileadding points or increasing multiplicities using the Nielson’s al-gorithm is called the “forward interpolation.” The backward in-terpolation is of great importance because it enables the reusingof interpolation results in the following two scenarios: 1) inter-polation needs to be carried out on multiple test vectors, whichshare common entries and 2) iterative ASD decoding where in-terpolation points have decreasing multiplicities.

An example for the first scenario is the low-complexity Chase(LCC) decoding [9], which interpolates over testvectors. With multiplicity one for each point in the test vectors,the LCC algorithm can achieve similar or higher coding gainthan the Koetter–Vardy (KV) ASD decoding [7] with maximummultiplicity four. In order to reduce the overall computation inthe interpolation of LCC decoding, intermediate interpolationresults can be stored and shared among the test vectors. Nev-ertheless, in this case, intermediate interpolation resultsneed to be stored at the same time, which leads to large memoryrequirement. The large memory accounts for a significant pro-portion of the overall interpolator area. Alternatively, the testvectors in the LCC decoding can be ordered such that adjacentvectors have only one different point. In this paper, we propose

1063-8210/$26.00 © 2009 IEEE

Page 2: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1603

a novel backward interpolation scheme that can eliminate onepoint from the interpolation result of a test vector. Adding thepoint that is different by forward interpolation, the interpolationresult for the adjacent vector can be derived directly. As a re-sult, only one interpolation result needs to be stored at any time.The basic idea of the backward interpolation is to construct anequivalent Gröbner basis. Although eliminating points requiresextra clock cycles, the original LCC decoding needs to inter-polate over the same point for multiple times. Hence, the extraclock cycles required by the backward interpolation are offset bythe redundancy in the original LCC decoding. In addition, therequired computation units for backward interpolation can beshared with those for forward interpolation. Applying our pro-posed scheme to the LCC decoding of a (255, 239) RS code with

, the area is reduced to 39% of that required by the archi-tecture in [21], which is the best available architecture. Despitethe higher clock frequency that can be achieved by the architec-ture in [21], our architecture is 48% more efficient in terms ofspeed/area ratio. In addition, the hardware saving increases with

. Parts of this result for the LCC decoding have been presentedin [24].

In this paper, we also extend our backward interpolation tothe iterative bit-level generalized minimum distance (BGMD)decoding [8], which belongs to the second scenario. Using max-imum multiplicity two with two decoding iterations, the BGMDalgorithm can also achieve similar or higher coding gain than theKV algorithm with maximum multiplicity four. In the seconditeration of the BGMD decoding with maximum multiplicitytwo, it is possible that a code position can change from havingtwo interpolation points of multiplicity one to one interpolationpoint of multiplicity two. In this case, one interpolation pointneeds to be eliminated and the other needs to have its multi-plicity increased by one from the interpolation result derived inthe first decoding iteration. In addition, the number of polyno-mials involved is larger than that in the LCC decoding. Thesefeatures make extending the backward interpolation to BGMDdecoding nontrivial. We propose to eliminate both interpolationpoints through constructing a equivalent Gröbner basis, thenthe multiplicity of the second point is increased from zero totwo. Although extra clock cycles are spent to reduce the mul-tiplicity of the second point, these clock cycles are much lessthan those required to carry out the interpolation for the seconddecoding iteration from the very beginning. Accordingly, ap-plying the proposed scheme to the BGMD decoding of a (255,239) RS code, 46% of the interpolation iterations can be savedwhen the SNR is high. Moreover, the computational units re-quired for this backward interpolation can be also shared withthose for forward interpolation. Compared to the architecture in[21], our architecture requires only 67% of the area. Althoughthe architecture in [21] can achieve higher clock frequency, ourarchitecture can achieve around 20% higher efficiency in termsof speed/area ratio.

The structure of this paper is as follows. Section II intro-duces general ASD decoding, the LCC decoding, the BGMDdecoding, and the Nielson’s forward interpolation algorithm.The proposed backward interpolation for the LCC decoding ispresented in Section III. Section IV describes the backward in-terpolation for the BGMD decoding. The corresponding inter-polation architectures are given in respective sections. Section V

provides the hardware requirement and latency analyses. Con-clusions are drawn in Section VI.

II. ASD DECODING AND THE INTERPOLATION

In this paper, we consider an RS code con-structed over finite fields of characteristic two, denoted as

, where is a positive integer. For primitive codes,. The message symbols

can be viewed as the coefficients of the message polynomial. The encoding of RS

codes can be carried out by evaluating the message polyno-mial at distinct nonzero elements of . Taking thefixed-ordered elements as the distinctevaluation elements, the codeword corresponding to is

. At the receiver end, given theobservation of the th symbol in the received word , the thsymbol sent at the transmitter can be any field element. Hence,the interpolation points associated with the th received symbolcan include for any . The probability of

being sent at the transmitter, given the observation , isdenoted by .

ASD decoding consists of three steps: the multiplicity assign-ment, the interpolation, and the factorization. The function ofthe multiplicity assignment step is to decide on the multiplicity,

, of the interpolation point according to the relia-bility information from the channel. Different ASD algorithmshave different multiplicity assignment schemes. Nevertheless,they share the same interpolation and factorization steps. Beforethe functions of the interpolation and factorization steps are in-troduced, some related definitions are given as follows.

Definition 1: A bivariate polynomial is said to passa point with multiplicity if containsa monomial with degree , and does not containany monomial with degree less than .

Definition 2: For nonnegative integers and ,the -weighted degree of a bivariate polynomial

is the maximum of ,such that .

Interpolation Step: Given the set of interpolation pointswith corresponding multiplicities , com-

pute a nontrivial bivariate polynomial of minimal-weighted degree that passes each interpolation point

with at least its associated multiplicity.Factorization Step: Determine all factors of the bivariate

polynomial in the form of with the degreeof less than . Each corresponds to one possiblemessage polynomial in the list.

The interpolation has higher complexity than the factoriza-tion. In this paper, we focus on the implementation of the inter-polation step.

The performance and complexity of ASD decoding aremainly determined by the multiplicity assignment. The de-coding complexity grows fast with multiplicities. Hence,multiplicity assignment schemes that use small multiplicitiesare preferred for practical applications. On the other hand,smaller multiplicities do not necessarily lead to inferior per-formance [8], [9]. In the following, the LCC and BGMDmultiplicity assignment schemes are introduced.

Page 3: 04837873

1604 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

Fig. 1. Performance of RS decoding algorithms.

A. The LCC Multiplicity Assignment Scheme

The multiplicity assignment step in the LCC decoding yieldstest vectors for a given positive integer , and

the interpolation needs to be carried out on each of them.For an code, each test vector has entries and eachentry is an interpolation point of multiplicity one. In LCCmultiplicity assignment, first the reliability of the th codeposition is determined by the log-likelihood ratio (LLR)

. Here, andare the most likely and second most likely symbols transmittedin the th position, respectively. The smaller the magnitude of

, the more unreliable the th code position. Then, each of themost unreliable code positions is assigned two interpolation

points: and , while each of the othercode positions is assigned only one interpolation point

. The test vectors in the LCC decoding are formed bychoosing one interpolation point for each code position. Sincethere are two candidate points for each of the unreliable codepositions, the total number of test vectors is .

With small , the LCC decoding can achieve very good per-formance. As it can be observed from Fig. 1, for a (255, 239)RS code, the LCC decoding with can achieve similar orhigher coding gain than the KV decoding with maximum mul-tiplicity four. For reference, the frame error rate (FER) of HDDusing the BMA is also included in this figure. These simula-tion results are obtained over the additive white Gaussian noise(AWGN) channel with binary phase-shift keying (BPSK) mod-ulation.

B. The BGMD Multiplicity Assignment Scheme

The BGMD decoding algorithm assigns multiplicitiesby using bit-level reliabilities. For an RS code constructedover , each received symbol consists of noise-cor-rupted bits. The LLR of a noise-corrupted bit, , is defined as

. Whether is erased or not isdetermined according to . Assume the maximum allow-able multiplicity is . The multiplicities in the BGMDalgorithm are decided based on the number of bits erased in areceived symbol using the following scheme [8].

Algorithm A: BGMD Multiplicity Assignment

1) if no bit is erased in the received symbol , assignto , where is the hard-decision of ;

2) if there is only one bit erased in , assign to bothand , where is the finite-field

element that differs from in only the erased bit;

3) if there are more than one bit erased in , do not assignany multiplicity to the interpolation points with .

The BGMD decoding algorithm is iterative. In the originalBGMD decoding, the received bits are first sorted accordingto the magnitudes of their LLRs. Then, the most unreliable bitsare erased. The number of erased bits increases by one in eachiteration, and the multiplicity assignment is done according toAlgorithm A. The sorting leads to high complexity in hardwareimplementations. Alternatively, thresholds can be used to deter-mine the erasures. Any bit with lower than the thresholdis erased. Using this approach, the multiplicity assignment canbe implemented simply by comparators. It turns out that theBGMD decoding with , two iterations and thresholderasure decision can achieve very good performance. Fig. 1 alsoillustrates the FER curves for BGMD decoding with one andtwo iterations. As it can be observed, the BGMD decoding with

, and two decoding iterations can achieve similar orhigher coding gain than the KV algorithm with maximum mul-tiplicity four. The thresholds used for the first and second itera-tions are 1.5 and 1.0, respectively.

C. The Nielson’s Interpolation

Among various algorithms proposed to solve the interpola-tion problem, the Nielson’s algorithm [12], [13] and the LOalgorithm [14] are suitable for hardware implementations. Al-though the LO algorithm can lead to a more efficient imple-mentation of the interpolation when [15], [16],it cannot take in more points or increase the multiplicities ofthe points after the interpolation started. Hence, it is not suit-able for the applications where intermediate interpolation re-sults need to be shared. On the contrary, the Nielson’s algorithmis a point-by-point algorithm. Points can be added and multiplic-ities can be increased by carrying out more interpolation itera-tions.

The Nielson’s algorithm constructs a Gröbner basis of amodule induced from the ideal of the polynomials that passthrough all interpolation points with their respective multiplici-ties. The Gröbner basis is defined next [25].

Definition 3: A set of nonzero polynomials,, contained in the module is

called a Gröbner basis of the module , if for all in themodule there exists , such thatdivides .

In Definition 3, denotes the leading term of the polyno-mial, which is the monomial of the highest order according tothe weighted degree. A monomial divides ifand . Hence, the polynomial with the minimum weighteddegree in the Gröbner basis is the polynomial with the minimum

Page 4: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1605

weighted degree among all polynomials in the module. Accord-ingly, the polynomial with the minimum weighted degree in theGröbner basis constructed by the Nielson’s algorithm is the de-sired interpolation output. Our backward interpolation will beused together with the Nielson’s forward interpolation in theimplementation of the LCC and BGMD decoding. The pseu-docodes of the Nielson’s algorithm are listed in Algorithm B.

Algorithm B: Nielson’s Interpolation

initialization:

interpolation starts:

for each interpolation point with multiplicity

for to and to

B1: compute

B2:for to ,

B3:

B4:

Output:

In Algorithm B, the discrepancy coefficient is thecoefficient of the monomial in . It canbe computed as

where is the coefficient of in . AlgorithmB starts with a set of bivariate candidate polynomials

initialized as . An in-terpolation point with multiplicity addsinterpolation constraints. The variable can be decided by thetotal number of interpolation constraints and the lexicograph-ical order of monomials according to -weighted de-gree. For high rate codes, equals the maximum multiplicityof the interpolation points. In Algorithm B, one more interpo-lation constraint is satisfied in each iteration. During each iter-ation, discrepancy coefficients are first computed for all candi-date polynomials in step B1. If all these coefficients are zero, thecurrent interpolation constraint is already satisfied and we canmove on to the next constraint. Otherwise, the polynomials areupdated in the B3–B4 steps to force these coefficients to becomezero. In addition, such polynomial updating does not affect thosecoefficients that have already been forced to zero in previousiterations. Therefore, after the B1–B4 steps have been carriedout for each combination of and , such that ,every candidate polynomial passes the point with mul-tiplicity . Furthermore, the weighted degree of the polyno-mials is increased by the minimum value, one, through the B4

step in each iteration. When the interpolation over all the pointsis completed, the candidate polynomials form a Gröbner basisof the polynomials that satisfy all interpolation constraints. Thepolynomial with the minimum weighted degree in the Gröbnerbasis is the desired interpolation output.

Various simplification techniques and architectures have beenproposed for the implementation of the Nielson’s interpolation.Applying the reencoding and coordinate transformation tech-niques [17], [18], the coordinates of the interpolation pointswith the largest multiplicities can be converted to zero. As aresult, the interpolation over these points can be presolved bysimple univariate interpolations. Accordingly, the number of bi-variate interpolation iterations that need to be carried out in Al-gorithm B can be significantly reduced, especially for high ratecodes. A point-serial scheme [19] has been proposed to facili-tate the computation of discrepancy coefficients. In this scheme,initial values for all discrepancy coefficients associated with thesame interpolation point are computed. Then, these initial valuesare updated simultaneously with the polynomials to derive theactual discrepancy coefficients. Consequently, the latency forthe discrepancy coefficient computation in later iterations canbe eliminated. However, the discrepancy coefficient updatingincurs area overhead and additional clock cycles are requiredto compute the coefficients at the beginning of the interpolationfor each point. The area requirement of the interpolation canbe lowered by our reduced-complexity interpolation architec-ture [20]. This architecture is developed based on the observa-tion that a significant proportion of discrepancy coefficients arezero when the interpolation multiplicity is larger than two, andhence, the corresponding polynomial updating can be skipped.The minimum achievable clock period of the interpolation ar-chitecture is limited by the data path in the feedback loops usedfor discrepancy coefficient computation. Power representationof finite-field element is employed in [21], such that the datapath can be reduced to one 1-bit full adder. Accordingly, otherparts of the interpolation architecture can be deep-pipelined toachieve higher clock frequency. Nevertheless, this approach en-tails the area overhead of look-up tables (LUTs) that are neededto convert back-and-forth between such power representationand the conventional representation of field elements. In addi-tion, when deep pipelining is applied, large number of registersare required, and the setup and hold times of registers are nolonger negligible. Hence, when the architecture in [21] is im-plemented on FPGAs in [22], a critical path that is around twicelonger than the theoretical limit is employed. An interpolationarchitecture that updates the monomials in a polynomial in par-allel was proposed in [23]. Although higher parallelism can beachieved, this architecture has large area requirement, and thehardware utilization efficiency is only about 50% [21].

Despite all efforts, Algorithm B only allows interpolationpoints to be added or multiplicities to be increased by carryingout more iterations to cover the extra constraints. How to elim-inate points or reduce multiplicities from an given interpola-tion result has not been studied in any previous literature. Suchbackward interpolation is indispensable to enable the reusingof interpolation results and hence reducing the overall decodingcomplexity when multiple interpolations need to be carried out.In the following two sections, assuming that the forward inter-

Page 5: 04837873

1606 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

Fig. 2. Two interpolation procedures of the LCC decoding with � � �.

polation is done by using Algorithm B, novel backward interpo-lation schemes for the LCC and BGMD decoding are proposed.Efficient implementation architectures that can be used for bothforward and backward interpolations are also presented.

III. INTERPOLATION ARCHITECTURE FOR LCC DECODING

In the LCC algorithm, the interpolation needs to be carriedout on each of the test vectors. Since the interpolation pointsfor the reliable code positions are common amongall test vectors, the interpolation over these points can be carriedout first using Algorithm B and then the result can be shared.According to the rest of the interpolation points, the test vec-tors can be mapped to a full binary tree with levels. Fig. 2(a)shows such a tree with . In this tree, each node denotesthe interpolation result over the points up to the current level,and the “0” and “1” labels on the edges indicate thatand , respectively, are included in the test vectors inthat level. For example, the left-most path corresponds to thecase that are included in the test vector for all thethree unreliable code positions. In addition, the labels in eachpath from the root to the leaf are collected as the -bit string byeach leaf. The interpolation result of a node in the tree can bebuilt upon the interpolation result of its parent node. To enablethe sharing of the interpolation results, a breadth-first schemeis employed in the LCC decoding to complete the interpolationover all test vectors [9]. Accordingly, intermediate inter-polation results need to be stored, which requires large memory.

Instead of being mapped to the nodes in a -level binary tree,the test vectors can be mapped to the vertices of a -dimen-sion hypercube, as shown in Fig. 2(b). Due to the property ofthe hypercube, the vertices can be labeled by -bit strings suchthat the adjacent vertices have only one bit different in their la-bels. Accordingly, the test vectors mapped to adjacent verticescontain different interpolation points in only one code position.In a hypercube, there always exists a path that travels throughall the vertices and passes each vertex only once. For example,all the vertices in the hypercube in Fig. 2(b) can be traversedby following the arrows. Hence, going through the path, eachtest vector will be covered exactly once and only one interpola-tion result needs to be stored at any time. In order to travel fromone vertex to its neighbor efficiently, we need to derive the in-terpolation result of a test vector directly from that of anotherwith only one different interpolation point. This can be done bydeleting a point from the interpolation result and then addinganother point.

At the beginning, a test vector is selected randomly as thestarting vector and the interpolation result for this vector iscomputed through constructing the corresponding Gröbnerbasis using Algorithm B. Assume and are thepoints that are different between the starting vector and oneof its neighbors. Here, one of and is and the otheris . To compute the interpolation result for the neighborvector, needs to be deleted from the Gröbner basisderived for the starting vector, and then, needs to beadded. This process will be denoted by .Apparently, adding one point to a Gröbner basis can be doneby following the Nielson’s forward interpolation in AlgorithmB. Next, we propose a backward interpolation scheme that candelete one point from the Gröbner basis in the LCC decoding.

A. Backward Interpolation for LCC Decoding

In the LCC decoding, since the maximum multiplicity ofthe interpolation points is one, the Gröbner basis consists oftwo polynomials and the maximum -degree of them is one.Subsequently, the two polynomials in the Gröbner basis canbe expressed as , .The backward interpolation for starts

with evaluating at . Since , if, then .

In this case, contains the factor and. In the case where

, does not have the factor . Inaddition, passes and it is of -degree one.Hence, is the only root of .Accordingly, when . Assume

and .According to whether the polynomials ( , 1)contain the factor , the Gröbner basis formed by

( , 1) can be divided into three categories.1) One of has the factor .

In this case, has the factor and doesnot. Hence, can be rewritten as

where is another bivariate polynomial.Lemma 1: The polynomials

(1)

form a Gröbner basis of the module induced from theideal of the polynomials that pass all the points except

in the starting vector.The proof of this lemma can be found in the Appendix ofthis paper.

2) None of has the factor .This case happen when none of and iszero. In this case, an equivalent Gröbner basis of the poly-nomials that pass all points in the starting vector can bebuilt as

(2)

Page 6: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1607

Lemma 2: contains the factor .For the proof, the interested reader is referred to theAppendix of this paper. Since one of the polynomials in(2) has the factor , they are in the same formas the polynomials in category 1. A Gröbner basis ofthe polynomials that pass all points except withmultiplicity one can be derived by dividing from

.3) Both of have the factor .

Lemma 3: This case cannot happen.The detailed proof is provided in the Appendix of thispaper.

After the point has been removed by constructing aGröbner basis of the module as discussed above, the forwardinterpolation in Algorithm B can be carried out for one itera-tion to add the point . The Gröbner basis constructed bysuch a backward and forward interpolation process may not bethe same as the Gröbner basis that would have been constructedby interpolating over all the points using Algorithm B. How-ever, they are Gröbner bases of the same module. It was statedin [25] that the polynomial with the minimum weighted degreein the module must appear in any Gröbner basis of this moduleand is unique up to a nonzero constant scaler. Hence, the min-imum polynomial in the Gröbner basis constructed by forwardand backward interpolation is a scaled version of the minimumpolynomial in the Gröbner basis constructed by forward-onlyinterpolation. The minimum polynomial is the desired interpo-lation output. Scaling the interpolation output polynomial by anonzero constant will not affect the factors. Therefore,employing the proposed backward interpolation scheme wouldlead to the same decoding output and no error rate degradationwill be resulted. In summary, Algorithm C lists the pseudocodesfor deriving the interpolation result of a test vector from that ofits neighbor vector in the LCC decoding.

Algorithm C: New Interpolation for LCC Decoding

for each in the hypercubebackward interpolation:

C1: compute for

ifC2:

C3: divide by

forward interpolation:C4: compute for

C5:

C6:

output:

In Algorithm C, the steps in the forward interpolation are thesame as those in Algorithm B. These steps are also used in the in-terpolation for the starting test vector. It can be observed that theC2 step is the same as the C5 step except that different scalar co-efficients are used. In addition, instead of multiplying the factor

in the forward interpolation, the division by this factoris included in the backward interpolation.

B. VLSI Architecture for the New LCC Interpolation

In this section, efficient architectures for the new interpolationprocedure for the LCC decoding are presented. The critical pathis designed to be no longer than one multiplier and one adder.

1) Polynomial Evaluation (PE) Architecture: The PE ar-chitecture in Fig. 3 is employed to calculate the evaluationvalues and at a given point . To applyHorner’s rule, the coefficients are fed to the PE architectureserially with the most significant one first. In our design,

and are first computed. Then,is computed as . Depending on whetherthe current iteration is for forward or backward interpolation,the multiplexor chooses or as the output.In addition, the output is checked by the zero detector to helpdecide the indices and in Algorithm C. The number ofpipelining stages in this architecture is three.

2) Polynomial Updating (PU) Architecture: After the evalu-ation values have been computed, the polynomial updating in-volved in the backward interpolation (C2 and C3 steps in Al-gorithm C) and forward interpolation (C5 and C6 steps) can becarried out by using the architecture shown in Fig. 4. In our de-sign, the coefficients with different -degrees are updated in par-allel. Fig. 4 shows the architecture for updating the coefficientswith one of the -degrees. This architecture is time-multiplexedto implement the polynomial updating in both the forward andbackward interpolations.

During the forward interpolation, the polynomial updatingis carried out according to the C5 and C6 steps. To compute

, four multi-plexors would be required to choose the proper polynomials andevaluation values. However, since and , thiscomputation can also be expressed as

. Accordingly, the multiplexors can beeliminated and this computation can be implemented by blockA in Fig. 4. Then, the output from block A can be passed tothe output of block B intact by taking “0” as the input to themultiplier in block B. The C6 step is implemented by block C,which multiplies to the picked . In previousdesigns [19], [21], the updated coefficients of and

are fed back to their own memory blocks and mul-tiplexors are required for coefficient routing. In our architec-ture, instead, the two updated polynomials are written back tofixed memory blocks. Although a few extra gates are requiredto keep track of the weighted degree of the polynomials, em-ploying this scheme can eliminate the multiplexors at the inputof the memory. Accordingly, the critical path and gate numbercan be reduced. Since the longest data path in Fig. 4 consistsof two pairs of multiplier and adder, registers are added to breakthe path. The number of pipelining stages of the PU architectureis two.

Page 7: 04837873

1608 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

Fig. 3. PE architecture.

Fig. 4. PU architecture for the LCC decoding.

During the backward interpolation, instead ofare fed to the PU architecture. If ,

the computation in step C2 needs to be carried out. This com-putation can be also implemented by block A in Fig. 4.When , the output of block A becomes

. However, multiplying by anonzero constant will not affect the decoding result. Hence,the output of block A can be directly fed to block B, which isnow configured to implement the division by throughtaking as the multiplier input. Since no computation needsto be done on during the backward interpolation,it is passed to the output of block C unchanged by taking “0”as the multiplier input in this block. Similar to the forwardinterpolation, the updated coefficients of the polynomials arewritten back to fixed memory blocks during the backwardinterpolation.

3) Computation Scheduling: Both the forward and backwardinterpolations consist of polynomial evaluation and polynomialupdating. Note that all blocks in the PE and PU architecturestake polynomial coefficients with the most significant one first.Once a coefficient of a polynomial has been updated, it can bepassed to the PE architecture to compute the evaluation value ofthe same polynomial for the next interpolation iteration. Afterthe last polynomial coefficient is updated, the evaluation valuefor the next interpolation iteration will be available after thepipelining latency. The control signal is decided based on theoutputs of the PE architectures in a polynomial updating con-trol (PUC) block. Since very simple logic is required and thelast pipelining stage of the PE architecture has only five gates inthe data path, the decision on can be made in the same clockcycle that the evaluation values are selected by the multiplexorsin the PE architectures. The PUC block also keeps track of the

weighted degrees of the polynomials. After each polynomial up-dating, only one polynomial changes its weighted degree andthe degree is either increased by one due to the multiplicationor decreased by one due to the division. Hence, the weighteddegrees can be tracked by using up and down counters. The de-gree change can be decided right after the corresponding poly-nomial evaluation values are available. Accordingly, the degreeupdating can be carried out in parallel with the polynomial up-dating and does not take extra clock cycles.

IV. INTERPOLATION ARCHITECTURE FOR ITERATIVE BGMDDECODING

The BGMD decoding can be carried out for multiple iter-ations to achieve higher coding gain. Since using maximummultiplicity two leads to simpler factorization and best com-plexity-performance tradeoff, we focus on iterative BGMD de-coding with maximum multiplicity two in this paper. Withoutloss of generality, we consider two adjacent decoding iterations.The error-correcting capability of the BGMD decoding is af-fected by the thresholds used for erasure decision. The optimalthresholds can be derived from simulations. Compared to havinga lower threshold in the second decoding iteration, the overallerror-correcting capability for two iterations does not improveif a higher threshold is used in the second iteration. Hence, weconsider the case that the threshold used in the second iterationis lower than that used in the first iteration. In this case, the pos-sible changes of the interpolation points in position are:

1) and with multiplicity one in itera-tion one; with multiplicity two in iteration two;

2) no interpolation point in iteration one; andwith multiplicity one in iteration two;

3) no interpolation point in iteration one; with mul-tiplicity two in iteration two.

The second and third changes can be taken care of by carryingout more iterations using the Nielson’s forward interpola-tion algorithm. However, for the first change, denoted by

in the remainder of this paper,needs to be eliminated and needs to have its multi-plicity increased by one. In addition, there are three candidatepolynomials involved in the BGMD decoding with maximummultiplicity two. These features make the backward inter-polation for the BGMD decoding different from that for theLCC decoding. In the following, we present a novel backwardinterpolation that can eliminate both andfrom a given interpolation result. Then, three forward interpo-lation iterations can be carried out to increase the multiplicityof the point from zero to two.

A. Backward Interpolation for the BGMD Decoding

The interpolation in the BGMD decoding with maximummultiplicity two involves three polynomials ( , 1, and 2).Each of these polynomials can be expressed as

(3)

At the end of the interpolation for the first decoding iteration,each passes both and . Hence

Page 8: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1609

has either all zero coefficients or two distinct roots. In the formercase, contains the factor . Similar to the back-ward interpolation for the LCC decoding, division bycan be carried out to remove the points with from the in-terpolation results in the BGMD decoding. In the latter case,

does not contain the factor . However, assume, polynomials divisible by

can be constructed by the following lemma.Lemma 4:

contains the factor .The proof of this lemma can be found in the Appendix of this

paper.Assume . Depending on whether the

polynomials ( , 1, and 2) contain the factor, the Gröbner basis formed by for BGMD

decoding can be divided into four categories.1) Two of have the factor .

In this case, and have the factor,and does not. Accordingly, and

can be rewritten as

where and are two bivariate poly-nomials. Following a process similar to that in the proofof Lemma 1 in the Appendix for LCC decoding, it canbe proved that the following set of polynomials form aGröbner basis of the polynomials with maximum -de-gree two that pass all the points except and

:

2) One of has the factor .In this case, one of and has thefactor. Without loss of generality, assume hasthe factor and accordingly . Then, a Gröbnerbasis equivalent to the original one can be built as

(4)From Lemma 4, contains the factor .Now, both and have the factor.Hence, the polynomials in (4) reduce to those in category1, and a Gröbner basis of the polynomials that pass allpoints, except and , can be derivedby dividing from and .

3) None of has the factor .In this case, none of , , and arezero. Since has the lowest weighted degree, anequivalent Gröbner basis can be constructed as

(5)

Again, from Lemma 4, and have thefactor . Therefore, the polynomials in (5) alsoreduce to those in category 1.

4) All of have the factor .Using similar approaches as those in the proof of Lemma3, it can be derived that this case cannot happen.

During the forward interpolation over and, the two polynomials that are picked as the min-

imum polynomials are different, and thus, two of aremultiplied by . As it can be observed from the first threecategories, to remove this pair of points from the interpolationresults, an equivalent Gröbner basis that has two polynomialscontaining the factor needs to be constructed first.Actually, the equivalent Gröbner basis for all categories can beconstructed according to (5), sincein category 1 and in category 2. Therefore,the backward interpolation that eliminates a pair of points

and from a given interpolation result inthe BGMD decoding can be done by following the pseudocodesin Algorithm D.

Algorithm D: Backward Interpolation for Iterative BGMDDecoding

D1: compute for , 1, 2

D2:

D3: divide and by

;

After and have been eliminated, the ad-dition of with multiplicity two can be completed bycarrying out three forward interpolation iterations using Algo-rithm B.

B. VLSI Architecture for Iterative BGMD Interpolation

In this section, an efficient interpolation architecture for it-erative BGMD decoding are presented. The critical path is de-signed to be no longer than one multiplier, one adder and two2:1 multiplexors.

1) Discrepancy Coefficient Computation (DCC) Archi-tecture: The DCC architecture for the BGMD decoding isillustrated in Fig. 5. Similar to the PE architecture in Fig. 3,Horner’s rule is applied and the coefficients are fed to the DCCarchitecture serially with the most significant one first. Duringthe backward interpolation, is needed. This value canbe derived by setting the signal to “1” and passing the lowerinput through the multiplexor in Fig. 5. In addition,is checked by the zero detector to decide whethercontains the factor .

The discrepancy coefficients required in the forward inter-polation can be also computed by the DCC architecture. Threediscrepancy coefficients are needed to interpolate over a point

Page 9: 04837873

1610 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

Fig. 5. DCC architecture for the BGMD decoding.

with multiplicity two. They are ,

and . Since, it can be computed by setting to “1” and

choosing the upper input of the multiplexor in Fig. 5. Wecan derive that . Hence, it can becomputed in the same way as for the backward interpolation.Because ,it can be generated by assigning to the signal . Since thecomputation is over finite fields of characteristic two, iseither 0 or 1. Actually, it is the least significant bit of . Thenumber of pipelining stages in the DCC architecture is three.

2) Polynomial Updating (PU) Architecture: Fig. 6 shows thePU architecture for the iterative BGMD decoding. This architec-ture can also be time-multiplexed to carry out the polynomialupdating in both the backward and forward interpolation. Ac-cording to the backward interpolation in Algorithm D, after thecoefficients are computed, , , and should be firstdecided. Then, for , 1, and 2 are rearranged.The rearrangement can be done by the two shifters in Fig. 6.The detailed architecture of the shifter is shown in Fig. 7. Ac-cording to the select signal , which is derived from , , and

, the shifter can pass the inputs, shift them to the left or right.Since there is no need to differentiate and , the same selectsignal can be used for all three multiplexors. The polynomialrearrangement required in the forward interpolation can be alsodone by these shifters. Particularly, the in the forward in-terpolation is equivalent to the in the backward interpolation.In addition, should be used instead of inthe forward interpolation.

The computations in step D2 of Algorithm D for the back-ward interpolation are carried out by the D block in Fig. 6. Eachof the divisions in the D3 step can be implemented by a B block.Since no computation is required for , the C blockpasses it unchanged by choosing zero as the multiplier coeffi-cient. In the case of forward interpolation, the computations inthe B3 step of Algorithm B can be also carried out by the Dblock. Then, the two output polynomials from the D block arepassed through the B blocks intact. In addition, the multiplica-tion by required in the B4 step is completed in the Cblock. The longest path in the architecture in Fig. 6 consists oftwo pairs of multiplier adder and two 2:1 multiplexors. Regis-ters are added to break the path into two pipelining stages.

Fig. 6. PU architecture for the BGMD decoding.

Fig. 7. Architecture for shifter.

V. HARDWARE REQUIREMENT AND LATENCY ANALYSES

This section provides the hardware requirement and latencyanalyses of the proposed interpolation architectures for the LCCand BGMD decoding for a (255, 239) RS code constructedover . In addition, our architectures are compared withexisting interpolation architectures. With significantly less area,the interpolation architecture in [21] can achieve more thantwice throughput than the one implemented in [23]. Hence,our architectures are mainly compared with the one in [21]and an optimized version of the one in [19]. The architecturesin [21] and [19] are for interpolations with multiplicity fiveand six, respectively. These architectures are scaled down tomultiplicity one and two before the comparisons are done.In addition, the DCC architecture used in [19] was replacedby that proposed in this paper and simultaneous polynomialupdating and discrepancy coefficient computation are enabled.Such an optimized architecture can achieve higher throughputand requires less area than the original architecture from [19].The purpose of these optimizations is to highlight the savingsthat can be achieved by the backward interpolation itself.

A. Hardware Requirement and Latency Analyses for LCCInterpolation

Table I lists the gate count and critical path of each buildingblock in the proposed LCC interpolation architecture for a (255,

Page 10: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1611

TABLE IGATE COUNTS AND CRITICAL PATHS OF THE PROPOSED LCC INTERPOLATION

ARCHITECTURE

239) RS code. All the gates in this table are two-input gates, andthe Muxes refer to 1-bit 2:1 multiplexors. Employing compositefield arithmetic, a multiplier can be implemented by 64XOR gates and 48 AND gates with six XOR gates and one AND gatein the critical path. Besides the listed gates, memory is requiredto store the coefficients of one interpolation result. From simu-lations, the maximum -degree of the candidate polynomials is16, and thus, the memory size is . Inour design, the weighted degree of the polynomials are stored inregisters. Taking these registers into account, the total numberof registers required in our design is 166. Each AND gate or OR

gate requires 3/4 of the area of an XOR, each Mux or memory cellhas the same area as an XOR, and each register occupies aboutthree times of the area of an XOR. Therefore, the area require-ment of the proposed architecture is equivalent to that of 2696XOR gates.

Here, we consider LCC decoding with . The interpo-lation in the original LCC decoding can be implemented ac-cording to the steps in the Nielson’s forward interpolation inAlgorithm B. If only one interpolation engine is available, as inour architecture, intermediate interpolation results needto be stored at the same time. One option is to store these co-efficients in a single large memory. In the original LCC de-coding, the starting polynomials in an interpolation iteration arenot the updated polynomials from the previous iteration. Hence,a three-port memory would be required to enable the concurrentpolynomial updating and polynomial evaluation from adjacentiterations if all coefficients are stored in a single memory. There-fore, this is not a feasible option and the intermediateinterpolation results need to be stored in individual memoryblocks. Accordingly, a multiplexor is required to feedthe coefficients from separated memories to the discrepancy co-efficient computation and polynomial updating architectures. Toimplement the interpolation in the original LCC decoding, theoptimized version of the interpolation architecture in [19] needs870 XOR gates, 593 AND gates, 33 OR gates, 408 Mux, and 281registers. The critical path of this architecture lies in its polyno-mial updating unit and it consists of one multiplier, two Mux,and one inverter. A finite-field inverter over can be im-plemented by a 256 8 LUT. The delay of such an LUT usuallyequals to that of 2–3 serially concatenated XOR gates. Replacingthe delay of the inverter with that of two XOR gates, the criticalpath of this architecture includes 11 gates. Besides the 256 bytesof memory required for the LUT, bytesof memory are required to store intermediate interpolation re-sults. In total, the area requirement of this architecture is equiv-alent to that of 6815 XOR gates. Using power representation offinite-field elements, the architecture in [21] has four gates in

the critical path. However, this architecture requires more multi-pliers. In addition, due to the larger number of pipelining stagesand larger number of bits in each pipelining cutset, the numberof registers required has been significantly increased. In total,the interpolation architecture in [21] requires 1428 XOR gates,1114 AND gates, 174 OR gates, 326 Mux, 697 registers, and 272bytes of memory to implement the original LCC interpolation.These requirements are equivalent to that of 6987 XOR gates.

The minimum achievable clock period of an architecture isproportional to the number of gates in the critical path. Hence,the achievable throughput of two interpolation architectures forthe same code can be compared by the product of the totalclock cycle number required and the gate number in the crit-ical path. The total clock cycle number can be computed as theproduct of the iteration number and average number of clock cy-cles required in each iteration. At the beginning of our proposedscheme, the interpolation over the points in the startingvector requires forward interpolation iterations. Each ofthe following requires two iterations, onefor backward interpolation and one for forward interpolation.Thus, the total number of iterations is .From simulation results, the average maximum -degree of thepolynomials using our proposed architecture is 9. In addi-tion, the numbers of pipelining stages of the PE and PU archi-tectures are three and two, respectively. Hence, the total numberof pipelining stages is 5 and the average number of clockcycles required for each iteration isusing our proposed architecture. In the original LCC interpo-lation scheme, the number of interpolation iterations for thecommon points is and that for the points in unre-liable code positions is . Hence, the total numberof iterations is . Although a dif-ferent PU architecture is used in [19], it also has two pipeliningstages. In addition, is also 9 in the original LCC interpolationscheme. Therefore, the optimized version of the architecture in[19] also requires on average 15 clock cycles in each iteration.Since deep pipelining is employed by the architecture in [21],

and thus, each iteration requires 19 clock cycles onaverage.

Table II lists some comparison results of the interpolationarchitectures for the LCC decoding. As it can be observedfrom this table, in terms of speed/area ratio, our architectureis more efficient than the optimizedversion of the architecture in [19]. Although the architecture in[21] has shorter critical path and hence can achieve higher clockfrequency, out architecture is still moreefficient. The high efficiency of our architecture mainly comesfrom the proposed backward interpolation, which enablesthe storage of only one intermediate interpolation result. Inaddition, most of the computation units required for the back-ward interpolation can be shared with those for the forwardinterpolation. Hence, the area overhead for incorporating thebackward interpolation is very small. Compared to the two pre-vious architectures employing forward-only interpolation forthe LCC decoding, the area requirement of our architecture isonly and , respectively.When increases, the area requirement of our architecturedoes not change. However, the memory requirements of pre-vious interpolation architectures increase linearly with .

Page 11: 04837873

1612 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

TABLE IIINTERPOLATION ARCHITECTURE COMPARISON RESULTS FOR LCC DECODING

TABLE IIIGATE COUNTS AND CRITICAL PATHS OF THE PROPOSED BGMD INTERPOLATION

ARCHITECTURE

Therefore, the saving can be achieved by our architecture willfurther increase with .

B. Hardware Requirement and Latency Analyses for IterativeBGMD Interpolation

The gate count and critical path of each building block in theproposed interpolation architecture for the BGMD decoding ofa (255, 239) RS code are listed in Table III. Besides the listedgates, memory is required to store one interpolation result. Fromsimulations, the maximum -degree of the polynomials is 32,and thus, a memory of is re-quired. In our design, the weighted degrees of polynomials arestored in registers. Taking these registers into account, the totalnumber of registers needed in our design is 371. Employingthe same equivalent XOR gate estimation method described inSection V-A, the area requirement of the proposed architectureequals to that of 7872 XOR gates.

Denote the average number of interpolation constraints inthe first and second BGMD decoding iterations by and ,respectively. Denote the average number ofchanges by . The average number of total interpolation itera-tions for two decoding iterations using our proposed architec-ture can be computed from and . Consider the three pos-sible cases of interpolation point change. The last two casesonly require increasing the multiplicities of the points. In thesecases, only forward interpolation is required and one forwardinterpolation iteration needs to be carried out to satisfy one con-straint. In the case of , the multiplicities of twopoints are increased from zero to one in the first decoding iter-ation. Then, in the second decoding iteration, the multiplicityof one point is increased from zero to two after the multiplic-ities of both points are decreased from one to zero. Comparedto increasing the multiplicity of a point from zero to two di-rectly, this process requires three additional interpolation iter-ations: two forward and one backward. Therefore, the averagenumber of total interpolation iterations using our backward in-terpolation scheme is . The average numbers of con-straints and interpolation iterations for decoding a (255, 239)RS are listed in Table IV. These data are collected from simula-tions over the AWGN channel with BPSK modulation. The twothresholds used for iterative BGMD decoding are 1.5 and 1.0.

TABLE IVAVERAGE NUMBERS OF CONSTRAINTS AND INTERPOLATION ITERATIONS IN

THE BGMD DECODING

The interpolation for two BGMD decoding iterations can bealso carried out separately by using only forward interpolation.Apparently, the average number of interpolation iterations re-quired in this case is . The forward-only interpolationcan be carried out by the architectures in [19] and [21]. When

, the optimized version of the architecture in [19]needs 2055 XOR gates, 1375 AND gates, 29 OR gates, 184 Mux,273 registers, and of memory. Hence,the area requirement is equivalent to that of 8535 XOR gates. Thecritical path of this architecture has 12 gates and the number ofpipelining stages is . In the case of , the ar-chitecture in [21] requires 3143 XOR gates, 2620 AND gates, 463OR gates, 262 Mux, 1211 registers, and 297 bytes of memory.The area of this architecture is equivalent to 11726 XOR gates.In addition, the critical path of this architecture has four gatesand .

Some comparison results of the three architectures arelisted in Table V. Compared to the optimized version ofthe architecture in [19] and the architecture in [21], ourproposed architecture requires and

less area, respectively. The decodingspeed comparisons are provided for low SNR (6.5 dB) and highSNR (7.25 dB) in Table V. With the backward interpolation, theinterpolation results can be shared between the two decodingiterations, and thus, the number of interpolation iterations canbe significantly reduced. The iteration number has been reducedby when and by

when . On the otherhand, the average in our backward interpolation schemeis larger than those in forward-only interpolation becausethe polynomials in the second decoding iteration have highermaximum -degrees. From simulations, of our proposedscheme is one more than that of forward-only interpolation.The relative hardware efficiency of the three architectures interms of speed/area ratio can be calculated from the numberslisted in Table V. When , our architecture is

and more efficientthan the optimized version of the architecture in [19] and the

Page 12: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1613

TABLE VINTERPOLATION ARCHITECTURE COMPARISON RESULTS FOR THE BGMD DECODING

architecture in [21], respectively. When , our ar-chitecture is andmore efficient.

The higher efficiency of our architecture mainly comes fromthe reduced number of interpolation iterations as a result of ap-plying backward interpolation. Having a critical path of onlyfour gates, the architecture in [21] can achieve the highest de-coding speed among the three architectures. However, this ar-chitecture requires large number of registers for deep pipelining.In addition, the conversion between power representation andconventional representation of finite-field elements incurs largearea overhead. Therefore, our architecture is still much moreefficient than the architecture in [21] despite its short criticalpath. On the other hand, the setup and hold times of registers areno longer negligible when the critical path becomes very short.Hence, the speedups can be achieved by the real implementa-tions of the architecture in [21] will be lower than those listedin Table V. Actually, when this architecture is implemented onFPGA devices [22], a critical path that has seven gates is em-ployed.

VI. CONCLUSION

This paper developed novel backward interpolation schemesand corresponding implementation architectures for the LCCand BGMD decoding. The proposed backward interpolation en-ables the reusing of interpolation results. In addition, the com-putation units required for the backward interpolation can beshared with those for the forward interpolation. Hence, the areaoverhead for incorporating the backward interpolation is verysmall. As a result, the proposed interpolation architectures canachieve significantly higher efficiency than previous designs.Future work will address developing backward interpolation formore general cases, in which larger multiplicities and more can-didate polynomials are involved.

APPENDIX

Lemma 1: The two polynomials in (1) forms a Gröbner basisof the module induced from the ideal of the polynomialsthat pass all the points except in the starting vector.

Proof: To make the following proof more clearly, we re-move from the notations of bivariate polynomials when-ever no ambiguity would occur. Assume does not pass apoint with . Then, since does notpass , does not pass this point either, which con-tradicts the fact that is a polynomial in the Gröbner basis.Hence, passes all the points, except , in the startingvector. In addition, . Accordingly, also passes

all with . Therefore, the two polynomials in (1)belong to the module induced from the ideal of the polyno-mials that pass through all the points except .

Next, we prove that and form a Gröbner basis. Ac-cording to Definition 3, we need to prove that the leading termof any polynomial in can be divided by that of at least oneof and . Note that a monomial dividesif and . This proof will be completed by con-tradiction. Assume that (1) is not a Gröbner basis of . Then,there exists a polynomial, say , whose leading term can neitherbe divided by that of nor by that of . From [26], theleading terms of the polynomials generated from the Nielson’salgorithm have different -degrees. In addition, the division by

does not change the -degree in the leading term.Hence, the -degrees in the leading terms of and aredifferent. Depending on the -degree of and the evalua-tion value of at , there are three cases to consider.Case 1) . Here, de-

notes the -degree of bivariate monomials. Simi-larly, will be used to denote the -degree.In this case, we can construct a polynomial

Apparently, passes all interpolation pointsincluding . Hence, is a polyno-mial in the module . It can be derived that

since cannotbe divided by and they have the same

-degree. In addition, and. Hence,

, and thus, does notdivide . Moreover,

and . It followsthat cannot be divided by either,which contradicts the assumption that and

form a Gröbner basis of .Case 2) and .

In this case, is a polynomial in the module .Since is not divisible by and

, it is not divisible by . In addition,and have different -degrees. Hence,is not divisible by either. This con-

tradicts that and form a Gröbner basis of.

Page 13: 04837873

1614 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 11, NOVEMBER 2009

Case 3) and .In this case, since is nonzero, a newpolynomial can be constructed as

Clearly, is nontrivial and passes all thepoints including . Since

, the order of according tothe weighted degree is either higher or lower thanthat of . If the order of is higher,then . Since isnot divisible by , it cannot be divided by

because . In addi-tion, , and thus,

cannot be divided by either. Accord-ingly, also has these two properties. Whenthe order of is lower than that of ,

. cannot di-vided by or . Hence, neither does

. Therefore, for both conditions, a contradic-tion has been reached.

Lemma 2: in (2) contains the factor .Proof: can be also rewritten as

. From (2)

(6)

Since ( , 1) passes ,. Thus, .

Substituting and in (6) by and, respectively, it can be derived that

Additionally, from (2)

(7)

Since both and are zero, containsthe factor .

Lemma 3: The case that both have the factorcannot happen.Proof: This lemma will also be proved by contradiction.

Assume that both of have the factor , thenthe Gröbner basis can be rewritten as

Similarly, each of and passes all points except. We construct another nontrivial polynomial

Clearly, passes the point as well as all other points,and is in the module . equals either or

. Since the -degrees of the leading terms in and

are one lower than those of and , respectively,cannot be divided by nor . This con-

tradicts the fact that and form a Gröbner basis of .Therefore, this case cannot happen.

Lemma 4:contains the factor .

Proof: This polynomial can be rewritten in the format ofwhere

(8)

for , 1, and 2. It is equivalent to prove that each ofhas the factor .

Substitute and into (3), it can be de-rived that

(9)

for , 1, and 2. Since and are elements in fi-nite fields of characteristic two,

. Hence, adding up the two equations in (9) would re-sult in

(10)

Now take as an example, from (8), (9) and (10), it canbe derived that

Similarly, it can be also derived that. Hence, each of for , 1, and 2 has the factor

.

REFERENCES

[1] E. R. Berlekamp, Algebraic Coding Theory. New York: McGraw-Hill, 1968.

[2] G. D. Forney, Jr., “Generalized minimum distance decoding,” IEEETrans. Inf. Theory, vol. 12, no. 4, pp. 125–131, Apr. 1966.

[3] D. Chase, “A class of algorithms for decoding block codes with channelmeasurement information,” IEEE Trans. Inf. Theory, vol. IT-18, no. 1,pp. 170–182, Jan. 1972.

[4] A. Vardy and Y. Be’ery, “Bit-level soft-decision decoding ofReed–Solomon codes,” IEEE Trans. Commun., vol. 12, no. 3, pp.440–445, Mar. 1991.

[5] M. P. C. Fossorier and S. Lin, “Soft-decision decoding of linear blockcodes based on ordered statistics,” IEEE Trans. Inf. Theory, vol. 41, no.5, pp. 1379–1396, Sep. 1995.

[6] V. Ponnampalam and B. S. Vucetic, “Soft decision decoding of Reed--Solomon codes,” in Proc. 13th Symp. Appl. Algebra, Algebraic Algo-rithms, Error-Correcting Codes, Honolulu, HI, Nov. 1999.

Page 14: 04837873

ZHU et al.: BACKWARD INTERPOLATION ARCHITECTURE FOR ALGEBRAIC SOFT-DECISION REED–SOLOMON DECODING 1615

[7] R. Koetter and A. Vardy, “Algebraic soft-decision decoding ofReed--Solomon codes,” IEEE Trans. Inf. Theory, vol. 49, no. 11, pp.2809–2805, Nov. 2003.

[8] J. Jiang and K. Narayanan, “ Algebraic soft-decision decoding ofReed–Solomon codes using bit-level soft information,” IEEE Inf.Theory, vol. 54, no. 9, pp. 3907–3928, Sep. 2008J. Jiang and K.Narayanan, “ Algebraic soft-decision decoding of Reed–Solomoncodes using bit-level soft information,” in Proc. Allerton Conf.Commun., Control, Comput., Sep. 2006.

[9] J. Bellorado and A. Kavcic, “A low-complexity method for Chase-type decoding of Reed–Solomon codes,” in Proc. IEEE Int. Symp. Inf.Theory, Seattle, WA, Jul. 2006, pp. 2037–2041.

[10] M. Sudan, “Decoding of Reed–Solomon codes beyond the error cor-rection bound,” J. Complexity, vol. 12, pp. 180–193, 1997.

[11] V. Guruswami and M. Sudan, “Improved decoding of Reed–Solomonand Algebraic-Geometric codes,” IEEE Trans. Inf. Theory, vol. 45, no.9, pp. 1755–1764, Sep. 1999.

[12] R. Koetter, “On algebraic decoding of algebraic-geometric and cycliccodes,” Ph.D. dissertation, Dept. Electr. Eng., Linkoping Univ.,Linkoping, Sweden, 1996.

[13] R. R. Nielson, “List decoder of linear block codes,” Ph.D. dissertation,Dept. Math., Tech. Univ. Denmark, Copenhagen, Denmark, 2001.

[14] K. Lee and M. E. O’Sullivan, “An interpolation algorithm usingGröbner bases for soft-decision decoding of Reed–Solomon codes,”presented at the IEEE Int. Symp. Inf. Theory, Seattle, WA, Jul. 2006.

[15] X. Zhang and J. Zhu, “Low-complexity interpolation architecture forsoft-decision Reed–Solomon decoding,” in Proc. IEEE Int. Symp. Cir-cuits Syst., New Orleans, LA, May 2007, pp. 1413–1416.

[16] J. Zhu and X. Zhang, “Efficient intepolation architecture for soft-de-cision Reed–Solomon decoding,” in Proc. IEEE Workshop SignalProcess. Syst., Shanghai, China, Oct. 2007, pp. 663–668.

[17] W. J. Gross, F. R. Kschischang, R. Koetter, and P. Gulak, “A VLSI ar-chitecture for interpolation in soft-decision decoding of Reed–Solomoncodes,” in Proc. IEEE Workshop Signal Process. Syst., San Diego, CA,Oct. 2002, pp. 39–44.

[18] R. Koetter and A. Vardy, “A complexity reducing transformation inalgebraic list decoding of Reed–Solomon codes,” in Proc. IEEE Inf.Theory Workshop, Paris, France, Mar. 2003, pp. 10–13.

[19] A. Ahmed, R. Koetter, and N. Shanbhag, “VLSI architecture for soft-decision decoding of Reed–Solomon codes,” in Proc. IEEE Int. Conf.Commun., Paris, France, Jun. 2004, pp. 2584–2590.

[20] X. Zhang, “Reduced complexity interpolation architecture for soft-de-cision Reed–Solomon decoding,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 14, no. 10, pp. 1156–1161, Oct. 2006.

[21] Z. Wang and J. Ma, “High-speed interpolation architecture for soft-decision decoding of Reed–Solomon codes,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 14, no. 9, pp. 937–950, Sep. 2006.

[22] Q. Chen, Z. Wang, and J. Ma, “FPGA implementation of an interpola-tion processor for soft-decision decoding of Reed–Solomon codes,” inProc. IEEE Int. Symp. Circuits Syst., New Orleans, LA, May 2007, pp.2100–2103.

[23] W. J. Gross, F. R. Kschischang, and P. Gulak, “Architecture andimplementation of an interpolation processor for soft-decisionReed–Solomon decoding,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 15, no. 3, pp. 309–318, Mar. 2007.

[24] J. Zhu, X. Zhang, and Z. Wang, “Novel interpolation architecturefor low-complexity chase soft-decision decoding of Reed–Solomoncodes,” in Proc. IEEE Int. Symp. Circuits Syst., Seattle, WA, May2008, pp. 3078–3081.

[25] H. O’Keeffe and P. Fitzpatrick, “Grönber basis solutions of con-strained interpolation problems,” Linear Algebra Appl., vol. 351–352,pp. 533–551, 2002.

[26] R. J. McEliece, “The Guruswami–Sudan decoding algorithm forReed–Solomon codes,” The Interplanetary Network Progress Report,IPN PR 42-153, May 2003, pp. 1–60.

Jiangli Zhu (S’08) received the B.S. and M.S. de-grees in electrical engineering from Zhejiang Uni-versity, Hangzhou, China, in 2004 and 2006, respec-tively. He is currently working toward the Ph.D. de-gree in the Department of Electrical Engineering andComputer Science, Case Western Reserve University,Cleveland, OH.

His current research interests include the design ofvery large-scale integration (VLSI) architectures forcommunications and digital signal processing, withthe emphasis on error-correcting coding.

Xinmiao Zhang (S’04–M’05) received the B.S. andM.S. degrees in electrical engineering from TianjinUniversity, Tianjin, China, in 1997 and 2000, respec-tively, and the Ph.D. degree in electrical engineeringfrom the University of Minnesota, Twin Cities, Min-neapolis, in 2005.

Since then, she has been with Case Western Re-serve University, Cleveland, OH, where she is cur-rently a Timothy E. and Allison L. Schroeder As-sistant Professor with the Department of ElectricalEngineering and Computer Science. Her current re-

search interests include VLSI architecture design for communications, cryp-tosystems, and digital signal processing.

Dr. Zhang was a recipient of the Best Paper Award at the ACM Greats LakeSymposium on VLSI in 2004 and First Prize in the Student Paper Contest atthe Asilomar Conference on Signals, Systems, and Computers 2004. She isthe Coeditor of the book Wireless Security and Cryptography: Specificationsand Implementations (CRC, 2007) and the Guest Editor for Springer MONETJournal Special Issue on “Next Generation Hardware Architectures for SecureMobile Computing”. She is a member of the Circuits and Systems for Communi-cations and VLSI Systems and Applications Technical Committees of the IEEECircuits and Systems Society, and the Design and Implementation of Signal Pro-cessing Systems Technical Committee of the IEEE Signal Processing Society.She has served on the Technical Program Committees of ACM Great LakesSymposium on VLSI, IEEE Workshops on Signal Processing Systems, and theReviewer committees of IEEE International Symposium on Circuits and Sys-tems.

Zhongfeng Wang (M’00–SM’05) received the B.S.and M.S. degrees from the Department of Automa-tion, Tsinghua University, Beijing, China, and thePh.D. degree from the Department of Electrical andComputer Engineering, University of Minnesota,Minneapolis, in 2000.

From 1990 to 1995, he was with Beijing Hua-HaiNew Technology Development Company, Beijing,as Principal Engineer and Technical Manager.He joined Morphics Technology, Inc. (now a partof Infineon Technology) in 2000 as Member of

the Technical Staff. Two years later, he moved to National SemiconductorCompany as a Senior Staff Design Engineer. In 2003, he became an AssistantProfessor with the School of Electrical Engineering and Computer Science(EECS), Oregon State University, Corvallis. Since 2007, he has been withBroadcom Corporation as a Senior Principal Scientist. His current researchinterests include the area of low-power/high-speed VLSI design, specificallyVLSI Design for digital signal processing, digital communications (includingerror control coding) and cryptography systems.

Dr. Wang was the recipient of the Best Student Paper Award (1st prize)at the 1999 IEEE Workshop on Signal Processing Systems (SiPS 1999)and the IEEE Circuits and Systems Society VLSI Transactions Best PaperAward in 2007. He has served as an Associate Editor (AE) for the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS and AE forIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS. Hehas also served as a Technical Committee Member for numerous IEEE andACM Conferences.