61521101 Rate Matching

May 12, 2010 15:23 RPS : Trim Size: 8.50in x 11.00in (IEEE) icfcc2010-lineup˙vol-1: F327

)0(kd

)1(kd

)2(kd

ke

)0(kv

)1(kv

)2(kv

kw

Efficient Implementation of Rate Matching for LTE Turbo Codes

Chixiang Ma Beijing Embedded System Key Lab Beijing University of Technology

Beijing, China [email protected]

Ping Lin Beijing Embedded System Key Lab Beijing University of Technology

Beijing, China [email protected]

Abstract—Rate matching is an important component in 3rd

Generation Partnership Project (3GPP) Long Term Evolution (LTE) and has a strong impact on its error correcting performance. Rate matching is part of error correct coding, which can impact the performance of hybrid ARQ (HARQ). Circular buffer rate matching algorithm has many advantages over earlier rate matching algorithm. One of its advantages is simple and flexible for any arbitrary code rate in LTE with excellent performance. In this paper, an efficient method to implement high throughput processing is proposed. The method is to rearrange the data order in memory to improve throughput and latency. However, implementation of rate matching is not very straightforward with the modified method. With the aid of the proposed method, rate matching can be implemented in parallel processing with low latency for high date rate applications.

Keywords-3GPP; LTE; rate matching; parallel; contention-free

I. INTRODUCTION

The explosive growth of cell phone users and the increasing demand for broadband wireless access has led to the development of a long term evolution (LTE) to replace the WCDMA-based air interface by the 3rd Generation Partnership Project (3GPP). Several minimum requirements of LTE include packet data support with peak data rates of 300 Mbps in the downlink and 75 Mbps in the uplink, a low maximum latency of 10 ms MAC layer round trip delay, and flexible bandwidth scalability. These requirements results to the adoption of OFDM based modulation and multiple access, MIMO antenna schemes, and adaptive modulation and coding with advanced channel coding, space time coding and hybrid ARQ protocols.

The 3GPP working group undertook plenty of rigorous evaluation of advanced channel coding candidates (turbo and LDPC codes) and space time coding to address the high data rate requirements of LTE. Consequently, it was decided to adopt the 1/3 code rate turbo codes as WCDMA with a new contention-free internal interleaver based on quadratic permutation polynomial (QPP) to facilitate efficient high speed turbo decoding [1-3]. The QPP interleaver requires small parameter storage, provides excellent performance, and, most importantly, allows highly flexible parallelization due to its maximum contention-free property.

The 3GPP working group also investigated the performance of the LTE turbo codes in conjunction with

different rate matching algorithm proposals. A rate matching algorithm repeats or punctures the bits of a mother codeword to generate a requested number of bits according to a desired code rate that may be different from the mother code rate of the turbo coder. The rate matching algorithm should also facilitate enhanced hybrid ARQ (HARQ) operation by minimizing repetition of coded bits (when possible) for subsequent retransmissions of a packet in order to increase coding gains via incremental redundancy (IR). Considering that the rate matching algorithm in 3GPP HSDPA is very different from that in 3GPP LTE, the topic of rate matching was extensively studied to devise an efficient implementation for LTE.

Figure 1. Rate matching for turbo codes.

In this paper, our main objective is to propose a method to implement rate matching efficiently. Specifically, parallel implementation is adopted to achieve high data rate for applications such as 3GPP LTE since the peak date rate of LTE is over 300Mbps. Section II presents the rate matching algorithm used in LTE [1] and defines the problem to be solved. Section III discusses resolutions of memory rearrangement and efficient implementation with the modified memory assignment. Section IV addresses the implementation of some elementary blocks derived from Section III. Finally, conclusion remarks are presented in Section V.

II. PROBLEM DEFINITION

A. Rate matching for turbo codes

The rate matching for turbo coded transport channels is defined per coded block and consists of interleaving the three

V1-704978-1-4244-5824-0/$26.00 c©2010 IEEE


information bit streams dk(0), dk

(1), and dk(2), followed by the

collection of bits and the generation of a circular buffer as depicted in Fig 1.

The bit stream dk(0) is interleaved according to the sub-

block interleaver A which will be defined later with an output sequence defined as v0

(0) , v1(0) , v2

(0) , …, vK–1(0) , where

K is defined in next subsection. The bit stream dk

(1) is interleaved according to the sub-block interleaver B which will be defined later with an output sequence defined as v0

(1) , v1(1) , v2

(1) , …, vK–1(1) .

The bit stream dk(2) is interleaved according to the sub-

block interleaver C which will be defined later with an output sequence defined as v0

(2) , v1(2) , v2

(2) , …, vK–1(2) .

The sequence of bits ek for transmission is generated according to subclause 5.1.4.1.2 in [1].

B. Subblock Interleaver

The bits input to the block interleaver are denoted by d0(i) ,

d1(i) , d2

(i) ,…, dK–1(i), where K is the number of bits. The

output bit sequence from the block interleaver is derived as follows:

(1) Assign Csubblock = 32 to be the number of columns of the matrix. The columns of the matrix are numbered 0, 1, 2, …, Csubblock–1 from left to right.

(2) Determine the number of rows of the matrix Rsubblock, by finding minimum integer Rsubblock such that:

K = ( Rsubblock × Csubblock ) The rows of rectangular matrix are numbered 0, 1, 2,…,

Rsubblock–1 from top to bottom. (3) Write the input bit sequence, i.e. yk = d2

(i), k = 0, 1,…, K–1, into the (Rsubblock × Csubblock) matrix row by row starting with bit y0 in column 0 of row 0 as Eq. (25)

For dk(0)and dk

(1):

(4) Perform the inter-column permutation for the matrix based on the pattern P that is shown in Table I, where P(j) is the original column position of the j-th permuted column. After permutation of the columns, the inter-column permuted (Rsubblock × Csubblock) matrix is equal to Eq. (26)

(5) The output of the subblock interleaver A or B is the bit sequence read out column by column from the inter-column permuted (Rsubblock × Csubblock) matrix. The bits after sub-block interleaving are denoted by v0

(i) , v1(i) , v2

(i) , …, vK–1

(i), where v0(i) corresponds to yP(0), v1

(i) corresponds to yP(0)+Csubblock, and so on. For example, when K is 128, the addresses of data are tabulated in Table IV and Table V.

For dk(2):

(4) The output of the subblock interleaver C is denoted by v0

(2) , v1(2) , v2

(2) , …, vK–1(2), where vk

(2) = yπ(k) and where

( ) ( ( / )subblock subblockk P k R Cπ = +⎢ ⎥⎣ ⎦

( mod ) 1) modsubblockk R K× + (1)

The permutation function P is defined in Table I.

TABLE I. INTER-COLUMN PERMUTATION PATTERN

Number of columns Csubblock

Inter-column permutation pattern <P(0), P(1), …, P(Csubblock–1)>

32 <0, 16, 8, 24, 4, 20, 12, 28, 2, 18, 10, 26, 6, 22, 14, 30, 1, 17, 9, 25, 5, 21, 13, 29, 3, 19, 11, 27, 7, 23, 15, 31>

III. IMPLEMENTATION OF RATE MATCHING

One approach to overcome the high latency of rate matching is to use parallel architectures. Parallel processing can greatly improve the throughput and the latency. Parallelism may be achieved by dividing the transmitter or receiver stream into substreams and processing the substreams in parallel by using multiple parallel processors. While the throughput and the latency may be improved by using the parallel processing, a larger memory is required. In addition, hardware complexity and cost are also increased. Therefore, parallel schemes that are memory and hardware efficient are needed for practical implementation.

A. Writing Data to Rate Matching Memory

The method in Section II cannot be used for parallel processing, since it is contention with multiprocessor. So the input data need to be arranged in another order so that parallel processing can be used, and it is benefit for throughput and latency.

The method proposed is to rearrange the input data in memory so that the data can be processed with eight parallel processes without contention. With the proposed method, for example, when K is 128, the addresses of data of subblock interleaver A or B are tabulated in Table VI and Table VII.

The address generate function of subblock interleaver A or B can be expressed with in_idx as the index of input data as follows,

wr_col = ( (pcol×(Rsubblock+Ra) + row ) mod Csubblock (2)

wr_row = row (3)

Where, pcol = P(in_t_idx mod Csubblock) (4)

row = _ _ / subblockin t idx C⎢ ⎥⎣ ⎦ (5)

0 if is odd

1 if is evensubblock

asubblock

RR

R

⎧= ⎨

⎩ (6)

_ if _

_ _ _ if _ 2

_ 2 if _ 3

in idx in idx K

in t idx in idx K K in idx K

in idx K K in idx K

<⎧⎪

= − < <⎨⎪ − 2 < <⎩

(7)

Similarly, the address generate function of subblock interleaver C can be expressed as follows,

wr_col = ( (pcol×(Rsubblock+Ra) + row + 1 ) mod Csubblock (8)

wr_row = row (9)

Where, pcol = P(in_idx mod Csubblock) (10)

[Volume 1] 2010 2nd International Conference on Future Computer and Communication V1-705


row = _ / subblockin idx C⎢ ⎥⎣ ⎦ (11)

0 if is odd


asubblock

RR

R

⎧= ⎨

⎩ (12)

and wr_col is the index of substream, and wr_row is the intra-substream address.

B. Reading Data from Rate Matching Memory

On reading data from rate matching memory, data are also read block by block in eight parallel processes, and the block length is variable. As a result, the subblock interleaver has to be implemented in a flexible way.

In fact, the subblock interleaver A or B can be described in another way, so that the subblock interleaver A or B can be described with formula directly. Then, the output of the subblock interleaver A or B is denoted by v0

(i) , v1(i) , v2

(i) , …, vK–1

(i), where vk(i) = yπ(k) and where


( mod )) modsubblockk R K× (13)

And the output of the subblock interleaver C is denoted by v0

(2) , v1(2) , v2

(2) , …, vK–1(2), where vk

(2) = yπ(k) and where


( mod ) 1) modsubblockk R K× + (14)

It is needed to read the data in output order from rate matching memory with a easy and efficient method. The address generate function of subblock interleaver A, B, or C can be expressed with out_idx as the index of output data as follows,

_ (( _ _ m od )

_ _ ) m od

subblock

asubblock

b

rd col out t idx C

Rout t idx C

R

=

⎢ ⎥+ ×⎢ ⎥⎣ ⎦

(15)

rd_row = row (16)

Where, _ if _

_ _ _otherwise

2

out idx out idx K

out t idx out idx

<⎧⎪

= ⎨⎢ ⎥ ⎪⎢ ⎥⎣ ⎦⎩

(17)

0 if is odd


asubblock

RR

R

⎧= ⎨

⎩ (18)

if 0

o therw isesu b b lo ck

bt

C R tR

R

=⎧= ⎨⎩

(19)

modt subblock subblockR R C= (20)

row = out_idx mod Rsubblock (21)

and t_idx is less than K when processing the subblock interleaver A, whereas t_idx is equal or greater than K when

processing the subblock interleaver B or C. Meanwhile, rd_col is the index of substream, and rd_row is the intra-substream address.

In the table II, III, IV, and V, the contents are in_t_idx. Thus, we need to know the relationship between in_t_idx and out_t_idx so that the results can be verified easily. The relationship for the subblock interleaver A or B can be described as follows,

_ _ ( _ _ mod )subblock subblockin t idx out t idx R C= ×

( _ _ / )subblockP out t idx R+ ⎢ ⎥⎣ ⎦ (22)

Similarly, the relationship for the subblock interleaver C can be described as follows,

_ _ (( _ _ mod )subblock subblockin t idx out t idx R C= ×

( _ _ / ) 1) modsubblockP out t idx R K+ +⎢ ⎥⎣ ⎦ (23)

Thus, with modified method of memory assignment, rate matching can be implemented in eight parallel processes, and the throughput and the latency are improved obviously while hardware complexity and cost are increased.

IV. HARDWARE IMPLEMENTATION

It is obvious that there are many modulo arithmetic. It is easy to implement the operation of modulo thirty two with the result being the least five significant bits regardless of overflow and underflow since thirty two is a power of two. However, there are many operations of modulo K which is defined in [1]. The number which is used for operations of modulo K is the addition of different non-negative part which is always non-negative. And y(x) is less than 2K and the elementary modulo arithmetic block can be implemented simply as in Fig. 2.

In the following derivation, a finite field modulo operation, x mod K, is defined as

if 2mod

otherwise

x K K x Kx K

x

− ≤ <⎧= ⎨

⎩ (24)

The domain of x mod K covers the range of 0 ≤ x < 2K–1, which does not include negative values since x is always non-negative.

Figure 2. Structure of elementary block of mod K

V1-706 2010 2nd International Conference on Future Computer and Communication [Volume 1]


Another elementary block for implementation is the inter-column permutation pattern. The implementation of this elementary block is bitwise reverse, which is the relationship between the output and the input of inter-column permutation pattern. As a result, the inter-column permutation pattern can be implemented easily with hardware.

The equation to get rd_col is very complex. Thus, we store the number of data before current column for every column when writing data to rate matching memory. So we can calculate rd_col by comparing out_t_idx with Csubblock

stored number as in Fig. 3.

Figure 3. Structure of elementary block of rd_col computation

TABLE II. COMPLEXITY OF THE PROPOSED QPP INTERLEAVING FOR 3GPP LTE TURBO CODES

Clock frequency (MHz) 100 150 200

Area(μm2) 382986 383079 383442

Rate matching algorithm in LTE is implemented with the computing elements presented in Section III. The implementation is targeted for application specific integrated circuit (ASIC) and is synthesized with 0.13μm technology, 1.2V voltage and the area is given in terms of μm2 in Table II.

TABLE III. LATENCY OF THE PROPOSED RATE MATCHING AND COMMON RATE MATCHING FOR 3GPP LTE

Latency (cycle)

Method with only one process 15+K

Method with eight processes 15+K/8

V. CONCLUSION

In this paper, we addressed the problem of computing rate matching address which is linearly incremented, and presented an efficient method to implement subblock interleaver of rate matching in eight processes for both the input data and the output data of rate matching memory. The main idea is to reassignment the memory allocation so that parallel processing is contention-free. We use eight parallel processes because QPP interleaver cannot be implemented in more than eight processes [4]. Because the design of the LTE rate matching is reasonable, the proposed method can be implemented for any code rate in LTE. Structures of three elementary blocks were proposed for simple hardware implementation. The latency of the proposed method is approximately 1/8th of that with only one process, which is tabulated in Table III. In the end, the ASIC synthesize report of the implementation of proposed method was presented and it was shown that the high throughput is obtained at a minimum cost.

REFERENCES

[1] 3GPP, "3GPP TS 36.212 v8.7.0 3rd generation partnership project; technical specification group radio access network; evolved universal terrestrial radio access; multiplexing and channel coding (release 8)," 3rd Generation Partnership Project, Tech. Rep., May. 2009.

[2] J. Sun and O. Y. Takeshita, “Interleavers for turbo codes using permutation polynomials over integer rings,” IEEE Trans. Inform. Theory, vol. 51, no. 1, pp. 101—119, Jan. 2005.

[3] O. Y. Takeshita, “On maximum contention-free interleavers and permutation polynomials over integer rings,” IEEE Trans. Inform. Theory, vol. 52, no. 3, pp. 1249—1253, Mar. 2006.

[4] C. Ma, and P. Lin, “Efficient Implementation of Quadratic Permutation Polynomial Interleaver in Turbo Codes”, 1st International Conference on Wireless Communication and Signal Processing, Nov., 2009

0 1 1

1 2 1

( 1) ( 1) 1 ( 1)

subblock

subblock subblock subblock

subblock subblock subblock subblock subblock subblock

C

C C C

R C R C R C

y y y

y y y

y y y

−

+ −

− × − × + × −

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

(25)

[Volume 1] 2010 2nd International Conference on Future Computer and Communication V1-707


(0) (1) ( 1)

(0) (1) ( 1)

(0) ( 1) (1) ( 1) ( 1) ( 1)

subblock

subblock subblock subblock subblock

subblock subblock subblock subblock subblock subblock subblock

P P P C

P C P C P C C

P R C P R C P C R C

y y y

y y y

y y y

−

+ + − +

+ − × + − × − + − ×

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

(26)

TABLE IV. ADDRESS OF RATE MATCHING WITH COMMON METHOD FOR THE CASE K IS 128

substream numberadd-ress 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30 1 32 48 40 56 36 52 44 60 34 50 42 58 38 54 46 62 2 64 80 72 88 68 84 76 92 66 82 74 90 70 86 78 94 3 96 112 104 120 100 116 108 124 98 114 106 122 102 118 110 126

TABLE V. ADDRESS OF RATE MATCHING WITH COMMON METHOD FOR THE CASE K IS 128


0 1 17 9 25 5 21 13 29 3 19 11 27 7 23 15 31 1 33 49 41 57 37 53 45 61 35 51 43 59 39 55 47 63 2 65 81 73 89 69 85 77 93 67 83 75 91 71 87 79 95 3 97 113 105 121 101 117 109 125 99 115 107 123 103 119 111 127

TABLE VI. ADDRESS OF RATE MATCHING WITH PROPOSED METHOD FOR THE CASE K IS 128


0 0 22 11 28 5 16 14 27 2 21 8 30 7 18 13 24 1 57 32 54 43 60 37 48 46 59 34 53 40 62 39 50 45 2 76 89 64 86 75 92 69 80 78 91 66 85 72 94 71 82 3 115 108 121 96 118 107 124 101 112 126 123 98 117 104 126 103

TABLE VII. ADDRESS OF RATE MATCHING WITH PROPOSED METHOD FOR THE CASE K IS 128


0 1 23 10 29 4 17 15 26 3 20 9 31 6 19 12 25 1 56 33 55 42 61 36 49 47 58 35 52 41 63 38 51 44 2 77 88 65 87 74 93 68 81 79 90 67 84 73 95 70 83 3 114 109 120 97 119 106 125 100 113 111 122 99 116 105 127 102

V1-708 2010 2nd International Conference on Future Computer and Communication [Volume 1]

61521101 Rate Matching

Documents

Transcript of 61521101 Rate Matching