Low-Complexity Order-64 Integer Cosine Transform Design...

1

Low-Complexity Order-64 Integer CosineTransform Design and Its Application in HEVC

Zhe Chen, Qinglong Han and Wai-Kuen Cham, Senior Member, IEEE

Abstract—In High Efficiency Video Coding (HEVC) standard,residuals are partitioned into variable-size transform units, andInteger Cosine Transforms (ICTs) of order-4, 8, 16 and 32 areadopted to optimize the coding performance. Order-64 ICT is notused due to its high complexity and limited gain in general codingapplications. This paper proposes a set of Low-Complexity ICTs(LCICTs), from order-8 to order-64, which have fully factorizablestructures. LCICTs achieve similar coding performance as thestate-of-art methods on HEVC Test Model 13 while requiringlower complexity, e.g. 71% fewer multiplications than the par-tially factorizable ICTs in the order-64 case. Moreover, we findthat order-64 ICTs achieve around 1% average bitrate reductionat high Quantization Parameters (QP).

Index Terms—ICT, DCT, Transform Coding, HEVC.

I. INTRODUCTION

TRANSFORM coding is an important technique in imageand video coding. It can significantly improve the perfor-

mance of lossy coding systems by decorrelating the residualsignals. In transform coding, the Discrete Cosine Transform(DCT) [1] has been widely used due to its strong energypacking ability and existence of fast DCT algorithms.

One disadvantage of the DCT is that some elements in itstransform kernel are irrational, which increase the computa-tional cost, and may cause mismatch between the encoderand decoder. Integer Cosine Transform (ICT) [2][3][4] wasdeveloped to solve this problem. In ICT, integer kernels areconstructed to make a good approximation of the DCT. Thisidea has been adopted in many video coding standards, suchas H.264 [5], AVS [6] and AVS2 [7].

Among existing video coding standards, High EfficiencyVideo Coding (HEVC) [8][9] is one of the most efficient ones.In its transform coding part, residuals are partitioned adap-tively into transform units (TUs) in a quad-tree manner. For8x8, 16x16 and 32x32 TUs, order-8, 16 and 32 ICTs [10] areapplied respectively. For 4x4 TUs, either ICT or Discrete SineTransform [11] is chosen according to the residual type. Evenlarger TU (64x64) and the corresponding ICT were proposedin the initial draft of the test model under consideration [12],but they were removed later due to the high complexity andlimited gain [13]. In 2011, Sugito et al. proposed a set ofpartially factorizable ICTs (PFTs) [14]. This ICT set includesICTs from order-4 to order-64, which achieve significant gainson some high-resolution videos. Another order-64 PFT withlarger kernel elements [15] was proposed by Chen et al. in

The authors are with the Department of Electronic Engineering, The Chi-nese University of Hong Kong, Hong Kong (e-mail: [email protected]).

2015. Order-64 PFTs achieve promising coding performancewhile requiring high complexity.

In fact, many works have been done on developing fullyfactorizable ICTs. For order-8 ICT, it is relatively easy tofind solutions, and some simple ones were found in [3]. In2012, based on the LLM DCT algorithm [16], Fong andCham constructed an order-16 ICT with a fully factorizablestructure, named LLM ICT [17]. Later, Hong et al. [18]extended the LLM DCT algorithm to order-32, and developedthe corresponding ICT. LLM-based ICTs are fully factorizable,but it is difficult to generalize them to higher orders. In 2017,Fong et al. developed a set of Recursive ICTs (RICTs) [19],from order-4 to order-32, which have the potential to begeneralized to arbitrary orders. RICT requires larger integerparameters and more computation than LLM-based ICTs.

In this paper, we propose a set of Low-Complexity ICTs(LCICTs), from order-8 to order-64, which have fully fac-torizable structures. We implemented them on HEVC TestModel 13 (HM13) [20], and evaluated their performance underdifferent bitrate configurations. We compare the results withthose of the PFTs and RICTs. From these results, we alsoinvestigate the merits of order-64 ICTs.

The remaining parts of the paper are organized as follows.Section II introduces DCT, ICT and RICT. Section III illus-trates the design of LCICTs, and these ICTs are analyzed inSection IV. Experiment results are demonstrated and discussedin Section V, and the conclusions are drawn in Section VI.

II. DCT, ICT AND RICT

In video coding, the Discrete Cosine Transform (DCT)usually means DCT-II [1]. Let x = [x0, x1, ..., xN−1]

t andy = [y0, y1, ..., yN−1]

t be the input and output vectorsrespectively, the order-N DCT-II is defined as:

yk = αk

N−1∑i=0

xi cos(kπ(2i+ 1)

2N), (1)

where α0 =√

1/N , and αk =√

2/N when k ≥ 1. Alterna-tively, it can be represented as y = CNx, where CN is theorder-N DCT kernel.

The DCT achieves nearly optimal performance in videocoding, but the elements in CN can be irrational, which maycause mismatch between the encoder and decoder. To solvethis problem, we can approximate CN by KNEN , whereEN is an integer kernel, and KN is a diagonal matrix usedfor normalization. This approximation is called Integer Cosine

Copyright c© 2018 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained fromthe IEEE by sending an email to [email protected].

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TCSVT.2018.2822319

Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

Transform (ICT) [3]. While talking about ICT, we refer to theinteger kernel EN , since the normalization process is trivial.

Let vi be the i-th basis vector of an order-2N ICT E2N ,it is usually required that vectors v0, v2, ..., v2N−2 aresymmetric while the remaining ones are anti-symmetric. Underthis assumption, matrix E2N can be factorized as:

E2N = P2N

[EN 00 FN

]A2N ,

where A2N =

[IN JNJN −IN

].

(2)

Here IN is the order-N identity matrix. Matrix JN is obtainedby flipping IN horizontally, and P2N is a permutation matrix.Note that EN is an order-N ICT, and thus can be furtherfactorized in the same way.

Based on the fast algorithm of the DCT [21], Fong et al.further require that FN can be factorized recursively [19]:

F2N = S2N

[F ′N 00 F ′N

]B2N ,where B2N =

[BN,0

BN,1

],

S2N =

[DN,0 DN,1JN

−JNDN,1 JNDN,0JN

].

(3)

Here DN,0 and DN,1 are define as diag(d2N,0, d2N,2, ...,d2N,2N−2) and diag(d2N,1, d2N,3, ..., d2N,2N−1) respec-tively, where {d2N,n} are the integer parameters. Let i ∈{0, 1, ..., N − 1} and j ∈ {0, 1, ..., 2N − 1} be the row andcolumn indices respectively, matrix BN,k is defined as:

BN,k(i, j) =

(−1)ki, j = 2i;

(−1)k(i+1), j = 2i+ 1;

0, otherwise.(4)

Matrix F ′N is required to have the same structure as FN , sothat it can be further factorized by applying Eq (3).

The ICT satisfying this requirement is called Recursive ICT(RICT), which is fully factorizable and has the potential to begeneralized to arbitrary orders. While RICT has these advan-tages, its complexity can be further reduced by introducingother ICT structures. In this paper, we propose a new order-64 ICT by using lower-order ICTs with similar coding gainsbut lower complexity. The proposed ICTs are called Low-Complexity ICTs (LCICTs).

III. TRANSFORM DESIGN

A. Transform Structure and Related Notations

In this subsection, we introduce the LCICT structure andsome related notations, which are used in remaining parts ofthis paper.

According to Eq. (2), order-4, 8, 16 and 32 ICTs can beembedded into an order-64 ICT. Fig. 1 shows the resultingICT structure, where low-order ICTs can be implemented byreusing the circuit areas of high-order ICTs [10]. We adopt thisstructure in the LCICT design, and follow Eq. (3) to factorize{F2N} (N ≥ 4) into {S2N}, {F ′N} and {B2N}. Unlike RICT,We do not require that {F ′N} can be recursively factorized byapplying Eq. (3). This relaxation allows us to improve RICT byintroducing other ICT structures. Once the forward transformsare defined, their transpose are used as the inverse transforms.

input output64A

32A16A

8A

32F

16F

8F

4F

8E 16E 32E

4E

Fig. 1. General structure of order-64 ICT. The blocks are modules performingmatrix multiplications, with corresponding matrices {AN}, {EN} and{FN}. Matrices E4, E8, E16 and E32 are the order-4, 8, 16 and 32 ICTsrespectively.

In HEVC, the transform kernel is implemented as T2N =k−12NE2N , where E2N is the integer kernel, and k2N = 2n/2

(n ∈ Z+) is a scalar for normalization. In this paper, we useEN and FN to denote the normalized EN and FN (i.e. withscaling factor

√2k−12N ). Note that T2N is orthogonal if and

only if both EN and FN are orthogonal.We use transform coding gain [22] and reconstruction error

[4] to evaluate the normalized transform kernels. Let T2N bean order-2N transform kernel, and its input signals be thesamples generated from a 1-D first-order Markov source withzero mean, unit variance and adjacent element correlation ρ.The transform coding gain of T2N can be defined as:

Gtc(T2N ) = − 5

Nlog10(

2N−1∏i=0

σ2i ), (5)

where σ20 , σ2

1 , ..., σ22N−1 are the variances of the transformed

coefficients y0, y1, ..., y2N−1 respectively. This expressioncan be separated as Gtc(T2N ) = − 5

N (GE(EN ) +GF (FN )).Functions GE(EN ) and GF (FN ) are determined by EN andFN respectively, and they are defined as:

GE(EN ) = log10(N−1∏i=0

σ22i), GF (FN ) = log10(

N−1∏i=0

σ22i+1).

(6)In our experiments, we average the Gtc values obtained

with ρ ∈ {0.6, 0.7, 0.8, 0.9}, which are typical values of theadjacent element correlation of the residual signals after intraor inter prediction.

Transform coding gain models the reduction of quantizationerror after introducing an orthogonal transform. However, T2N

may be non-orthogonal, resulting in additional distortion. Thisdistortion can be modelled by the reconstruction error [4]:

σ2r(T2N ) =

1

2N

2N−1∑i=0

2N−1∑j=0

M2N (i, j)ρ|i−j|, (7)

where M2N = (T t2NT2N − I2N )2. Similar to Gtc, we usethe averaged σ2

r to estimate the distortion. The good codingperformance of T2N can usually be guaranteed by a high Gtcand a negligible σ2

r .

B. Order-8 & 16 LCICT Design

The order-8 LCICT structure is represented as E8 in Fig.3. This structure is similar to the one in [3], and it has the



3

6 7 8 9 10 11 12 13

3.52

3.53

3.54

3.55

3.56

3.57

log2(k

8)

Gtc

/ d

B

DCT

LCICT

RICT

Fig. 2. Comparison of coding gain between the order-8 LCICT and RICTstructures. The coding gain of the DCT serves as a reference.

same property as RICT: matrix F4 is orthogonal if all itsrow vectors have unit L2 norm, i.e. ||

√2k−18 ui||22 = 1 (i =

0, 1, 2, 3), where ui is the i-th row vector of F4. In this paper,we approximate these constraints by |1− ||

√2k−18 ui||22| < α,

which are equivalent to the following conditions.

a) |1− 2k−28 (b20 + 2b21)(r20 + r21)| < α.

b) |1− 4k−28 (b20 + 2b21)(r22 + r23)| < α.

Here {bn} and {rn} are the integer parameters in F4, and α isset as 0.005 in our experiments. Since |rj | ≥ 1 (j = 0, 1, 2, 3),these conditions imply constraint 8k−28 (b20 + 2b21) < 1 + α.

Given a fixed E4, matrix F4 is determined by finding theparameter set θ = [b0, b1, r0, r1, r2, r3] which minimizesGF (F4) (see Eq. (6)) under conditions a) and b). The search-ing is done by first collecting all the candidate {bn} satisfying8k−28 (b20 + 2b21) < 1 + α, and then finding the optimal {rn}corresponding to each candidate. Note that the optimal [r0, r1]and [r2, r3] can be determined independently, e.g. [r0, r1] canbe determined by finding the one which minimizes σ2

1σ27

among all the ones satisfying condition a). There may bemultiple θ with GF (F4) values close to the minimum, in thiscase we choose the one with relatively small parameters.

The advantage of the order-8 LCICT structure is illustratedby an experiment. In this experiment, we find the highestcoding gains the order-8 LCICT and RICT structures achieveunder constraints |1 − ||

√2k−18 ui||22| < α, and this process

is repeated with different k8. Since LCICT and RICT sharethe same E4 structure, we fix E4 as the order-4 DCT inthe experiment, and directly generalize the searching strategymentioned above to find the maximum coding gains of RICT.As we can see in Fig. 2, when k8 ≥ 29, the LCICT structureachieves higher coding gains than RICT and DCT.

In our design, we set E4 as the one in HEVC, and find theparameters in F4 under condition k8 = 28

√8.

Matrix F8 is constructed by Eq. (3), where F ′4 is designedby reusing the structure of F4 (see Fig. 3). Let ui be the i-throw vector of F8, we approximate the orthogonality constrainton F8 by |1− ||

√2k−116 ui||22| < β (i = 0, 1, ..., 7), which are

equivalent to the following conditions.

a) n4,i = |1− 2w2,ik−216 (r

′22i + r′21+2i)p4,i| < β.

b) n4,3−i = |1− 2w2,ik−216 (r

′22i + r′21+2i)p4,3−i| < β.

Here i ∈ {0, 1}, and w2,i, p4,i are defined as 2i+1(b′20 +2b′21 )and d28,2i + d28,1+2i respectively. Threshold β is set as 0.01 inour experiments.

Given the order-8 LCICT (E8), we set [b′0, b′1] = [7, 5]

(≈ 110 [b0, b1]), and find the remaining parameters {r′n} and

{d8,n} which minimize GF (F8) under conditions a) andb). This problem can be divided into two independent sub-problems, and each sub-problem is to find the parameterset θ2,i = [r′2i, r

′1+2i, d8,2i, d8,1+2i, d8,6−2i, d8,7−2i] (i =

0, 1) which minimizes f = σ21+2iσ

27−2iσ

29+2iσ

215−2i under

conditions n4,i < β and n4,3−i < β. Each θ2,i is determinedin a similar way as the θ in F4.

We determine F8 under condition k16 = 210√16. The

resulting integer parameters are smaller than those in RICTwhile keeping a similar coding gain.

C. Order-32 & 64 LCICT Design

We follow Eq. (3) to construct F16. In RICT, matrixF ′8 is recursively factorized by using Eq. (3), resulting inrelatively high computational complexity. We adopt a moreefficient structure [17] for F ′8, which is shown in Fig. 3. Letqj = c22j + c21+2j (j = 0, 1, 2, 3), matrix F16 is orthogonal ifq0 = q1 = q2 = q3 and ||

√2k−132 ui||22 = 1 (i = 0, 1, ..., 15),

where ui is the i-th row vector of F16. These constraintsimplicitly require large parameters or a sacrifice of the codinggain, so we approximate them by the following conditions.

a) maxi,j |qi − qj | <τ

4

∑i qi.

b) |1− ||√2k−132 ui||22| < β (i = 0, 1, ..., 15).

Threshold τ is set as 0.005 in our experiments.Given the order-16 LCICT (E16), we find the parameters{cn}, {sn} and {d16,n} which minimize GF (F16) among thecandidates satisfying conditions a) and b). These candidatesare generated by the following steps.

1) Collect the candidates of {cn} satisfying condition a).To reduce the search space, we constrain {cn} to beclose to the scaled DCT parameters, i.e. |ci −sϕi| ≤ 2,where ϕi is the corresponding DCT parameter [16], ands is a scaling factor taken from {1, 2, ..., 100}.

2) For each candidate {cn}, set [s0, s1, s2] as [2, 1, 2] and[4, 2, 5] (as suggested in [17]), and determine the {d16,n}corresponding to each {sn} by finding the one whichminimizes GF (F16) under condition b).

In step 2), each {d16,2i, d16,1+2i} (i = 0, 1, ..., 7) can be foundindependently by minimizing σ2

1+2iσ231−2i under condition

|1 − ||√2k32ui||22| < β. This process can be simply done

by an exhaustive search. We determine F16 under conditionk32 = 213

√32, and the resulting order-32 LCICT is shown

in Fig. 3. Note that intermediate divisions (represented by t3)are required to guarantee a 32-bit dynamic range.

Matrix F32 is constructed by Eq. (3), where F ′16 is designedby reusing matrix F ′8:

F ′16 = S′16

[F ′8 00 F ′8

]B16. (8)

Here S′16 is determined by parameters {d′16,n} (see Fig. 3).Let ui be the i-th row vector of F32, we approximate theorthogonality constraint on F32 by |1−||

√2k64ui||22| < β (i =

0, 1, ..., 31), which are equivalent to the following conditions.a) n16,i = |1− 8w8,ik

−264 (d

′216,2i + d′216,1+2i)p16,i| < β.



4

E 16

F ’4

F ’8

E 8

4

12

20

28

0

16

8

24

18

22

26

30

2

6

10

14

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

9

11

13

15

1

3

5

7

25

27

29

31

17

19

21

23

d16,14

d16,14

d16,0

d16,1

d16,0

d16,1

d16,2

d16,3

d16,2

d16,3

d16,4

d16,5

d16,4

d16,5

d16,6

d16,7

d16,6

d16,7

d16,8

d16,9d16,10

d16,11d16,12

d16,13

d16,15

d16,8

d16,9d16,10

d16,11d16,12

d16,13

d16,15

d8,0

d8,1d8,2

d8,3

d8,0

d8,4

d8,5d8,6

d8,7

d8,1d8,2

d8,3d8,4

d8,5d8,6

d8,7

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

0

3

2

1

4

5

6

7

8

11

10

9

12

13

14

15

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

16

19

18

17

20

21

22

23

24

27

26

25

28

29

30

31

X 2t

X 1t

E 4

a0a1

a0a1

b0

b0b0

b0

b12b1

2b1b1

t0

t0

r0

r0

r2

r2

r1

r1

r3

r3

x

x

x

x

x

x

x

x

0

3

2

1

4

5

6

7

x

x

x

x

x

x

x

x

8

11

10

9

12

13

14

15

2

6

10

14

0

8

4

12

y

y

y

y

y

y

y

y

9

11

13

15

1

3

5

7

y

y

y

y

y

y

y

y

x

x

x

x

0

3

2

1

x

x

x

x

4

5

6

7

0

4

2

6

y

y

y

y

1

3

5

7

y

y

y

y

x

x

x

x

0

3

2

1

0

2

1

3

y

y

y

y

r1

c0

c2c

3c4c

5c6

c6

c7c

7

c1

c5c4 c3c2 c1c0

s1s

2s2s1

s1s

2s2s1

s0

s0

s0

s0

c0

c2c

3c4c

5c6

c6

c7c

7

c1

c5c4

c3c2

c1c0

s1s2s2s1

s1s

2s2s1

s0

s0

s0

s0reorder

X , 3tz

z

z

z

z

z

z

z

z

z

z

z

z

z

z

z

4

0

7

5

2

6

1

3

2

4

3

6

7

1

0

5

reorder

X , 3tz

z

z

z

z

z

z

z

12

8

15

13

10

14

9

11

z

z

z

z

z

z

z

z

10

12

11

14

15

9

8

13

3

r0’

’

r1’r2’

r2’

r3’r3’

r0’r0’

r0’

r2’

r2’r1’

r1’

r ’r3’

36

83

67

53

7

5

41

34

25

47

51

15

5

53

1

5

2

3

1

5

2

3

4

2

5

d

d

d

d

d

d

d

d

8,0

8,2

8,1

8,3

8,4

8,6

8,5

8,7

40

4

39

11

36

18

31

26

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

16,0

16,2

16,3

16,4

16,5

16,6

16,7

16,8

16,9

16,10

16,11

16,12

16,13

16,14

16,15

16,1

77

4

40

6

39

10

51

19

49

4

35

20

32

25

57

52

t

t

t

t

0

1

2

3

64

2

2

1 / 2

2

2

3

b0’b1’2b1’

b0’b0’

2b1’b1’

b0’

b0’b1’2b1’

b0’b0’

2b1’b1’

b0’

a

a

b

b

b

b

c

c

c

c

c

c

c

c

r

r

r

r

r

r

r

r

s

s

s

0

1

0

1

0

1

2

3

4

5

6

7

0

1

2

3

0

1

2

3

0

1

2

’’’’

0

1

’’

Fig. 3. Signal flow diagram of the order-32 LCICT (or E32). The inputs and outputs of each ICT (E4, E8, E16 and E32) are represented by {xn} and{yn} respectively, and {zn} denotes the intermediate results. Dashed branches denote subtractions. The values of parameters (multipliers) are listed on theright, where 2n and 1/2n denote that the corresponding multiplications are implemented as bitwise shifts.

TABLE ICOMPARISON OF NUMBERS OF ADDITIONS AND MULTIPLICATIONS

Order LCICT SPFT RICT LLMT+ × + × + × + ×

8 26 20 28 22 28 22 26 1816 78 64 100 86 82 66 72 4632 202 152 372 342 228 182 186 12264 514 390 1428 1366 580 474 N/A N/A

b) n16,15−i = |1−8w8,ik−264 (d

′216,2i+d

′216,1+2i)p16,15−i| < β.

Here i ∈ {0, 1, ..., 7}, symbol w8,i denotes the squared L2

norm of the i-th row vector of F ′8, and p16,i = d232,2i+d232,1+2i.

Given the order-32 LCICT (E32), we find the parameters{d′16,n} and {d32,n} which minimize GF (F32) under con-ditions a) and b). This problem can be divided into severalindependent sub-problems, and each sub-problem is to findthe parameter set θ8,i = [d′16,2i, d

′16,1+2i, d32,2i, d32,1+2i,

d32,30−2i, d32,31−2i] (i = 0, 1, ..., 7) which minimizes f =σ21+2iσ

231−2iσ

233+2iσ

263−2i under conditions n16,i < β and

n16,15−i < β. Each θ8,i is determined in a similar way as theθ in F4. We determine F32 under condition k64 = 218

√64,

and the resulting order-64 LCICT is shown in Fig. 4.

IV. TRANSFORM ANALYSIS

In this section, we estimate the complexity, coding gains andreconstruction errors of LCICTs. The estimation results arecompared with those of Sugito’s PFTs (SPFTs) [14], Chen’sorder-64 PFT (CPFT) [15], RICTs [19] and Hong’s LLM-based ICTs (LLMTs) [18].

The complexity of a transform can be measured by thenumbers of additions and multiplications, which are directly

TABLE IICOMPARISON OF TRANSFORM CODING GAIN (Gtc) AND

RECONSTRUCTION ERROR (σ2r )

Order LCICT SPFT CPFT RICT LLMT8 3.569 3.563 N/A 3.563 3.563

Gtc 16 3.827 3.827 N/A 3.829 3.82932 3.969 3.970 N/A 3.966 3.97164 4.045 4.045 4.048 4.044 N/A8 0.018 0.016 N/A 0.001 0.075

σ2r 16 0.222 0.027 N/A 0.004 0.420

(10−4) 32 0.228 0.039 N/A 0.164 0.49964 0.272 0.150 0.018 0.143 N/A

related to the computation time and hardware complexity. Theestimation results are shown in Table I. CPFT has the sameoperation numbers as the order-64 SPFT, and thus is not shownhere. We extend the RICTs in [19] by using Eq. (3) and (8)to obtain an order-64 RICT, where parameters {d′16,n} and{d32,n} are determined in the same way as the ones in LCICT.These parameters are shown in Table III.

LCICTs require fewer operation numbers than SPFTs andRICTs, especially in the order-64 case, where LCICT reduces71% multiplications compared with SPFT. Moreover, LCICTsrequire at most 8 bits (including the sign bit) to represent theinteger parameters, while RICTs require up to 14 bits. LLMTshave the lowest complexity, but it is difficult to generalizetheir structures to order-64. All the transform sets here exceptCPFT have high circuit reusability, since they adopt the theICT structure depicted in Fig. 1.

While having the advantages mentioned above, LCICTsachieve similar coding gains as other ICTs, as show in Table II.



5

d32,0

32E

64A

60

3

20

3

20

5

22

8

25

12

18

11

20

15

30

27

41

1

64

5

62

8

73

13

61

14

59

16

49

16

57

22

56

24

46

24

53

31

53

34

61

43

49

39

48

42

30

28

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

32,0

32,2

32,3

32,4

32,5

32,6

32,7

32,8

32,9

32,10

32,11

32,12

32,13

32,14

32,15

32,1

32,16

32,17

32,18

32,19

32,20

32,21

32,22

32,23

32,24

32,25

32,26

32,27

32,28

32,29

32,30

32,31d32,0

d32,1

d32,1

d32,2d32,3

d32,2

d32,3

d32,4

d32,5

d32,4

d32,5

d32,6

d32,7

d32,6

d32,7

d32,8

d32,9

d32,8

d32,9

d32,10

d32,11

d32,10

d32,11

d32,12

d32,13

d32,12

d32,13

d32,14

d32,15

d32,14

d32,15

d32,16

d32,17

d32,16

d32,17

d32,18

d32,19

d32,18

d32,19

d32,20

d32,21

d32,20

d32,21

d32,22

d32,23

d32,22

d32,23

d32,24

d32,24

d32,25

d32,25

d32,26

d32,26

d32,27

d32,27

d32,28

d32,28

d32,29

d32,29

d32,30

d32,30

d32,31d32,31

8 'F

8 'F

8 'F

8 'F

X t4

4t51 / 2

16 'F

X t4

d16,0’

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

d

16,0

16,2

16,3

16,4

16,5

16,6

16,7

16,8

16,9

16,10

16,11

16,12

16,13

16,14

16,15

16,1

’’’’’’’’’’’’’’’’

d16,2’d16,3’d16,4’

d16,6’

d16,8’ d16,7’

d16,10’d16,11’

d16,13’

d16,15’d16,15’

d16,14’

d16,12’d16,13’

d16,11’

d16,7’

d16,5’

d16,3’

d16,1’

d16,10’

d16,8’

d16,6’

d16,4’

d16,2’

d16,0’

d16,1’

d16,5’

d16,9’

d16,9’

d16,12’

d16,14’

d16,0’

d16,2’d16,3’d16,4’

d16,6’

d16,8’ d16,7’

d16,10’d16,11’

d16,13’

d16,15’d16,15’

d16,14’

d16,12’d16,13’

d16,11’

d16,7’

d16,5’

d16,3’

d16,1’

d16,10’

d16,8’

d16,6’

d16,4’

d16,2’

d16,0’

d16,1’

d16,5’

d16,9’

d16,9’

d16,12’

d16,14’

x , x , …, x0 1 63

x + x , x + x , …, x + x 0 1 3162 3263

y , y , …, y0 2 62

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

27

26

25

24

31

30

29

28

19

18

17

16

23

22

21

20

11

10

9

8

15

14

13

12

3

2

1

0

7

6

5

4

--

--

----

----

----

--

--

----

----

----

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

36

37

38

39

32

33

34

35

44

45

46

47

40

41

42

43

52

53

54

55

48

49

50

51

60

61

62

63

56

57

58

59

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

y

9

11

13

15

1

3

5

7

25

27

29

31

17

19

21

23

41

43

45

47

33

35

37

39

57

59

61

63

49

51

53

55

Fig. 4. Signal flow diagram of the order-64 LCICT. The inputs and outputs are represented by {xn} and {yn} respectively. Dashed branches denotesubtractions. The values of parameters (multipliers) are listed on the right, where 1/2n denote that the multiplication is implemented as bitwise shifts.

TABLE IIIPARAMETERS IN F32 IN THE ORDER-64 RICT.

d′16,0 d′16,1 d′16,2 d′16,3 d′16,4 d′16,5 d′16,6 d′16,7 d′16,8 d′16,9 d′16,10 d′16,11 d′16,12 d′16,13 d′16,14 d′16,1542 2 20 3 20 5 23 8 17 8 15 9 24 18 11 10d32,0 d32,1 d32,2 d32,3 d32,4 d32,5 d32,6 d32,7 d32,8 d32,9 d32,10 d32,11 d32,12 d32,13 d32,14 d32,15

32 1 90 7 89 12 70 12 90 20 102 28 58 19 84 33d32,16 d32,17 d32,18 d32,19 d32,20 d32,21 d32,22 d32,23 d32,24 d32,25 d32,26 d32,27 d32,28 d32,29 d32,30 d32,31

83 36 54 28 92 53 77 50 58 41 71 55 70 57 24 21

All the transforms here have very small reconstruction errors(σ2r < 10−4).

V. EXPERIMENT RESULTS

We implemented LCICTs, SPFTs [14], CPFT [15] andRICTs [19] respectively on HM13. Here RICTs include theorder-64 RICT introduced in Section IV. The inverse trans-forms are set as the transpose of the corresponding forwardones. These ICTs were evaluated at:• different coding modes: All Intra (AI), Random Access

(RA) and Low Delay B (LDB);• different QP ranges: normal QP (22, 27, 32, 37), high QP

(36, 42, 47, 51) and low QP (1, 5, 9, 13).The test videos are divided into the following classes:

1600p, 1080p, WVGA, WQVGA, 720p and screen content,which are labelled as Class A-F respectively. According to thecommon test conditions [23], 720p and 1600p videos were nottested at RA and LDB modes respectively. We evaluated thecoding performance with BD-rates [23], [24], where HM13served as the anchor. The evaluation results are shown in

TABLE IVBD-RATES AT LOW DELAY B MODE

BD-rate LCICT SPFT CPFT RICT(%) Y U & V Y U & V Y U & V Y U & V

Class B -0.8 -1.9 -0.8 -1.7 -0.8 -1.7 -0.8 -1.8Normal Class C -0.6 -0.5 -0.6 -0.5 -0.7 -0.5 -0.6 -0.5

QP Class D -0.1 -0.2 -0.1 -0.2 -0.2 -0.2 -0.1 0.1Class E -0.8 -2.8 -0.6 -3.0 -0.7 -2.8 -0.8 -2.6Class F -0.8 -0.6 -0.8 -0.9 -0.8 -0.8 -0.7 -0.6Overall -0.6 -1.2 -0.6 -1.2 -0.6 -1.2 -0.6 -1.0Class B -1.8 -9.2 -1.8 -9.2 -1.7 -9.1 -1.6 -9.4

High Class C -0.9 -0.6 -0.7 -0.9 -0.7 -1.9 -0.7 -0.9QP Class D -0.1 -3.0 -0.1 -0.6 0.0 -1.3 0.0 -1.5

Class E -1.2 -9.3 -1.1 -8.6 -1.0 -8.6 -1.2 -9.6Class F -1.3 -3.8 -1.2 -4.1 -1.2 -4.5 -1.2 -4.7Overall -1.1 -5.2 -1.0 -4.7 -1.0 -5.1 -0.9 -5.2

Low QP Overall 0.1 -0.1 0.0 -0.1 -0.1 -0.7 0.1 -0.1

Table IV and V, where negative numbers denote percentages ofbitrate reduction. Due to the page limit, we show the averagedBD-rates of chroma components instead of individual ones.

Order-64 ICTs achieve notable improvements on high-resolution videos at normal QP. Moreover, we find that order-64 ICTs achieve around 1% average bitrate reduction onthe luma component at high QP, which is mainly due to



6

TABLE VBD-RATES AT ALL INTRA AND RANDOM ACCESS MODES

AI RABD-rate LCICT SPFT CPFT RICT LCICT SPFT CPFT RICT

(%) Y U & V Y U & V Y U & V Y U & V Y U & V Y U & V Y U & V Y U & VClass A -0.8 -1.6 -0.9 -1.7 -1.1 -1.7 -0.8 -1.8 -1.2 -0.9 -1.3 -0.8 -1.4 -0.7 -1.3 -1.0Class B -0.5 -1.5 -0.6 -1.5 -0.7 -1.6 -0.6 -1.5 -0.6 -1.2 -0.6 -1.2 -0.7 -1.2 -0.6 -1.2

Normal Class C -0.1 -0.2 -0.1 -0.3 -0.1 -0.3 -0.1 -0.3 -0.3 -0.2 -0.3 -0.2 -0.4 -0.2 -0.3 -0.1QP Class D -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 0.0 0.2 0.0 -0.2 0.0 -0.1 -0.1 0.2

Class E -0.6 -1.4 -0.7 -1.5 -0.7 -1.6 -0.7 -1.5 N/A N/A N/A N/A N/A N/A N/A N/AClass F -0.1 -0.2 -0.1 -0.2 -0.1 -0.2 -0.1 -0.3 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5Overall -0.4 -0.8 -0.4 -0.9 -0.5 -0.9 -0.4 -0.9 -0.5 -0.5 -0.6 -0.6 -0.6 -0.6 -0.6 -0.6Class A -0.7 -13.6 -0.6 -13.6 -0.7 -13.7 -0.6 -13.6 -0.5 -10.1 -0.4 -9.6 -0.5 -10.3 -0.4 -10.2Class B -1.6 -9.3 -1.6 -9.3 -1.6 -9.5 -1.6 -9.4 -1.5 -8.0 -1.4 -7.8 -1.4 -8.4 -1.4 -7.8

High Class C -0.6 -2.1 -0.5 -2.0 -0.5 -2.3 -0.5 -2.2 -0.6 -1.0 -0.6 -1.2 -0.7 -1.2 -0.7 -1.1QP Class D -0.2 -0.5 -0.2 -0.7 -0.1 -0.5 -0.1 -0.7 0.2 -1.6 0.1 -1.8 0.2 -1.4 0.3 -0.7

Class E -1.0 -6.2 -1.0 -6.3 -1.0 -6.5 -0.9 -6.3 N/A N/A N/A N/A N/A N/A N/A N/AClass F -2.3 -4.7 -2.1 -4.8 -2.1 -4.6 -2.3 -4.4 -1.1 -5.0 -1.0 -4.2 -0.9 -4.7 -0.8 -4.8Overall -1.1 -6.2 -1.0 -6.2 -1.0 -6.3 -1.0 -6.2 -0.8 -5.3 -0.7 -5.1 -0.7 -5.3 -0.6 -5.1

Low QP Overall 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 -0.1 0.0 -0.2 0.0 -0.2 0.0 -0.2

the significant improvements on high-resolution videos andscreen contents. The 64x64 TU leads to more significantimprovements on chroma components by enabling order-32transforms, and this phenomenon is consistent with the conclu-sion that the order-32 transform usually plays a more importantrole than the order-64 one [13]. Order-64 ICTs provide littlegain at low QP. This is reasonable since transforms are mainlyused to reduce the quantization errors, which are very smallhere.

LCICTs achieve comparable coding performance as otherICTs, especially at high QP, where order-64 ICTs lead tosignificant improvements. At the same time, LCICTs have thelowest complexity among all the ICTs here, e.g. 71% fewermultiplications than the PFTs in the order-64 case.

The 64x64 TU increases the total encoding time by around13%, which is a significant number. This time may be reducedby simplifying the processing of large TUs, and one possiblesolution is to use fast TU decision techniques like [25].

VI. CONCLUSIONS

In this paper, we propose a set of Low-Complexity ICTs(LCICTs), from order-8 to order-64, which have fully fac-torizable structures. Experiment results on HM13 show thatLCICTs achieve similar coding performance as the state-of-art methods while having lower complexity. We also findthat order-64 ICTs can significantly improve the codingperformance under low-bitrate configurations. Consequently,these ICTs can benefit low-bitrate applications such as videoconferencing and video surveillance.

In our current implementation, the 64x64 TU increasesthe encoding time by around 13%. We will explore ways toaccelerate the processing of large TUs in the future.

REFERENCES

[1] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE Trans. Comput., vol. 100, no. 1, pp. 90–93, Jan. 1974.

[2] W.-K. Cham, “Family of order-4 four-level orthogonal transforms,” IEEElectronics Letters, vol. 21, no. 19, pp. 869–871, 1983.

[3] ——, “Development of integer cosine transforms by the principle ofdyadic symmetry,” IEE Proceedings I - Communications, Speech andVision, vol. 136, pp. 276–282, 1989.

[4] J. Dong, K. N. Ngan, C.-K. Fong, and W.-K. Cham, “2-D order-16integer transforms for HD video coding,” IEEE Trans. Circuits Syst.Video Technol., vol. 19, no. 10, pp. 1462–1474, Oct. 2009.

[5] ITU-T Rec. H.264 and ISO/IEC 14496-10:2009: Advanced video coding,ITU-T and ISO/IEC, 2010.

[6] Information technology - Advanced coding of audio and video, Part 2:Video, GB/T 20090.2-2006, AVS Workgroup of China, 2006.

[7] Information technology - High efficiency media coding, Part 2: Video,GB/T 33475.2-2016, AVS Workgroup of China, 2016.

[8] ITU-T Rec. H.265 and ISO/IEC 23008-2: High efficiency video coding,ITU-T and ISO/IEC, 2013.

[9] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of theHigh Efficiency Video Coding (HEVC) standard,” IEEE Trans. CircuitsSyst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

[10] M. Budagavi, A. Fuldseth, G. Bjøntegaard, V. Sze, and M. Sadafale,“Core transform design in the High Efficiency Video Coding (HEVC)standard,” IEEE J. Sel. Topics Signal Process., vol. 7, no. 6, pp. 1029–1041, Dec. 2013.

[11] A. Saxena and F. C. Fernandes, “DCT/DST-based transform coding forintra prediction in image/video coding,” IEEE Trans. Image Process.,vol. 22, no. 10, pp. 3974–3981, Oct. 2013.

[12] F. Bossen, Test model under consideration, document JCTVC-A205,JCT-VC, Apr. 2010.

[13] M. Zhou, Coding efficiency test on large block size transforms in HEVC,document JCTVC-B028, JCT-VC, Jul. 2010.

[14] Y. Sugito, A. Ichigaya, S. Sakaida, K. Sugimoto, A. Minezawa, andS. Sekiguchi, A study on addition of 64x64 transform to HM 3.0,document JCTVC-F192, JCT-VC, Jul. 2011.

[15] J. Chen, Y. Chen, M. Karczewicz, X. Li, H. Liu, L. Zhang, and X. Zhao,“Coding tools investigation for next generation video coding based onHEVC,” in Proc. SPIE, vol. 9599, 2015, pp. 95 991B–1–95 991B–9.

[16] C. Loeffler, A. Ligtenberg, and G. S. Moschytz, “Practical fast 1-DDCT algorithms with 11 multiplications,” Proc. IEEE Int. Conf. Acoust.Speech Signal Process., vol. 2, pp. 988–991, May 1989.

[17] C.-K. Fong and W.-K. Cham, “LLM integer cosine transform and its fastalgorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 6, pp.844–854, Jun. 2012.

[18] Y. M. Hong, I.-K. Kim, T. Lee, M.-S. Cheon, E. Alshina, W.-J. Han, andJ.-H. Park, “New fast DCT algorithms based on Loeffler’s factorization,”in Proc. SPIE, vol. 8499, 2012, pp. 84 990U–1–84 990U–8.

[19] C.-K. Fong, Q. Han, and W.-K. Cham, “Recursive integer cosinetransform for HEVC and future video coding standards,” IEEE Trans.Circuits Syst. Video Technol., vol. 27, no. 2, pp. 326–336, Feb. 2017.

[20] I.-K. Kim, K. McCann, K. Sugimoto, B. Bross, W.-J. Han, and G. J.Sullivan, High Efficiency Video Coding (HEVC) Test Model 13 (HM13)Encoder Description, document JCTVC-O1002, JCT-VC, Nov. 2013.

[21] Z. Wang, “Fast algorithms for the discrete W transform and for the dis-crete Fourier transform,” IEEE Trans. Acoust., Speech, Signal Process.,vol. 32, no. 4, pp. 803–816, Aug. 1984.

[22] N. S. Jayant and P. Noll, Digital Coding of Waveforms, Principles andApplications to Speech and Video. Englewood Cliffs NJ, USA: Prentice-Hall, 1984.

[23] F. Bossen, Common test conditions and software reference configura-tions, document JCTVC-K1100, JCT-VC, Oct. 2012.

[24] G. Bjøntegaard, Calculation of average PSNR Differences between RD-curves, document VCEG-M33, ITU-T SG16/Q6, Apr. 2001.

[25] L. Shen, Z. Zhang, X. Zhang, P. An, and Z. Liu, “Fast TU size decisionalgorithm for HEVC encoders using Bayesian theorem detection,” SignalProcess.: Image Commun., vol. 32, pp. 121–128, 2015.



7

Zhe Chen received the Bachelor’s degree in elec-tronic engineering from the Chinese University ofHong Kong in 2014, and he is currently pursuingthe Ph.D. degree in the Department of ElectronicEngineering, the Chinese University of Hong Kong.

His current research interests include video cod-ing, video inpainting and dynamic texture synthesis.

Qinglong Han (S’14) received the Bachelor’s de-gree in electronic engineering from the Universityof Electronic Science and Technology of China in2011, and Ph.D. degree in electronic engineeringfrom the Chinese University of Hong Kong in 2017.

His current research interests include video cod-ing, video streaming, and computer vision.

Wai-Kuen Cham (S’77-M’79-SM’91) graduatedfrom the Chinese University of Hong Kong in 1979in electronics. He received the M.Sc. and Ph.D. de-grees from Loughborough University of Technology,U.K., in 1980 and 1983, respectively.

Since May 1985, he has been with the Departmentof Electronic Engineering, the Chinese Universityof Hong Kong. His research interests include imageprocessing and video coding.



Low-Complexity Order-64 Integer Cosine Transform Design...

Documents

Transcript of Low-Complexity Order-64 Integer Cosine Transform Design...