High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ﬃ...

High performance intra algorithm and parallel

hardware architecture for the next generation

video coding

Doctor’s Course in Electrical and Electronic Engineering

Graduate School of Engineering, University of Tokushima

Wen Shi

Jan. 2018

Abstract

Dynamic force from the worldwide consumer electronics market is driving dis-

play technology and video coding technology. High definition (HD) and ultra-high

definition (UHD) video contents are growing and expending. In the meantime,

multimedia terminals are adapting to display, capture, and playback the higher res-

olution video with better quality. However, higher resolution and higher quality

video causes network overload in content-delivery for large amounts of video data.

To compress and encode the huge amount of video data to much smaller capacity

and faster speed so that it can be manipulated with low cost applications for broad-

casting, video conference, online video game and surveillance system, video coding

technology has been developed and researched since 1989. H.264/AVC is the lat-

est international video coding standard which provides high coding efficiency while

introducing heavy computational loading and high-power consumption. As video

content is growing in both resolution and quality. High quality and high resolution

(e.g. 2K, 4K) contents become the mainstream, a successor to H.264/AVC called

High Efficiency Video Coding (HEVC) for next generation video coding is under

standardization. In 2013, the Joint Collaborative Team on Video Coding (JCT-VC)

standardized HEVC to achieve higher video coding efficiency. As the next gener-

ation video coding standard, HEVC supports higher resolution video coding and

achieves about 50% bit-rate reduction under the same visual quality compared with

Advanced Video Coding (H.264/AVC).

In HEVC, Rate Distortion Optimization (RDO) based hybrid encoding is effec-

tive tool, which reduces spatial and temporal redundancy by trying a wide range of

unit types and prediction modes. Therefore, RDO process is also computationally

intensive in video encoding. Intra coding reduces data redundancy in neighboring

blocks, which leads to high data dependency and high-power consumption. The

problem becomes critical in HEVC because of the quadtree-structured coding unit

sizes have been increased from 4 to 64. In recent years, real-time transmission

and mobile devices are the imminence requirement for video-related entertainment

applications. It is desirable to develop optimization techniques for video encoders

3

because of short battery life and limited hardware resources. The targets of the

research are to reduce the computational complexity, increase coding performance

and realize hardware parallelization, so that video data can be compressed efficiently

within faster speed and lower consumption in real-time hardware implementation.

Three novel schemes are proposed to realize the above-mentioned targets. Firstly,

an edge detector based fast level decision algorithm for intra prediction of HEVC is

presented to reduce the redundant calculation and encoding time. In the intra pre-

diction of HEVC, prediction unit sizes from 4X4 to 64x64 are employed, respectively

defined as levels 1 to 5, to achieve higher coding efficiency. Nevertheless, this is at

the expense of high computational complexity. The proposal can efficiently decide

the level on the basis of the roberts-cross edge detector. The proposed algorithm

utilizes the high correlation between regional texture and prediction unit partition-

ing. It is mainly composed of a bottom-up level decision method and an efficient

decision flow based on an authentic image feature. Furthermore, chrominance in-

formation is also employed to decide the prediction unit partitioning. Secondly, an

adaptive downsampling signal based intra prediction for parallel intra coding of high

efficiency video coding is proposed to improve coding efficiency and reduce data de-

pendency. Downsampling signal is applied to generate prediction samples instead of

neighboring pixels. It reduces spatial redundancy and removes the data dependency

in intra encoding for coding tree unit (CTU) structure. Meanwhile, a fast training

method is designed to derive downsampling signal adaptively. Thirdly, hardware

implementation oriented fast intra coding based on downsampling information for

HEVC is presented to realize parallel hardware implementation for real-time appli-

cations. The scheme is consisted of two parts, preprocessing stage and fast intra

coding stage. Three downsampling information based fast decision algorithms are

proposed in fast intra coding stage. Moreover, a parallelized architecture of the

fast intra coding scheme is presented. The preprocessed downsampling stage can be

executed with intra coding stage in parallel.

The experimental results for the proposal 1 demonstrated that it achieved a task

with greatly reduced computational complexity compared with the original HEVC.

The average time-saving is approximately 37%, while the increase in bit rate and

4

decrease in PSNR are negligible. For the proposal 2, experimental results show that

the proposed fast parallelized scheme achieves 4.17% bit saving on average, with

reducing computational complexity by 27.26%. For the proposal 3, the proposed

architecture fully makes use of this feature to improve throughput and fragment

data dependency. Experimental results demonstrate that the proposed algorithms

achieves on average 60.4% reduction on encoding time with negligible coding effi-

ciency loss, compared with original HEVC.

Contents i

Contents

Chapter 1 Introduction 1

1.1 Introduction to HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Requirements for implementing HEVC . . . . . . . . . . . . . . . . . . 3

1.2.1 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Data dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Compression efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Edge detector based fast level decision algorithm for intra predic-

tion of HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Adaptive downsampling signal based intra prediction for parallel

intra coding of high efficiency video coding . . . . . . . . . . . . . . 5

1.3.3 Hardware implementation oriented fast intra coding based on down-

sampling information for HEVC . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2 Overview of HEVC 7

2.0.1 Coding Design And Feature Highlights . . . . . . . . . . . . . . . . 7

2.0.1.1 Encoder structure of HEVC . . . . . . . . . . . . . . . . . . 7

2.0.1.2 Coding tree units and coding tree block structure . . . . . . 8

2.0.1.3 Coding tree unit . . . . . . . . . . . . . . . . . . . . . . . . 9

2.0.1.4 Coding unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.0.1.5 Prediction unit . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.0.1.6 Transform unit . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.0.2 Intra prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.0.3 Inter prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.0.4 Transform and quantization . . . . . . . . . . . . . . . . . . . . . . 13

2.0.5 In-loop filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.0.6 Sample adaptive offset filter . . . . . . . . . . . . . . . . . . . . . . 16

2.0.7 Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

ii Contents

2.0.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Chapter 3 Edge detector based fast level decision algorithm for in-

tra prediction of HEVC 19

3.1 Background and previous works . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Analysis of Intra Coding in HEVC . . . . . . . . . . . . . . . . . . . . . 20

3.3 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Bottom-up level-decision method . . . . . . . . . . . . . . . . . . . 24

3.3.2 Authentic-feature-based level decision method . . . . . . . . . . . . 26

3.3.3 Integrated fast level decision algorithm . . . . . . . . . . . . . . . . 27

3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Chapter 4 Adaptive downsampling signal based intra prediction for

parallel intra coding of HEVC 33


4.2 Analysis of intra coding in HEVC . . . . . . . . . . . . . . . . . . . . . 35

4.3 Proposed parallel intra coding scheme . . . . . . . . . . . . . . . . . . . 38

4.3.1 Downsampling approach for reconstructing samples . . . . . . . . . 39

4.3.2 Downsampling Prediction Modes . . . . . . . . . . . . . . . . . . . 40

4.3.3 Training method for obtaining downsampling QP . . . . . . . . . . 41

4.3.4 A parallel HEVC intra encoding architecture . . . . . . . . . . . . 43


4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Chapter 5 Hardware implementation oriented fast intra coding based

on downsampling information for HEVC 51


5.2 Analysis of intra coding in HEVC . . . . . . . . . . . . . . . . . . . . . 56

5.3 Proposed downsampling information based fast intra coding algorithms 59

5.3.1 Downsampling information based preprocessing stage . . . . . . . . 60

5.3.2 Fast PU depth decision algorithm . . . . . . . . . . . . . . . . . . . 62

List of Figures iii

5.3.3 Fast TU depth decision algorithm . . . . . . . . . . . . . . . . . . . 65

5.3.4 Fast prediction mode decision algorithm . . . . . . . . . . . . . . . 67

5.4 Top-level design for VLSI architecture . . . . . . . . . . . . . . . . . . . 68


5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Chapter 6 Conclusions 79

Acknowledgement 81

Bibliography 82

Appendix 87

A Publication Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

B International Conference . . . . . . . . . . . . . . . . . . . . . . . . . . 89

List of Figures

2.1 Slices and slice segments . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 CU, PU and TU size . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 The intra prediction modes . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Motion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Recursively splitting of CUs in HEVC . . . . . . . . . . . . . . . . . 20

3.2 Intra prediction modes in HEVC . . . . . . . . . . . . . . . . . . . . 21

3.3 Proportions of depths for different class sequences . . . . . . . . . . . 22

3.4 PU partitioning of BQMall sequence . . . . . . . . . . . . . . . . . . 23

3.5 Roberts-cross masks for gradient calculation . . . . . . . . . . . . . . 24

3.6 Chroma-assisting decision . . . . . . . . . . . . . . . . . . . . . . . . 25

3.7 Authentic-image-feature-based level decision procedure . . . . . . . . 27

3.8 Flowchart of the proposed algorithm . . . . . . . . . . . . . . . . . . 28

3.9 RD curves of ParkScene sequence . . . . . . . . . . . . . . . . . . . . 31

3.10 RD curves of BQMall sequence . . . . . . . . . . . . . . . . . . . . . 32

3.11 RD curves of RaceHorses sequence . . . . . . . . . . . . . . . . . . . 32

iv List of Tables

4.1 Intra coding flow for HEVC. . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 High encoding latency caused by CTU-level data dependency: (a)

CTU encoding order and (b) CTU pipeline scheduling. . . . . . . . . 36

4.3 High encoding latency caused by PU-level data dependency: (a) PU

encoding order and (b) PU pipeline scheduling. . . . . . . . . . . . . 36

4.4 Spatially referencing samples of HEVC. . . . . . . . . . . . . . . . . . 37

4.5 Overview of the proposed intra coding scheme. . . . . . . . . . . . . . 39

4.6 Generate prediction samples using downsampling signal. . . . . . . . 40

4.7 Flowchart of training TDQP. . . . . . . . . . . . . . . . . . . . . . . . 42

4.8 Top level hardware architecture for HEVC. . . . . . . . . . . . . . . . 44

4.9 RD curves of Beauty 2160p in M2. . . . . . . . . . . . . . . . . . . . 47

4.10 RD curves of HoneyBee 2160p in M2. . . . . . . . . . . . . . . . . . . 47

4.11 RD curves of Beauty 1080p in M2. . . . . . . . . . . . . . . . . . . . 48

5.1 Hierarchically quadtree structure in HEVC . . . . . . . . . . . . . . . 56

5.2 Optimal mode and partitioning decision flow . . . . . . . . . . . . . . 57

5.3 The proposed fast intra coding scheme . . . . . . . . . . . . . . . . . 59

5.4 Concordance rates of downsampling and original encoded information 62

5.5 Overview of downsampling method . . . . . . . . . . . . . . . . . . . 63

5.6 Top level architecture of proposed fast intra coding . . . . . . . . . . 69

5.7 Timing diagram with PS and FICS . . . . . . . . . . . . . . . . . . . 70

List of Tables

3.1 Number of intra modes for each PU size . . . . . . . . . . . . . . . . 22

3.2 Experimental results for threshold value . . . . . . . . . . . . . . . . 28

3.3 Performance of the proposed algorithm and da Silva’s algorithm . . . 30

4.1 Result of TDQP and FDQP. . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Experiential conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Compared Experimental Results for Proposed Scheme and HEVC . . 49

5.1 Number of mode candidate list for HEVC and proposal in worst case 68

List of Tables v

5.2 Performance comparison between FICS and FICS without FTDD . . 75

5.3 Performance comparison between FICS and FICS without FPMD . . 76

5.4 Performance comparison of compression and encoding time for PS

and FICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.5 Comparison of coding architecture and performance . . . . . . . . . . 78

Chapter 1 Introduction 1

Chapter 1 Introduction

1.1 Introduction to HEVC

As video content market is growing in both resolution and quality in recent

years, HDTV and 4k contents have become the mainstream for more and more

applications. It is worthwhile to note that higher resolution and higher quality

video will be network overload in content-delivery for large amounts of video data.

To solve the problem and achieve the goal of higher video coding efficiency, the Joint

Collaborative Team on Video Coding (JCT-VC), which is organized by ISO/IEC

Moving Picture Group (MPEG) and ITU-T Video Coding Experts Group (VCEG),

standardized HEVC in January 25, 2013 [1]. HEVC aims to encode videos with

twice compression efficiency compared with H.264/AVC at the encoding condition.

The new standard is expected to be used as the next generation video standard for

video-delivery via limited bandwidth and super-high-vision broadcasting.

The first edition of the HEVC standard was finalized in January 2013, result-

ing in an aligned text that is published by both ITU-T and ISO/IEC. Additional

works have been added to the standard to support several additional applications,

including scalable video coding, multiview video coding, 3D video coding, extended-

range uses with enhanced precision and color format support. In ITU-T, HEVC

will become ITU-T Recommendation H.265, and it will become MPEG-H Part 2

(ISO/IEC 23008-2) in ISO/IEC. The ITU-T produced H.261 and H.263, ISO/IEC

produced MPEG-1 and MPEG-4, and the two organizations jointly produced the

H.262/MPEG-2 Video and H.264/MPEG-4 advanced video coding (AVC) stan-

dards. Therefore, ITUT and ISO/IEC standards are well-known for video coding

standards. The two standards that were jointly produced have played a important

role and have found their way into a wide variety of products in our daily lives.

HEVC/H.265 [2] has highly efficient compression methods, which allow it to com-

press video much more efficiently than older standards and provide more flexibility

for applications in complex network environments. Different to previous standards,

2 Chapter 1 Introduction

HEVC uses frame partitioning and hybrid prediction as well as residual encoding

with a quadtree-based structure. In frame partitioning, source frame is divided into

a sequence of largest coding unit (LCU) with 64×64 unit size. The coding unit

(CU) is the basic unit of region partitioning used for intra and inter prediction. It

should be square and range size from 8x8 luma samples up to 64×64. The CU allows

recursive splitting into quad equally sized units. The prediction unit (PU) is the

basic unit used in the prediction processes. In order to facilitate partitioning which

matches the boundaries of real objects in the picture, it is not restricted to being

square in inter prediction, but in intra prediction PU should be square. Each CU

may contain one or more PUs. Due to the adoption of quadtree structure, quadtree

is formed by three levels of transform unit (TU) for DCT transform named residual

quadtree transform (RQT). It is helpful to achieve the optimal tradeoff between

energy consumption and adaptability. In addition to the quadtree-based prediction

and residual encoding, other advanced new coding tools such as directional intra pre-

diction with 35 modes, residual quadtree transform, sample adaptive offset (SAO) ,

in-loop filtering are also employed in HEVC. Owing to the above mentioned tools,

HEVC achieves about 40% bit-rate saving compared to H.264 High Profile, by Feb.

of 2012.

HEVC supports two different coding scenarios because of different applications

with the considerations of computational complexity. The Low Complexity (LC)

configuration is intended for low-delay applications. The High Efficiency (HE) con-

figuration is intended to achieve high compression ratio for applications with huge

hardware resources. HEVC encoder also considers about three kinds configuration

for experiment and different applications. Intra-only configuration is that each frame

is encoded as IDR picture. No temporal reference pictures shall be used and it is not

allowed to change QP within a picture. Low-delay configuration is that only the first

frame is encoded as IDR picture. The successive frames are encoded as P and B-

picture (GPB). No picture reordering between decoder processing and output, with

bit rate fluctuation characteristics. Random-access configuration is that structure

delay of processing units no larger than 8-picture group of pictures (GOP), dynamic

hierarchical B usage with four levels Intra picture shall be inserted cyclically per

1.2 Requirements for implementing HEVC 3

second. Finally, HEVC supports 6 kinds of coding configurations: Low Complexity

Intra Only, High Efficiency Intra Only, Low Complexity Low Delay, High Efficiency

Low Delay, Low Complexity Random Access and High Efficiency Random Access.

The computational complexity and coding efficiency gradually increases for the 6

configures as the order.

1.2 Requirements for implementing HEVC

The integrated intra coding process is divided into two main steps, rough mode

decision (RMD) and rate-distortion optimization (RDO). In RMD process, recon-

structed pixels of neighboring PU are utilized to build prediction pixels with 35 intra

prediction modes. The Hadamard transform absolute difference (HAD) costs of all

the supported prediction modes are calculated to create a list of candidates with

the minimum values of HAD. Meanwhile, MPMs are evaluated from neighboring

PUs and supplemented to mode candidates. In the RDO process, the optimal PU

partitioning and prediction mode is decided by the RD cost. RD cost is derived by

calculating the residual between reconstructed samples and original samples. More-

over, the intra coding process is performed recursively to generate optimal encoding

results. The recursive intra coding of HEVC improves coding efficiency, however,

there are still three issues.

1.2.1 Computational complexity

H.264/AVC [3] defines macroblock size from 4 × 4 to 16 × 16 and a total

of 9 optional prediction modes, however, HEVC employs much bigger block size,

which supports from 4 × 4 up to 64 × 64, and much more intra prediction modes.

The two features result in high efficiency for intra coding, meanwhile, significantly

high complexity for intra prediction is introduced. To find out the optimal CU

partitioning and prediction mode, HEVC will travel through all the combinations

of CU, PU and TU by performing a series of computations to carry out the RDO

process. For the intra prediction process, a 64 × 64 CTU will perform approximately

2.65 × more predictions than that in the H.264.


1.2.2 Data dependency

The encoding order for CTU and PU in HEVC. HEVC encodes CTU in raster

scan order, and all PUs in a CTU are encoded in Z-Scan which provides more avail-

able neighboring reference samples in most cases. Since reconstruction pixels in

neighboring CTU and PU are utilized in HEVC, intra encoding process is restricted

by data dependency and extremely high latency. Take CTU for instance to explain

hardware implementation problems for intra encoder because of the data depen-

dency. The order of intra encoding is restricted by CTU level data dependency,

which requires that each CTU is obliged to wait until both of the left and above

right neighboring CTUs encoding is finished. The constraint leads to the latency

between current CTU and adjacent CTU. Furthermore, the maximum parallelism is

sustainably limited and encoding from optional position in frame is forbidden. Only

the encoding direction which is from left top to right bottom is permitted in HEVC.

Therefore, pipelined hardware implementation and parallel architecture is difficult

to design. Both CTU level and PU level data dependency lead to low throughput

performance and high hardware overhead.

1.2.3 Compression efficiency

In the spatial domain, pixels which need to be predicted tend to have a high

correlation with neighboring pixels, in most instances. As above-mentioned, HEVC

intra coding reduce the redundancy that pixels close to each other in the same

frame. 35 intra prediction modes are provided which utilize the spatially neighboring

samples to predict current PU samples. However, the correlation between pixels of

spatial locality decays with PU size increasing. In particular, for large PU sizes (64

× 64 or 32 × 32), the available referencing samples are too far away from some

pixels of current PU and interfere the prediction performance.

1.3 Main Contributions

In this paper, algorithms and optimization hardware architecture are proposed to

reduce the heavy computational burden in RDO-based intra prediction and realize

real-time intra encoder. Main contributions are listed as follows.

1.3 Main Contributions 5

1.3.1 Edge detector based fast level decision algorithm for intra predic-

tion of HEVC

High efficiency video coding (HEVC) achieves significant higher coding efficiency

than previous video coding standards. In the intra prediction of HEVC, prediction

unit sizes from 4X4 to 64x64 are employed, respectively defined as levels 1 to 5, to

achieve higher coding efficiency. Nevertheless, this is at the expense of high compu-

tational complexity. This paper proposes a fast algorithm for the intra prediction

of HEVC, which can efficiently decide the level on the basis of the roberts-cross

edge detector.The proposed algortihm utilizes the high correlation between regional

texture and prediction unit partitioning. It is mainly composed of a bottom-up level

decision method and an efficient decision flow based on an authentic image feature.

Furthermore, chrominance information is also employed to decide the prediction unit

partitioning. The experimental results for the proposed algorithm demonstrated

that it achieved a task with greatly reduced computational complexity compared

with the original HEVC. The average time-saving is approximately 37%, while the

increase in bit rate and decrease in PSNR are negligible.

1.3.2 Adaptive downsampling signal based intra prediction for parallel

intra coding of high efficiency video coding

Intra coding utilizes neighboring reference pixels to construct the prediction sam-

ples and reduce spatial redundancy. In high efficiency video coding (HEVC), per-

formance improvement of coding efficiency achieved by enhancing traditional intra

prediction. However, the enhanced method of intra coding is a real hindrance for

parallel hardware implementation. In this paper, an efficient parallel scheme is

proposed for intra coding of HEVC. Downsampling signal is applied to generate

prediction samples instead of neighboring pixels. It reduces spatial redundancy and

removes the data dependency in intra encoding for coding tree unit (CTU) struc-

ture. Meanwhile, a fast training method is designed to derive downsampling signal

adaptively. Experimental results show that the proposed fast parallelized scheme

achieves 4.17% bit saving on average, with reducing computational complexity by

27.26%.


1.3.3 Hardware implementation oriented fast intra coding based on down-

sampling information for HEVC

High efficiency video coding (HEVC) aims at achieving higher coding perfor-

mance, especially for ultra high definition applications. Many novel features apply

to intra coding of HEVC. However, these features bring massive calculation and the

hardware implementation is difficult for a real-time intra encoder which support-

ing high resolution applications. This paper proposes a downsampling information

based intra coding scheme which consists of two parts, preprocessing stage and fast

intra coding stage. Three downsampling information based fast decision algorithms

are proposed in fast intra coding stage. Moreover, a parallelized architecture of the

fast intra coding scheme is proposed. The preprocessed downsampling stage can

be executed with intra coding stage in parallel. The proposed architecture fully

makes use of this feature to improve throughput and fragment data dependency.

Experimental results demonstrate that the proposed algorithms achieves on average

60.4% reduction on encoding time with negligible coding efficiency loss, compared

with original HEVC.

1.4 Thesis Outline

Chapter 2 provides overview of HEVC encoder along with brief explanation

of coding tool features and encoding process. Chapter 3 discusses edge detector

based fast level decision algorithm for intra prediction of HEVC. Chapter 4 explains

adaptive downsampling signal based intra prediction for parallel intra coding of

high efficiency video coding in detail. Chapter 5 explains hardware implementation

oriented fast intra coding based on downsampling information for HEVC in detail.

Chapter 6 outlines the conclusions and provides possibilities for future research.


Chapter 2 Overview of HEVC

2.0.1 Coding Design And Feature Highlights

2.0.1.1 Encoder structure of HEVC

A slice is data structure that can be decoded independently from slices of the

same picture as shown in Figure 2.1, in terms of entropy coding, signal prediction

and residual signal reconstruction. A slice can either be the entire picture or a

region of a picture, which is not necessity rectangular. A slice segment consists of

a sequence of coding tree units (CTUs). An independent slice segment is a slice

segment for which the value of the syntax elements of the slice segment header are

not inferred from the values for a preceding slice segment. A dependent slice segment

is a slice segment for which the value of some syntax elements of the slice segment

header are inferred from the values for the preceding independent slice segment in

decoding order. The picture is divided into two slices. It consists of a sequence of one

or more slice segments starting with an independent slice segment and containing

all subsequent dependent slice segments that precede the next independent slice

segment within the same access unit.

Figure 2.1 Slices and slice segments

8 Chapter 2 Overview of HEVC

2.0.1.2 Coding tree units and coding tree block structure

The HEVC standard has adopted a highly flexible and efficient block partitioning

structure by introducing four different block concepts: CTU, CU, PU, and TU, as

shown in Figure 2.2, which are defined to have clearly separated roles. The terms

coding tree block (CTB), coding block (CB), prediction block (PB), and TB are

also defined to specify the 2-D sample array of one color component associated with

the CTU, CU, PU, and TU, respectively. Thus, a CTU consists of one luma CTB,

two chroma CTBs, and associated syntax elements. A similar relationship is valid

for CU, PU, and TU.

Figure 2.2 CU, PU and TU size

The coding tree approach in HEVC can bring additional coding efficiency benefits

by incorporating PU and TU quad tree concepts for video compression. Leaf nodes

of a tree can be merged or combined in a general quad tree structured video coding

scheme. After the final quad tree is formed, motion information is transmitted at the

leaf nodes of the tree. L-shaped or rectangular-shaped motion partition is possible

through merging and combination of nodes. However, in order to make such shapes,

the merge process should be followed using smaller blocks after further splitting

has occurred. In the HEVC block partitioning structure, such cases are taken care


of by the PU. Instead of splitting one depth more for merging and combination,

predefined partition modes such as PART‐2N×2N, PART‐2N×N, and PART‐

N×2N are tested and the optimal partition mode is selected at the leaf nodes of

the tree. It is worthwhile mentioning that PUs still can share motion information

through the merging mode in HEVC. Though a general quad tree structure without

the PU concept was investigated by removing the symmetric rectangular partition

modes (PART‐2N×N and PART‐N×2N).

Another aspect is the full utilization of depth information for entropy coding.

For example, entropy coding of HEVC is highly reliant on the depth information

of a quad tree. For syntax elements such as inter‐pred‐idc, split‐transform‐

flag, cbf‐luma, cbf‐cb and cbf‐cr, depth dependent context derivation is heavily

used for coding efficiency. It has been demonstrated that this can break the depen-

dency with neighboring blocks with less line buffer requirements in the hardware

implementations because information of the above CTU does not need to be stored.

2.0.1.3 Coding tree unit

A slice contains an integer multiple of CTU, which is an analogous term to

the macroblock in H.264/AVC. Inside a slice, a raster scan method is used for

processing the CTU. In the main profile, the minimum and the maximum sizes of

the CTU are specified by the syntax elements in the sequence parameter set (SPS)

among the sizes of 8×8, 16×16, 32×32, and 64×64. Due to this flexibility of the

CTU, HEVC provides a way to adapt according to various application needs such

as encoder/decoder pipeline delay constraints or on-chip memory requirements in

a hardware design. In addition, the support of large sizes up to 64×64 allows the

coding structure to match the characteristics of the high definition video content

better than previous standards; this was one of the main sources of the coding

efficiency improvements seen with HEVC.

2.0.1.4 Coding unit

The CTU is further partitioned into multiple CUs to adapt to various local

characteristics. A quad tree denoted as the coding tree is used to partition the CTU

into multiple CUs. 1) Recursive Partitioning from CTU: Let CTU size be 2N×2N


where N is one of the values of 32, 16, or 8. The CTU can be a single CU or can be

split into four smaller units of equal sizes of N×N, which are nodes of a coding tree.

If the units are leaf nodes of the coding tree, the units become CUs. Otherwise,

it can be split again into four smaller units when the split size is equal or larger

than the minimum CU size specified in the sequence parameter set (SPS). This

representation results in a recursive structure specified by a coding tree. Numbers

on the tree represent whether the CU is further split. This flexible and recursive

representation of picture in CTUs and CUs provides several major benefits.

The benefit comes from the support of CU sizes greater than the conventional

16×16 size. When the region is homogeneous, a large CU can represent the region

by using a smaller number of symbols than is the case using several small blocks.

Supporting arbitrary sizes of CTU enables the codec to be readily optimized for

various contents, applications, and devices. Compared to the use of fixed size mac-

roblock, support of various sizes of CTU is one of the strong points of HEVC in terms

of coding efficiency and adaptability for contents and applications. This ability is

especially useful for low-resolution video services.

By choosing an appropriate size of CTU and maximum hierarchical depth, the

hierarchical block partitioning structure can be optimized to the target application.

Figure 4.4 shows examples of various CTU sizes and CU sizes suitable for different

resolutions and types of content. For example, for an application using 1080p content

that is known to include only simple global motion activities, a CTU size of 64 and

depth of 2 may be an appropriate choice. For more general 1080p content, which

may also include complex motion activities of small regions, a CTU size of 64 and

maximum depth of 4 would be preferable.

2.0.1.5 Prediction unit

One or more PUs are specified for each CU. Inside one PU, the same prediction

process is applied and the relevant information is transmitted to the decoder on

a PU basis. A CU can be split into one, two or four PUs according to the PU

splitting type. HEVC defines two splitting shapes for the intra coded CU and eight

splitting shapes for inter coded CU. Unlike the CU, the PU may only be split once


1) PU Splitting Type: Similar to prior standards, each CU in the HEVC can be

classified into three categories: skipped CU, inter coded CU, and intra coded CU.

An inter coded CU uses a motion compensation scheme for the prediction of the

current block, while an intra coded CU uses neighboring reconstructed samples for

the prediction. A skipped CU is a special form of inter coded CU where both

the motion vector difference and the residual energy are equal to zero. Figure 4.5

describes the splitting types of the PU in the HEVC standard.

2.0.1.6 Transform unit

Similar with the PU, one or more TUs are specified for the CU. HEVC allows a

residual block to be split into multiple units recursively to form another quad tree

which is analogous to the coding tree for the CU. The TU is a basic representative

block having residual or transform coefficients for applying the integer transform and

quantization. For each TU, one integer transform having the same size as the TU

is applied to obtain residual coefficients. These coefficients are transmitted to the

decoder after quantization on a TU basis. 1) Residual Quad tree: After obtaining

the residual block by prediction process based on PU splitting type, it is split into

multiple TUs according to a quad tree structure. For each TU, an integer transform

is applied. The tree is called transform tree or residual quad tree (RQT) since the

residual block is partitioned by a quad tree structure and a transform is applied for

each leaf node of the quad tree. Transform tree partitioning is shown in figure 4.6.

2.0.2 Intra prediction

In H.265/HEVC, intra prediction of the luma component supports five PUs: 4x4,

8x8, 16x16, 32x32 and 64x64, and each PU has 35 prediction modes which contain

planar mode, DC mode and 33 angular modes. It is noted that the bottom left pixels

are used as reference pixels, which in some cases, can improve the coding efficiency

significantly.

All HEVC intra prediction modes are defined by prediction mode number as

follow: planar, DC and angular modes. The prediction directions of the 33 angle

mode is shown as Figure 2.3. The direction of the mode number 2-17 mean horizontal

modes, and the direction of the mode number 18-34 mean vertical modes. Planar


mode corresponds to plane mode in H.264/AVC, and it adapts to the pixels smooth

areas. The prediction pixel value Px,y is generated by the average of the prediction

horizontal and vertical values. This method can make the change of prediction

pixel smooth, and improve the video subjective quality. DC mode is suitable for

large flat areas. The current prediction value is generated by the average of the

left and above reference pixels. There are eight different prediction directions in

H.264/AVC. However, in order to adapt to the different texture of the video content,

H.265/HEVC specifies 33 angular prediction modes.

Figure 2.3 The intra prediction modes

2.0.3 Inter prediction

In H.265/HEVC, the prediction block (PB) is the basic process unit in inter

prediction, and the prediction unit contains the prediction informations. As shown in

Figure 2.4, the motion compensation principle is that the reference blocks are used to

predict the current block information. The displacement between the reference block

and the current block is called motion vector (MV) and the difference between them


is named motion distortion. The MV and motion distortion are used to determine

the best prediction mode based rate-distortion (R-D) model. Similar to H.264/AVC,

B-frame or P-frame prediction is used for motion compensation, and in the final

standard, the bi-prediction is used to achieve a trade-off between encoding efficiency

and encoding complexity. Furthermore, it needs to access memory constantly for bi-

prediction, which s considered to be the main factors of computational complexity,

especially for hardware design.

Figure 2.4 Motion estimation

2.0.4 Transform and quantization

HEVC specifies two-dimensional transforms of various sizes 4x4, 8x8, 16x16, and

32x32 that are find precision approximations to the discrete cosine transform (DCT).

Multiple transform sizes improve compression performance, but also increase the

implementation complexity. The N transform coefficients vi of an N-point 1D-DCT

applied to the input samples ui can be expressed as


vi =N−1∑j=0

ujcij (2.1)

where i=0,...N-1. Elements cij of the DCT transform matrix C are defined as

cij =P√N

cos[π

N(j +

1

2)i] (2.2)

where i,j=0,...N-1 and where P is equal to 1 and√2 for i=0 and i ¿ 0, repec-

tively. The basis vector ci of the DCT are defined as ci = [ci0, ...ci(N‐1)]T where

i=0,..N-1. DCT is desirable for compression efficiency by achieving transform co-

efficients that are uncorrelated and provides good energy compaction which is also

desirable for compression efficiency. Furthermore, DCT is desirable for simplifying

the quantization and de-quantization process and useful to reduce implementation

costs as the same multipliers can be reused for various transform sizes. For slowly

changing grey value of pixels piece, after DCT most of the energy is concentrated in

the upper left corner of the low frequency coefficient. On the contrary, if the pixel

texture block contains more details information, more energy distributes in the high

frequency area. In fact, most images contain more low frequency components. Using

the characteristics that the human eye is not sensitive to high frequency detail image

with relative, the low-frequency coefficients of high frequency energy can be handled

subtly, and low energy of high frequency coefficients can be quantized roughly.

Quantization consists of division by a quantization step size (Qstep) and subse-

quent rounding while inverse quantization consists of multiplication by the quanti-

zation step size. In HEVC, quantization parameter (QP) is used to get Qstep, and

QP can take 52 values from 0 to 51. The relationship between QP and Qstep is

defined as follow:

Qstep(QP ) = (216 )QP−4 (2.3)

The integer DCT scaling operation need to complete at the same time in H.265/HEVC

quantitative process. In order to avoid floating point arithmetic, quantizer formula

(2.3) will enlarge to a certain extent both the numerator and denominator, then in-

teger to retain the accuracy of operation. In HEVC, the encoder can signal whether


or not to use quantization matrices enabling frequency dependent scaling. Human

visual system based quantization can achieve better quaintly than frequency inde-

pendent quantization. In HEVC, three options can be configured for the operation

of the quantizer: flat quantization, default weighting matrix and custom weighting

matrix. The quantization step may need to be changed within a picture for rate

control and perceptual quantization purposes. This is updated by a QP delta in the

slice segment header. The applicable QP for a CU is derived from the QP applied

in the previous CU in decoding order, and the dalta QP is transmitted in coding

units with non-zero transform coefficients.

2.0.5 In-loop filtering

Deblocking filter and sample adaptive offset (SAO) filter are included in HEVC.

The deblocking filter aims to reduce the visibility of blocking artifacts and is applied

to sample located at block boundaries. The SAO filter arms to improve the accuracy

of the reconstruction of the original signal amplitudes and is applied adaptively to

all samples, by conditionally adding an offset value to each sample based on values

in look-up tables defined by the encoder.

A deblocking filter process is performed for each CU in the same order as the

decoding process. First vertical edges are filtered then horizontal edges are filtered.

Filtering is applied to 8x8 block boundaries which are determined to be filtered,

both luma and chroma components. The deblocking filter process has three stages:

boundary decision, filter on/off decision and strong/weak filter decision. TU bound-

aries and PU boundaries are involved om the deblocking filter. In boundary decision

stage, the boundary strength (Bs) is calculated to reflect how strong a filtering pro-

cess may be needed for the boundary. A value of 2 for Bs indicates strong filtering, 1

means weak filtering and 0 means no deblocking filtering. The filter on/off decision

is made using 4 lines grouped as a unit, to reduce computational complexity. If

filtering is turned on, a decision is made between strong and weak filtering. The

strong deblocking filter is applied to smooth flat areas.


2.0.6 Sample adaptive offset filter

SAO is applied to the reconstructed signal after the deblocking filter by using

offsets specified for each CTB by the encoder. The SAO reduces sample distortion

by first classifying the samples in the region into multiple categories with as selected

classifier and adding a specific offset to each sample depending on its category. The

classifier index and the offsets for each region are signaled in the bit-stream. SAO

operation includes edge offset (OE) which uses edge properties for pixel classifica-

tion in SAO type 1 to 4, and band offset (BO) which uses pixel intensity for pixel

classification in SAO type 5.

2.0.7 Entropy coding

A single entropy coding scheme is used in all configurations of HEVC: context

adaptive binary algorithmic coding (CABAC). Entropy coding is a lossless compres-

sion scheme that uses the statistical properties to compress data, and it is performed

at the last stage of video encoding, after the video signal has been reduced to a series

of syntax elements. CABAC adopts efficient arithmetic coding technology, considers

the related statistical properties video stream, and improves the coding efficiency

significantly. Entropy coding processing has three stages: binarization, context

modeling and binary arithmetic coding. In general, a binarization scheme defines

a unique mapping of syntax element value to sequences of binary symbols, which

can be interpreted in terms of a binary code tree. By decomposing each non-binary

syntax element value into a sequence of bins, further processing of each bin value

in CABAC depends on the associated mode decision. The probability models in

CABAC are adaptive, which means that, for those high probability events on the

coding performance, a delicate context model is set up, on the contrary, for the low

probability events on coding performance, a simple context model is set up. For the

syntax elements of binary, every Bin is processed with arithmetic coding accord-

ing to the probability model parameters, and gets the final video stream. Binary

arithmetic coding contains two kinds of encoding: regular coding mode and bypass

coding mode. The regular mode uses the probability model of adaptive coding, and

the bypass coding mode uses the form of equal probability coding.


2.0.8 Summary

Compared with H.264/AVC, the compression efficiency of H.265/HEVC is over

H.264/AVC in both objective and subjective tests. Moreover, the bit rate reduction,

based on objective evaluation of CTC test sequences, indicates all over performance

improvement of about 50% over H.264/AVC. HEVC yields a substantial improve-

ment in compression capability beyond that of H.264/AVC for video streaming ap-

plications, and the coding performance gains of HEVC over H.264/AVC generally

increase with increasing video resolution up to at least 4K resolutions. For the next

generation of video coding, the features of parallel processing, high compression

capability, and low computational complexity are very important.

Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC 19

Chapter 3 Edge detector based fast level

decision algorithm for intra prediction

of HEVC

3.1 Background and previous works

HEVC is a new international standard developed by the Joint Collaborative

Team on Video Coding (JCT-VC) that aims to achieve 50% bit rate reduction

relative to H.264. In particular, it can support 4K and 8K ultra high-definition

(UHD) video, with a resolution of 8,192×4,320. Intra prediction, which is of great

significance for video compression in H.264, continues to play an important role in

HEVC. Many new features have been proposed to improve the intra coding efficiency

of HEVC. The two main features are the size of the prediction unit, which can be

defined from 4×4 up to 64×64, compared with from 4×4 to 16×16 for H.264, and

the 35 intra prediction modes. This results in significantly higher complexity for

intra prediction.

There have already been some related works aiming to reduce the complexity

of HEVC. A fast intra prediction algorithm based on pixel gradient statistics was

proposed by Chen et al. [4]. In their paper, pixel gradient statistics are extracted

using the Sobel operator to exclude unreasonable modes and unit sizes. The method

reduces the encoding time by about 28%. Meanwhile, da Silva et al. developed a

gradient based fast intra prediction algorithm using five edge strengths filters. In

their paper, the edge strength results decided the corresponding intra prediction

mode set. The number of available intra prediction modes was reduced to 9, com-

pared with 35 in HEVC, and a reduction in the processing time of almost 32% was

achieved. However, da Silva et al. only utilized edge direction information to reduce

the number of modes considered unreasonable [5], and in some cases the list of re-

maining modes did not contain the optimal mode. In this paper, edge information

is considered as a highly sensitive parameter of region complexity conditions and is

used for reducing unreasonable unit sizes. This enables our proposed method to save

20 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC

much more time with almost the same bit rate and PSNR as conventional HEVC.

3.2 Analysis of Intra Coding in HEVC

In contrast to the previous video coding standards, HEVC employs a flexible

quadtree coding block-partitioning structure, which consists of three kinds of basic

units : coding units (CUs), prediction units (PUs) and transform units (TUs). The

CU is the fundamental partible unit, whose size can range from 8×8 to the largest

coding unit (LCU). A picture is divided into slices and each slice is composed of a

sequence of LCUs whose maximum size can be 64×64. Each LCU is recursively split

into CUs that constitute the quadtree structure. Figure 3.1 shows the recursively

splitting of CUs in HEVC. The PU is the basic unit for intra and inter prediction and

only PU sizes of 2N×2N and N×N are supported in the intra prediction of HEVC.

TUs are used for transform and quantization and the TU size can be different from

the PU size. The maximum size of PUs and TUs is 64×64, while their minimum

sizes are 4×4.

Figure 3.1 Recursively splitting of CUs in HEVC

3.2 Analysis of Intra Coding in HEVC 21

Figure 3.2 Intra prediction modes in HEVC

In HEVC, 35 intra prediction modes are employed to improve the coding effi-

ciency [6]. As illustrated in Figure 3.2, the 35 intra prediction modes consist of

33 angular modes, DC mode and planar mode. The specific number of prediction

modes varies according to the PU size, as shown in Table 3.1. To find the optimal

CU partitioning and prediction mode, HEVC must examine all the combinations

of CU, PU and TU by performing a series of computations to carry out the rate-

distortion optimization (RDO) process. In HEVC there are two major steps in

achieving the above targets for intra prediction. The first step is to calculate the

Hadamard transform absolute difference (HAD) costs of all the supported prediction

modes to create a list of candidates with the minimum values of HAD. The number

of candidates for different PU sizes is shown in Table 3.1. In the second step, the

optimum PU size is derived by computing the rate-distortion (RD) costs, and the

final prediction mode and CU partitioning are determined. The above process has

a tremendous computational workload and is time-consuming. In fact, for an LCU


Figure 3.3 Proportions of depths for different class sequences

Table 3.1 Number of intra modes for each PU size

Size Number of intra Number of mode

of PU prediction modes candidates

4×4 18 8

8×8 35 8

16×16 35 3

32×32 35 3

64×64 4 3

with the size of 64×64, 7,552 times HAD costs and 2,623 - 4,923 times RD costs

have to be calculated. Thus, if the CU partition can be determined in advance, a

great deal of computation time will be saved.

The computation process has a lot of redundancy which can be reduced. As

the depths of CUs in HEVC are 0, 1, 2 and 3, recursively considers all possible

combinations at each depth in order from depth 0 to depth 3. Finally it reserves

only one optimal combination. This means that there is a lot of redundancy in

the process. From [7], the proportions of depths for different class sequences are

shown in Figure 3.3. The average percentage of depth 0 is approximately 60%.

Therefore, searching for an effective way to reduce the calculated CU unit size and

early termination of the RDO process are very meaningful for saving time.

3.3 Proposed algorithm 23

3.3 Proposed algorithm

The CU partitioning of a frame tends to have a high correlation with the regional

texture according to experimental results. A homogeneous region is likely to utilize

a larger CU size or low-level CUs, and a complex region will be split into smaller

CUs with a high probability [8],[9]. An example of this is illustrated in Figure 3.4,

CUs are partitioned after full RDO process. HEVC utilizes low-level CUs in the

top left corner and high-level CUs on the man’s face. Therefore, there is a close

relationship between texture complexity and CU partitioning.

Figure 3.4 PU partitioning of BQMall sequence

Edge detection can obtain gradient information, which mainly contains two com-

ponents, the texture complexity and the spatial direction [10]. In this paper, the

texture complexity is utilized to determine the PU size for intra prediction. First,

the concept of the Roberts-cross edge detector is introduced to obtain the magni-

tude and orientation which is used for the analysis of the texture. The calculation

utilizes the two convolution masks of 2×2 units shown in Figure 3.5. The gradient

vector of the unit Gx,y is obtained as follows:


Gxi,j = pi,j − pi+1,j+1 (3.1)

Gyi,j = pi+1,j − pi,j+1 (3.2)

The magnitude of the gradient can be defined as follows:

∇Ix,y = Gx,y =√Gx2 +Gy2 (3.3)

The angle of orientation can also be defined as follows:

Θx,y =Gx2

Gy2(3.4)

Figure 3.5 Roberts-cross masks for gradient calculation

The texture complexity of the original LCU can be explored using the above

Roberts-cross edge detector.

3.3.1 Bottom-up level-decision method

It is well known that humans are less sensitive to changes in hue (chrominance)

than to changes in brightness (luminance), and coding technology has made use

of this feature. In this paper, pixel luminance samples of the original source are

utilized to calculate the gradient magnitude. On the basis of bottom-up theory, the

calculation starts from the smallest unit size (4×4). Because the gradient magnitude

is tiny, a homogeneous region and four units can be merged to a larger unit (8×8)

which means to be decided as the next level. On the other hand, if the derived


gradient magnitude is not neglected, which means the region is complex, the four

units will not be merged and decided as high-level.

In the meantime, to assist in the merging decision, chrominance sample infor-

mation is introduced. By way of illustration, we will explain the chroma-assisting

decision. As shown in Figure 3.6, a search for the maximum and minimum chromi-

nance of the samples from all 64 pixels in PU1, PU2, PU3 and PU4 (size 4×4, level

4) is carried out to decide whether or not the image chrominance changes sharply.

If some of the four differences between the maximum and minimum are large, the

four PU will be assigned to size 4x4, otherwise, the four PUs have the possibility of

being assigned to a larger PU (8×8) depending to the above gradient magnitude.

Figure 3.6 Chroma-assisting decision

As described in Algorithm 1, the calculation starts from a unit size of 2×2. Then,

the chrominance of a level 4 sample is calculated and the sum of sharply varying

chroma units (SCU) is evaluated. If the value of SCU is not 0, the decision process

will terminate and the level will be set as 4. Otherwise, the luminance decision

process will start and the sum of homogeneous units (SHU) will be evaluated. If

SHU is 4, the four PUs will be merged into a larger PU of size 8×8 and the level

is set as 3. It is noteworthy that the four PUs involved in the calculation are those


continuously merged from the smallest unit size (4×4); not all the PUs have a size

of 8×8. Therefore, redundant calculations are avoided and the encoding time is

reduced. The same process is performed at unit sizes of 16×16 and 32×32. In this

way, a fast bottom-up level decision is achieved.

Algorithm 1 Bottom-up Level Decision

1: function LevelDecision

2: for every 4 continuous 2×2 units in LCU do

3: edge detection calculation

4: end for

5: for every 4 continuous 4×4 units in LCU do

6: chrominance decision if SCU != 0 then

7: Level ← 4

8: else if SCU = 0 do

9: luminance decision if SHU != 4 then

10: Level ← 4

11: else if SHU = 0 then

12: Level ← 3

13: end if

14: end if

15: end for

16: end function

3.3.2 Authentic-feature-based level decision method

The second part of the proposed algorithm is based on an authentic image fea-

ture, as we observed from statistical results that the PU size is mainly 16×16 and

8×8. Therefore, the same calculation as above is started from level 2, which has the

higher probability with a PU size of 16×16. As shown in Figure 3.7, the program is


executed in two directions : to level 0 with a PU size of 64×64, and to level 4 with

a PU of size 4×4. Here, we preferentially perform the calculation whose orientation

is level 0, and then achieve the level decision with a top-down design starting from

level 2.

Figure 3.7 Authentic-image-feature-based level decision procedure

3.3.3 Integrated fast level decision algorithm

In general there are five levels from 4×4 to 64×64 in HEVC. The proposed

algorithm can rapidly decide the level and terminate the CU compression process

early. Firstly, a rough gradient calculation for a frame with unit size 64×64 is

carried out to obtain the complexity of the frame. A threshold (TH) is used to

choose the bottom-up level decision method or authentic-image-feature-based level

decision method for the frame. In this study, a series of experiments are conducted

with the sequence RaceHorses (416×240) and the results are shown in Table 3.2.

The quantization parameter is set to 32. The increasing in bitrate (∆BR), decrease

in PSNR (∆PSNR) and time saving in encoding (TS) are considered in selecting

the threshold value. Finally, the threshold value is set to 128 in this paper. In

the rough gradient calculation, if the value for the merged 4×4 unit is above the

threshold value, the authentic-image-feature-based level decision method will be

utilized. Otherwise, the bottom-up level decision method will be utilized. Then the

bottom-up level decision or authentic-image-feature-based level decision is carried


Figure 3.8 Flowchart of the proposed algorithm

Table 3.2 Experimental results for threshold value

Threshold value ∆BR(%) ∆PSNR(dB) TS(%)

122 1.72 -0.0503 22.40

124 1.69 -0.0523 25.50

126 1.47 -0.0575 25.61

128 1.31 -0.0508 27.04

130 1.12 -0.0479 25.35

132 1.12 -0.0569 25.24

134 1.04 -0.0491 23.69

out to decide the level and terminate the CU compression. A flowchart of the

proposed algorithm is shown in Figure 3.8.

3.4 Experimental results

The proposed edge-detector-based fast level-decision algorithm is integrated with

the reference software of the HEVC test model (HM) 12.1 [11]. The simulation

platform is Intel(R) Core(TM) 2 Quad CPU Q8400 @ 2.66GHz with 4 cores and

3.4 Experimental results 29

2.00 GB RAM. Class A, B, C, D and E sequences are employed for performance

comparison. As the algorithm is mainly applied to intra prediction, we set the period

of I frames to 1 to ensure that all the frames are intra encoded. The simulation

conditions are defined in [12] and the quantization parameters are set to 22, 27,

32 and 37. The performances of da Silva’s algorithm compared with HM and our

proposed algorithm compared to the HM are shown in Table 3.3. In addition, the

reduction in complexity of the proposed algorithm is derived from the time saving

of encoding, which is defined as follows:

TS =THM − TPA

THM

× 100% (3.5)

where THM denotes the coding time of HM, and TPA denotes the time used by

the proposed algorithm. ∆PSNR is the difference in PSNR between the proposed

algorithm and the original HM. ∆BR denotes the percentage increased in the bit

rate of the proposed algorithm compared with the original HM.

As shown in Table 3.3, the comparison performance between the proposed algorithm

and da Silva’s algorithm gives clear results for the bit rate (∆BR), ∆PSNR and time

saving (TS). On average, the proposed algorithm achieves a time saving of 37.16% for

intra encoding, while the average increase in the bit rate is 1.46% and the decrease

in PSNR is only 0.0635 dB, which is negligible. The RD performance is shown in

Figure 3.9-3.11. From the RD curves for the proposed algorithm and HEVC, it is

clear that our proposed algorithm achieves almost the same PSNR value for different

bit rates as HM.

According to Table 3.3, in high-resolution video sequences, da Silva’s algorithm

cannot save as much time as the proposed algorithm, while the advantages in terms of

PSNR loss and the increase in bit rate are less evident than our proposed method. In

the sequence PeopleOnStreet and BQTerrace, both of da Silva’s proposed algorithm

and our algorithm have lower performance compared with other sequences. The two

sequences have the same feature that there is a lot of shade, which influences the

accuracy of edge detection. This leads to PU partition tending to occur at a low

level and termination early, resulting in both algorithms missing the optimal.


Table

3.3

Perform

ance

oftheproposedalgorithm

anddaSilva’s

algorithm

Sequences

daSilva’sAlgorithm

Proposed

Algorithm

∆BR(%

)∆PSNR(dB)

TS(%

)∆BR(%

)∆PSNR(dB)

TS(%

)

Class

A

2560×1600

Steam

LocomotiveT

rain

0.49

-0.0060

12.44

0.27

-0.0186

27.72

PeopleOnStreet

1.98

-0.0490

20.91

1.58

-0.0581

35.99

Class

B

1920×1080

Kim

ono

0.81

-0.0140

31.73

0.18

-0.0180

40.48

BQTerrace

2.26

-0.0280

17.45

1.89

-0.0540

37.40

ParkScene

0.03

-0.0300

8.74

0.39

-0.0730

41.73

Class

C

832×

480

BQMall

NA

1.11

-0.0718

40.53

BasketballD

rill

NA

1.84

-0.0615

37.49

Class

D

416×

240

RaceH

orses

NA

1.31

-0.0508

27.04

BlowingB

ubbles

NA

1.61

-0.0769

25.55

Class

E

1280×720

Vidyo1

NA

3.08

-0.1141

46.48

Vidyo4

NA

2.84

-0.1013

48.31

Average

1.11

-0.0254

18.25

1.46

-0.0635

37.16

3.5 Summary 31

Figure 3.9 RD curves of ParkScene sequence

3.5 Summary

We proposed a fast level decision algorithm for the intra prediction of HEVC

using an edge detector. By using the result of gradient detection, the times required

for RDO processes are considerably reduced and the coding efficiency is increased.

The proposed algorithm achieved a large reduction in computation complexity com-

pared with the original HEVC decision algorithm. Experimental results showed that

our proposed algorithm reduced the coding time by about 37.16%, while the corre-

sponding increase in the bit rate was only 1.46% bit rate and the PSNR loss was

0.0635 dB. Compared with da Silva’s algorithm, our proposed algorithm achieves

greater time saving and a similar performance in terms of the increase in bit rate

and decrease in PSNR. In future, our main work will to develop novel intra predic-

tion modes and optimize the decision of the prediction mode combination with the

fast-level-decision algorithm to further increase coding efficiency.


Figure 3.10 RD curves of BQMall sequence

Figure 3.11 RD curves of RaceHorses sequence

Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC 33

Chapter 4 Adaptive downsampling signal based

intra prediction for parallel intra

coding of HEVC


Dynamic force from the worldwide consumer electronics market is driving display

technology and video coding technology. In 2013, the Joint Collaborative Team

on Video Coding (JCT-VC) standardized HEVC to achieve higher video coding

efficiency [13-15]. HEVC employs a flexible hierarchical structure of CTU which

is partitioned into coding unit (CUs) recursively, and supports larger sizes of CU

ranged from 8 × 8 to 64 × 64 [16]. HEVC also defines two other functional types

of unit, prediction unit (PU) and transform unit (TU). PU is the basic unit for

prediction and TU is defined for transform and quantization. Intra coding reduces

spatial redundancy between neighboring PUs in the same frame and improves coding

efficiency.

Some related works have been performed for improving intra coding efficiency

with correlation between spatial reference samples and samples of current PU. A

lossless intra coding algorithm is proposed based on pixel-wise spatial interleave pre-

diction (PWSIP) for H.264/AVC [17]. Interleave prediction whose reference sample

pixels are derived from neighboring reconstructed pixels, is utilized to generate pre-

diction pixels. Experimental result shows 4.13% increased compression efficiency,

however encoding time going up 31.59%. S.Kanimozhi and others have proposed

an integrated scheme using PWSIP and context adaptive interpolation (CAI) [18].

Horizontal, vertical, mean mode and 4 intra prediction modes of H.264/AVC in-

cluded, are employed to increase coding performance. Their proposed algorithm

achieves high compression efficiency with about 4% increment. A locally adaptive

downsampling video coding scheme have been proposed in [19]. An adaptive down-

sampling/upsampling video coding scheme is proposed in order to achieve better

video quality at low bit rates in terms of both measure and visual quality. Differ-

34 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC

ent regions of video frame with the consideration of local contents are adaptively

encoded with downsampling ratios and quantization step sizes. The appropriate

downsampling rates have been derived though theoretical analyses. Compared with

Test Model 5 (TM5) MPEG-2 encoder, the PSNR improvement can be up to 1.3 dB

at low bit rates. Though the performance of the proposed scheme is impressive at

low bit rates, compression rate is worse than regular coding at high bit rates and low

QPs. In [20], a multiple line-based algorithm is proposed for HEVC. Li and others

utilize further reference lines with relatively higher quality and performs better pre-

diction. Compared with HEVC, the proposed fast search method achieves 2.0% bit

saving on average, with increasing the encoding time by 112%. A new block-based

method for HEVC intra coding have been proposed in [21]. They divide the pixels in

each prediction into two parts: half pixels are coded with a constrained quantization

algorithm; whereas the other half are reconstructed by linear interpolations along a

prediction direction by utilizing the neighboring reference pixels and the first half

coded pixels. A competition mechanism is employed between this new method and

the original HEVC intra coding in order to choose the optimal mode for each pre-

diction block. Experimental results show that about 2% BD bitrate reduction has

been achieved both for luma and chroma with respect to the original HEVC intra

coding, however the encoder complexity increases by 130%.

Recently, leaning-based image super-resolution using convolutional neural net-

work (CNN) has been performed for improving intra coding efficiency in [22]. Com-

pared with traditionally high coding efficiency method regarding to intra coding,

learning based approach improves compression efficiency further citing high cost

of computational complexity and unsolved data dependency. A CNN-based block

up-sampling scheme for intra frame coding is proposed with compressing a down-

sampled block by normal intra coding, and then up-sampling to its original reso-

lution. The network is both compact and efficient in term of a new CNN struc-

ture which features deconvolution of feature maps, multi-scale fusion, and residue

learning. Coding parameters of down-sampled blocks is also fully studied for rate-

distortion optimization. Experimental results show that the CNN-based scheme has

achieved significant bit rates saving (approximately 5.5%). However, multi-scale

4.2 Analysis of intra coding in HEVC 35

feature extraction, deconvolution, multi-scale reconstruction and residue learning in

CNN-based up-sampling is leading to amount of calculation and increased encoding

time.

These aforementioned works have achieved outstanding coding performance,

however, significantly increased computational complexity and data dependency for

intra coding. Therefore, they are difficult to design parallel hardware architecture

and perform pipelined hardware implementation. In this paper, adaptive downsam-

pling signal based intra coding scheme is proposed aiming at developing a higher

efficiency intra prediction scheme for HEVC, while reducing computational complex-

ity and supporting parallelization. The rest of this paper is organized as follows.

Section 4.2 introduces and analyzes intra coding in HEVC. The proposed meth-

ods including downsampling approach, downsampling prediction modes, training

method for obtaining downsampling QP and parallel intra encoding architecture

are presented in Section 4.3. Section 4.4 shows the experimental result, which is

followed by the conclusion in Section 4.5.

4.2 Analysis of intra coding in HEVC

The integrated intra coding process is illustrated in Figure 1 and explained in

detail as follows. The process is divided into two main steps, rough mode decision

(RMD) and rate-distortion optimization (RDO). In RMD process, reconstructed

pixels of neighboring PU are utilized to build prediction pixels with 35 intra pre-

diction modes. The Hadamard transform absolute difference (HAD) costs of all the

supported prediction modes are calculated to create a list of candidates with the

minimum values of HAD. Meanwhile, MPMs are evaluated from neighboring PUs

and supplemented to mode candidates. In the RDO process, the optimal PU par-

titioning and prediction mode is decided by the RD cost. RD cost is derived by

calculating the residual between reconstructed samples and original samples. More-

over, the intra coding process is performed recursively to generate optimal encoding

results.

The recursive intra coding of HEVC improves coding efficiency, however, there

are still three issues of implementation difficulty:


��

��

��

��

��

�

��

�

��

��

!��

!��

�"#��

��#��

�

�

��

�$��

%�&#��

%�&#��

Figure 4.1 Intra coding flow for HEVC.

��

��

�

��

��

��

��

��

��

��

Figure 4.2 High encoding latency caused by CTU-level data dependency: (a)

CTU encoding order and (b) CTU pipeline scheduling.

��

��

��

�

��

�

��

�

��

�

��

�

��

�

��

�

��

�

��

�

��

��

��

��

��

��

�

� � ��

��

Figure 4.3 High encoding latency caused by PU-level data dependency: (a) PU

encoding order and (b) PU pipeline scheduling.


��

��

��

��

��

��

��

��

��

��

Figure 4.4 Spatially referencing samples of HEVC.

a) computational complexity

H.264/AVC defines macroblock size from 4 × 4 to 16 × 16 and a total of 9

optional prediction modes, however, HEVC employs much bigger block size, which

supports from 4 × 4 up to 64 × 64, and much more intra prediction modes. The

two features result in high efficiency for intra coding, meanwhile, significantly high

complexity for intra prediction is introduced. To find out the optimal CU parti-

tioning and prediction mode, HEVC will travel through all the combinations of CU,

PU and TU by performing a series of computations to carry out the RDO process.

For the intra prediction process, a 64 × 64 CTU will perform approximately 2.65 ×

more predictions than that in the H.264 [23].

b) data dependency

Figure 2 and Figure 3 illustrates the encoding order for CTU and PU in HEVC.

HEVC encodes CTU in raster scan order, and all PUs in a CTU are encoded in

Z-Scan which provides more available neighboring reference samples in most cases.


Since reconstruction pixels in neighboring CTU and PU are utilized in HEVC, intra

encoding process is restricted by data dependency and extremely high latency. Take

CTU for instance to explain hardware implementation problems for intra encoder

because of the data dependency. The order of intra encoding is restricted by CTU

level data dependency, which requires that each CTU is obliged to wait until both of

the left and above right neighboring CTUs encoding is finished. The constraint leads

to the latency between current CTU and adjacent CTU. Furthermore, the maximum

parallelism is sustainably limited and encoding from optional position in frame is

forbidden. Only the encoding direction which is from left top to right bottom is

permitted in HEVC. Therefore, pipelined hardware implementation and parallel

architecture is difficult to design. Both CTU level and PU level data dependency

lead to low throughput performance and high hardware overhead.

c) coding efficiency

In the spatial domain, pixels which need to be predicted tend to have a high

correlation with neighboring pixels, in most instances. As above-mentioned, HEVC

intra coding reduce the redundancy that pixels close to each other in the same frame

[24]. 35 intra prediction modes are provided which utilize the spatially neighboring

samples to predict current PU samples as shown in Figure 4. However, the correla-

tion between pixels of spatial locality decays with PU size increasing. In particular,

for large PU sizes (64 × 64 or 32 × 32), the available referencing samples are too

far away from some pixels of current PU and interfere the prediction performance.

4.3 Proposed parallel intra coding scheme

In the paper, we propose a parallel intra coding scheme which is consisted of

reconstruction of downsampling reference samples, downsampling prediction modes

and training method of optimal downsampling QP. The overview of the proposal

is briefly illustrated in Figure 5. In the work, reference samples are reconstruced

from the downsampling approach instead of neighboring CUs in original intra cod-

ing of HEVC. Data dependency referring to encoding CUs by Z-scan order is solved

by utilizing reference samples of the downsampling approach. Moreover, parallel

implementation of intra encoder is possible in our proposal. The proposed down-

4.3 Proposed parallel intra coding scheme 39

sampling prediction modes derive intra prediction samples with reconstructing sam-

ples of downsampling approach. Owing to correlation between original signal and

downsampling signal, downsampling prediction modes reduce spatial redundancy

more and increase coding efficiency, compared with original HEVC. Furthermore,

we propose a training method to generate optimal downsampling QP which tries

to balance reconstructing samples of downsampling approach and intra prediction

samples. The final encoded bitstream is optimized in coding performance with op-

timal downsampling QP. The parallel intra encoding architecture is also explained

in detail by following contents.

��

��

��

� ��

� � ��

� � ��

��

Figure 4.5 Overview of the proposed intra coding scheme.

4.3.1 Downsampling approach for reconstructing samples

In our proposal, downsapling signal is utilized to generate prediction pixels. The

current CTU is downsampled into 4 sub-CTUs (S0, S1, S2 and S3) with the same

size. A 4-tap downscaling filter is devised. Each CTU adopts the downsapling filter

by a simple coefficient (DFcoeff ). The DFcoeff vector is [1,0,0,0], [0,1,0,0], [0,0,1,0],

[0,0,0,1], which is matching the value of j separately.−→P is the vector of original

pixels for the i-th 8×8 block in 1 frame. The downsampling pixels (Pdown) are

obtained as follows:

Pdown(i, j) =−→P · −−−−−→DFcoeff >> 2 (4.1)


Pdown(i, j) =

P (4i+ 0)

P (4i+ 1)

P (4i+ 2)

P (4i+ 3)

·

DFcoeff (4i+ 0, j)

DFcoeff (4i+ 1, j)

DFcoeff (4i+ 2, j)

DFcoeff (4i+ 3, j)

>> 2 (4.2)

Then, the downsampling signal with S0 are encoded by original HEVC intra cod-

ing tool to derive downsampling reconstruction pixels. Finally, downsampling signal

is utilized to generate prediction samples as shown in Figure 6 and also utilized as

reference samples in intra prediction. The spatial correlation between downsampling

samples and original samples is considered to be closely tied.

��

��

��

Figure 4.6 Generate prediction samples using downsampling signal.

4.3.2 Downsampling Prediction Modes

In HEVC, planar mode, DC mode, and directional modes are employed to gener-

ate prediction samples with spatial correlation between PUs. Intra prediction modes

are performed with reference samples are reconstructed from neighboring PUs. RD

costs are calculated recursively with 35 intra prediction modes in the RDO process of


HEVC, which leads to amount of power consumption because of computational com-

plexity. Meanwhile, the data dependency caused by intra prediction modes severely

restricts parallel hardware implementation. Therefore, in our proposal, all of the

35 intra prediction modes are replaced with two downsampling prediction modes,

whereas, planar mode, DC prediction mode, and directional modes are not suitable

for parallel hardware design. Since 35 intra prediction modes are not available,

MPMs are also forbidden in our proposal. Two efficient downsampling prediction

modes are defined as downsampling mode A (DMA) and downsampling mode B

(DMB). The downsampling predictor can be formulated as:

pi,j =

Di,j−1

Di−1,j

Di,j

Di+1,j

Di,j+1

·

Ri,j−1

Ri−1,j

Ri,j

Ri+1,j

Ri,j+1

>> 2 (4.3)

pi,j denotes the predicted value of each samples, where i and j are the coordinate

of column and row. Di,j is downsampling reference samples which are generated in

downsampling approach. Ri,j is weighting parameter for downsampling reference

samples and is calculated by Di,j. In DMA and DMB, the Ri,j is adjusted to be

different value.

4.3.3 Training method for obtaining downsampling QP

It is obvious that the final encoding result is relative to downsampling QP and

current QP (CQP). As shown in Figure 7, theoretically, the optimal downsampling

QP with best performance of coding efficiency can be obtained by training method.

The downsampling QP of theoretical method is defined as TDQP. Each frame in

the video source is introduced to complete the training process. All candidates of

downsampling QP are employed to encode downsampling signal and generate down-

sampling reconstruction samples. RDO process is performed with downsampling

reconstruction samples which generated with assumed downsampling QP (ADQP).

The ADQP is adjusted on account of the bitrate and PSNR result. When the trend

of bitrate and PSNR result is upward, the ADQP is adjusted to be higher. On


the contrary, the ADQP is set to be lower when coding efficiency becomes worse.

TDQP is obtained by the training approach which is based on coding performance.

The optimal combination of downsampling QP and CQP is determined by the final

compression performance.

��

��

��

��

��

��

��

��

��

��!��

"�� "��

#�

��

Figure 4.7 Flowchart of training TDQP.

However, performing a complete coding process for each round of the training

process costs massive computation. A fast training method is presented. The down-

sampling QP of fast method is presented as FDQP. The initial frame of video source

is divided into four parts with same size. Offline training is performed as the above-

motioned training method with the four parts of initial frame. The FDQP, CQP, RD

cost of downsampling signal and entropy signal are utilized to derive the calculation

formula. The calculation formula is defined as follows:

Cqp = QPC ·N2 + µ ·RDS + ν ·RE (4.4)

where QPC denotes current QP, N2 is a weighting factor for current QP. µ

and ν are parameters for RD cost of downsampling signal (RDS) and intra coding


Table 4.1 Result of TDQP and FDQP.

Sequence CQP TDQP FDQP

Beauty 27 34 34

HoneyBee 27 34 34

ShakeNDry 27 32 32

Kimono 27 30 30

ParkScene 27 31 30

entropy signal (RE). Cqp is the optimal downsampling QP. Following this approach,

weighting factor and parameters are derived in offline training process. In online

intra coding, Cqp is calculated value for the optimal downsampling QP with the

result of offline training. Following this approach, weighting factors and parameters

are derived in offline training process. In online intra coding, Cqp is calculated value

for the optimal downsampling QP with the final frame of video source. Table 4.1

presents the result of TDQP and FDQP in different sequence conditions. The results

point out that the proposed fast training method is valid and reasonable to replace

the theoretical method.

4.3.4 A parallel HEVC intra encoding architecture

As illustrated in Figure 8, the parallel architecture design is proposed. 5 pre-

diction engines are working in parallelization and the universal predictor with same

block size (64 × 64, 32 × 32, 16 × 16, 8 × 8 and 4 × 4) improves the throughput

much more. The detailed information of the parallel intra encoding architecture is

explained as follows. Original video source is introduced to be encoded. Firstly,

the input signal is pooled for deriving downsampling signal and downsampling pixel

buffer stores pixel values of the downsampling signal. Secondly, the downsampling

signal is utilized in normal RDO process (DSRDO) to generate reference samples,

and values of the reference samples are written in downsampling reconstruction pixel

buffer. In the RDO process, the results of RD cost are stored in DSRD cost buffer for

training method and calculating optimal downsampling QP. Thirdly, reconstructed

reference samples are introduced to 5 prediction engines. The prediction engines


��

��

��

��

��

��

� ��

!��!�

"�"

��

��#��

��

��

��

��

��

��$�%��#�� &��'�(��$�%��

Figure 4.8 Top level hardware architecture for HEVC.

are able to generate prediction pixels independently, and reduce data dependency

with parallel execution of multiple predictors. Therefore, the traditionally recursive

RDO process in HEVC is replaced by the parallel DSRDO and ERDO process in our

proposal. Moreover, in the proposed parallel architecture, implementation of trans-

form and quantization in DSRDO and ERDO should be exactly the same, which

simplifies the hardware design. The results of RD cost in the parallel RDO process

are stored in ERD cost buffer. The values in DSRD cost buffer and ERD cost buffer

are utilized to perform training method for obtaining optimal downsampling QP,

and weighting factors and parameters in equation (3) are derived in offline process.

Finally, encoded bitstream is outputted with optimal downsampling QP which is

calculated with current QP.


The proposed scheme is integrated with the reference software of HEVC test

model (HM) 16.11 [25]. The simulation platform is Intel(R) Core(TM) i7-4770 CPU

@ 3.40GHz with 4 cores, RAM 8.00 GB and Windows 10 Home Edition 64-bit. Since

the proposed method is applied to intra coding, the experimental is conducted with


Table 4.2 Experiential conditions.

Reference software HM 16.11

OS Windows 10 Home Edition 64-bit

CPU Intel Core i7-4770 3.40GHz

RAM 16.00 GB

Configuration encoder intra main

Frames 100

QP 22, 27, 32, 37

Compiler software Microsoft Visual Studio Community 2015

all-intra configuration [26]. The number of test frame of each sequence is 100 in our

experiment for evaluating performance adequately. Owing to our proposal aiming

at intra encoding of high resolution source, only high-resolution test sequences are

employed to compare the performance with quantization parameters (QP) of 22, 27,

32 and 37 [27]. Sequences of Beauty, HoneyBee and ShakeNdry are tested in 3840

× 2160 and 1920 × 1080 [28]. The sequence resolution of Kimono and ParkScene is

1920 × 1080. Test conditions and experimental settings are listed for evaluation in

Table 4.2.

In order to evaluate the compression performance of the parallel intra encoding,

two categories are defined: proposed efficiency intra coding without adaptive down-

sampling QP (M1), proposed efficiency intra coding with adaptive downsampling

QP (M2). Table 4.3 summarizes the simulation results in terms of bitrate(∆BR),

PSNR (∆PSNR) and time saving (∆T) for M1 and M2. The performance evalua-

tion of compression ratio, video quality and computational reduction is calculated

as follows:

∆BR =BRprop. −BRHEV C

BRHEV C

× 100% (4.5)

∆PSNR = PSNRprop. − PSNRHEV C × 100% (4.6)


∆T =Tprop. − THEV C

THEV C

× 100% (4.7)

where ∆ BR are the bitrate results of output bitstream for original HEVC and

the proposed method. ∆ PSNR denote PSNR values of normal HEVC and the

proposal. ∆ T are encoding time of HM 16.11 encoder and the proposal, utilized to

evaluate the proposed speed-up method.

On average, the proposed algorithm achieves a time saving of 27.26% for intra

encoding, while the average bitrate saving is 4.17% and the decrease in PSNR is

only 0.2564 dB, which is negligible. The RD performance is shown in Figure 9,

10 and 11. From the RD curves for the proposed algorithm and HEVC, it is clear

that our proposed algorithm achieves higher coding efficiency compared with HEVC.

Specifically, the proposed scheme performs well in high resolution video compression,

and when the resolution is higher, the coding performance is better at the example

of sequence Beauty in Figure 9 and 11. We consider that higher resolution videos

have substantially spatial redundancy and the method using downsampling signal

reduce the redundancy between PUs. Meanwhile, comparing the RD curves in

Figure 9 and 10, we can find that the proposed method performs better in sequence

Beauty than HoneyBee. The video content of sequence Beauty is a portrait of a

beauty and the background is dark and full of digital noise. Sequence HoneyBee is

a video about bees and flowers with less digital noise. Frequently, noise in frame is

lack of spatial correlation with samples of neighboring PUs. Our proposal utilizes

downsampling signal which is generated from noise and normal pixels. Therefore,

the proposed method performs well for the video signal where noise is present. We do

not contend that the proposed algorithm is always significantly better HEVC from

the above experimental results. When encoding some cases of video sequences with

low QP (for example 22 and 27), reference samples generated from downsampling

approach improve prediction accuracy insensibly because of the method of transform

and quantization in HEVC. However, the proposed algorithm can save bit rates

in the majority of cases. Moreover, it can reduce computational complexity and

improve the parallel performance for the all sequences.


Figure 4.9 RD curves of Beauty 2160p in M2.

Figure 4.10 RD curves of HoneyBee 2160p in M2.


Figure 4.11 RD curves of Beauty 1080p in M2.


Tab

le4.3Com

pared

Experim

entalResultsforProposedSchem

eandHEVC

Test

M1

M2

Sequence

s∆BR(%

)∆PSNR(d

B)

∆T(%

)∆BR(%

)∆PSNR(d

B)

∆T(%

)

Beauty

-2.50

-0.1270

24.35

-4.73

-0.0934

26.95

3840x2160

Hon

eyBee

-7.54

-0.5450

24.08

-10.67

-0.4323

25.11

ShakeN

Dry

-3.21

-0.4363

24.84

-5.32

-0.3383

25.61

Beauty

0.72

-0.1510

24.73

-1.38

-0.1135

26.17

Hon

eyBee

-6.47

-0.6182

22.13

-9.32

-0.4059

25.06

1920x1080

ShakeN

Dry

-1.83

-0.5860

27.21

-0.325

-0.2109

27.80

Kim

ono

-2.38

-0.1809

30.94

-4.35

-0.1206

31.23

ParkScene

4.81

-0.2453

26.94

2.71

-0.3363

30.18

Average

-2.3

-0.3612

25.65

-4.17

-0.2564

27.26


4.5 Summary

A novel parallel HEVC intra coding scheme based on adaptive downsampling

signal for hardware implementation is presented. It utilizes downsampling pixels to

reconstruct reference samples, and generates prediction pixels with proposed down-

sampling prediction modes. The proposed parallel hardware architecture reduces

data dependency and improve data throughput. The experimental results show

that our proposal achieves higher coding efficiency compared with original HEVC.

Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC 51

Chapter 5 Hardware implementation oriented

fast intra coding based on

downsampling information for HEVC


In recent 10 years, the worldwide consumer electronics market has proven to be

a dynamic force driving display technology and video coding technology, and the

demand of high resolution and high quality video is strengthened as the days passed.

The assortment of differentiated standard definition (SD) and high definition (HD)

devices dissipated significantly, with most of the better features and performance

attributes moving into full high definition (FHD) and 4K ultra high definition (UHD)

products [29]. More and more multimedia terminals are updating and adapting to

higher resolution video. It is worthwhile to note that higher resolution and higher

quality video will be network overload in content-delivery for large amounts of video

data. To solve the problem and achieve the goal of higher video coding efficiency,

the Joint Collaborative Team on Video Coding (JCT-VC), which is organized by

ISO/IEC Moving Picture Group (MPEG) and ITU-T Video Coding Experts Group

(VCEG), standardized HEVC in January 25, 2013 [30-33].

Known as the next generation video coding standard, HEVC supports higher

resolution video coding over the previous video coding standards. It extends high

resolution support to 4K and 8K ultra high definition (UHD) and responses to

the increasing demand for higher resolution and higher quality video. Far more

important is that HEVC achieves about 50% bit-rate reduction compared with ad-

vanced video coding (H.264/AVC). An evaluation of H.265/HEVC Main profile and

H.264/AVC High profile identifies that HEVC has substantial advantages in coding

efficiency [34]. Bit-rate results show that HEVC with Main profile reduces overall

50.1% bit-rate compared with H.264/AVC Baseline profile and reduces 40.7% bit-

rate compared with H.264/AVC High profile. Due to higher coding efficiency, HEVC

is expected to progressively replace H.264/AVC applications and develop into one

52 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC

of main video coding standards in future.

HEVC employs a flexible hierarchical structure of coding tree unit (CTU), which

succeeds to macro-block (MB) based block-shaped region structure historically. The

input video is divided into CTUs, and the root of quadtree structure is CTU. CTU

is partitioned into coding unit (CUs) recursively. CU is the basic processing unit,

and each CU consists of one luma coding block (CB) and two chroma coding blocks

(CBs). Instead of maximum MB size 16× 16 in H.264/AVC, HEVC supports larger

sizes of the fundamental processing unit, CU sizes range from 8 × 8 to 64 × 64.

HEVC also defines two other functional types of unit, prediction unit (PU) and

transform unit (TU). PU is the basic unit for prediction and allows sizes from 4× 4

to 64× 64. TU is defined for transform and quantization. In HEVC, maximum TU

size is 32× 32 for luma and 16× 16 for chrome. Minimum TU size is 4× 4 for both

luma and chroma [35].

Historically, intra coding is a significant coding technology from H.263 [36]. It is

intended to reduce spatial redundancy between neighboring PUs in same frame and

increase coding efficiency. In HEVC, intra coding is still one cause of performing

high efficiency coding [17]. Many novel features are employed to improve the coding

efficiency of intra coding in HEVC. As mentioned above, the maximum size of PU

extends to 64 × 64. HEVC constructs reference samples from wide spatial region

which are at the bottom left, left, top left, top, top right. Another improvement that

intra coding of HEVC owned is 35 prediction modes. As contrasted with 9 prediction

modes in H.264/AVC, HEVC employs 33 angular prediction, DC mode and planar

mode to make a tremendous in the effective use of spatial correlation. These novel

features lead to substantially improve coding efficiency with excellent visual quality,

however, they are also most serious causes of increasing computational complexity up

to 502% compared with H.264/AVC [37]. The enormous computational complexity

persecutes not only software operation but hardware implementation. In hardware

implementation, the complex intra coding in HEVC demands unacceptable hardware

resources for encoding high resolution (HR) and high frame rate (HFR) video. The

unbalanced workload in intra encoding process leads to difficultly implement in

flexible pipelined scheme and derives to a low throughput. Moreover, CTU level

5.1 Background and previous works 53

data dependency obstructs a higher hardware efficiency in parallel. Therefore, it

is certainly worth developing low complexity algorithms which apply to pipelined

scheme and parallelized architecture for a faster encoder of HEVC.

Up to present, a number of fast algorithms for intra coding have been proposed

to reduce redundant computation in original HEVC. These algorithms attempt to

achieve encoding time saving, meanwhile ensure an acceptable visual quality loss.

They can be approximately classified into two categories of strategies.

The first strategy bases on texture feature of spatial region to reduce spatial

redundancy. In [8], Jiang et al. proposed a gradient based fast mode decision al-

gorithm. By making use of the gradient information, the redundant candidates of

prediction modes are reduced. As compared with HEVC, the fast intra mode deci-

sion scheme provides 20% time saving on average. To achieve higher time saving, a

fast mode decision scheme for intra prediction is proposed in [38]. In the scheme,

texture complexity of CUs is calculated and set the rational threshold to decide CU

size. The proposed scheme reduces about 40.9% encoding time. Both [8] and [38]

forced on prediction mode decision, and ignored the redundance of CU sizes in intra

coding process. Zhang et al. tried to reduce redundant computation in two ways.

Four orientation feature filters are designed to extract gradient intensity and tex-

ture direction in [9]. The result of gradient intensity is unitized to skip impossible

PU size and the texture direction is derived to exclude redundant prediction modes.

The proposed algorithms saves about 56.7% of encoding time with 4.78% bit-rate in-

creasing. Song et al. intended to restrain compression performance loss, meanwhile

reduce computational complexity efficiently. They employed an adaptive discretiza-

tion total variation threshold-based CU size determination algorithm [39] to make

a fast decision of CU size. Meanwhile, analyzing pixel value based on orientation

gradient assists to reduce candidate modes. By reducing redundant CU sizes and

prediction modes, their proposal incurred the maximum 2.17% bit-rate increasing

and saved coding time up to 57.21% on average. Another method is proposed in

[7] to retrain coding efficiency decreasing. Na et al. establishes an edge map based

on results of edge detection and balances the trade-off between computational com-

plexity and coding efficiency by setting candidate modes number. Compared to full


modes HEVC, the proposed algorithm reduces the encoding time by 56.8% at the

cost of 2.5% bit-rate increasing. Furthermore, a three-stage pipelined architecture

[40] is designed based on dominant direction strength for real-time realization of

HEVC encoder. In [40], a parameter named dominant direction strength is defined

to evaluate CU homogeneity and perform a fast mode selection. Experimental re-

sults show that the proposed method achieves on average 45.8% time saving and the

architecture operates at 235 MHz maximum clock rate with supporting video up to

level 6.2.

The second strategy is inclined to utilize information of neighboring CUs to

achieve the goal of reducing encoding time. A fast CU size and mode decision

algorithm [41] is proposed. The algorithm exploits correlations between spatially

nearby CU and reduces redundant CU size and mode basing on the correlations.

Another method for deriving spatial correlation is proposed by Zhao et al. in [42].

They analyzed the rate distortion costs between the parent CU and part of its child

CUs to decide CU depth, meanwhile, exploited the correlation of intra-prediction

modes between neighboring PUs to speed up the mode decision. As a result, their

proposal provides about 50% time savings on average.

Some other work tries to combine the first strategy with the second strategy

to reduce redundancy more efficiently. In [43], Huang et al. proposed a fast CU

depth decision algorithm. The algorithm decides CU partitioning basing on both

CU texture complexity and correlation between the current CU and neighbouring

CUs. The proposed scheme provides 39.3% encoder time savings with only 0.6%

coding performance cost. Shen et al. reduced much more computational complexity

(averaged 45%) with a fast CU size decision algorithm [44]. They proposed adaptive

thresholds for texture homogeneity of current CU and bypass strategy for intra pre-

diction with texture property and coding information from neighboring coded CUs.

To exploit spatial correlation further, PU mode and RD cost correlations between

different depth levels are obtained and utilized to reduce low-possible candidate

modes and skip redundancy CU sizes in [45]. Experimental results demonstrate

that the proposed algorithm achieves about 50.99% computation reduction.

5.1 Background and previous works 55

The above-mentioned fast algorithms and schemes reduce the computational

complexity with acceptable coding efficiency loss. However, most of the proposals

are not hardware oriented and are difficult to implement in hardware. The im-

plementation of these algorithms and schemes is obstructed by two main factors,

throughput burden and data dependency. Throughput burden is a big challenge for

real-time intra encoder. As well known, in 4K UHD@60fps video format, the base

throughout is as four time as that in 1080P@60fps. It is necessary to take special

measures to increase throughout for real-time hardware implementation. Pipelined

architecture and parallelized design are two regular methods to improve through-

put efficiently. However, the CTU based data dependency in intra coding of HEVC

makes it difficult to normalize pipeline workload allocation and perform in highly

parallel architecture. As a significant obstruction in VLSI architecture design, the

CTU level data dependency problem is not solved by the above-mentioned fast algo-

rithms. [46-47] implemented real-time intra encoder supporting FHD video. Their

work reduced computational complexity and improved the throughput with paral-

lelized VLSI architecture. However, it is not satisfactory for the coding performance

in both of them. Moreover, the max throughput is insufficient to support 4k appli-

cations. Another work [48] presented a computationally scalable algorithm based

hardware architecture. The scalability increases the max throughput which supports

intra encoding up to 2160p@30fps video resolution, nevertheless, it is disappointing

to note the max coding performance loss in 8.91%. Other hardware implementation

[49-50], which has been reported for mobile devices, reaches the demand of real-time

encoding for 4K applications at the sacrifice of the coding efficiency. The proposal

in [49-50] employed a new feature detection approach which considers pixel orien-

tation, similarity of spatial domain and temporal domain, and partitioning block

types. Statistics analysis is performed with the feature detection approach to ana-

lyze pixel activities and determine the mode candidates. It reduced computational

complexity by 68.5% while maximum peak signal-to-noise ratio degradation (PSNR)

at 0.16 dB.

In this paper, a downsampling process is introduced as a pre-processing before

the intra coding process. The downsampling information can be derived from the


pre-processing stage. It can be utilized to reduce the computational complexity

and achieve fast intra coding process with excellent coding performance. Moreover,

it is designed to be hardware-friendly and parallelized to achieve real-time encod-

ing for UHD video. The pipeline efficiency and the throughput of the proposal is

substantially improved compared with original intra coding of HEVC.

5.2 Analysis of intra coding in HEVC

In hevc, a flexible hierarchical structure of ctu is employed instead of macro-

block units which were adopted in previous video coding standards. To perform

residual intra coding process, a CU is divided into a quadtree of PUs and TUs.

Figure 5.1 illustrates the quadtree structure based intra prediction order and 5

sizes of PU (64×64, 32×32, 16×16, 8×8 and 4×4)， where the block in deep color

denotes partitioning further and the block in light color denotes not splitting at all.

The 5 sizes of PU is defined as different depths respectively, ranging from 0 to 5.

TUs is specified to contain and compute coefficients for spatial block transform and

quantization. A TU is square block with sizes from 4×4 up to 32×32. Another

improvement of intra prediction is the amount of available intra prediction mode.

In order to exploit the spatial correlation more efficiently, HEVC raises available

modes to 35, instead of 9 intra prediction modes in AVC.

��

��

��

��

��

Figure 5.1 Hierarchically quadtree structure in HEVC


The integrated process of intra mode and partitioning decision in HEVC is il-

lustrated in Figure 5.2. The process is divided into two main stages, rough mode

decision (RMD) and rate-distortion optimization (RDO). The RMD stage is a pre-

processing stage which decides candidates from all 35 intra prediction modes. Basing

on the candidates, the RDO stage makes the decision of optimal prediction mode

and partitioning.

��

��

��

��

��

�

��

�

��

��

!��

!��

�"#��

��#��

�

�

��

�$��

��$�%�&'�((

�)*+,�

-�

.�/ #��

.�/ #��

Figure 5.2 Optimal mode and partitioning decision flow

The intra coding process starts with depth 0 and the depth increases step by step.

Depth 0 matches the largest CU size 64×64 in CTU. Firstly, original pixels of the

current PU are derived, and reconstructed pixels of neighbouring PUs are utilized

to build prediction pixels with 35 intra prediction modes. Secondly, RMD performs


to decide prediction mode candidates, as shown in Figure 5.2 (a). In RMD stage,

the residuals between the original pixels and the prediction pixels are calculated and

exported to perform Hadamard transform. Then, a SATD based Hadamard cost is

evaluated with the following function:

JRMD = SATD + λRMD ·Rprediction (5.1)

where SATD is the sum of absolute difference of the Hadamard transformed coeffi-

cients, λRMD presents Lagrange multiplier which is related to quantization parame-

ter (QP), and Rprediction denotes bits of encoding the prediction mode information. A

prediction mode candidate list is created by the Hadamard cost for each depth, and

the number of mode candidates specifies as {3, 3, 3, 8, 8} corresponding to depth

from 0 to 4, respectively. Thirdly, most probable modes (MPMs) are evaluated from

neighbouring PUs and supplemented to mode candidates. Fourthly, the best predic-

tion mode is decided by the RD cost. RD cost is derived by calculating the residual

between reconstructed samples and original samples. Moreover, the reconstructed

samples are utilized as reference samples when predicts next neighbouring PU. The

detailed RD cost function is defined as:

JRDO = SSE + λRDO ·Rmode (5.2)

where SSE denotes the sum of square error between original pixels and reconstructed

pixels, λRDO is Lagrange multiplier which determines the tradeoff between rate and

distortion, and Rmode specifies total bits of encoding PU. The quadtree based RDO

process is implemented recursively and determines the optimal prediction mode and

partitioning finally. The RMD and RDO process searches through all the combi-

nations of CU, PU and TU with all 35 intra prediction modes. For a largest CU

with the size of 64×64, 7,327 times Hadamard cost and minimum 3,646 times RD

cost will be calculated, which consumes much computational time [51] and makes

it difficult to implement in real-time hardware. Therefore, reduction of redundant

partitioning and prediction modes is important, and a great deal of computation

will be saved by optimizing RDO process. Moreover, reducing PU size, TU size and

predication mode candidates improperly will incur the decay of coding efficiency.


　The proposed downsampling information based intra coding greatly reduces PU

size, TU size and mode candidates. However, the proposed implementation still

can not absolutely avoid to miss optimal partitioning, prediction mode and totally

banish the performance decrease.

5.3 Proposed downsampling information based fast intra

coding algorithms

��

��

��

��!��

��"��!��

��

# ��$��

��#

��

��

Figure 5.3 The proposed fast intra coding scheme

As analyzed above, intra encoder of HEVC checks the possible partitioning and

available prediction modes exhaustively to obtain the lowest R-D cost. It derives

optimal intra prediction and achieves a goal of higher coding efficiency at a cost of

enormous computation in RDO and RMD. To reduce redundant computation and

implement a real-time hardware, we propose the downsampling information based

intra coding which contains several fast algorithms. The proposed fast intra cod-


ing scheme is briefly diagramed in Figure 5.3. First, a preprocess stage performs

with downsampling resource. The stage exploits prediction information and derives

encoding characteristics with downsampling filter. Then, the downsampling pre-

diction information is utilized to execute fast decision of PU depth, TU depth and

prediction mode. The fast PU depth decision (FPDD) skips and early terminates

redundant PU depths. The fast TU depth decision (FTDD) optimizes the residual

quadtree proceess (RQT) and contribute practicality of real-time hardware proposal.

The fast prediction mode decision (FPMD) reduces redundant mode candidates and

accelerates RDO process. Finally, the best CU, PU, TU partitioning and prediction

mode is fast determined. The further details of proposed preprocess stage and fast

algorithms will be described in following paragraphs.

5.3.1 Downsampling information based preprocessing stage

Owing to the reason known to all, there is a causal relationship between down-

sampling frame and original frame. The downsampling samples are strongly as-

sociated with the original samples, particularly in conditions of high resolution.

Therefore we presumes that the optimal PU partitioning and prediction mode of

intra encoding with downsampling samples is strongly related to that with original

samples. The experimental results show that the correlation between downsampling

frame and original frame in optimal PU partitioning and prediction mode is strong,

in most instances. The correlation between downsampling frame and original frame

is estimated by following functions:

µDSD =1

N

N∑i=1

Ci, Ci = {0, (PUDepth ̸= DSDepth)

1, (PUDepth = DSDepth)(5.3)

µDSD±1 =1

N

N∑i=1

Ei (5.4)

Ei = {0, (PUDepth ∈ DSDepth± 1)

1, (PUDepth /∈ DSDepth± 1)(5.5)


µDPM =1

N

N∑i=1

Pi, Pi = {0, (PUMode ̸= DSMode)

1, (PUMode = DSMode)(5.6)

where Ci refers to the result of whether the optimal PU depth is the same be-

tween the downsampling samples (DSDepth) and the original samples (PUDepth)

for the i-th 8×8 block in 1 frame. N represents the total number of 8×8 blocks in

1 frame for different sequences. µDSD denotes the proportion that the PU parti-

tioning in downsampling frame is the same to that in original frame. In (4), Ei is

the result of whether the optimal PU depth in original samples is out of depth±1

in downsampling samples. µDSD±1 denotes the proportion that the extended PU

partitioning in downsampling frame is under to optimal PU partitioning in original

frame. In (5), Pi is the result of whether the optimal prediction mode in original

samples is the same between downsampling samples (DSMode) and original sam-

ples (PUMode). µDPM denotes the proportion that the encoded prediction mode in

downsampling frame is similar to that in original frame. The results of µDSD, µDSD±1

and µDPM is diagramed with DS Depth, DS Depth±1 and DS Mode respectively.

in Figure 5.4. Several class A and B sequences which contain Traffic, PeopleOn-

Street,SteamLocomotiveTrain, Ne-butaFestival, BQTerrace, Kimono, Cactus and

ParkScene are utilized to estimate the correlation. From the histograms, it is obvi-

ous that the PU partitioning in original frames is similar to that in downsampling

frames with a high possibility, and the optimal prediction mode with downsampling

samples is regular to that with original samples. Therefore, we consider that the

downsampling information can be utilized to predict optimal PU partitioning and

prediction mode and reduce redundant computation in original HEVC.

The downsampling information based preprocessing stage is proposed. In the

stage, firstly, we downsamples original frames with a 4 tap downsampling filter.

Each CU adopts the downsampling filter by a simple coefficient (DFcoeff ). The

DFcoeff vector is [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], which is matching the value

of j separately.−→P is the vector of original pixels for the i-th 8×8 block in 1 frame.

The downsampling pixels (Pdown) are obtained as follows:


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

Tra. Peo. Ste. Neb. BQT. Kim. Cac. Par.

DS Depth DS Depth±1 DS Prediction Mode

Figure 5.4 Concordance rates of downsampling and original encoded informa-

tion

Pdown(i, j) =−→P · −−−−−→DFcoeff (5.7)

Pdown(i, j) =

P (4i+ 0)

P (4i+ 1)

P (4i+ 2)

P (4i+ 3)

·

DFcoeff (4i+ 0, j)

DFcoeff (4i+ 1, j)

DFcoeff (4i+ 2, j)

DFcoeff (4i+ 3, j)

(5.8)

As illustrated in Figure 5.5, each PU in original frames is downsampled into

4 sub-PUs (S0, S1, S2 and S3) with same size. Then, the downsampling frames

with S0 are encoded by original HEVC intra coding tool to derive results of PU

partitioning and prediction mode. The other 3 sub-PUs (S1,S2 and S3) are utilized to

assist with RQT partitioning determination in the proposed fast PU depth decision

algorithm which will be described exhaustively in section 3.2. Moreover, the stage

exploits downsampling information at an available computation cost which is proved

in experimental results later.

5.3.2 Fast PU depth decision algorithm

The PU depth of intra coding in HEVC is specified ranging from 0 to 4, which is

adopting to PU sizes from 64×64 down to 4×4. Various PU sizes are available and


S0 S1

S2 S3

Downsampling

Original frame

Figure 5.5 Overview of downsampling method

lead to expensive computation in RDO process as above mentioned. Meanwhile,

the strong correlation between optimal PU depth in downsampling frame and in

original frame is presented and demonstrated in section 3.1. Therefore, we propose

to utilize optimal PU depths with downsampling samples to make rapid decision of

PU depths with original frame.

As shown in Figure 5.4, the optimal PU depth in downsampling frames is not

always similar to that in original frames. In other words, the optimal PU partitioning

is out of the result in downsamling frames in some case. Obviously, the concordance

rate of DS Depth which denotes PU partitioning similarity between downsampling

and original frame is not high enough as expected. Therefore, the donsampling

information can not be used directly to determine the PU partitioning in intra

coding of HEVC.

In our proposal, reference confidence (RC) is defined to decide PU partitioning for

intra coding with a method named structural similarity index measurement (SSIM).

The SSIM is a method for measuring the similarity for images and videos [52]. It

can perceive quality of matched samples and can be viewed as a quality measure

of the current images whose reference image is regarded as of perfect quality. In

this paper, SSIM is utilized to obtain reference confidence components (RCC) as


following functions:

RCCk = [l(x, yk)]α · [c(x, yk)]β · [s(x, yk)]γ (5.9)

where RCCk denotes the similarity between S0 sub-PU and Sk neighbouring sub-

PU in Figure 5.5. x is the index of 8×8 S0 sub-PUs and yk is the index of 8×8 Sk

neighbouring sub-PUs. The RCC is based on the computation of three terms, namely

the luminance term, the contrast term and the structural term. The overall RCC is

a multiplicative combination of the three terms, as the Eq. 7. l(x, yk), c(x, yk) and

s(x, yk) represents the luminance term, the contrast term and the structural term,

respectively. The three terms are formulated as following:

l(x, yk) =2µxµyk + C1

µ2x + µ2

yk+ C1

(5.10)

c(x, yk) =2σxσyk + C2

σ2x + σ2

yk+ C2

(5.11)

s(x, yk) =2σxyk + C3

σxσyk + C3

(5.12)

where µx, µyk , σx, σyk , and σxyk are the local means, standard deviations, and

cross-covariance for S0 sub-PU and Sk neighbouring sub-PU. The exponents of α,

β and γ are used to adjust the relative importance of the three terms (luminance

term, contrast term and structural term). In the paper we consider the three term

with the same weight, and set α, β and γ to 1. C1, C2 and C3 are three variables

to stabilize the division with weak denominator and avoid instability when either

µ2x+µ2

ykor σ2

x+σ2yk

is very close to zero. C1 and C2 is calculated by C1 = (ν1L)2 and

C2 = (ν2L)2. ν1 and ν2 are small constants which are far less than 1. In this work,

ν1 is set to 0.01 and ν2 is set to 0.03. L is the dynamic range of the pixel-values and

set to 255 in our proposal. C3 is obtained with C3 = C2/2 to simplify the expression.

Finally, the RCC simplifies to

RCCk =(2µxµyk + C1)(2σxyk + C2)

(µ2x + µ2

yk+ C1)(σ2

x + σ2yk+ C2)

(5.13)


The smaller value of RCC means the similarity of S0 and Sk is lower. On the

contrary, a big value of RCC indicates that S0 is similar to Sk. High reference con-

fidence (HRC) is defined when the values of RCC1, RCC2 and RCC3 are all over

than 0.85 (threshold θRCC for RCC), and the contrary case is specified as low refer-

ence confidence (LRC). In HRC condition, the 4 sub-PUs are similar which implies

that downsampling samples are similar to original samples in structural measure-

ment system. Therefore, the partitioning result of HRC condition can be utilized

to determine partitioning in intra coding. In our proposal, the optimal PU depth

is decided to be equal to that in the same position of downsampling coding result

when the RC is HRC. Depth skip and early termination is implemented to exclude

the other 4 redundant PU depths for RDO process. On the other side, more than

3 PU depths are utilized in RDO to avoid missing the optimal partitioning in LRC

condition. If the RC is LRC, any PU depth which is no more than downsampling

depth + 2 will be performed in RDO process. For instance, the downsampling depth

in the same position is 1 and RC is LRC, PU depth 0,1,2,3 will be utilized to decide

optimal PU partitioning.

In summary, the proposed fast PU depth decision algorithm is represented in

Algorithm 1. Firstly, the downsampling pixels are derived as ArrayDS which is

described in section 3.1, and RDO process in preprocessing stage is performed with

ArrayDS and output the DSDepth result. On the other side, RCC is calculated

with ArrayDS and the result is based on θRCC . Then, original pixels are imported

to perform RDO process, and terminate or skip the process according to the result

of RCC and DSDepth. Finally, optimal PU depth is derived.

5.3.3 Fast TU depth decision algorithm

HEVC employes a recursive RQT process to determine the TU depth. The orig-

inal RQT process allows a residual block to be further divided into TUs recursively

and contribute another quad-tree. The quadtree-based recursive structure increases

intra coding efficiency, however, brings a huge computation complexity to search

optimal partitioning and prediction mode which obstructs real-time hardware im-

plementation for intra coding of HEVC.


Algorithm 1 Optimal PU depth derivation process

1: function DSDep(DSPel)

2: ArrayDS ← DSPel

3: Process RDO with ArrayDS[j]

4: for all i such that 8× 8 PU ∈ CTU do

5: DSDepth← PUDepth(i)

6: end for

7: return DSDepth

8: end function

9:

10: function PUDepthDecision(OrgPel)

11: ArrayOrg ← OrgPel

12: for all k ∈ DFcoeff vectors do

13: ArrayDS[k]← ArrayOrg ·DFcoeff

14: while k ̸= 0 do

15: Calculate RCCk with ArrayDS[0]&[k]

16: end while

17: end for

18: Decide HRC or LRC with θRCC

19: Process RDO with ArrayOrg

20: if HRC then

21: while Depth < DSDep(ArrayDS[0]) do

22: skip the Depth

23: end while

24: while Depth ≥ DSDep(ArrayDS[0]) do

25: Terminate process early

26: end while

27: else LRC

28: while Depth ≥ DSDep(ArrayDS[0])+2 do


30: end while

31: end if

32: return PUDepth

33: end function


The fast TU depth decision is proposed to optimize the RQT process and re-

duce redundant TU depths in RDO. The recursive procedure for quantization and

transform (QT) is replaced by a simplified process in the paper. In the simplified

process, original pixels are input as ArrayOrg and RQT process is performed with

ArrayOrg. When TU depth is deeper than current PU depth, the RQT process

is early terminated. Conversely, the TU depth is skipped when it does not attain

to current PU size. For instance, when current PU size is 32×32, the PU is only

transformed and quantized with the size of 32×32, instead of implementing entire

intra prediciton process with PU sizes of 32×32, 16×16, 8×8 and 4×4. Accordingly

to the reduction of TU sizes, the downsampling information based fast intra coding

is ensured to reduce computation complexity and save encoding time further. From

experimental results, the fast TU depth decision algorithm save averaged 16.8% en-

coding time for proposed fast intra coding, at the cost of imperceptible BD bit-rate

increasing and PSNR loss.

Algorithm 2 Optimal TU depth derivation process

1: function TUDepthDecision(OrgPel)

2: ArrayOrg ← OrgPel

3: Process RQT with ArrayOrg

4: while Depth < PUDepthDecision(OrgPel) do

5: skip the Depth

6: end while

7: while Depth ≥ PUDepthDecision(OrgPel) do


9: end while

10: return TUDepth

11: end function

5.3.4 Fast prediction mode decision algorithm

HEVC employs 35 intra prediction modes to create a mode candidate list for

RDO process. The number of initial list is {3, 3, 3, 8, 8} which is matching PU

depth {0, 1, 2, 3, 4}. More prediction modes are utilized in RDO process for


smaller PU sizes. Moreover, HEVC supplements the mode candidate list with the

MPMs which are at most 3 non-contained prediction modes [53], as shown in Table

1. Undoubtedly, there are substantially redundant prediction modes in the mode

candidate list of HEVC. In our proposal, we reduce the number of initial list to {1,

1, 2, 3, 3}, and the downsampling prediction mode (DPM) is supplemented.

As discussed in section 3.1, downsampling prediction mode is similar to optimal

prediction mode with a specified probability. Therefore, we considers that the DPM

should be included into the mode candidate list to avoid missing the optimal pre-

diction mode. The number of intra prediction mode candidate list in worst case is

listed in Table 1. Compared with original HEVC, we reduce redundant prediction

mode candidates and cuts down computation complexity substantially, particularly

in PU size 4×4 and 8×8. Experimental results demonstrate that the fast prediction

mode decision algorithm reduces redundant prediction modes further and the supple-

mented downsampling prediction mode effectively avoids missing optimal prediction

mode.

Table 5.1 Number of mode candidate list for HEVC and proposal in worst case

The size Number of mode Number of mode

of PU candidates in HEVC candidates in proposal

64x64 [3] + 3 [1 + 1] + 3

32x32 [3] + 3 [1 + 1] + 3

16x16 [3] + 3 [2 + 1] + 3

8x8 [8] + 3 [3 + 1] + 3

4x4 [8] + 3 [3 + 1] + 3

5.4 Top-level design for VLSI architecture

As mentioned above, data dependency, throughput burden and additional com-

plex hardware structures severely restrict the real-time hardware implementation for

intra coding of HEVC. The proposed architecture aims to solve the three problems,

and support all CU, PU, TU paritioning and prediction modes.

5.4 Top-level design for VLSI architecture 69

As illustrated in Figure 5.6, the pipeline architecture design consists of two

stages, preprocessing stage (PS) and fast intra coding stage (FICS). The two stages

are operated in parallel and the data in preprocessing stage is entirely independent

from that in fast intra coding stage. For preprocessing and deriving downsampling

partitioning, prediction mode and RC, the PS imports original pixels and source in-

formation. Downsampling samples are obtained by the original samples and stored

in downsampling samples buffer. Then RMD, RDO is implemented to derive down-

sampling depth and prediction mode. In addition, RC is calculated and stored in

buffer. Meanwhile, original pixel source is imported into original samples buffer in

FICS. The fast RQT process is implemented to optimal partitioning and prediction

mode with downsampling information and original samples. Finally, coding entropy

is operated and output coded bitstream. As discussed in section 2, obstacles need

to be overcome for hardware implementation. The parallelized design of PS and

FICS solves the problem of CTU based data dependency in intra coding of HEVC.

Both PS and FICS are able to design into pipelined structures and the pipelined

structures improve throughput. Furthermore, PS and FICS are able to utilize the

similar prediction elements (PEs) to predict current PU and calculate RDO cost.

��

� ��

��

��

��

�

�

��

��

� ��

��

� ��

� ��

��

� ��

� ��

��

��

��

��

��

Figure 5.6 Top level architecture of proposed fast intra coding


Owing to the proposed fast PU depth decision algorithm, fast TU depth deci-

sion algorithm and fast prediction mode decision algorithm, the number of cycles

in worst case (CFICS) is calculated in (12). The calculation shows that about 2361

cycles are necessary to encode a CTU. As our proposal is aiming at a software solu-

tion for 4k@60fps real-time hardware implementation, the max operating frequency

is estimated as about 287MHz by (13). Therefore, our proposed downsampling in-

formation based fast intra coding and parallelized architecture is available for the

requirement of real-time processing.

Time

DS+RC+RMD Preprocessing Stage

Fast Intra Coding Stage

RDO

910 Cycles

DS+RC+RMD

RDO

CTU 0

CTU 1

CTU 0

CTU 1

FRDO

2361 Cycles

FRDO

……

…………

……

Figure 5.7 Timing diagram with PS and FICS

The pipelined timing schedule of PS and FICS is illustrated in Figure 5.7. PS and

FICS operate parallelly and independently, which reduce CTU level data dependency

of intra coding in HEVC. In PS, source downsampling, RC calculation and RMD

are performed firstly for CTU 0. Then, the RDO process is executed and intra

encoding of CTU 0 is completed. Meanwhile, DS, RC calculation and RMD of CTU

1 can start. The RDO process of PS is obtained as 910 cycles, which is calculated

in (14). The PS is designed in pipelined architecture to improve the throughput. In

FICS, fast RDO process (FRDO) is performed with 2361 cycles to encode source

and output coded bitstream.

CFICS =64

4× 64

4× 7 +

64

8× 64

8× 7 +

64

16× 64

16× 6 (5.14)


+64

32× 64

32× 5 +

64

64× 64

64× 5 = 2361(Cycles/CTU) (5.15)

Fmax = 2361× 3840

64× 2160

64× 60 = 286.87(MHz) (5.16)

CPS =32

4× 32

4× 11 +

32

8× 32

8× 11 +

32

16× 32

16× 6 (5.17)

+32

32× 32

32× 6 = 910(Cycles/CTU) (5.18)


Several experiments are carried out to evaluate the proposed downsampling in-

formation based fast intra coding which is integrated with the reference software of

HEVC test model (HM) 12.1 [54]. The simulation platform is Intel(R) Core(TM)

i7-4770 CPU @ 3.40GHz with 4 cores, RAM 8.00 GB and Windows 10 Home Edi-

tion 64-bit. Owing to our proposal aiming at a software solution for real-time intra

encoding of high resolution source, Class 4K (3840×2160), Class A (2560×1600), B

(1920×1080), sequences are employed to evaluate BD-rate, BD-PSNR [55] perfor-

mance and computational complexity reduction. The tested sequences are conducted

quantization parameters (QP) 22, 27, 32 and 37 with the common test condition

defined in [56]. As the proposed fast algorithms are mainly applied to the intra

coding, we set the period of I frames to 1 so that all the tested frames are intra

encoded.

The proposed fast intra coding consists of three algorithms, FPDD, FTDD and

FPMD which achieves time saving at BD-rate increasing. FTDD reduces redundant

TU depths in RQT process and provides 15.4% of time reduction with averaged

0.4% BD-rate increasing, as listed in Table 2. The comparison between FICS and

FICS with FTDD contains performance in terms of time saving (∆TSTU), BD-rate

(∆BRTU), BD-PSNR (∆PSNRTU), which are defined as follows:

∆BRTU = BRFICS −BRTU (5.19)

∆PSNRTU = PSNRFICS − PSNRTU (5.20)


∆TSTU = TSFICS − TSTU (5.21)

TSTU =1

4

4∑i=1

THM(QPi)− TTU(QPi)

THM(QPi)× 100% (5.22)

TSFICS =1

4

4∑i=1

THM(QPi)− TFICS(QPi)

THM(QPi)× 100% (5.23)

where BRFICS and BRTU denote the BD-rate performance of FICS and FICS

without FTDD. PSNRFICS and PSNRTU indicates the BD-PSNR loss of FICS and

FICS without FTDD. TSFICS and TSTU is the encoding time reduction of FICS

and FICS without FTDD. THM indicates the encoding time of HM 12.1 with QPi.

THM(QPi) and THM(QPi) denotes the encoding time of FICS and FICS without

FTDD under QP value 22, 27, 32 and 37. FPMD reduces redundant prediction

modes and supplements the downsampling prediction mode to avoid missing optimal

prediction mode. As shown in Table 3, FPMD contributes 1.85% of encoding time

reduction and 0.03% BD-rate decreasing, compared with FICS without FPMD. The

comparison between FICS and FICS with FPMD are conducted with performance in

terms of time saving (∆TSPM), BD-rate (∆BRPM), BD-PSNR (∆PSNRPM), which

are defined as following:

∆BRPM = BRFICS −BRPM (5.24)

∆PSNRPM = BRFICS − PSNRPM (5.25)

∆TSPM = TSFICS − TSPM (5.26)

TSPM =1

4

4∑i=1

THM(QPi)− TFICS(QPi)

THM(QPi)× 100% (5.27)

where BRPM and PSNRPM denote the BD-rate performance and the BD-PSNR

loss of FICS without FPMD. TSPM is the encoding time reduction of FICS without


FPMD. Thereinto, TPM denotes the encoding time of FICS without FPMD. FPDD

reduces the computational complexity with downsampling prediction depth. The

time saving and BD-rate increasing contribution can be derived by analyzing Table

2, 3 and 4.

Table 4 shows the simulation results in terms of BD-rate(BRFICS), BD-PSNR

(PSNRFICS), time consumption of preprocessing stage (PTC) and time saving for

fast intra coding (TSFICS). Compared with encoding time of original HM 12.1, PTC

and TSFICS is estimated to evaluate the computational complexity of PS and FICS,

PTC is defined as following:

PTC =1

4

4∑i=1

TPS(QPi)

THM(QPi)× 100% (5.28)

where THM indicates the encoding time of HM 12.1 with QPi. TFICS and TPS

denotes the encoding time of FICS and PS. As shown in in Table 2, the proposed fast

intra coding achieve 60.4% encoding time reduction and 1.26% BD-rate increment

on average. PS consumes approximately 19.2% encoding time compared with orig-

inal HEVC, however both PS and FICS are parallelized in hardware architecture.

Moreover, PS and FICS are designed to utilize similar PE to complete prediction

calculation, high efficient hardware utilization can be achieved.

For further evaluation of the proposed fast intra coding, we compare our work

with several related work which achieved real-time intra coding implementation.

Table 3 shows the overall specification of our proposal and other two work. Zhu

et al. propose to estimate the RD cost with image texture in [23]. Their work

is implemented for the HDTV1080p@44fps real-time encoding. The overall time

saving in [23] is better than our proposal with 1.3%, however the BD-rate increases

by 4.53%. Moreover, their implementation reduces PU size 64×64. In work [24],

Huang et al. utilizes source signal based fast RMD algorithm to parallelize the

hardware implementation which is able to encode HDTV1080p@60fps in real-time.

The BD-rate increasing is 4.30% which is better than [23], however the encoding

time saving is about 41.6%. Both of their work utilize the information of source to

accelerate intra coding process. However, the computational complexity reduction


and BD-rate performance is not perfect enough to perform HDTV applications. In

our proposal the biggest advantage is that the proposed fast intra coding is based

on downsampling information and is considered to encode 4k(2160p) video in real-

time with max operating clock frequency less than 290MHz which is demonstrated

theoretically.

5.5 Experimental results 75Tab

le5.2Perform

ance

comparisonbetweenFIC

SandFIC

SwithoutFTDD

Class

Sequence

Bitdepth

∆BR

TU(%

)∆PSNR

TU(dB)

∆TSTU(%

)

Bosphorus

10bit

0.52

-0.0152

15.4

Hon

eyBee

10bit

0.11

-0.0044

15.9

Class

4KHon

eyBee

8bit

0.15

-0.0049

15.6

3840×2160

Jockey

8bit

0.34

-0.0053

11.0

ReadySetGo

8bit

0.62

-0.0216

17.1

ShakeN

Dry

10bit

0.17

-0.0030

16.0

ShakeN

Dry

8bit

0.16

-0.0029

15.1

Traffic

8bit

0.71

-0.0385

20.5

Class

APeopleOnStreet

8bit

0.60

-0.0344

19.9

2560×1600

Steam

LocomotiveT

rain

10bit

0.21

-0.0125

21.7

NebutaFestival

10bit

0.08

-0.0055

22.4

Bosphorus

8bit

0.64

-0.0281

13.3

Hon

eyBee

8bit

0.57

-0.0227

15.1

Jockey

8bit

0.38

-0.0139

12.3

Class

BShakeN

Dry

8bit

0.10

-0.0051

14.9

1920×1080

BQTerrace

8bit

0.66

-0.0435

19.0

Kim

ono

8bit

0.31

-0.0108

16.8

Cactus

8bit

0.80

-0.0303

19.7

ParkScene

8bit

0.43

-0.0186

18.4

Total

average

0.40

-0.0169

16.8


Table

5.3Perform

ance

comparisonbetweenFIC

SandFIC

SwithoutFPMD

Class

Sequence

Bitdepth

∆BR

PM(%

)∆PSNR

PM(dB)

∆TSPM(%

)

Bosphorus

10bit

0.05

-0.0011

3.9

Hon

eyBee

10bit

0.03

-0.0001

1.8

Class

4KHon

eyBee

8bit

0.03

0.0001

1.8

3840×2160

Jockey

8bit

-0.20

0.0027

1.1

ReadySetGo

8bit

-0.02

-0.0006

2.1

ShakeN

Dry

10bit

-0.04

-0.0012

1.6

ShakeN

Dry

8bit

0.05

-0.0013

1.0

Traffic

8bit

-0.03

0.0012

2.6

Class

APeopleOnStreet

8bit

0.01

-0.0006

3.0

2560×1600

Steam

LocomotiveT

rain

10bit

0.00

0.0001

2.6

NebutaFestival

10bit

0.00

0.0000

3.0

Bosphorus

8bit

-0.08

0.0034

0.6

Hon

eyBee

8bit

0.06

-0.0020

1.1

Jockey

8bit

-0.04

0.0011

0.9

Class

BShakeN

Dry

8bit

-0.09

0.0043

0.9

1920×1080

BQTerrace

8bit

0.03

-0.0020

1.5

Kim

ono

8bit

-0.11

0.0037

1.8

Cactus

8bit

-0.01

0.0001

2.1

ParkScene

8bit

-0.03

0.0012

1.7

Total

average

-0.03

0.0008

1.85

5.5 Experimental results 77Tab

le5.4Perform

ance

comparisonofcompressionandencodingtimeforPSandFIC

S

Class

Sequence

Bitdepth

BR

FICS(%

)PSNR

FICS(dB)

PTC(%

)TSFICS(%

)

Bosphorus

10bit

2.12

-0.0611

18.2

74.2

Hon

eyBee

10bit

0.91

-0.0198

19.2

62.9

Class

4KHon

eyBee

8bit

0.92

-0.0200

19.2

63.1

3840×2160

Jockey

8bit

2.09

-0.0309

18.6

72.4

ReadySetGo

8bit

1.67

-0.0571

19.3

55.2

ShakeN

Dry

10bit

0.86

-0.0170

18.8

68.5

ShakeN

Dry

8bit

0.82

-0.0163

18.7

69.0

Traffic

8bit

1.37

-0.0743

19.4

50.6

Class

APeopleOnStreet

8bit

1.20

-0.0683

19.6

48.6

2560×1600

Steam

LocomotiveT

rain

10bit

0.61

-0.0372

19.5

53.6

NebutaFestival

10bit

0.29

-0.0214

19.8

46.7

Bosphorus

8bit

1.98

-0.0867

19.2

63.2

Hon

eyBee

8bit

1.66

-0.0651

19.8

58.6

Jockey

8bit

1.49

-0.0556

19.0

65.8

Class

BShakeN

Dry

8bit

0.79

-0.0400

19.2

65.4

1920×1080

BQTerrace

8bit

1.28

-0.0833

19.7

54.9

Kim

ono

8bit

1.02

-0.0362

19.2

66.2

Cactus

8bit

1.80

-0.0681

19.8

52.8

ParkScene

8bit

1.03

-0.0445

19.4

55.3

Total

average

1.26

-0.0475

19.2

60.4


Table 5.5 Comparison of coding architecture and performance

Specification Proposal [23] [24]

BD-BR(%) 1.26 4.53 4.30

BD-PSNR(dB) -0.0475 -0.2000 NA

Time saving(%) 60.4 61.7 41.6

Frame rate(FPS) 60 44 60

Max resolution 2160p 1080p 1080p

Max Throughput 498 91 124

(MegaPxiels/s)

Max Frequency(MHz) 286.87 357.00 294.00

(Estimated)

Block Size ALL 4,8,16,32 ALL

5.6 Summary

The paper presents a downsampling information based fast intra coding for

HEVC to reduce the computational complexity in intra encoding process. The pro-

posed fast intra coding derives downsampling information in preprocessing stage.

Downsampling depth and reference confidence is employed to fast determine PU

partitioning. Downsampling mode is utilized to avoid missing optimal prediction

mode. Moreover, in RQT process, recursive TU searching is optimized. The experi-

mental results show that our proposal reduces encoding time on average 60.4% and

increases BD-rate about 1.26%, compared with HM 12.1. In addition, the feasibility

of real-time 4k@60fps intra encoder implementation is demonstrated theoretically.

The theoretical max operating frequency is derived as 286.87MHz. For further work,

the integrated hardware architecture design and implementation will be conducted

and a high resolution oriented intra encoder will be developed.

Chapter 6 Conclusions 79

Chapter 6 Conclusions

The dissertation targets to reduce the computational complexity in video en-

coders by half while keeping the compression performance in terms of both video

quality and encoded bits, so that video data can be compressed efficiently within

halt encoding time or under lower power consumption.

Firstly, a fast algorithm for the intra prediction of HEVC is proposed, which

can efficiently decide the level on the basis of the roberts-cross edge detector.The

proposed algortihm utilizes the high correlation between regional texture and predic-

tion unit partitioning. It is mainly composed of a bottom-up level decision method

and an efficient decision flow based on an authentic image feature. Furthermore,

chrominance information is also employed to decide the prediction unit partitioning.

The experimental results for the proposed algorithm demonstrated that it achieved

a task with greatly reduced computational complexity compared with the original

HEVC. The average time-saving is approximately 37%, while the increase in bit rate

and decrease in PSNR are negligible.

Secondly, an efficient parallel scheme is proposed for intra coding of HEVC.

Downsampling signal is applied to generate prediction samples instead of neighbor-

ing pixels. It reduces spatial redundancy and removes the data dependency in intra

encoding for coding tree unit (CTU) structure. Meanwhile, a fast training method

is designed to derive downsampling signal adaptively. Experimental results show

that the proposed fast parallelized scheme achieves 4.17% bit saving on average,

with reducing computational complexity by 27.26%.

Thirdly, the paper proposes a downsampling information based intra coding

scheme which consists of two parts, preprocessing stage and fast intra coding stage.

Three downsampling information based fast decision algorithms are proposed in

fast intra coding stage. Moreover, a parallelized architecture of the fast intra coding

scheme is proposed. The preprocessed downsampling stage can be executed with

intra coding stage in parallel. The proposed architecture fully makes use of this

feature to improve throughput and fragment data dependency. Experimental results

80 Chapter 6 Conclusions

demonstrate that the proposed algorithms achieves on average 60.4% reduction on

encoding time with negligible coding efficiency loss, compared with original HEVC.

Acknowledgement 81

Acknowledgement

Time goes fast, and I am finishing the three years’studies and lives in Tokushima

University for pursuing my PH.D degree. Finally, it is the time to graduate and give

acknowledgments to those who ever helped me in my study and life.

First of all, I would like to express my gratitude to my advisors, Associate Pro-

fessor Song Tian and Professor Shimamoto Takashi, for giving me continuous in-

spiration, support and criticism throughout the whole of my work. They always

provided me with valuable insight and making sure that I was not lost in the re-

search directions. Without their supports, it would be impossible for me to finish

this work.

I have had the pleasure of study in the course of electrical and electronic sys-

tems, Tokushima University. I would like to extend my appreciation to the Professor

Hashizume Masaki, the Professor Nishio Yoshifumi, the Associate Professor Yot-

suyanagi Hiroyuki, the Assistant Professor Uwate Yoko, and other technical staffs

in our course, for their supports and friendship. I have had the pleasure of collab-

orating with numerous exceptionally talented graduate students over the last few

years. I would like to thank all my colleagues in our lab.

Finally, I would like to express my deepest love and gratitude towards my parents,

Jingyan Shi and Hong Liu for their unconditional supports and understanding over

the years.

Department of Electrical and Electronic Engineering,

College of Systems Innovation Engineering,

Graduate School of Advanced Technology and Science,

The University of Tokushima, Japan.

Wen Shi

82 Bibliography

Bibliography

[1] B. Bross, W.J. Han, G. J. Sullivan, J.R. Ohm and T. Wiega: Working draft 11

of high efficiency video coding, JCTVC-L1003, Geneva, Switzerland, 2013.

[2] G.J. Sullivan, J.R. Ohm, W.J. Han, and T. Wiegand: Overview of the high

efficiency video coding (HEVC) standard, IEEE Trans. Circuits Syst. Video

Technol., vol. 22, no. 12, pp. 1649-1668, Dec. 2012.

[3] T.Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra: Overview of the

H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol.,

vol. 13, no. 7, pp. 560-576. 2003.

[4] G. Chen, Z. Pei, L. Sun and Z. Liu: Fast intra prediction for HEVC based

on pixel gradient statistics and mode refinement, IEEE Signal and Information

Processing (ChinaSIP), pp. 514-517, 2013.

[5] T. L. da Silva, L. V. Agostini and L. A. da S. Cruz: Fast HEVC intra prediction

mode decision based on EDGE direction information, IEEE Signal Processing

Conference (EUSIPCO), pp. 1214-1218, 2012.

[6] JCT-VC: High efficiency video coding (HEVC) test model 8 (HM 8) encoder

description, JCTVC-J1002, 2012.

[7] S. Na, W. Lee and K. Yoo: Edge-based fast mode decision algorithm for intra

prediction in HEVC, Consumer Electronics (ICCE), 2014 IEEE International

Conference, pp. 11-14, 2014.

[8] W. Jiang, H. Ma and Y. Chen: Gradient based fast mode decision algorithm

for intra prediction in HEVC, 2nd International Conference on IEEE Consumer

Electronics, pp. 1836-1840, 2012.

[9] Y. Zhang, Z. Li and B. Li: Gradient-based fast decision for intra prediction in

HEVC, IEEE Visual Communications and Image Processing (VCIP), pp. 1-6,

2012.

[10] W. Ma and B. S. Manjunath: A texture thesaurus for browsing large aerial pho-

tographs, Artificial Intelligence Techniques for Emerging Information Systems

Applications, Vol. 49, pp. 633-648, 1998.

Bibliography 83

[11] https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/

[12] F. Bossen: Common test conditions and software reference configurations,

JCTVC-K1100, JCTVC of ISO/IEC and ITU-T, Shanghai, China, 2012.

[13] ITU: Recommendation ITU-R BT.2020-2 - Parameter Values for Ultrahigh Def-

inition Television Systems for Production and International Program Exchange,

Document ITU-R WP6C Contribution 399, Geneva, Oct. 2015.

[14] High Efficiency Video Coding (HEVC) Text Specification Draft 10 (for FDIS

and Consent), JCTVC-L1003, ITU-T/ISO/IEC Joint Collaborative Team on

Video Coding (JCT-VC), Jan. 2013.

[15] High efficiency coding and media delivery in heterogeneous environments Part

2: High efficiency video coding, ISO/IEC 23008-2. May, 2015.

[16] J. Lainema et al.: Intra Coding of the HEVC Standard, IEEE Trans. Circuits

Syst. Video Technol., Vol. 22, No. 12, Dec. 2012, pp.1792-1801.

[17] S. Li et al.: Improving Lossless Intra Coding of H.264/AVC by Pixel-Wise

Spatial Interleave Prediction, IEEE Trans. Circuits Syst. Video Technol., Vol.

21, No. 12, Dec. 2011, pp.1924-1928.

[18] S. Kanimozhi et al.: Efficient Lossless Compression Using H.264/AVC Intra

Coding and PWSIP Prediction, Int. Congr. Information Syst. and Computing

(ICISC), Vol. 3, Jan. 2013, pp.406-410.

[19] VA. Nguyen et al.: Adaptive Downsampling / Upsampling for Better Video

Compression at Low Bit Rate, IEEE Int. Symp. on Circuits and Syst. (ISCAS),

May 2008, pp. 1624?1627.

[20] J. Li et al.: Efficient Multiple Line-Based Intra Prediction for HEVC, IEEE

Trans. Circuits Syst. Video Technol., Vol. PP, No. 99, Nov. 2016, pp.1-1.

[21] C. Chen et al.: A New Block-Based Method for HEVC Intra Coding, IEEE

Trans. Circuits Syst. Video Technol., vol. 27, no. 8, April 2016, pp. 1727?1736.

[22] Y. Li et al.: Convolutional Neural Network-Based Block Up-sampling for Intra

Frame Coding, IEEE Trans. Circuits Syst. Video Technol., vol. PP, no. 99, July

2017, pp. 1?13.

84 Bibliography

[23] S. Park et al.: Report on the evaluation of HM versus JM, Document JCTVC-

D181, Daegu, Korea, Jan. 2011.

[24] V. Sze et al.: High Efficiency Video Coding (HEVC): Algorithms and Archi-

tectures, Switzerland: Springer International Publishing, 2014, pp. 220-221.

[25] HEVC test model 16.11, Accessed, 2016.

https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.11/

[26] F. Bossen,“Common Test Conditions and Software Reference Configurations,”

Document JCTVC-F900, Torino, IT, July 2011.

[27] G. Bjontegaard: Calculation of Average PSNR Differences between RD-Curves,

Document VCEG-M33, Austin, TX, USA, Apr. 2001.

[28] Ultra Video Group, 4K Test Sequences, Accessed, 2015.

http://ultravideo.cs.tut.fi/testsequences

[29] Tarr, G.: 6M 4K Ultra HDTVs Shipped In North America In 2015. [Online].

Available:

http://hdguru.com/ihs-6m-4k-ultra-hdtvs-shipped-in-north-america-in-2015/

[30] Bross, B., Ohm J., Sullivan G.J., Han W.J., Wiegand T.: High efficiency video

coding text specification draft 10. JCTVC-L1003 (2013)

[31] ITU-T Study Group 16: High efficiency video coding. Final draft approval

(2013)

[32] ISO/IEC 23008-2: High efficiency coding and media delivery in heterogeneous

environments Part 2: High efficiency video coding. Final standard approval

(2013)

[33] Ohm, J.R., Sullivan, G.J., Schwarz, H., Tan, T.K., Wiegand, T.: Compari-

son of the coding efficiency of video coding standards including high efficiency

video coding (HEVC). IEEE Transactions on Circuits and Systems for Video

Technology, 22(12), 1668-1683 (2012)

[34] 3GPP organizational partners: Evaluation of High Efficiency Video Coding

(HEVC) for 3GPP services. 3GPP TR 26.906 (2014)

[35] Sze, V., Budagavi, M., Sullivan, G.J.: High Efficiency Video Coding (HEVC):

Bibliography 85

Algorithms and Architectures. Springer International Publishing, Switzerland,

pp 220-221 (2014)

[36] Song, T., Song, H., Fujita, G., Onoye, T., Shirakawa, I.: A Codec of H.263

Advanced Intra Coding mode and it’s Architecture. IEICE technical report.

100(384), 45-50 (2000)

[37] Correa, G., Assuncao, P., Agostini, L., Cruz, L.A.S.: Performance and compu-

tational complexity assessment of high-efficiency video encoders. IEEE Transac-

tions on Circuits and Systems for Video Technology, 22(12), 1899?1909 (2012)

[38] Lei, H., Yang, Z.: Fast intra prediction mode decision for high efficiency video

coding. In: 2nd International Symposium on Computer, Communication, Con-

trol and Automation (2013)

[39] Song, Y., Zeng, Y., Li, X., Cai, B., Yang, G.: Fast CU size decision and

mode decision algorithm for intra prediction in HEVC. Multimedia Tools and

Applications. 1-17 (2016)

[40] Ramezanpour, M., Zargari, F.: Fast HEVC I-frame coding based on strength

of dominant direction of CUs. Journal of Real-Time Image Processing, Special

issue paper, 1-10 (2016)

[41] Shen, L., Zhang, Z., An, P.: Fast CU size decision and mode decision algorithm

for HEVC intra coding. IEEE Transactions on Consumer Electronics, 59(1),

207-213 (2013)

[42] Zhao, L., Fan, X., Ma, S., Zhao, D.: Fast intra-encoding algorithm for High

Efficiency Video Coding. Signal Processing: Image Communication, 29(9), 935-

944 (2014)

[43] Huang, X., Jia, H., Wei, K., Liu, J., Zhu, C., Lv, Z., Xie, D.: Fast algorithm of

coding unit depth decision for HEVC intra coding. In: IEEE Visual Communi-

cations and Image Processing Conference, 458?461 (2014)

[44] Shen, L, Zhang, Z., Liu, Z.: Effective CU size decision for HEVC intracoding.

IEEE Transactions on Image Processing, 23(10), 4232-4241 (2014)

[45] Shang, X., Wang, G., Fan, T., Li, Y., Zuo, Y.: Low-complexity intra-coding

scheme for HEVC, Circuits, Systems, and Signal Processing, 1-19 (2016)

86 Bibliography

[46] Zhu, J., Liu, Z., Wang, D., Han, Q.: HDTV1080p HEVC Intra encoder with

source texture based CU/PUmode pre-decision. In: 19th Asia and South Pacific

Design Automation Conference (ASP-DAC), 367-372 (2014)

[47] Huang, X., Jia, H., Cai, B., Zhu, C., Liu, J., Yang, M., Xie, D., Gao, W.:

Fast algorithms and VLSI architecture design for HEVC intra-mode decision.

Journal of Real-Time Image Processing, Special issue paper, 1-18 (2015)

[48] Pastuszak, G., Abramowski, A.: Algorithm and architecture design of the

H.265/HEVC intra encoder. IEEE Transactions on Circuits and Systems for

Video Technology, 26(1), 210?222 (2016)

[49] Ju, C.C., Liu, T.M., Lee, K.B., Chang, Y.C.: 18.6 A 0.5nJ/pixel 4K

H.265/HEVC codec LSI for multi-format smartphone applications. In: IEEE

International Solid-State Circuits Conference, 1-3 (2015)

[50] Ju, C.C., Liu, T.M., Lee, K.B., Chang, Y.C.: 18.6 A 0.5nJ/pixel 4K

H.265/HEVC codec LSI for multi-format smartphone applications. IEEE Jour-

nal of Solid-State Circuits, 51(1), 56-67 (2016)

[51] Ozcan, E., Kalali, E., Adibelli, Y., Hamzaoglu, I.: A computation and en-

ergy reduction technique for HEVC intra mode decision. IEEE Transactions on

Consumer Electronics, 60(4), 745-753 (2014)

[52] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess-

ment: From error visibility to structural similarity. IEEE Transactions on Image

Processing, 13(4), 600-612 (2004)

[53] Zhao, L., Zhang, L., Ma, S., Zhao, D.: Fast mode decision algorithm for intra

prediction in HEVC. In: IEEE Visual Communications and Image Processing

Conference, 1-4 (2011)

[54] HEVC test model 12.1 [Online]. Available:

https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-12.1/

[55] Bjontegaard, G.: Improvements of the bd-psnr model. ITU-T Q.6/SG16 Doc-

ument, VCEG-AI11 (2008)

[56] Bossen, F.: Common test condition and software reference configurations.

JCTVC-L1100 (2013)

A Publication Paper 87

Appendix

A Publication Paper

1. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Edge

Detector Based Fast Level Decision Algorithm for Intra Prediction of

HEVC, Journal of Signal Processing, Vol.19, No.2, pp.67–73, 2015.

2. Xiantao Jiang, Tian Song, Takashi Shimamoto, Wen Shi and Lisheng

Wang : Spatio-Temporal Prediction Based Algorithm for Parallel Im-

provement of HEVC, IEICE Transactions on Fundamentals of Electron-

ics, Communications and Computer Sciences, Vol.E98-A, No.11, pp.2229–

2237, 2015.


Wang : High Efficiency CU Depth Prediction Algorithm for High Res-

olution Applications of HEVC, IEICE Transactions on Fundamentals of

Electronics, Communications and Computer Sciences, Vol.E98-A, No.12,

pp.2528–2536, 2015.

4. Wen Shi, Tian Song, Takafumi Katayama, Xiantao Jiang and Takashi

Shimamoto : Hardware Implementation-Oriented Fast Intra-Coding Based

on Downsampling Information for HEVC, Journal of Real-Time Image

Processing, pp.1–15, 2017.

5. Xiantao Jiang, Xiaofeng Wang, Tian Song, Wen Shi and Takafumi

Katayama : An Efficient Complexity Reduction Algorithm for CU Size

Decision in HEVC, International Journal of Innovative Computing, In-

formation and Control , Vol.10, No.1, pp.1–10, 2017.

88 Appendix

6. Takafumi Katayama, Tian Song, Wen Shi, Gen Fujita, Xiantao Jiang,

Takashi Shimamoto : Hardware Oriented Low-Complexity Intra Coding

Algorithm for SHVC, IEICE Transactions on Fundamentals of Electron-

ics, Communications and Computer Sciences, Vol.E100-A, No.12, pp.-,

2017.

7. Takafumi Katayama, Tian Song, Wen Shi, Xiantao Jiang, Takashi

Shimamoto : Fast Edge Detection and Early Depth Decision for Intra

Coding of 3D-HEVC, International Journal of Advances in Computer

and Electronics Engineering, Vol.2, No.7, pp.11–20, 2017.

B International Conference 89

B International Conference

1. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Edge

Detector Based Fast Level Decision Algorithm for Intra Prediction of

HEVC, Proceedings of International Workshop on Nonlinear Circuits,

Communications and Signal Processing (NCSP’14), pp.129–132, Hon-

olulu, Mar. 2014.

2. Yutaro Tanida, Wen Shi, Tian Song and Takashi Shimamoto : Com-

plexity Reduction Algorithm for Intra Prediction of HEVC, The 29th

International Technical Conference on Circuits/Systems, Computers and

Communications (ITC-CSCC2014), pp.221–224, Phuket, THAILAND,

Jul. 2014.


Wang : Temporal Prediction Improvement for Parallel Processing of

HEVC, Proceedings of IEEE Asia Pacific Conference on Circuits Sys-

tems(APCCAS2014), pp.515–518, Okinawa, Nov. 2014.

4. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Edge In-

formation Based Fast Selection Algorithm for Intra Prediction of HEVC,

Proceedings of IEEE Asia Pacific Conference on Circuits Systems (APC-

CAS2014), pp.17–20, Okinawa, Nov. 2014.

5. Wen Shi, Xiantao Jiang, Tian Song, Jenq-Shiou Leu and Takashi

Shimamoto : Efficient Intra Coding for HEVC Based on Spatial Lo-

cality, Proceedings of International Forum on Advanced Technologies

(IFAT2015), pp.168–170, Tokushima, Mar. 2015.

6. Xiantao Jiang, Tian Song, Wen Shi, Lisheng Wang and Takashi Shi-

mamoto : Merge Prediction Algorithm for Adaptive Parallel Improve-

ment of High Efficiency Video Coding, Proceedings of IEEE International

Conference on Consumer Electronics(ICCE-Taiwan 2015), pp.310–311,

Taipei, Jun. 2015.

90 Appendix

7. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Spatial

Locality Based Supplemental Modes for Intra Prediction of HEVC, Pro-

ceedings of IEEE International Conference on Consumer Electronics(ICCE-

Taiwan 2015), pp.298–299, Taipei, Jun. 2015.

8. Takuya Hamada, Yutaro Tanida, Wen Shi, Tian Song, Jenq-Shiou Leu

and Takashi Shimamoto : Original Pixel Based Parallel Algorithm for

Intra Prediction of HEVC, The 30th International Technical Conference

on Circuits/Systems, Computers and Communications (ITC-CSCC2015),

pp.400–401, Seoul, Jun. 2015.

9. Wen Shi, Xiantao Jiang, Tian Song, Jenq-Shiou Leu and Takashi Shi-

mamoto : Spatial Locality Based Parallel Scheme for Intra Coding of

HEVC, Tenth International Conference on Innovative Computing, Infor-

mation and Control (ICICIC2015), p.204, Dalian, Aug. 2015.

10. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Seg-

mental Downsampling Intra Coding Based on Spatial Locality for HEVC,

Proceedings of IEEE International Conference on Consumer Electronics

Berlin(ICCE-Berlin 2015), pp.12–16, Berlin, Sep. 2015.

11. Wen Shi, Tian Song and Takashi Shimamoto : High Efficiency Intra

Coding Extension for HEVC/H.265, IEEE CASS Shikoku and Malaysia

Chapters Joint Seminar, Tokushima, Oct. 2015.

12. Xiantao Jiang, Tian Song, Wen Shi, Lisheng Wang and Takashi Shi-

mamoto : High efficiency CU depth decision algorithm for high resolution

application of HEVC, Proceedings of IEEE International Technical Con-

ference TENCON 2015, pp.1–4, Macau, China, Nov. 2015.

B International Conference 91

13. Takafumi Katayama, Wen Shi, Tian Song and Takashi Shimamoto

: Low-Complexity Intra Coding Algorithm in Enhancement Layer for

SHVC, Proceedings of 2016 IEEE International Conference on Consumer

Electronics (ICCE), pp.457–460, Las Vegas, Jan. 2016.

14. Wen Shi, Xiantao Jiang, Tian Song, Jenq-Shiou Leu and Takashi

Shimamoto : Downsampled Information Based Low Complexity Intra

Coding for HEVC, Proceedings of 2nd International Forum on Advanced

Technologies *(IFAT2016), No.P1-17, pp.1–3, Tokushima, Mar. 2016.

15. Takafumi Katayama, Wen Shi, Tian Song, Jenq-Shiou Leu and Takashi

Shimamoto : Fast CU Size Decision for Intra Coding Algorithm in SHVC,

Proceedings of 2nd International Forum on Advanced Technologies (IFAT

2016), No.P1-18, pp.1–3, Tokushima, Mar. 2016.

16. Takafumi Katayama, Tian Song, Wen Shi, Takashi Shimamoto and

Jenq-Shiou Leu : Reference Frame Selection Algorithm of HEVC En-

coder for Low Power Video Device, Proceedings of 2nd International

Conference on Intelligent Green Building and Smart Grid (IGBSG 2016),

pp.34–39, Praha, Jun. 2016.

17. Yoshiki Ito, Wen Shi, Tian Song and Takashi Shimamoto : An Adap-

tive Search Range Selection Algorithm for HEVC, Proceedings of In-

ternational Technical Conference on Circuits/Systems, Computers and

Communications(CSCC2016), pp.211–214, Okinawa, Jul. 2016.

92 Appendix

18. Ryo Kuroda, Wen Shi, Tian Song and Takashi Shimamoto : Hard-

ware Oriented Early CU Splitting Algorithm by Coding Unit Feature

Analysis for HEVC, Proceedings of International Technical Conference on

Circuits/Systems, Computers and Communications (CSCC2016), pp.217–

220, Okinawa, Jul. 2016.

19. Takafumi Katayama, Wen Shi, Tian Song and Takashi Shimamoto

: Early Depth Determination Algorithm for Enhancement Layer Intra

Coding of SHVC, Proceedings of IEEE International Technical Confer-

ence TENCON 2016, 3083-3086, Singapore, Nov. 2016.

20. Shota Yusa, Takafumi Katayama, Wen Shi, Tian Song and Takashi

Shimamoto : Fast CU Depth Decision Algorithm Using Depth-Map for

3D-HEVC, Proceedings of International Technical Conference on Cir-

cuits/Systems, Computers and Communications(ITC-CSCC2017), 473-

474, Busan, Jul. 2017.

21. Koki Tamura, Takafumi Katayama, Wen Shi, Tian Song and Takashi

Shimamoto : Coding Efficiency Improvement Algorithm for Inter-Layer

Reference Prediction in SHVC, Proceedings of International Technical

Conference on Circuits/Systems, Computers and Communications (ITC-

CSCC2017), 479-480, Busan, Jul. 2017.

High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ﬃ...

Documents

Transcript of High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ﬃ...