High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi...
Transcript of High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi...
![Page 1: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/1.jpg)
High performance intra algorithm and parallel
hardware architecture for the next generation
video coding
Doctor’s Course in Electrical and Electronic Engineering
Graduate School of Engineering, University of Tokushima
Wen Shi
Jan. 2018
![Page 2: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/2.jpg)
Abstract
Dynamic force from the worldwide consumer electronics market is driving dis-
play technology and video coding technology. High definition (HD) and ultra-high
definition (UHD) video contents are growing and expending. In the meantime,
multimedia terminals are adapting to display, capture, and playback the higher res-
olution video with better quality. However, higher resolution and higher quality
video causes network overload in content-delivery for large amounts of video data.
To compress and encode the huge amount of video data to much smaller capacity
and faster speed so that it can be manipulated with low cost applications for broad-
casting, video conference, online video game and surveillance system, video coding
technology has been developed and researched since 1989. H.264/AVC is the lat-
est international video coding standard which provides high coding efficiency while
introducing heavy computational loading and high-power consumption. As video
content is growing in both resolution and quality. High quality and high resolution
(e.g. 2K, 4K) contents become the mainstream, a successor to H.264/AVC called
High Efficiency Video Coding (HEVC) for next generation video coding is under
standardization. In 2013, the Joint Collaborative Team on Video Coding (JCT-VC)
standardized HEVC to achieve higher video coding efficiency. As the next gener-
ation video coding standard, HEVC supports higher resolution video coding and
achieves about 50% bit-rate reduction under the same visual quality compared with
Advanced Video Coding (H.264/AVC).
In HEVC, Rate Distortion Optimization (RDO) based hybrid encoding is effec-
tive tool, which reduces spatial and temporal redundancy by trying a wide range of
unit types and prediction modes. Therefore, RDO process is also computationally
intensive in video encoding. Intra coding reduces data redundancy in neighboring
blocks, which leads to high data dependency and high-power consumption. The
problem becomes critical in HEVC because of the quadtree-structured coding unit
sizes have been increased from 4 to 64. In recent years, real-time transmission
and mobile devices are the imminence requirement for video-related entertainment
applications. It is desirable to develop optimization techniques for video encoders
![Page 3: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/3.jpg)
3
because of short battery life and limited hardware resources. The targets of the
research are to reduce the computational complexity, increase coding performance
and realize hardware parallelization, so that video data can be compressed efficiently
within faster speed and lower consumption in real-time hardware implementation.
Three novel schemes are proposed to realize the above-mentioned targets. Firstly,
an edge detector based fast level decision algorithm for intra prediction of HEVC is
presented to reduce the redundant calculation and encoding time. In the intra pre-
diction of HEVC, prediction unit sizes from 4X4 to 64x64 are employed, respectively
defined as levels 1 to 5, to achieve higher coding efficiency. Nevertheless, this is at
the expense of high computational complexity. The proposal can efficiently decide
the level on the basis of the roberts-cross edge detector. The proposed algorithm
utilizes the high correlation between regional texture and prediction unit partition-
ing. It is mainly composed of a bottom-up level decision method and an efficient
decision flow based on an authentic image feature. Furthermore, chrominance in-
formation is also employed to decide the prediction unit partitioning. Secondly, an
adaptive downsampling signal based intra prediction for parallel intra coding of high
efficiency video coding is proposed to improve coding efficiency and reduce data de-
pendency. Downsampling signal is applied to generate prediction samples instead of
neighboring pixels. It reduces spatial redundancy and removes the data dependency
in intra encoding for coding tree unit (CTU) structure. Meanwhile, a fast training
method is designed to derive downsampling signal adaptively. Thirdly, hardware
implementation oriented fast intra coding based on downsampling information for
HEVC is presented to realize parallel hardware implementation for real-time appli-
cations. The scheme is consisted of two parts, preprocessing stage and fast intra
coding stage. Three downsampling information based fast decision algorithms are
proposed in fast intra coding stage. Moreover, a parallelized architecture of the
fast intra coding scheme is presented. The preprocessed downsampling stage can be
executed with intra coding stage in parallel.
The experimental results for the proposal 1 demonstrated that it achieved a task
with greatly reduced computational complexity compared with the original HEVC.
The average time-saving is approximately 37%, while the increase in bit rate and
![Page 4: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/4.jpg)
4
decrease in PSNR are negligible. For the proposal 2, experimental results show that
the proposed fast parallelized scheme achieves 4.17% bit saving on average, with
reducing computational complexity by 27.26%. For the proposal 3, the proposed
architecture fully makes use of this feature to improve throughput and fragment
data dependency. Experimental results demonstrate that the proposed algorithms
achieves on average 60.4% reduction on encoding time with negligible coding effi-
ciency loss, compared with original HEVC.
![Page 5: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/5.jpg)
Contents i
Contents
Chapter 1 Introduction 1
1.1 Introduction to HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Requirements for implementing HEVC . . . . . . . . . . . . . . . . . . 3
1.2.1 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Data dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Compression efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Edge detector based fast level decision algorithm for intra predic-
tion of HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Adaptive downsampling signal based intra prediction for parallel
intra coding of high efficiency video coding . . . . . . . . . . . . . . 5
1.3.3 Hardware implementation oriented fast intra coding based on down-
sampling information for HEVC . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2 Overview of HEVC 7
2.0.1 Coding Design And Feature Highlights . . . . . . . . . . . . . . . . 7
2.0.1.1 Encoder structure of HEVC . . . . . . . . . . . . . . . . . . 7
2.0.1.2 Coding tree units and coding tree block structure . . . . . . 8
2.0.1.3 Coding tree unit . . . . . . . . . . . . . . . . . . . . . . . . 9
2.0.1.4 Coding unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.0.1.5 Prediction unit . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.0.1.6 Transform unit . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.0.2 Intra prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.0.3 Inter prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.0.4 Transform and quantization . . . . . . . . . . . . . . . . . . . . . . 13
2.0.5 In-loop filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.0.6 Sample adaptive offset filter . . . . . . . . . . . . . . . . . . . . . . 16
2.0.7 Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
![Page 6: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/6.jpg)
ii Contents
2.0.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3 Edge detector based fast level decision algorithm for in-
tra prediction of HEVC 19
3.1 Background and previous works . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Analysis of Intra Coding in HEVC . . . . . . . . . . . . . . . . . . . . . 20
3.3 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Bottom-up level-decision method . . . . . . . . . . . . . . . . . . . 24
3.3.2 Authentic-feature-based level decision method . . . . . . . . . . . . 26
3.3.3 Integrated fast level decision algorithm . . . . . . . . . . . . . . . . 27
3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 4 Adaptive downsampling signal based intra prediction for
parallel intra coding of HEVC 33
4.1 Background and previous works . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Analysis of intra coding in HEVC . . . . . . . . . . . . . . . . . . . . . 35
4.3 Proposed parallel intra coding scheme . . . . . . . . . . . . . . . . . . . 38
4.3.1 Downsampling approach for reconstructing samples . . . . . . . . . 39
4.3.2 Downsampling Prediction Modes . . . . . . . . . . . . . . . . . . . 40
4.3.3 Training method for obtaining downsampling QP . . . . . . . . . . 41
4.3.4 A parallel HEVC intra encoding architecture . . . . . . . . . . . . 43
4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 5 Hardware implementation oriented fast intra coding based
on downsampling information for HEVC 51
5.1 Background and previous works . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Analysis of intra coding in HEVC . . . . . . . . . . . . . . . . . . . . . 56
5.3 Proposed downsampling information based fast intra coding algorithms 59
5.3.1 Downsampling information based preprocessing stage . . . . . . . . 60
5.3.2 Fast PU depth decision algorithm . . . . . . . . . . . . . . . . . . . 62
![Page 7: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/7.jpg)
List of Figures iii
5.3.3 Fast TU depth decision algorithm . . . . . . . . . . . . . . . . . . . 65
5.3.4 Fast prediction mode decision algorithm . . . . . . . . . . . . . . . 67
5.4 Top-level design for VLSI architecture . . . . . . . . . . . . . . . . . . . 68
5.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6 Conclusions 79
Acknowledgement 81
Bibliography 82
Appendix 87
A Publication Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B International Conference . . . . . . . . . . . . . . . . . . . . . . . . . . 89
List of Figures
2.1 Slices and slice segments . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 CU, PU and TU size . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The intra prediction modes . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Motion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Recursively splitting of CUs in HEVC . . . . . . . . . . . . . . . . . 20
3.2 Intra prediction modes in HEVC . . . . . . . . . . . . . . . . . . . . 21
3.3 Proportions of depths for different class sequences . . . . . . . . . . . 22
3.4 PU partitioning of BQMall sequence . . . . . . . . . . . . . . . . . . 23
3.5 Roberts-cross masks for gradient calculation . . . . . . . . . . . . . . 24
3.6 Chroma-assisting decision . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Authentic-image-feature-based level decision procedure . . . . . . . . 27
3.8 Flowchart of the proposed algorithm . . . . . . . . . . . . . . . . . . 28
3.9 RD curves of ParkScene sequence . . . . . . . . . . . . . . . . . . . . 31
3.10 RD curves of BQMall sequence . . . . . . . . . . . . . . . . . . . . . 32
3.11 RD curves of RaceHorses sequence . . . . . . . . . . . . . . . . . . . 32
![Page 8: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/8.jpg)
iv List of Tables
4.1 Intra coding flow for HEVC. . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 High encoding latency caused by CTU-level data dependency: (a)
CTU encoding order and (b) CTU pipeline scheduling. . . . . . . . . 36
4.3 High encoding latency caused by PU-level data dependency: (a) PU
encoding order and (b) PU pipeline scheduling. . . . . . . . . . . . . 36
4.4 Spatially referencing samples of HEVC. . . . . . . . . . . . . . . . . . 37
4.5 Overview of the proposed intra coding scheme. . . . . . . . . . . . . . 39
4.6 Generate prediction samples using downsampling signal. . . . . . . . 40
4.7 Flowchart of training TDQP. . . . . . . . . . . . . . . . . . . . . . . . 42
4.8 Top level hardware architecture for HEVC. . . . . . . . . . . . . . . . 44
4.9 RD curves of Beauty 2160p in M2. . . . . . . . . . . . . . . . . . . . 47
4.10 RD curves of HoneyBee 2160p in M2. . . . . . . . . . . . . . . . . . . 47
4.11 RD curves of Beauty 1080p in M2. . . . . . . . . . . . . . . . . . . . 48
5.1 Hierarchically quadtree structure in HEVC . . . . . . . . . . . . . . . 56
5.2 Optimal mode and partitioning decision flow . . . . . . . . . . . . . . 57
5.3 The proposed fast intra coding scheme . . . . . . . . . . . . . . . . . 59
5.4 Concordance rates of downsampling and original encoded information 62
5.5 Overview of downsampling method . . . . . . . . . . . . . . . . . . . 63
5.6 Top level architecture of proposed fast intra coding . . . . . . . . . . 69
5.7 Timing diagram with PS and FICS . . . . . . . . . . . . . . . . . . . 70
List of Tables
3.1 Number of intra modes for each PU size . . . . . . . . . . . . . . . . 22
3.2 Experimental results for threshold value . . . . . . . . . . . . . . . . 28
3.3 Performance of the proposed algorithm and da Silva’s algorithm . . . 30
4.1 Result of TDQP and FDQP. . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Experiential conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Compared Experimental Results for Proposed Scheme and HEVC . . 49
5.1 Number of mode candidate list for HEVC and proposal in worst case 68
![Page 9: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/9.jpg)
List of Tables v
5.2 Performance comparison between FICS and FICS without FTDD . . 75
5.3 Performance comparison between FICS and FICS without FPMD . . 76
5.4 Performance comparison of compression and encoding time for PS
and FICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Comparison of coding architecture and performance . . . . . . . . . . 78
![Page 10: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/10.jpg)
![Page 11: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/11.jpg)
Chapter 1 Introduction 1
Chapter 1 Introduction
1.1 Introduction to HEVC
As video content market is growing in both resolution and quality in recent
years, HDTV and 4k contents have become the mainstream for more and more
applications. It is worthwhile to note that higher resolution and higher quality
video will be network overload in content-delivery for large amounts of video data.
To solve the problem and achieve the goal of higher video coding efficiency, the Joint
Collaborative Team on Video Coding (JCT-VC), which is organized by ISO/IEC
Moving Picture Group (MPEG) and ITU-T Video Coding Experts Group (VCEG),
standardized HEVC in January 25, 2013 [1]. HEVC aims to encode videos with
twice compression efficiency compared with H.264/AVC at the encoding condition.
The new standard is expected to be used as the next generation video standard for
video-delivery via limited bandwidth and super-high-vision broadcasting.
The first edition of the HEVC standard was finalized in January 2013, result-
ing in an aligned text that is published by both ITU-T and ISO/IEC. Additional
works have been added to the standard to support several additional applications,
including scalable video coding, multiview video coding, 3D video coding, extended-
range uses with enhanced precision and color format support. In ITU-T, HEVC
will become ITU-T Recommendation H.265, and it will become MPEG-H Part 2
(ISO/IEC 23008-2) in ISO/IEC. The ITU-T produced H.261 and H.263, ISO/IEC
produced MPEG-1 and MPEG-4, and the two organizations jointly produced the
H.262/MPEG-2 Video and H.264/MPEG-4 advanced video coding (AVC) stan-
dards. Therefore, ITUT and ISO/IEC standards are well-known for video coding
standards. The two standards that were jointly produced have played a important
role and have found their way into a wide variety of products in our daily lives.
HEVC/H.265 [2] has highly efficient compression methods, which allow it to com-
press video much more efficiently than older standards and provide more flexibility
for applications in complex network environments. Different to previous standards,
![Page 12: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/12.jpg)
2 Chapter 1 Introduction
HEVC uses frame partitioning and hybrid prediction as well as residual encoding
with a quadtree-based structure. In frame partitioning, source frame is divided into
a sequence of largest coding unit (LCU) with 64×64 unit size. The coding unit
(CU) is the basic unit of region partitioning used for intra and inter prediction. It
should be square and range size from 8x8 luma samples up to 64×64. The CU allows
recursive splitting into quad equally sized units. The prediction unit (PU) is the
basic unit used in the prediction processes. In order to facilitate partitioning which
matches the boundaries of real objects in the picture, it is not restricted to being
square in inter prediction, but in intra prediction PU should be square. Each CU
may contain one or more PUs. Due to the adoption of quadtree structure, quadtree
is formed by three levels of transform unit (TU) for DCT transform named residual
quadtree transform (RQT). It is helpful to achieve the optimal tradeoff between
energy consumption and adaptability. In addition to the quadtree-based prediction
and residual encoding, other advanced new coding tools such as directional intra pre-
diction with 35 modes, residual quadtree transform, sample adaptive offset (SAO) ,
in-loop filtering are also employed in HEVC. Owing to the above mentioned tools,
HEVC achieves about 40% bit-rate saving compared to H.264 High Profile, by Feb.
of 2012.
HEVC supports two different coding scenarios because of different applications
with the considerations of computational complexity. The Low Complexity (LC)
configuration is intended for low-delay applications. The High Efficiency (HE) con-
figuration is intended to achieve high compression ratio for applications with huge
hardware resources. HEVC encoder also considers about three kinds configuration
for experiment and different applications. Intra-only configuration is that each frame
is encoded as IDR picture. No temporal reference pictures shall be used and it is not
allowed to change QP within a picture. Low-delay configuration is that only the first
frame is encoded as IDR picture. The successive frames are encoded as P and B-
picture (GPB). No picture reordering between decoder processing and output, with
bit rate fluctuation characteristics. Random-access configuration is that structure
delay of processing units no larger than 8-picture group of pictures (GOP), dynamic
hierarchical B usage with four levels Intra picture shall be inserted cyclically per
![Page 13: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/13.jpg)
1.2 Requirements for implementing HEVC 3
second. Finally, HEVC supports 6 kinds of coding configurations: Low Complexity
Intra Only, High Efficiency Intra Only, Low Complexity Low Delay, High Efficiency
Low Delay, Low Complexity Random Access and High Efficiency Random Access.
The computational complexity and coding efficiency gradually increases for the 6
configures as the order.
1.2 Requirements for implementing HEVC
The integrated intra coding process is divided into two main steps, rough mode
decision (RMD) and rate-distortion optimization (RDO). In RMD process, recon-
structed pixels of neighboring PU are utilized to build prediction pixels with 35 intra
prediction modes. The Hadamard transform absolute difference (HAD) costs of all
the supported prediction modes are calculated to create a list of candidates with
the minimum values of HAD. Meanwhile, MPMs are evaluated from neighboring
PUs and supplemented to mode candidates. In the RDO process, the optimal PU
partitioning and prediction mode is decided by the RD cost. RD cost is derived by
calculating the residual between reconstructed samples and original samples. More-
over, the intra coding process is performed recursively to generate optimal encoding
results. The recursive intra coding of HEVC improves coding efficiency, however,
there are still three issues.
1.2.1 Computational complexity
H.264/AVC [3] defines macroblock size from 4 × 4 to 16 × 16 and a total
of 9 optional prediction modes, however, HEVC employs much bigger block size,
which supports from 4 × 4 up to 64 × 64, and much more intra prediction modes.
The two features result in high efficiency for intra coding, meanwhile, significantly
high complexity for intra prediction is introduced. To find out the optimal CU
partitioning and prediction mode, HEVC will travel through all the combinations
of CU, PU and TU by performing a series of computations to carry out the RDO
process. For the intra prediction process, a 64 × 64 CTU will perform approximately
2.65 × more predictions than that in the H.264.
![Page 14: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/14.jpg)
4 Chapter 1 Introduction
1.2.2 Data dependency
The encoding order for CTU and PU in HEVC. HEVC encodes CTU in raster
scan order, and all PUs in a CTU are encoded in Z-Scan which provides more avail-
able neighboring reference samples in most cases. Since reconstruction pixels in
neighboring CTU and PU are utilized in HEVC, intra encoding process is restricted
by data dependency and extremely high latency. Take CTU for instance to explain
hardware implementation problems for intra encoder because of the data depen-
dency. The order of intra encoding is restricted by CTU level data dependency,
which requires that each CTU is obliged to wait until both of the left and above
right neighboring CTUs encoding is finished. The constraint leads to the latency
between current CTU and adjacent CTU. Furthermore, the maximum parallelism is
sustainably limited and encoding from optional position in frame is forbidden. Only
the encoding direction which is from left top to right bottom is permitted in HEVC.
Therefore, pipelined hardware implementation and parallel architecture is difficult
to design. Both CTU level and PU level data dependency lead to low throughput
performance and high hardware overhead.
1.2.3 Compression efficiency
In the spatial domain, pixels which need to be predicted tend to have a high
correlation with neighboring pixels, in most instances. As above-mentioned, HEVC
intra coding reduce the redundancy that pixels close to each other in the same
frame. 35 intra prediction modes are provided which utilize the spatially neighboring
samples to predict current PU samples. However, the correlation between pixels of
spatial locality decays with PU size increasing. In particular, for large PU sizes (64
× 64 or 32 × 32), the available referencing samples are too far away from some
pixels of current PU and interfere the prediction performance.
1.3 Main Contributions
In this paper, algorithms and optimization hardware architecture are proposed to
reduce the heavy computational burden in RDO-based intra prediction and realize
real-time intra encoder. Main contributions are listed as follows.
![Page 15: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/15.jpg)
1.3 Main Contributions 5
1.3.1 Edge detector based fast level decision algorithm for intra predic-
tion of HEVC
High efficiency video coding (HEVC) achieves significant higher coding efficiency
than previous video coding standards. In the intra prediction of HEVC, prediction
unit sizes from 4X4 to 64x64 are employed, respectively defined as levels 1 to 5, to
achieve higher coding efficiency. Nevertheless, this is at the expense of high compu-
tational complexity. This paper proposes a fast algorithm for the intra prediction
of HEVC, which can efficiently decide the level on the basis of the roberts-cross
edge detector.The proposed algortihm utilizes the high correlation between regional
texture and prediction unit partitioning. It is mainly composed of a bottom-up level
decision method and an efficient decision flow based on an authentic image feature.
Furthermore, chrominance information is also employed to decide the prediction unit
partitioning. The experimental results for the proposed algorithm demonstrated
that it achieved a task with greatly reduced computational complexity compared
with the original HEVC. The average time-saving is approximately 37%, while the
increase in bit rate and decrease in PSNR are negligible.
1.3.2 Adaptive downsampling signal based intra prediction for parallel
intra coding of high efficiency video coding
Intra coding utilizes neighboring reference pixels to construct the prediction sam-
ples and reduce spatial redundancy. In high efficiency video coding (HEVC), per-
formance improvement of coding efficiency achieved by enhancing traditional intra
prediction. However, the enhanced method of intra coding is a real hindrance for
parallel hardware implementation. In this paper, an efficient parallel scheme is
proposed for intra coding of HEVC. Downsampling signal is applied to generate
prediction samples instead of neighboring pixels. It reduces spatial redundancy and
removes the data dependency in intra encoding for coding tree unit (CTU) struc-
ture. Meanwhile, a fast training method is designed to derive downsampling signal
adaptively. Experimental results show that the proposed fast parallelized scheme
achieves 4.17% bit saving on average, with reducing computational complexity by
27.26%.
![Page 16: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/16.jpg)
6 Chapter 1 Introduction
1.3.3 Hardware implementation oriented fast intra coding based on down-
sampling information for HEVC
High efficiency video coding (HEVC) aims at achieving higher coding perfor-
mance, especially for ultra high definition applications. Many novel features apply
to intra coding of HEVC. However, these features bring massive calculation and the
hardware implementation is difficult for a real-time intra encoder which support-
ing high resolution applications. This paper proposes a downsampling information
based intra coding scheme which consists of two parts, preprocessing stage and fast
intra coding stage. Three downsampling information based fast decision algorithms
are proposed in fast intra coding stage. Moreover, a parallelized architecture of the
fast intra coding scheme is proposed. The preprocessed downsampling stage can
be executed with intra coding stage in parallel. The proposed architecture fully
makes use of this feature to improve throughput and fragment data dependency.
Experimental results demonstrate that the proposed algorithms achieves on average
60.4% reduction on encoding time with negligible coding efficiency loss, compared
with original HEVC.
1.4 Thesis Outline
Chapter 2 provides overview of HEVC encoder along with brief explanation
of coding tool features and encoding process. Chapter 3 discusses edge detector
based fast level decision algorithm for intra prediction of HEVC. Chapter 4 explains
adaptive downsampling signal based intra prediction for parallel intra coding of
high efficiency video coding in detail. Chapter 5 explains hardware implementation
oriented fast intra coding based on downsampling information for HEVC in detail.
Chapter 6 outlines the conclusions and provides possibilities for future research.
![Page 17: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/17.jpg)
Chapter 2 Overview of HEVC 7
Chapter 2 Overview of HEVC
2.0.1 Coding Design And Feature Highlights
2.0.1.1 Encoder structure of HEVC
A slice is data structure that can be decoded independently from slices of the
same picture as shown in Figure 2.1, in terms of entropy coding, signal prediction
and residual signal reconstruction. A slice can either be the entire picture or a
region of a picture, which is not necessity rectangular. A slice segment consists of
a sequence of coding tree units (CTUs). An independent slice segment is a slice
segment for which the value of the syntax elements of the slice segment header are
not inferred from the values for a preceding slice segment. A dependent slice segment
is a slice segment for which the value of some syntax elements of the slice segment
header are inferred from the values for the preceding independent slice segment in
decoding order. The picture is divided into two slices. It consists of a sequence of one
or more slice segments starting with an independent slice segment and containing
all subsequent dependent slice segments that precede the next independent slice
segment within the same access unit.
Figure 2.1 Slices and slice segments
![Page 18: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/18.jpg)
8 Chapter 2 Overview of HEVC
2.0.1.2 Coding tree units and coding tree block structure
The HEVC standard has adopted a highly flexible and efficient block partitioning
structure by introducing four different block concepts: CTU, CU, PU, and TU, as
shown in Figure 2.2, which are defined to have clearly separated roles. The terms
coding tree block (CTB), coding block (CB), prediction block (PB), and TB are
also defined to specify the 2-D sample array of one color component associated with
the CTU, CU, PU, and TU, respectively. Thus, a CTU consists of one luma CTB,
two chroma CTBs, and associated syntax elements. A similar relationship is valid
for CU, PU, and TU.
Figure 2.2 CU, PU and TU size
The coding tree approach in HEVC can bring additional coding efficiency benefits
by incorporating PU and TU quad tree concepts for video compression. Leaf nodes
of a tree can be merged or combined in a general quad tree structured video coding
scheme. After the final quad tree is formed, motion information is transmitted at the
leaf nodes of the tree. L-shaped or rectangular-shaped motion partition is possible
through merging and combination of nodes. However, in order to make such shapes,
the merge process should be followed using smaller blocks after further splitting
has occurred. In the HEVC block partitioning structure, such cases are taken care
![Page 19: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/19.jpg)
Chapter 2 Overview of HEVC 9
of by the PU. Instead of splitting one depth more for merging and combination,
predefined partition modes such as PART‐2N×2N, PART‐2N×N, and PART‐
N×2N are tested and the optimal partition mode is selected at the leaf nodes of
the tree. It is worthwhile mentioning that PUs still can share motion information
through the merging mode in HEVC. Though a general quad tree structure without
the PU concept was investigated by removing the symmetric rectangular partition
modes (PART‐2N×N and PART‐N×2N).
Another aspect is the full utilization of depth information for entropy coding.
For example, entropy coding of HEVC is highly reliant on the depth information
of a quad tree. For syntax elements such as inter‐pred‐idc, split‐transform‐
flag, cbf‐luma, cbf‐cb and cbf‐cr, depth dependent context derivation is heavily
used for coding efficiency. It has been demonstrated that this can break the depen-
dency with neighboring blocks with less line buffer requirements in the hardware
implementations because information of the above CTU does not need to be stored.
2.0.1.3 Coding tree unit
A slice contains an integer multiple of CTU, which is an analogous term to
the macroblock in H.264/AVC. Inside a slice, a raster scan method is used for
processing the CTU. In the main profile, the minimum and the maximum sizes of
the CTU are specified by the syntax elements in the sequence parameter set (SPS)
among the sizes of 8×8, 16×16, 32×32, and 64×64. Due to this flexibility of the
CTU, HEVC provides a way to adapt according to various application needs such
as encoder/decoder pipeline delay constraints or on-chip memory requirements in
a hardware design. In addition, the support of large sizes up to 64×64 allows the
coding structure to match the characteristics of the high definition video content
better than previous standards; this was one of the main sources of the coding
efficiency improvements seen with HEVC.
2.0.1.4 Coding unit
The CTU is further partitioned into multiple CUs to adapt to various local
characteristics. A quad tree denoted as the coding tree is used to partition the CTU
into multiple CUs. 1) Recursive Partitioning from CTU: Let CTU size be 2N×2N
![Page 20: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/20.jpg)
10 Chapter 2 Overview of HEVC
where N is one of the values of 32, 16, or 8. The CTU can be a single CU or can be
split into four smaller units of equal sizes of N×N, which are nodes of a coding tree.
If the units are leaf nodes of the coding tree, the units become CUs. Otherwise,
it can be split again into four smaller units when the split size is equal or larger
than the minimum CU size specified in the sequence parameter set (SPS). This
representation results in a recursive structure specified by a coding tree. Numbers
on the tree represent whether the CU is further split. This flexible and recursive
representation of picture in CTUs and CUs provides several major benefits.
The benefit comes from the support of CU sizes greater than the conventional
16×16 size. When the region is homogeneous, a large CU can represent the region
by using a smaller number of symbols than is the case using several small blocks.
Supporting arbitrary sizes of CTU enables the codec to be readily optimized for
various contents, applications, and devices. Compared to the use of fixed size mac-
roblock, support of various sizes of CTU is one of the strong points of HEVC in terms
of coding efficiency and adaptability for contents and applications. This ability is
especially useful for low-resolution video services.
By choosing an appropriate size of CTU and maximum hierarchical depth, the
hierarchical block partitioning structure can be optimized to the target application.
Figure 4.4 shows examples of various CTU sizes and CU sizes suitable for different
resolutions and types of content. For example, for an application using 1080p content
that is known to include only simple global motion activities, a CTU size of 64 and
depth of 2 may be an appropriate choice. For more general 1080p content, which
may also include complex motion activities of small regions, a CTU size of 64 and
maximum depth of 4 would be preferable.
2.0.1.5 Prediction unit
One or more PUs are specified for each CU. Inside one PU, the same prediction
process is applied and the relevant information is transmitted to the decoder on
a PU basis. A CU can be split into one, two or four PUs according to the PU
splitting type. HEVC defines two splitting shapes for the intra coded CU and eight
splitting shapes for inter coded CU. Unlike the CU, the PU may only be split once
![Page 21: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/21.jpg)
Chapter 2 Overview of HEVC 11
1) PU Splitting Type: Similar to prior standards, each CU in the HEVC can be
classified into three categories: skipped CU, inter coded CU, and intra coded CU.
An inter coded CU uses a motion compensation scheme for the prediction of the
current block, while an intra coded CU uses neighboring reconstructed samples for
the prediction. A skipped CU is a special form of inter coded CU where both
the motion vector difference and the residual energy are equal to zero. Figure 4.5
describes the splitting types of the PU in the HEVC standard.
2.0.1.6 Transform unit
Similar with the PU, one or more TUs are specified for the CU. HEVC allows a
residual block to be split into multiple units recursively to form another quad tree
which is analogous to the coding tree for the CU. The TU is a basic representative
block having residual or transform coefficients for applying the integer transform and
quantization. For each TU, one integer transform having the same size as the TU
is applied to obtain residual coefficients. These coefficients are transmitted to the
decoder after quantization on a TU basis. 1) Residual Quad tree: After obtaining
the residual block by prediction process based on PU splitting type, it is split into
multiple TUs according to a quad tree structure. For each TU, an integer transform
is applied. The tree is called transform tree or residual quad tree (RQT) since the
residual block is partitioned by a quad tree structure and a transform is applied for
each leaf node of the quad tree. Transform tree partitioning is shown in figure 4.6.
2.0.2 Intra prediction
In H.265/HEVC, intra prediction of the luma component supports five PUs: 4x4,
8x8, 16x16, 32x32 and 64x64, and each PU has 35 prediction modes which contain
planar mode, DC mode and 33 angular modes. It is noted that the bottom left pixels
are used as reference pixels, which in some cases, can improve the coding efficiency
significantly.
All HEVC intra prediction modes are defined by prediction mode number as
follow: planar, DC and angular modes. The prediction directions of the 33 angle
mode is shown as Figure 2.3. The direction of the mode number 2-17 mean horizontal
modes, and the direction of the mode number 18-34 mean vertical modes. Planar
![Page 22: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/22.jpg)
12 Chapter 2 Overview of HEVC
mode corresponds to plane mode in H.264/AVC, and it adapts to the pixels smooth
areas. The prediction pixel value Px,y is generated by the average of the prediction
horizontal and vertical values. This method can make the change of prediction
pixel smooth, and improve the video subjective quality. DC mode is suitable for
large flat areas. The current prediction value is generated by the average of the
left and above reference pixels. There are eight different prediction directions in
H.264/AVC. However, in order to adapt to the different texture of the video content,
H.265/HEVC specifies 33 angular prediction modes.
Figure 2.3 The intra prediction modes
2.0.3 Inter prediction
In H.265/HEVC, the prediction block (PB) is the basic process unit in inter
prediction, and the prediction unit contains the prediction informations. As shown in
Figure 2.4, the motion compensation principle is that the reference blocks are used to
predict the current block information. The displacement between the reference block
and the current block is called motion vector (MV) and the difference between them
![Page 23: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/23.jpg)
Chapter 2 Overview of HEVC 13
is named motion distortion. The MV and motion distortion are used to determine
the best prediction mode based rate-distortion (R-D) model. Similar to H.264/AVC,
B-frame or P-frame prediction is used for motion compensation, and in the final
standard, the bi-prediction is used to achieve a trade-off between encoding efficiency
and encoding complexity. Furthermore, it needs to access memory constantly for bi-
prediction, which s considered to be the main factors of computational complexity,
especially for hardware design.
Figure 2.4 Motion estimation
2.0.4 Transform and quantization
HEVC specifies two-dimensional transforms of various sizes 4x4, 8x8, 16x16, and
32x32 that are find precision approximations to the discrete cosine transform (DCT).
Multiple transform sizes improve compression performance, but also increase the
implementation complexity. The N transform coefficients vi of an N-point 1D-DCT
applied to the input samples ui can be expressed as
![Page 24: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/24.jpg)
14 Chapter 2 Overview of HEVC
vi =N−1∑j=0
ujcij (2.1)
where i=0,...N-1. Elements cij of the DCT transform matrix C are defined as
cij =P√N
cos[π
N(j +
1
2)i] (2.2)
where i,j=0,...N-1 and where P is equal to 1 and√2 for i=0 and i ¿ 0, repec-
tively. The basis vector ci of the DCT are defined as ci = [ci0, ...ci(N‐1)]T where
i=0,..N-1. DCT is desirable for compression efficiency by achieving transform co-
efficients that are uncorrelated and provides good energy compaction which is also
desirable for compression efficiency. Furthermore, DCT is desirable for simplifying
the quantization and de-quantization process and useful to reduce implementation
costs as the same multipliers can be reused for various transform sizes. For slowly
changing grey value of pixels piece, after DCT most of the energy is concentrated in
the upper left corner of the low frequency coefficient. On the contrary, if the pixel
texture block contains more details information, more energy distributes in the high
frequency area. In fact, most images contain more low frequency components. Using
the characteristics that the human eye is not sensitive to high frequency detail image
with relative, the low-frequency coefficients of high frequency energy can be handled
subtly, and low energy of high frequency coefficients can be quantized roughly.
Quantization consists of division by a quantization step size (Qstep) and subse-
quent rounding while inverse quantization consists of multiplication by the quanti-
zation step size. In HEVC, quantization parameter (QP) is used to get Qstep, and
QP can take 52 values from 0 to 51. The relationship between QP and Qstep is
defined as follow:
Qstep(QP ) = (216 )QP−4 (2.3)
The integer DCT scaling operation need to complete at the same time in H.265/HEVC
quantitative process. In order to avoid floating point arithmetic, quantizer formula
(2.3) will enlarge to a certain extent both the numerator and denominator, then in-
teger to retain the accuracy of operation. In HEVC, the encoder can signal whether
![Page 25: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/25.jpg)
Chapter 2 Overview of HEVC 15
or not to use quantization matrices enabling frequency dependent scaling. Human
visual system based quantization can achieve better quaintly than frequency inde-
pendent quantization. In HEVC, three options can be configured for the operation
of the quantizer: flat quantization, default weighting matrix and custom weighting
matrix. The quantization step may need to be changed within a picture for rate
control and perceptual quantization purposes. This is updated by a QP delta in the
slice segment header. The applicable QP for a CU is derived from the QP applied
in the previous CU in decoding order, and the dalta QP is transmitted in coding
units with non-zero transform coefficients.
2.0.5 In-loop filtering
Deblocking filter and sample adaptive offset (SAO) filter are included in HEVC.
The deblocking filter aims to reduce the visibility of blocking artifacts and is applied
to sample located at block boundaries. The SAO filter arms to improve the accuracy
of the reconstruction of the original signal amplitudes and is applied adaptively to
all samples, by conditionally adding an offset value to each sample based on values
in look-up tables defined by the encoder.
A deblocking filter process is performed for each CU in the same order as the
decoding process. First vertical edges are filtered then horizontal edges are filtered.
Filtering is applied to 8x8 block boundaries which are determined to be filtered,
both luma and chroma components. The deblocking filter process has three stages:
boundary decision, filter on/off decision and strong/weak filter decision. TU bound-
aries and PU boundaries are involved om the deblocking filter. In boundary decision
stage, the boundary strength (Bs) is calculated to reflect how strong a filtering pro-
cess may be needed for the boundary. A value of 2 for Bs indicates strong filtering, 1
means weak filtering and 0 means no deblocking filtering. The filter on/off decision
is made using 4 lines grouped as a unit, to reduce computational complexity. If
filtering is turned on, a decision is made between strong and weak filtering. The
strong deblocking filter is applied to smooth flat areas.
![Page 26: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/26.jpg)
16 Chapter 2 Overview of HEVC
2.0.6 Sample adaptive offset filter
SAO is applied to the reconstructed signal after the deblocking filter by using
offsets specified for each CTB by the encoder. The SAO reduces sample distortion
by first classifying the samples in the region into multiple categories with as selected
classifier and adding a specific offset to each sample depending on its category. The
classifier index and the offsets for each region are signaled in the bit-stream. SAO
operation includes edge offset (OE) which uses edge properties for pixel classifica-
tion in SAO type 1 to 4, and band offset (BO) which uses pixel intensity for pixel
classification in SAO type 5.
2.0.7 Entropy coding
A single entropy coding scheme is used in all configurations of HEVC: context
adaptive binary algorithmic coding (CABAC). Entropy coding is a lossless compres-
sion scheme that uses the statistical properties to compress data, and it is performed
at the last stage of video encoding, after the video signal has been reduced to a series
of syntax elements. CABAC adopts efficient arithmetic coding technology, considers
the related statistical properties video stream, and improves the coding efficiency
significantly. Entropy coding processing has three stages: binarization, context
modeling and binary arithmetic coding. In general, a binarization scheme defines
a unique mapping of syntax element value to sequences of binary symbols, which
can be interpreted in terms of a binary code tree. By decomposing each non-binary
syntax element value into a sequence of bins, further processing of each bin value
in CABAC depends on the associated mode decision. The probability models in
CABAC are adaptive, which means that, for those high probability events on the
coding performance, a delicate context model is set up, on the contrary, for the low
probability events on coding performance, a simple context model is set up. For the
syntax elements of binary, every Bin is processed with arithmetic coding accord-
ing to the probability model parameters, and gets the final video stream. Binary
arithmetic coding contains two kinds of encoding: regular coding mode and bypass
coding mode. The regular mode uses the probability model of adaptive coding, and
the bypass coding mode uses the form of equal probability coding.
![Page 27: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/27.jpg)
Chapter 2 Overview of HEVC 17
2.0.8 Summary
Compared with H.264/AVC, the compression efficiency of H.265/HEVC is over
H.264/AVC in both objective and subjective tests. Moreover, the bit rate reduction,
based on objective evaluation of CTC test sequences, indicates all over performance
improvement of about 50% over H.264/AVC. HEVC yields a substantial improve-
ment in compression capability beyond that of H.264/AVC for video streaming ap-
plications, and the coding performance gains of HEVC over H.264/AVC generally
increase with increasing video resolution up to at least 4K resolutions. For the next
generation of video coding, the features of parallel processing, high compression
capability, and low computational complexity are very important.
![Page 28: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/28.jpg)
![Page 29: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/29.jpg)
Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC 19
Chapter 3 Edge detector based fast level
decision algorithm for intra prediction
of HEVC
3.1 Background and previous works
HEVC is a new international standard developed by the Joint Collaborative
Team on Video Coding (JCT-VC) that aims to achieve 50% bit rate reduction
relative to H.264. In particular, it can support 4K and 8K ultra high-definition
(UHD) video, with a resolution of 8,192×4,320. Intra prediction, which is of great
significance for video compression in H.264, continues to play an important role in
HEVC. Many new features have been proposed to improve the intra coding efficiency
of HEVC. The two main features are the size of the prediction unit, which can be
defined from 4×4 up to 64×64, compared with from 4×4 to 16×16 for H.264, and
the 35 intra prediction modes. This results in significantly higher complexity for
intra prediction.
There have already been some related works aiming to reduce the complexity
of HEVC. A fast intra prediction algorithm based on pixel gradient statistics was
proposed by Chen et al. [4]. In their paper, pixel gradient statistics are extracted
using the Sobel operator to exclude unreasonable modes and unit sizes. The method
reduces the encoding time by about 28%. Meanwhile, da Silva et al. developed a
gradient based fast intra prediction algorithm using five edge strengths filters. In
their paper, the edge strength results decided the corresponding intra prediction
mode set. The number of available intra prediction modes was reduced to 9, com-
pared with 35 in HEVC, and a reduction in the processing time of almost 32% was
achieved. However, da Silva et al. only utilized edge direction information to reduce
the number of modes considered unreasonable [5], and in some cases the list of re-
maining modes did not contain the optimal mode. In this paper, edge information
is considered as a highly sensitive parameter of region complexity conditions and is
used for reducing unreasonable unit sizes. This enables our proposed method to save
![Page 30: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/30.jpg)
20 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
much more time with almost the same bit rate and PSNR as conventional HEVC.
3.2 Analysis of Intra Coding in HEVC
In contrast to the previous video coding standards, HEVC employs a flexible
quadtree coding block-partitioning structure, which consists of three kinds of basic
units : coding units (CUs), prediction units (PUs) and transform units (TUs). The
CU is the fundamental partible unit, whose size can range from 8×8 to the largest
coding unit (LCU). A picture is divided into slices and each slice is composed of a
sequence of LCUs whose maximum size can be 64×64. Each LCU is recursively split
into CUs that constitute the quadtree structure. Figure 3.1 shows the recursively
splitting of CUs in HEVC. The PU is the basic unit for intra and inter prediction and
only PU sizes of 2N×2N and N×N are supported in the intra prediction of HEVC.
TUs are used for transform and quantization and the TU size can be different from
the PU size. The maximum size of PUs and TUs is 64×64, while their minimum
sizes are 4×4.
Figure 3.1 Recursively splitting of CUs in HEVC
![Page 31: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/31.jpg)
3.2 Analysis of Intra Coding in HEVC 21
Figure 3.2 Intra prediction modes in HEVC
In HEVC, 35 intra prediction modes are employed to improve the coding effi-
ciency [6]. As illustrated in Figure 3.2, the 35 intra prediction modes consist of
33 angular modes, DC mode and planar mode. The specific number of prediction
modes varies according to the PU size, as shown in Table 3.1. To find the optimal
CU partitioning and prediction mode, HEVC must examine all the combinations
of CU, PU and TU by performing a series of computations to carry out the rate-
distortion optimization (RDO) process. In HEVC there are two major steps in
achieving the above targets for intra prediction. The first step is to calculate the
Hadamard transform absolute difference (HAD) costs of all the supported prediction
modes to create a list of candidates with the minimum values of HAD. The number
of candidates for different PU sizes is shown in Table 3.1. In the second step, the
optimum PU size is derived by computing the rate-distortion (RD) costs, and the
final prediction mode and CU partitioning are determined. The above process has
a tremendous computational workload and is time-consuming. In fact, for an LCU
![Page 32: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/32.jpg)
22 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
Figure 3.3 Proportions of depths for different class sequences
Table 3.1 Number of intra modes for each PU size
Size Number of intra Number of mode
of PU prediction modes candidates
4×4 18 8
8×8 35 8
16×16 35 3
32×32 35 3
64×64 4 3
with the size of 64×64, 7,552 times HAD costs and 2,623 - 4,923 times RD costs
have to be calculated. Thus, if the CU partition can be determined in advance, a
great deal of computation time will be saved.
The computation process has a lot of redundancy which can be reduced. As
the depths of CUs in HEVC are 0, 1, 2 and 3, recursively considers all possible
combinations at each depth in order from depth 0 to depth 3. Finally it reserves
only one optimal combination. This means that there is a lot of redundancy in
the process. From [7], the proportions of depths for different class sequences are
shown in Figure 3.3. The average percentage of depth 0 is approximately 60%.
Therefore, searching for an effective way to reduce the calculated CU unit size and
early termination of the RDO process are very meaningful for saving time.
![Page 33: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/33.jpg)
3.3 Proposed algorithm 23
3.3 Proposed algorithm
The CU partitioning of a frame tends to have a high correlation with the regional
texture according to experimental results. A homogeneous region is likely to utilize
a larger CU size or low-level CUs, and a complex region will be split into smaller
CUs with a high probability [8],[9]. An example of this is illustrated in Figure 3.4,
CUs are partitioned after full RDO process. HEVC utilizes low-level CUs in the
top left corner and high-level CUs on the man’s face. Therefore, there is a close
relationship between texture complexity and CU partitioning.
Figure 3.4 PU partitioning of BQMall sequence
Edge detection can obtain gradient information, which mainly contains two com-
ponents, the texture complexity and the spatial direction [10]. In this paper, the
texture complexity is utilized to determine the PU size for intra prediction. First,
the concept of the Roberts-cross edge detector is introduced to obtain the magni-
tude and orientation which is used for the analysis of the texture. The calculation
utilizes the two convolution masks of 2×2 units shown in Figure 3.5. The gradient
vector of the unit Gx,y is obtained as follows:
![Page 34: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/34.jpg)
24 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
Gxi,j = pi,j − pi+1,j+1 (3.1)
Gyi,j = pi+1,j − pi,j+1 (3.2)
The magnitude of the gradient can be defined as follows:
∇Ix,y = Gx,y =√Gx2 +Gy2 (3.3)
The angle of orientation can also be defined as follows:
Θx,y =Gx2
Gy2(3.4)
Figure 3.5 Roberts-cross masks for gradient calculation
The texture complexity of the original LCU can be explored using the above
Roberts-cross edge detector.
3.3.1 Bottom-up level-decision method
It is well known that humans are less sensitive to changes in hue (chrominance)
than to changes in brightness (luminance), and coding technology has made use
of this feature. In this paper, pixel luminance samples of the original source are
utilized to calculate the gradient magnitude. On the basis of bottom-up theory, the
calculation starts from the smallest unit size (4×4). Because the gradient magnitude
is tiny, a homogeneous region and four units can be merged to a larger unit (8×8)
which means to be decided as the next level. On the other hand, if the derived
![Page 35: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/35.jpg)
3.3 Proposed algorithm 25
gradient magnitude is not neglected, which means the region is complex, the four
units will not be merged and decided as high-level.
In the meantime, to assist in the merging decision, chrominance sample infor-
mation is introduced. By way of illustration, we will explain the chroma-assisting
decision. As shown in Figure 3.6, a search for the maximum and minimum chromi-
nance of the samples from all 64 pixels in PU1, PU2, PU3 and PU4 (size 4×4, level
4) is carried out to decide whether or not the image chrominance changes sharply.
If some of the four differences between the maximum and minimum are large, the
four PU will be assigned to size 4x4, otherwise, the four PUs have the possibility of
being assigned to a larger PU (8×8) depending to the above gradient magnitude.
Figure 3.6 Chroma-assisting decision
As described in Algorithm 1, the calculation starts from a unit size of 2×2. Then,
the chrominance of a level 4 sample is calculated and the sum of sharply varying
chroma units (SCU) is evaluated. If the value of SCU is not 0, the decision process
will terminate and the level will be set as 4. Otherwise, the luminance decision
process will start and the sum of homogeneous units (SHU) will be evaluated. If
SHU is 4, the four PUs will be merged into a larger PU of size 8×8 and the level
is set as 3. It is noteworthy that the four PUs involved in the calculation are those
![Page 36: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/36.jpg)
26 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
continuously merged from the smallest unit size (4×4); not all the PUs have a size
of 8×8. Therefore, redundant calculations are avoided and the encoding time is
reduced. The same process is performed at unit sizes of 16×16 and 32×32. In this
way, a fast bottom-up level decision is achieved.
Algorithm 1 Bottom-up Level Decision
1: function LevelDecision
2: for every 4 continuous 2×2 units in LCU do
3: edge detection calculation
4: end for
5: for every 4 continuous 4×4 units in LCU do
6: chrominance decision if SCU != 0 then
7: Level ← 4
8: else if SCU = 0 do
9: luminance decision if SHU != 4 then
10: Level ← 4
11: else if SHU = 0 then
12: Level ← 3
13: end if
14: end if
15: end for
16: end function
3.3.2 Authentic-feature-based level decision method
The second part of the proposed algorithm is based on an authentic image fea-
ture, as we observed from statistical results that the PU size is mainly 16×16 and
8×8. Therefore, the same calculation as above is started from level 2, which has the
higher probability with a PU size of 16×16. As shown in Figure 3.7, the program is
![Page 37: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/37.jpg)
3.3 Proposed algorithm 27
executed in two directions : to level 0 with a PU size of 64×64, and to level 4 with
a PU of size 4×4. Here, we preferentially perform the calculation whose orientation
is level 0, and then achieve the level decision with a top-down design starting from
level 2.
Figure 3.7 Authentic-image-feature-based level decision procedure
3.3.3 Integrated fast level decision algorithm
In general there are five levels from 4×4 to 64×64 in HEVC. The proposed
algorithm can rapidly decide the level and terminate the CU compression process
early. Firstly, a rough gradient calculation for a frame with unit size 64×64 is
carried out to obtain the complexity of the frame. A threshold (TH) is used to
choose the bottom-up level decision method or authentic-image-feature-based level
decision method for the frame. In this study, a series of experiments are conducted
with the sequence RaceHorses (416×240) and the results are shown in Table 3.2.
The quantization parameter is set to 32. The increasing in bitrate (∆BR), decrease
in PSNR (∆PSNR) and time saving in encoding (TS) are considered in selecting
the threshold value. Finally, the threshold value is set to 128 in this paper. In
the rough gradient calculation, if the value for the merged 4×4 unit is above the
threshold value, the authentic-image-feature-based level decision method will be
utilized. Otherwise, the bottom-up level decision method will be utilized. Then the
bottom-up level decision or authentic-image-feature-based level decision is carried
![Page 38: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/38.jpg)
28 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
Figure 3.8 Flowchart of the proposed algorithm
Table 3.2 Experimental results for threshold value
Threshold value ∆BR(%) ∆PSNR(dB) TS(%)
122 1.72 -0.0503 22.40
124 1.69 -0.0523 25.50
126 1.47 -0.0575 25.61
128 1.31 -0.0508 27.04
130 1.12 -0.0479 25.35
132 1.12 -0.0569 25.24
134 1.04 -0.0491 23.69
out to decide the level and terminate the CU compression. A flowchart of the
proposed algorithm is shown in Figure 3.8.
3.4 Experimental results
The proposed edge-detector-based fast level-decision algorithm is integrated with
the reference software of the HEVC test model (HM) 12.1 [11]. The simulation
platform is Intel(R) Core(TM) 2 Quad CPU Q8400 @ 2.66GHz with 4 cores and
![Page 39: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/39.jpg)
3.4 Experimental results 29
2.00 GB RAM. Class A, B, C, D and E sequences are employed for performance
comparison. As the algorithm is mainly applied to intra prediction, we set the period
of I frames to 1 to ensure that all the frames are intra encoded. The simulation
conditions are defined in [12] and the quantization parameters are set to 22, 27,
32 and 37. The performances of da Silva’s algorithm compared with HM and our
proposed algorithm compared to the HM are shown in Table 3.3. In addition, the
reduction in complexity of the proposed algorithm is derived from the time saving
of encoding, which is defined as follows:
TS =THM − TPA
THM
× 100% (3.5)
where THM denotes the coding time of HM, and TPA denotes the time used by
the proposed algorithm. ∆PSNR is the difference in PSNR between the proposed
algorithm and the original HM. ∆BR denotes the percentage increased in the bit
rate of the proposed algorithm compared with the original HM.
As shown in Table 3.3, the comparison performance between the proposed algorithm
and da Silva’s algorithm gives clear results for the bit rate (∆BR), ∆PSNR and time
saving (TS). On average, the proposed algorithm achieves a time saving of 37.16% for
intra encoding, while the average increase in the bit rate is 1.46% and the decrease
in PSNR is only 0.0635 dB, which is negligible. The RD performance is shown in
Figure 3.9-3.11. From the RD curves for the proposed algorithm and HEVC, it is
clear that our proposed algorithm achieves almost the same PSNR value for different
bit rates as HM.
According to Table 3.3, in high-resolution video sequences, da Silva’s algorithm
cannot save as much time as the proposed algorithm, while the advantages in terms of
PSNR loss and the increase in bit rate are less evident than our proposed method. In
the sequence PeopleOnStreet and BQTerrace, both of da Silva’s proposed algorithm
and our algorithm have lower performance compared with other sequences. The two
sequences have the same feature that there is a lot of shade, which influences the
accuracy of edge detection. This leads to PU partition tending to occur at a low
level and termination early, resulting in both algorithms missing the optimal.
![Page 40: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/40.jpg)
30 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
Table
3.3
Perform
ance
oftheproposedalgorithm
anddaSilva’s
algorithm
Sequences
daSilva’sAlgorithm
Proposed
Algorithm
∆BR(%
)∆PSNR(dB)
TS(%
)∆BR(%
)∆PSNR(dB)
TS(%
)
Class
A
2560×1600
Steam
LocomotiveT
rain
0.49
-0.0060
12.44
0.27
-0.0186
27.72
PeopleOnStreet
1.98
-0.0490
20.91
1.58
-0.0581
35.99
Class
B
1920×1080
Kim
ono
0.81
-0.0140
31.73
0.18
-0.0180
40.48
BQTerrace
2.26
-0.0280
17.45
1.89
-0.0540
37.40
ParkScene
0.03
-0.0300
8.74
0.39
-0.0730
41.73
Class
C
832×
480
BQMall
NA
1.11
-0.0718
40.53
BasketballD
rill
NA
1.84
-0.0615
37.49
Class
D
416×
240
RaceH
orses
NA
1.31
-0.0508
27.04
BlowingB
ubbles
NA
1.61
-0.0769
25.55
Class
E
1280×720
Vidyo1
NA
3.08
-0.1141
46.48
Vidyo4
NA
2.84
-0.1013
48.31
Average
1.11
-0.0254
18.25
1.46
-0.0635
37.16
![Page 41: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/41.jpg)
3.5 Summary 31
Figure 3.9 RD curves of ParkScene sequence
3.5 Summary
We proposed a fast level decision algorithm for the intra prediction of HEVC
using an edge detector. By using the result of gradient detection, the times required
for RDO processes are considerably reduced and the coding efficiency is increased.
The proposed algorithm achieved a large reduction in computation complexity com-
pared with the original HEVC decision algorithm. Experimental results showed that
our proposed algorithm reduced the coding time by about 37.16%, while the corre-
sponding increase in the bit rate was only 1.46% bit rate and the PSNR loss was
0.0635 dB. Compared with da Silva’s algorithm, our proposed algorithm achieves
greater time saving and a similar performance in terms of the increase in bit rate
and decrease in PSNR. In future, our main work will to develop novel intra predic-
tion modes and optimize the decision of the prediction mode combination with the
fast-level-decision algorithm to further increase coding efficiency.
![Page 42: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/42.jpg)
32 Chapter 3 Edge detector based fast level decision algorithm for intra prediction of HEVC
Figure 3.10 RD curves of BQMall sequence
Figure 3.11 RD curves of RaceHorses sequence
![Page 43: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/43.jpg)
Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC 33
Chapter 4 Adaptive downsampling signal based
intra prediction for parallel intra
coding of HEVC
4.1 Background and previous works
Dynamic force from the worldwide consumer electronics market is driving display
technology and video coding technology. In 2013, the Joint Collaborative Team
on Video Coding (JCT-VC) standardized HEVC to achieve higher video coding
efficiency [13-15]. HEVC employs a flexible hierarchical structure of CTU which
is partitioned into coding unit (CUs) recursively, and supports larger sizes of CU
ranged from 8 × 8 to 64 × 64 [16]. HEVC also defines two other functional types
of unit, prediction unit (PU) and transform unit (TU). PU is the basic unit for
prediction and TU is defined for transform and quantization. Intra coding reduces
spatial redundancy between neighboring PUs in the same frame and improves coding
efficiency.
Some related works have been performed for improving intra coding efficiency
with correlation between spatial reference samples and samples of current PU. A
lossless intra coding algorithm is proposed based on pixel-wise spatial interleave pre-
diction (PWSIP) for H.264/AVC [17]. Interleave prediction whose reference sample
pixels are derived from neighboring reconstructed pixels, is utilized to generate pre-
diction pixels. Experimental result shows 4.13% increased compression efficiency,
however encoding time going up 31.59%. S.Kanimozhi and others have proposed
an integrated scheme using PWSIP and context adaptive interpolation (CAI) [18].
Horizontal, vertical, mean mode and 4 intra prediction modes of H.264/AVC in-
cluded, are employed to increase coding performance. Their proposed algorithm
achieves high compression efficiency with about 4% increment. A locally adaptive
downsampling video coding scheme have been proposed in [19]. An adaptive down-
sampling/upsampling video coding scheme is proposed in order to achieve better
video quality at low bit rates in terms of both measure and visual quality. Differ-
![Page 44: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/44.jpg)
34 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
ent regions of video frame with the consideration of local contents are adaptively
encoded with downsampling ratios and quantization step sizes. The appropriate
downsampling rates have been derived though theoretical analyses. Compared with
Test Model 5 (TM5) MPEG-2 encoder, the PSNR improvement can be up to 1.3 dB
at low bit rates. Though the performance of the proposed scheme is impressive at
low bit rates, compression rate is worse than regular coding at high bit rates and low
QPs. In [20], a multiple line-based algorithm is proposed for HEVC. Li and others
utilize further reference lines with relatively higher quality and performs better pre-
diction. Compared with HEVC, the proposed fast search method achieves 2.0% bit
saving on average, with increasing the encoding time by 112%. A new block-based
method for HEVC intra coding have been proposed in [21]. They divide the pixels in
each prediction into two parts: half pixels are coded with a constrained quantization
algorithm; whereas the other half are reconstructed by linear interpolations along a
prediction direction by utilizing the neighboring reference pixels and the first half
coded pixels. A competition mechanism is employed between this new method and
the original HEVC intra coding in order to choose the optimal mode for each pre-
diction block. Experimental results show that about 2% BD bitrate reduction has
been achieved both for luma and chroma with respect to the original HEVC intra
coding, however the encoder complexity increases by 130%.
Recently, leaning-based image super-resolution using convolutional neural net-
work (CNN) has been performed for improving intra coding efficiency in [22]. Com-
pared with traditionally high coding efficiency method regarding to intra coding,
learning based approach improves compression efficiency further citing high cost
of computational complexity and unsolved data dependency. A CNN-based block
up-sampling scheme for intra frame coding is proposed with compressing a down-
sampled block by normal intra coding, and then up-sampling to its original reso-
lution. The network is both compact and efficient in term of a new CNN struc-
ture which features deconvolution of feature maps, multi-scale fusion, and residue
learning. Coding parameters of down-sampled blocks is also fully studied for rate-
distortion optimization. Experimental results show that the CNN-based scheme has
achieved significant bit rates saving (approximately 5.5%). However, multi-scale
![Page 45: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/45.jpg)
4.2 Analysis of intra coding in HEVC 35
feature extraction, deconvolution, multi-scale reconstruction and residue learning in
CNN-based up-sampling is leading to amount of calculation and increased encoding
time.
These aforementioned works have achieved outstanding coding performance,
however, significantly increased computational complexity and data dependency for
intra coding. Therefore, they are difficult to design parallel hardware architecture
and perform pipelined hardware implementation. In this paper, adaptive downsam-
pling signal based intra coding scheme is proposed aiming at developing a higher
efficiency intra prediction scheme for HEVC, while reducing computational complex-
ity and supporting parallelization. The rest of this paper is organized as follows.
Section 4.2 introduces and analyzes intra coding in HEVC. The proposed meth-
ods including downsampling approach, downsampling prediction modes, training
method for obtaining downsampling QP and parallel intra encoding architecture
are presented in Section 4.3. Section 4.4 shows the experimental result, which is
followed by the conclusion in Section 4.5.
4.2 Analysis of intra coding in HEVC
The integrated intra coding process is illustrated in Figure 1 and explained in
detail as follows. The process is divided into two main steps, rough mode decision
(RMD) and rate-distortion optimization (RDO). In RMD process, reconstructed
pixels of neighboring PU are utilized to build prediction pixels with 35 intra pre-
diction modes. The Hadamard transform absolute difference (HAD) costs of all the
supported prediction modes are calculated to create a list of candidates with the
minimum values of HAD. Meanwhile, MPMs are evaluated from neighboring PUs
and supplemented to mode candidates. In the RDO process, the optimal PU par-
titioning and prediction mode is decided by the RD cost. RD cost is derived by
calculating the residual between reconstructed samples and original samples. More-
over, the intra coding process is performed recursively to generate optimal encoding
results.
The recursive intra coding of HEVC improves coding efficiency, however, there
are still three issues of implementation difficulty:
![Page 46: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/46.jpg)
36 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
�������������� ������������
��
����������� ����
������������������
�������������������� �
�
�� �������������
�
��� ����
����������
!�� �����������
!�� ���� ����
�"#��� �������
���������#��� �
�
�
�������� ��
�$���������������������
%�&#��
%�&#��
Figure 4.1 Intra coding flow for HEVC.
����
����
�
��� ��
���� ���� ���� ����
����
����
����
����
����
Figure 4.2 High encoding latency caused by CTU-level data dependency: (a)
CTU encoding order and (b) CTU pipeline scheduling.
������ ������
������ ����
��
�
��
�
��
�
��
�
��
�
��
�
��
�
��
�
��
�
���
���
���
���
���
���
�
� � ���
����
Figure 4.3 High encoding latency caused by PU-level data dependency: (a) PU
encoding order and (b) PU pipeline scheduling.
![Page 47: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/47.jpg)
4.2 Analysis of intra coding in HEVC 37
������������
����������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
������������
����
����
Figure 4.4 Spatially referencing samples of HEVC.
a) computational complexity
H.264/AVC defines macroblock size from 4 × 4 to 16 × 16 and a total of 9
optional prediction modes, however, HEVC employs much bigger block size, which
supports from 4 × 4 up to 64 × 64, and much more intra prediction modes. The
two features result in high efficiency for intra coding, meanwhile, significantly high
complexity for intra prediction is introduced. To find out the optimal CU parti-
tioning and prediction mode, HEVC will travel through all the combinations of CU,
PU and TU by performing a series of computations to carry out the RDO process.
For the intra prediction process, a 64 × 64 CTU will perform approximately 2.65 ×
more predictions than that in the H.264 [23].
b) data dependency
Figure 2 and Figure 3 illustrates the encoding order for CTU and PU in HEVC.
HEVC encodes CTU in raster scan order, and all PUs in a CTU are encoded in
Z-Scan which provides more available neighboring reference samples in most cases.
![Page 48: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/48.jpg)
38 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
Since reconstruction pixels in neighboring CTU and PU are utilized in HEVC, intra
encoding process is restricted by data dependency and extremely high latency. Take
CTU for instance to explain hardware implementation problems for intra encoder
because of the data dependency. The order of intra encoding is restricted by CTU
level data dependency, which requires that each CTU is obliged to wait until both of
the left and above right neighboring CTUs encoding is finished. The constraint leads
to the latency between current CTU and adjacent CTU. Furthermore, the maximum
parallelism is sustainably limited and encoding from optional position in frame is
forbidden. Only the encoding direction which is from left top to right bottom is
permitted in HEVC. Therefore, pipelined hardware implementation and parallel
architecture is difficult to design. Both CTU level and PU level data dependency
lead to low throughput performance and high hardware overhead.
c) coding efficiency
In the spatial domain, pixels which need to be predicted tend to have a high
correlation with neighboring pixels, in most instances. As above-mentioned, HEVC
intra coding reduce the redundancy that pixels close to each other in the same frame
[24]. 35 intra prediction modes are provided which utilize the spatially neighboring
samples to predict current PU samples as shown in Figure 4. However, the correla-
tion between pixels of spatial locality decays with PU size increasing. In particular,
for large PU sizes (64 × 64 or 32 × 32), the available referencing samples are too
far away from some pixels of current PU and interfere the prediction performance.
4.3 Proposed parallel intra coding scheme
In the paper, we propose a parallel intra coding scheme which is consisted of
reconstruction of downsampling reference samples, downsampling prediction modes
and training method of optimal downsampling QP. The overview of the proposal
is briefly illustrated in Figure 5. In the work, reference samples are reconstruced
from the downsampling approach instead of neighboring CUs in original intra cod-
ing of HEVC. Data dependency referring to encoding CUs by Z-scan order is solved
by utilizing reference samples of the downsampling approach. Moreover, parallel
implementation of intra encoder is possible in our proposal. The proposed down-
![Page 49: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/49.jpg)
4.3 Proposed parallel intra coding scheme 39
sampling prediction modes derive intra prediction samples with reconstructing sam-
ples of downsampling approach. Owing to correlation between original signal and
downsampling signal, downsampling prediction modes reduce spatial redundancy
more and increase coding efficiency, compared with original HEVC. Furthermore,
we propose a training method to generate optimal downsampling QP which tries
to balance reconstructing samples of downsampling approach and intra prediction
samples. The final encoded bitstream is optimized in coding performance with op-
timal downsampling QP. The parallel intra encoding architecture is also explained
in detail by following contents.
�����
����� ������������ ������ ��
��������������������
� ���������� � � �� ����� �
� � ��� �� ������������ ������������������� ���������� �
� � ���� �����������������������
��������� ����
Figure 4.5 Overview of the proposed intra coding scheme.
4.3.1 Downsampling approach for reconstructing samples
In our proposal, downsapling signal is utilized to generate prediction pixels. The
current CTU is downsampled into 4 sub-CTUs (S0, S1, S2 and S3) with the same
size. A 4-tap downscaling filter is devised. Each CTU adopts the downsapling filter
by a simple coefficient (DFcoeff ). The DFcoeff vector is [1,0,0,0], [0,1,0,0], [0,0,1,0],
[0,0,0,1], which is matching the value of j separately.−→P is the vector of original
pixels for the i-th 8×8 block in 1 frame. The downsampling pixels (Pdown) are
obtained as follows:
Pdown(i, j) =−→P · −−−−−→DFcoeff >> 2 (4.1)
![Page 50: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/50.jpg)
40 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
Pdown(i, j) =
P (4i+ 0)
P (4i+ 1)
P (4i+ 2)
P (4i+ 3)
·
DFcoeff (4i+ 0, j)
DFcoeff (4i+ 1, j)
DFcoeff (4i+ 2, j)
DFcoeff (4i+ 3, j)
>> 2 (4.2)
Then, the downsampling signal with S0 are encoded by original HEVC intra cod-
ing tool to derive downsampling reconstruction pixels. Finally, downsampling signal
is utilized to generate prediction samples as shown in Figure 6 and also utilized as
reference samples in intra prediction. The spatial correlation between downsampling
samples and original samples is considered to be closely tied.
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
������������
����������� ����
Figure 4.6 Generate prediction samples using downsampling signal.
4.3.2 Downsampling Prediction Modes
In HEVC, planar mode, DC mode, and directional modes are employed to gener-
ate prediction samples with spatial correlation between PUs. Intra prediction modes
are performed with reference samples are reconstructed from neighboring PUs. RD
costs are calculated recursively with 35 intra prediction modes in the RDO process of
![Page 51: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/51.jpg)
4.3 Proposed parallel intra coding scheme 41
HEVC, which leads to amount of power consumption because of computational com-
plexity. Meanwhile, the data dependency caused by intra prediction modes severely
restricts parallel hardware implementation. Therefore, in our proposal, all of the
35 intra prediction modes are replaced with two downsampling prediction modes,
whereas, planar mode, DC prediction mode, and directional modes are not suitable
for parallel hardware design. Since 35 intra prediction modes are not available,
MPMs are also forbidden in our proposal. Two efficient downsampling prediction
modes are defined as downsampling mode A (DMA) and downsampling mode B
(DMB). The downsampling predictor can be formulated as:
pi,j =
Di,j−1
Di−1,j
Di,j
Di+1,j
Di,j+1
·
Ri,j−1
Ri−1,j
Ri,j
Ri+1,j
Ri,j+1
>> 2 (4.3)
pi,j denotes the predicted value of each samples, where i and j are the coordinate
of column and row. Di,j is downsampling reference samples which are generated in
downsampling approach. Ri,j is weighting parameter for downsampling reference
samples and is calculated by Di,j. In DMA and DMB, the Ri,j is adjusted to be
different value.
4.3.3 Training method for obtaining downsampling QP
It is obvious that the final encoding result is relative to downsampling QP and
current QP (CQP). As shown in Figure 7, theoretically, the optimal downsampling
QP with best performance of coding efficiency can be obtained by training method.
The downsampling QP of theoretical method is defined as TDQP. Each frame in
the video source is introduced to complete the training process. All candidates of
downsampling QP are employed to encode downsampling signal and generate down-
sampling reconstruction samples. RDO process is performed with downsampling
reconstruction samples which generated with assumed downsampling QP (ADQP).
The ADQP is adjusted on account of the bitrate and PSNR result. When the trend
of bitrate and PSNR result is upward, the ADQP is adjusted to be higher. On
![Page 52: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/52.jpg)
42 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
the contrary, the ADQP is set to be lower when coding efficiency becomes worse.
TDQP is obtained by the training approach which is based on coding performance.
The optimal combination of downsampling QP and CQP is determined by the final
compression performance.
�������������� �
�������������
�������
���������������
����������������������
����������������������
���������������������������
�������������������������
������ �����������������
�����!����
"�� �����"���
#�
��
Figure 4.7 Flowchart of training TDQP.
However, performing a complete coding process for each round of the training
process costs massive computation. A fast training method is presented. The down-
sampling QP of fast method is presented as FDQP. The initial frame of video source
is divided into four parts with same size. Offline training is performed as the above-
motioned training method with the four parts of initial frame. The FDQP, CQP, RD
cost of downsampling signal and entropy signal are utilized to derive the calculation
formula. The calculation formula is defined as follows:
Cqp = QPC ·N2 + µ ·RDS + ν ·RE (4.4)
where QPC denotes current QP, N2 is a weighting factor for current QP. µ
and ν are parameters for RD cost of downsampling signal (RDS) and intra coding
![Page 53: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/53.jpg)
4.3 Proposed parallel intra coding scheme 43
Table 4.1 Result of TDQP and FDQP.
Sequence CQP TDQP FDQP
Beauty 27 34 34
HoneyBee 27 34 34
ShakeNDry 27 32 32
Kimono 27 30 30
ParkScene 27 31 30
entropy signal (RE). Cqp is the optimal downsampling QP. Following this approach,
weighting factor and parameters are derived in offline training process. In online
intra coding, Cqp is calculated value for the optimal downsampling QP with the
result of offline training. Following this approach, weighting factors and parameters
are derived in offline training process. In online intra coding, Cqp is calculated value
for the optimal downsampling QP with the final frame of video source. Table 4.1
presents the result of TDQP and FDQP in different sequence conditions. The results
point out that the proposed fast training method is valid and reasonable to replace
the theoretical method.
4.3.4 A parallel HEVC intra encoding architecture
As illustrated in Figure 8, the parallel architecture design is proposed. 5 pre-
diction engines are working in parallelization and the universal predictor with same
block size (64 × 64, 32 × 32, 16 × 16, 8 × 8 and 4 × 4) improves the throughput
much more. The detailed information of the parallel intra encoding architecture is
explained as follows. Original video source is introduced to be encoded. Firstly,
the input signal is pooled for deriving downsampling signal and downsampling pixel
buffer stores pixel values of the downsampling signal. Secondly, the downsampling
signal is utilized in normal RDO process (DSRDO) to generate reference samples,
and values of the reference samples are written in downsampling reconstruction pixel
buffer. In the RDO process, the results of RD cost are stored in DSRD cost buffer for
training method and calculating optimal downsampling QP. Thirdly, reconstructed
reference samples are introduced to 5 prediction engines. The prediction engines
![Page 54: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/54.jpg)
44 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
������������� ��
����������� �������
���������������������
���������������
�����
�����
� ��
!��!�
"�"
���
���#����������
�����������������������������������
������� ����������
���������
������������
����������������
�����������$�%��������#���� &������'�(��$�%��������
Figure 4.8 Top level hardware architecture for HEVC.
are able to generate prediction pixels independently, and reduce data dependency
with parallel execution of multiple predictors. Therefore, the traditionally recursive
RDO process in HEVC is replaced by the parallel DSRDO and ERDO process in our
proposal. Moreover, in the proposed parallel architecture, implementation of trans-
form and quantization in DSRDO and ERDO should be exactly the same, which
simplifies the hardware design. The results of RD cost in the parallel RDO process
are stored in ERD cost buffer. The values in DSRD cost buffer and ERD cost buffer
are utilized to perform training method for obtaining optimal downsampling QP,
and weighting factors and parameters in equation (3) are derived in offline process.
Finally, encoded bitstream is outputted with optimal downsampling QP which is
calculated with current QP.
4.4 Experimental results
The proposed scheme is integrated with the reference software of HEVC test
model (HM) 16.11 [25]. The simulation platform is Intel(R) Core(TM) i7-4770 CPU
@ 3.40GHz with 4 cores, RAM 8.00 GB and Windows 10 Home Edition 64-bit. Since
the proposed method is applied to intra coding, the experimental is conducted with
![Page 55: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/55.jpg)
4.4 Experimental results 45
Table 4.2 Experiential conditions.
Reference software HM 16.11
OS Windows 10 Home Edition 64-bit
CPU Intel Core i7-4770 3.40GHz
RAM 16.00 GB
Configuration encoder intra main
Frames 100
QP 22, 27, 32, 37
Compiler software Microsoft Visual Studio Community 2015
all-intra configuration [26]. The number of test frame of each sequence is 100 in our
experiment for evaluating performance adequately. Owing to our proposal aiming
at intra encoding of high resolution source, only high-resolution test sequences are
employed to compare the performance with quantization parameters (QP) of 22, 27,
32 and 37 [27]. Sequences of Beauty, HoneyBee and ShakeNdry are tested in 3840
× 2160 and 1920 × 1080 [28]. The sequence resolution of Kimono and ParkScene is
1920 × 1080. Test conditions and experimental settings are listed for evaluation in
Table 4.2.
In order to evaluate the compression performance of the parallel intra encoding,
two categories are defined: proposed efficiency intra coding without adaptive down-
sampling QP (M1), proposed efficiency intra coding with adaptive downsampling
QP (M2). Table 4.3 summarizes the simulation results in terms of bitrate(∆BR),
PSNR (∆PSNR) and time saving (∆T) for M1 and M2. The performance evalua-
tion of compression ratio, video quality and computational reduction is calculated
as follows:
∆BR =BRprop. −BRHEV C
BRHEV C
× 100% (4.5)
∆PSNR = PSNRprop. − PSNRHEV C × 100% (4.6)
![Page 56: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/56.jpg)
46 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
∆T =Tprop. − THEV C
THEV C
× 100% (4.7)
where ∆ BR are the bitrate results of output bitstream for original HEVC and
the proposed method. ∆ PSNR denote PSNR values of normal HEVC and the
proposal. ∆ T are encoding time of HM 16.11 encoder and the proposal, utilized to
evaluate the proposed speed-up method.
On average, the proposed algorithm achieves a time saving of 27.26% for intra
encoding, while the average bitrate saving is 4.17% and the decrease in PSNR is
only 0.2564 dB, which is negligible. The RD performance is shown in Figure 9,
10 and 11. From the RD curves for the proposed algorithm and HEVC, it is clear
that our proposed algorithm achieves higher coding efficiency compared with HEVC.
Specifically, the proposed scheme performs well in high resolution video compression,
and when the resolution is higher, the coding performance is better at the example
of sequence Beauty in Figure 9 and 11. We consider that higher resolution videos
have substantially spatial redundancy and the method using downsampling signal
reduce the redundancy between PUs. Meanwhile, comparing the RD curves in
Figure 9 and 10, we can find that the proposed method performs better in sequence
Beauty than HoneyBee. The video content of sequence Beauty is a portrait of a
beauty and the background is dark and full of digital noise. Sequence HoneyBee is
a video about bees and flowers with less digital noise. Frequently, noise in frame is
lack of spatial correlation with samples of neighboring PUs. Our proposal utilizes
downsampling signal which is generated from noise and normal pixels. Therefore,
the proposed method performs well for the video signal where noise is present. We do
not contend that the proposed algorithm is always significantly better HEVC from
the above experimental results. When encoding some cases of video sequences with
low QP (for example 22 and 27), reference samples generated from downsampling
approach improve prediction accuracy insensibly because of the method of transform
and quantization in HEVC. However, the proposed algorithm can save bit rates
in the majority of cases. Moreover, it can reduce computational complexity and
improve the parallel performance for the all sequences.
![Page 57: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/57.jpg)
4.4 Experimental results 47
Figure 4.9 RD curves of Beauty 2160p in M2.
Figure 4.10 RD curves of HoneyBee 2160p in M2.
![Page 58: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/58.jpg)
48 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
Figure 4.11 RD curves of Beauty 1080p in M2.
![Page 59: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/59.jpg)
4.4 Experimental results 49
Tab
le4.3Com
pared
Experim
entalResultsforProposedSchem
eandHEVC
Test
M1
M2
Sequence
s∆BR(%
)∆PSNR(d
B)
∆T(%
)∆BR(%
)∆PSNR(d
B)
∆T(%
)
Beauty
-2.50
-0.1270
24.35
-4.73
-0.0934
26.95
3840x2160
Hon
eyBee
-7.54
-0.5450
24.08
-10.67
-0.4323
25.11
ShakeN
Dry
-3.21
-0.4363
24.84
-5.32
-0.3383
25.61
Beauty
0.72
-0.1510
24.73
-1.38
-0.1135
26.17
Hon
eyBee
-6.47
-0.6182
22.13
-9.32
-0.4059
25.06
1920x1080
ShakeN
Dry
-1.83
-0.5860
27.21
-0.325
-0.2109
27.80
Kim
ono
-2.38
-0.1809
30.94
-4.35
-0.1206
31.23
ParkScene
4.81
-0.2453
26.94
2.71
-0.3363
30.18
Average
-2.3
-0.3612
25.65
-4.17
-0.2564
27.26
![Page 60: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/60.jpg)
50 Chapter 4 Adaptive downsampling signal based intra prediction for parallel intra coding of HEVC
4.5 Summary
A novel parallel HEVC intra coding scheme based on adaptive downsampling
signal for hardware implementation is presented. It utilizes downsampling pixels to
reconstruct reference samples, and generates prediction pixels with proposed down-
sampling prediction modes. The proposed parallel hardware architecture reduces
data dependency and improve data throughput. The experimental results show
that our proposal achieves higher coding efficiency compared with original HEVC.
![Page 61: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/61.jpg)
Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC 51
Chapter 5 Hardware implementation oriented
fast intra coding based on
downsampling information for HEVC
5.1 Background and previous works
In recent 10 years, the worldwide consumer electronics market has proven to be
a dynamic force driving display technology and video coding technology, and the
demand of high resolution and high quality video is strengthened as the days passed.
The assortment of differentiated standard definition (SD) and high definition (HD)
devices dissipated significantly, with most of the better features and performance
attributes moving into full high definition (FHD) and 4K ultra high definition (UHD)
products [29]. More and more multimedia terminals are updating and adapting to
higher resolution video. It is worthwhile to note that higher resolution and higher
quality video will be network overload in content-delivery for large amounts of video
data. To solve the problem and achieve the goal of higher video coding efficiency,
the Joint Collaborative Team on Video Coding (JCT-VC), which is organized by
ISO/IEC Moving Picture Group (MPEG) and ITU-T Video Coding Experts Group
(VCEG), standardized HEVC in January 25, 2013 [30-33].
Known as the next generation video coding standard, HEVC supports higher
resolution video coding over the previous video coding standards. It extends high
resolution support to 4K and 8K ultra high definition (UHD) and responses to
the increasing demand for higher resolution and higher quality video. Far more
important is that HEVC achieves about 50% bit-rate reduction compared with ad-
vanced video coding (H.264/AVC). An evaluation of H.265/HEVC Main profile and
H.264/AVC High profile identifies that HEVC has substantial advantages in coding
efficiency [34]. Bit-rate results show that HEVC with Main profile reduces overall
50.1% bit-rate compared with H.264/AVC Baseline profile and reduces 40.7% bit-
rate compared with H.264/AVC High profile. Due to higher coding efficiency, HEVC
is expected to progressively replace H.264/AVC applications and develop into one
![Page 62: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/62.jpg)
52 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
of main video coding standards in future.
HEVC employs a flexible hierarchical structure of coding tree unit (CTU), which
succeeds to macro-block (MB) based block-shaped region structure historically. The
input video is divided into CTUs, and the root of quadtree structure is CTU. CTU
is partitioned into coding unit (CUs) recursively. CU is the basic processing unit,
and each CU consists of one luma coding block (CB) and two chroma coding blocks
(CBs). Instead of maximum MB size 16× 16 in H.264/AVC, HEVC supports larger
sizes of the fundamental processing unit, CU sizes range from 8 × 8 to 64 × 64.
HEVC also defines two other functional types of unit, prediction unit (PU) and
transform unit (TU). PU is the basic unit for prediction and allows sizes from 4× 4
to 64× 64. TU is defined for transform and quantization. In HEVC, maximum TU
size is 32× 32 for luma and 16× 16 for chrome. Minimum TU size is 4× 4 for both
luma and chroma [35].
Historically, intra coding is a significant coding technology from H.263 [36]. It is
intended to reduce spatial redundancy between neighboring PUs in same frame and
increase coding efficiency. In HEVC, intra coding is still one cause of performing
high efficiency coding [17]. Many novel features are employed to improve the coding
efficiency of intra coding in HEVC. As mentioned above, the maximum size of PU
extends to 64 × 64. HEVC constructs reference samples from wide spatial region
which are at the bottom left, left, top left, top, top right. Another improvement that
intra coding of HEVC owned is 35 prediction modes. As contrasted with 9 prediction
modes in H.264/AVC, HEVC employs 33 angular prediction, DC mode and planar
mode to make a tremendous in the effective use of spatial correlation. These novel
features lead to substantially improve coding efficiency with excellent visual quality,
however, they are also most serious causes of increasing computational complexity up
to 502% compared with H.264/AVC [37]. The enormous computational complexity
persecutes not only software operation but hardware implementation. In hardware
implementation, the complex intra coding in HEVC demands unacceptable hardware
resources for encoding high resolution (HR) and high frame rate (HFR) video. The
unbalanced workload in intra encoding process leads to difficultly implement in
flexible pipelined scheme and derives to a low throughput. Moreover, CTU level
![Page 63: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/63.jpg)
5.1 Background and previous works 53
data dependency obstructs a higher hardware efficiency in parallel. Therefore, it
is certainly worth developing low complexity algorithms which apply to pipelined
scheme and parallelized architecture for a faster encoder of HEVC.
Up to present, a number of fast algorithms for intra coding have been proposed
to reduce redundant computation in original HEVC. These algorithms attempt to
achieve encoding time saving, meanwhile ensure an acceptable visual quality loss.
They can be approximately classified into two categories of strategies.
The first strategy bases on texture feature of spatial region to reduce spatial
redundancy. In [8], Jiang et al. proposed a gradient based fast mode decision al-
gorithm. By making use of the gradient information, the redundant candidates of
prediction modes are reduced. As compared with HEVC, the fast intra mode deci-
sion scheme provides 20% time saving on average. To achieve higher time saving, a
fast mode decision scheme for intra prediction is proposed in [38]. In the scheme,
texture complexity of CUs is calculated and set the rational threshold to decide CU
size. The proposed scheme reduces about 40.9% encoding time. Both [8] and [38]
forced on prediction mode decision, and ignored the redundance of CU sizes in intra
coding process. Zhang et al. tried to reduce redundant computation in two ways.
Four orientation feature filters are designed to extract gradient intensity and tex-
ture direction in [9]. The result of gradient intensity is unitized to skip impossible
PU size and the texture direction is derived to exclude redundant prediction modes.
The proposed algorithms saves about 56.7% of encoding time with 4.78% bit-rate in-
creasing. Song et al. intended to restrain compression performance loss, meanwhile
reduce computational complexity efficiently. They employed an adaptive discretiza-
tion total variation threshold-based CU size determination algorithm [39] to make
a fast decision of CU size. Meanwhile, analyzing pixel value based on orientation
gradient assists to reduce candidate modes. By reducing redundant CU sizes and
prediction modes, their proposal incurred the maximum 2.17% bit-rate increasing
and saved coding time up to 57.21% on average. Another method is proposed in
[7] to retrain coding efficiency decreasing. Na et al. establishes an edge map based
on results of edge detection and balances the trade-off between computational com-
plexity and coding efficiency by setting candidate modes number. Compared to full
![Page 64: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/64.jpg)
54 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
modes HEVC, the proposed algorithm reduces the encoding time by 56.8% at the
cost of 2.5% bit-rate increasing. Furthermore, a three-stage pipelined architecture
[40] is designed based on dominant direction strength for real-time realization of
HEVC encoder. In [40], a parameter named dominant direction strength is defined
to evaluate CU homogeneity and perform a fast mode selection. Experimental re-
sults show that the proposed method achieves on average 45.8% time saving and the
architecture operates at 235 MHz maximum clock rate with supporting video up to
level 6.2.
The second strategy is inclined to utilize information of neighboring CUs to
achieve the goal of reducing encoding time. A fast CU size and mode decision
algorithm [41] is proposed. The algorithm exploits correlations between spatially
nearby CU and reduces redundant CU size and mode basing on the correlations.
Another method for deriving spatial correlation is proposed by Zhao et al. in [42].
They analyzed the rate distortion costs between the parent CU and part of its child
CUs to decide CU depth, meanwhile, exploited the correlation of intra-prediction
modes between neighboring PUs to speed up the mode decision. As a result, their
proposal provides about 50% time savings on average.
Some other work tries to combine the first strategy with the second strategy
to reduce redundancy more efficiently. In [43], Huang et al. proposed a fast CU
depth decision algorithm. The algorithm decides CU partitioning basing on both
CU texture complexity and correlation between the current CU and neighbouring
CUs. The proposed scheme provides 39.3% encoder time savings with only 0.6%
coding performance cost. Shen et al. reduced much more computational complexity
(averaged 45%) with a fast CU size decision algorithm [44]. They proposed adaptive
thresholds for texture homogeneity of current CU and bypass strategy for intra pre-
diction with texture property and coding information from neighboring coded CUs.
To exploit spatial correlation further, PU mode and RD cost correlations between
different depth levels are obtained and utilized to reduce low-possible candidate
modes and skip redundancy CU sizes in [45]. Experimental results demonstrate
that the proposed algorithm achieves about 50.99% computation reduction.
![Page 65: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/65.jpg)
5.1 Background and previous works 55
The above-mentioned fast algorithms and schemes reduce the computational
complexity with acceptable coding efficiency loss. However, most of the proposals
are not hardware oriented and are difficult to implement in hardware. The im-
plementation of these algorithms and schemes is obstructed by two main factors,
throughput burden and data dependency. Throughput burden is a big challenge for
real-time intra encoder. As well known, in 4K UHD@60fps video format, the base
throughout is as four time as that in 1080P@60fps. It is necessary to take special
measures to increase throughout for real-time hardware implementation. Pipelined
architecture and parallelized design are two regular methods to improve through-
put efficiently. However, the CTU based data dependency in intra coding of HEVC
makes it difficult to normalize pipeline workload allocation and perform in highly
parallel architecture. As a significant obstruction in VLSI architecture design, the
CTU level data dependency problem is not solved by the above-mentioned fast algo-
rithms. [46-47] implemented real-time intra encoder supporting FHD video. Their
work reduced computational complexity and improved the throughput with paral-
lelized VLSI architecture. However, it is not satisfactory for the coding performance
in both of them. Moreover, the max throughput is insufficient to support 4k appli-
cations. Another work [48] presented a computationally scalable algorithm based
hardware architecture. The scalability increases the max throughput which supports
intra encoding up to 2160p@30fps video resolution, nevertheless, it is disappointing
to note the max coding performance loss in 8.91%. Other hardware implementation
[49-50], which has been reported for mobile devices, reaches the demand of real-time
encoding for 4K applications at the sacrifice of the coding efficiency. The proposal
in [49-50] employed a new feature detection approach which considers pixel orien-
tation, similarity of spatial domain and temporal domain, and partitioning block
types. Statistics analysis is performed with the feature detection approach to ana-
lyze pixel activities and determine the mode candidates. It reduced computational
complexity by 68.5% while maximum peak signal-to-noise ratio degradation (PSNR)
at 0.16 dB.
In this paper, a downsampling process is introduced as a pre-processing before
the intra coding process. The downsampling information can be derived from the
![Page 66: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/66.jpg)
56 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
pre-processing stage. It can be utilized to reduce the computational complexity
and achieve fast intra coding process with excellent coding performance. Moreover,
it is designed to be hardware-friendly and parallelized to achieve real-time encod-
ing for UHD video. The pipeline efficiency and the throughput of the proposal is
substantially improved compared with original intra coding of HEVC.
5.2 Analysis of intra coding in HEVC
In hevc, a flexible hierarchical structure of ctu is employed instead of macro-
block units which were adopted in previous video coding standards. To perform
residual intra coding process, a CU is divided into a quadtree of PUs and TUs.
Figure 5.1 illustrates the quadtree structure based intra prediction order and 5
sizes of PU (64×64, 32×32, 16×16, 8×8 and 4×4), where the block in deep color
denotes partitioning further and the block in light color denotes not splitting at all.
The 5 sizes of PU is defined as different depths respectively, ranging from 0 to 5.
TUs is specified to contain and compute coefficients for spatial block transform and
quantization. A TU is square block with sizes from 4×4 up to 32×32. Another
improvement of intra prediction is the amount of available intra prediction mode.
In order to exploit the spatial correlation more efficiently, HEVC raises available
modes to 35, instead of 9 intra prediction modes in AVC.
��������������
�������������
���������������
���������������
������������������
Figure 5.1 Hierarchically quadtree structure in HEVC
![Page 67: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/67.jpg)
5.2 Analysis of intra coding in HEVC 57
The integrated process of intra mode and partitioning decision in HEVC is il-
lustrated in Figure 5.2. The process is divided into two main stages, rough mode
decision (RMD) and rate-distortion optimization (RDO). The RMD stage is a pre-
processing stage which decides candidates from all 35 intra prediction modes. Basing
on the candidates, the RDO stage makes the decision of optimal prediction mode
and partitioning.
�������������� ������������
��
����������� ����
������������������
�������������������� �
�
�� �������������
�
��� ����
����������
!�� �����������
!�� ���� ����
�"#��� �������
���������#��� �
�
�
�������� ��
�$���������������������
��$�%�&'�((
�)*+,�
-�
.�/ #��
.�/ #��
Figure 5.2 Optimal mode and partitioning decision flow
The intra coding process starts with depth 0 and the depth increases step by step.
Depth 0 matches the largest CU size 64×64 in CTU. Firstly, original pixels of the
current PU are derived, and reconstructed pixels of neighbouring PUs are utilized
to build prediction pixels with 35 intra prediction modes. Secondly, RMD performs
![Page 68: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/68.jpg)
58 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
to decide prediction mode candidates, as shown in Figure 5.2 (a). In RMD stage,
the residuals between the original pixels and the prediction pixels are calculated and
exported to perform Hadamard transform. Then, a SATD based Hadamard cost is
evaluated with the following function:
JRMD = SATD + λRMD ·Rprediction (5.1)
where SATD is the sum of absolute difference of the Hadamard transformed coeffi-
cients, λRMD presents Lagrange multiplier which is related to quantization parame-
ter (QP), and Rprediction denotes bits of encoding the prediction mode information. A
prediction mode candidate list is created by the Hadamard cost for each depth, and
the number of mode candidates specifies as {3, 3, 3, 8, 8} corresponding to depth
from 0 to 4, respectively. Thirdly, most probable modes (MPMs) are evaluated from
neighbouring PUs and supplemented to mode candidates. Fourthly, the best predic-
tion mode is decided by the RD cost. RD cost is derived by calculating the residual
between reconstructed samples and original samples. Moreover, the reconstructed
samples are utilized as reference samples when predicts next neighbouring PU. The
detailed RD cost function is defined as:
JRDO = SSE + λRDO ·Rmode (5.2)
where SSE denotes the sum of square error between original pixels and reconstructed
pixels, λRDO is Lagrange multiplier which determines the tradeoff between rate and
distortion, and Rmode specifies total bits of encoding PU. The quadtree based RDO
process is implemented recursively and determines the optimal prediction mode and
partitioning finally. The RMD and RDO process searches through all the combi-
nations of CU, PU and TU with all 35 intra prediction modes. For a largest CU
with the size of 64×64, 7,327 times Hadamard cost and minimum 3,646 times RD
cost will be calculated, which consumes much computational time [51] and makes
it difficult to implement in real-time hardware. Therefore, reduction of redundant
partitioning and prediction modes is important, and a great deal of computation
will be saved by optimizing RDO process. Moreover, reducing PU size, TU size and
predication mode candidates improperly will incur the decay of coding efficiency.
![Page 69: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/69.jpg)
5.3 Proposed downsampling information based fast intra coding algorithms 59
The proposed downsampling information based intra coding greatly reduces PU
size, TU size and mode candidates. However, the proposed implementation still
can not absolutely avoid to miss optimal partitioning, prediction mode and totally
banish the performance decrease.
5.3 Proposed downsampling information based fast intra
coding algorithms
�������������� ��������������
�� ������ ���������������� �����������������
������ ��� ���������� ������ ��� ����������
���������������!����������
����������"������������������������������!��
��� ��������������� �����������
# ����$���
��#
���� �������
����������������
Figure 5.3 The proposed fast intra coding scheme
As analyzed above, intra encoder of HEVC checks the possible partitioning and
available prediction modes exhaustively to obtain the lowest R-D cost. It derives
optimal intra prediction and achieves a goal of higher coding efficiency at a cost of
enormous computation in RDO and RMD. To reduce redundant computation and
implement a real-time hardware, we propose the downsampling information based
intra coding which contains several fast algorithms. The proposed fast intra cod-
![Page 70: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/70.jpg)
60 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
ing scheme is briefly diagramed in Figure 5.3. First, a preprocess stage performs
with downsampling resource. The stage exploits prediction information and derives
encoding characteristics with downsampling filter. Then, the downsampling pre-
diction information is utilized to execute fast decision of PU depth, TU depth and
prediction mode. The fast PU depth decision (FPDD) skips and early terminates
redundant PU depths. The fast TU depth decision (FTDD) optimizes the residual
quadtree proceess (RQT) and contribute practicality of real-time hardware proposal.
The fast prediction mode decision (FPMD) reduces redundant mode candidates and
accelerates RDO process. Finally, the best CU, PU, TU partitioning and prediction
mode is fast determined. The further details of proposed preprocess stage and fast
algorithms will be described in following paragraphs.
5.3.1 Downsampling information based preprocessing stage
Owing to the reason known to all, there is a causal relationship between down-
sampling frame and original frame. The downsampling samples are strongly as-
sociated with the original samples, particularly in conditions of high resolution.
Therefore we presumes that the optimal PU partitioning and prediction mode of
intra encoding with downsampling samples is strongly related to that with original
samples. The experimental results show that the correlation between downsampling
frame and original frame in optimal PU partitioning and prediction mode is strong,
in most instances. The correlation between downsampling frame and original frame
is estimated by following functions:
µDSD =1
N
N∑i=1
Ci, Ci = {0, (PUDepth ̸= DSDepth)
1, (PUDepth = DSDepth)(5.3)
µDSD±1 =1
N
N∑i=1
Ei (5.4)
Ei = {0, (PUDepth ∈ DSDepth± 1)
1, (PUDepth /∈ DSDepth± 1)(5.5)
![Page 71: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/71.jpg)
5.3 Proposed downsampling information based fast intra coding algorithms 61
µDPM =1
N
N∑i=1
Pi, Pi = {0, (PUMode ̸= DSMode)
1, (PUMode = DSMode)(5.6)
where Ci refers to the result of whether the optimal PU depth is the same be-
tween the downsampling samples (DSDepth) and the original samples (PUDepth)
for the i-th 8×8 block in 1 frame. N represents the total number of 8×8 blocks in
1 frame for different sequences. µDSD denotes the proportion that the PU parti-
tioning in downsampling frame is the same to that in original frame. In (4), Ei is
the result of whether the optimal PU depth in original samples is out of depth±1
in downsampling samples. µDSD±1 denotes the proportion that the extended PU
partitioning in downsampling frame is under to optimal PU partitioning in original
frame. In (5), Pi is the result of whether the optimal prediction mode in original
samples is the same between downsampling samples (DSMode) and original sam-
ples (PUMode). µDPM denotes the proportion that the encoded prediction mode in
downsampling frame is similar to that in original frame. The results of µDSD, µDSD±1
and µDPM is diagramed with DS Depth, DS Depth±1 and DS Mode respectively.
in Figure 5.4. Several class A and B sequences which contain Traffic, PeopleOn-
Street,SteamLocomotiveTrain, Ne-butaFestival, BQTerrace, Kimono, Cactus and
ParkScene are utilized to estimate the correlation. From the histograms, it is obvi-
ous that the PU partitioning in original frames is similar to that in downsampling
frames with a high possibility, and the optimal prediction mode with downsampling
samples is regular to that with original samples. Therefore, we consider that the
downsampling information can be utilized to predict optimal PU partitioning and
prediction mode and reduce redundant computation in original HEVC.
The downsampling information based preprocessing stage is proposed. In the
stage, firstly, we downsamples original frames with a 4 tap downsampling filter.
Each CU adopts the downsampling filter by a simple coefficient (DFcoeff ). The
DFcoeff vector is [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], which is matching the value
of j separately.−→P is the vector of original pixels for the i-th 8×8 block in 1 frame.
The downsampling pixels (Pdown) are obtained as follows:
![Page 72: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/72.jpg)
62 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
Tra. Peo. Ste. Neb. BQT. Kim. Cac. Par.
DS Depth DS Depth±1 DS Prediction Mode
Figure 5.4 Concordance rates of downsampling and original encoded informa-
tion
Pdown(i, j) =−→P · −−−−−→DFcoeff (5.7)
Pdown(i, j) =
P (4i+ 0)
P (4i+ 1)
P (4i+ 2)
P (4i+ 3)
·
DFcoeff (4i+ 0, j)
DFcoeff (4i+ 1, j)
DFcoeff (4i+ 2, j)
DFcoeff (4i+ 3, j)
(5.8)
As illustrated in Figure 5.5, each PU in original frames is downsampled into
4 sub-PUs (S0, S1, S2 and S3) with same size. Then, the downsampling frames
with S0 are encoded by original HEVC intra coding tool to derive results of PU
partitioning and prediction mode. The other 3 sub-PUs (S1,S2 and S3) are utilized to
assist with RQT partitioning determination in the proposed fast PU depth decision
algorithm which will be described exhaustively in section 3.2. Moreover, the stage
exploits downsampling information at an available computation cost which is proved
in experimental results later.
5.3.2 Fast PU depth decision algorithm
The PU depth of intra coding in HEVC is specified ranging from 0 to 4, which is
adopting to PU sizes from 64×64 down to 4×4. Various PU sizes are available and
![Page 73: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/73.jpg)
5.3 Proposed downsampling information based fast intra coding algorithms 63
S0 S1
S2 S3
Downsampling
Original frame
Figure 5.5 Overview of downsampling method
lead to expensive computation in RDO process as above mentioned. Meanwhile,
the strong correlation between optimal PU depth in downsampling frame and in
original frame is presented and demonstrated in section 3.1. Therefore, we propose
to utilize optimal PU depths with downsampling samples to make rapid decision of
PU depths with original frame.
As shown in Figure 5.4, the optimal PU depth in downsampling frames is not
always similar to that in original frames. In other words, the optimal PU partitioning
is out of the result in downsamling frames in some case. Obviously, the concordance
rate of DS Depth which denotes PU partitioning similarity between downsampling
and original frame is not high enough as expected. Therefore, the donsampling
information can not be used directly to determine the PU partitioning in intra
coding of HEVC.
In our proposal, reference confidence (RC) is defined to decide PU partitioning for
intra coding with a method named structural similarity index measurement (SSIM).
The SSIM is a method for measuring the similarity for images and videos [52]. It
can perceive quality of matched samples and can be viewed as a quality measure
of the current images whose reference image is regarded as of perfect quality. In
this paper, SSIM is utilized to obtain reference confidence components (RCC) as
![Page 74: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/74.jpg)
64 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
following functions:
RCCk = [l(x, yk)]α · [c(x, yk)]β · [s(x, yk)]γ (5.9)
where RCCk denotes the similarity between S0 sub-PU and Sk neighbouring sub-
PU in Figure 5.5. x is the index of 8×8 S0 sub-PUs and yk is the index of 8×8 Sk
neighbouring sub-PUs. The RCC is based on the computation of three terms, namely
the luminance term, the contrast term and the structural term. The overall RCC is
a multiplicative combination of the three terms, as the Eq. 7. l(x, yk), c(x, yk) and
s(x, yk) represents the luminance term, the contrast term and the structural term,
respectively. The three terms are formulated as following:
l(x, yk) =2µxµyk + C1
µ2x + µ2
yk+ C1
(5.10)
c(x, yk) =2σxσyk + C2
σ2x + σ2
yk+ C2
(5.11)
s(x, yk) =2σxyk + C3
σxσyk + C3
(5.12)
where µx, µyk , σx, σyk , and σxyk are the local means, standard deviations, and
cross-covariance for S0 sub-PU and Sk neighbouring sub-PU. The exponents of α,
β and γ are used to adjust the relative importance of the three terms (luminance
term, contrast term and structural term). In the paper we consider the three term
with the same weight, and set α, β and γ to 1. C1, C2 and C3 are three variables
to stabilize the division with weak denominator and avoid instability when either
µ2x+µ2
ykor σ2
x+σ2yk
is very close to zero. C1 and C2 is calculated by C1 = (ν1L)2 and
C2 = (ν2L)2. ν1 and ν2 are small constants which are far less than 1. In this work,
ν1 is set to 0.01 and ν2 is set to 0.03. L is the dynamic range of the pixel-values and
set to 255 in our proposal. C3 is obtained with C3 = C2/2 to simplify the expression.
Finally, the RCC simplifies to
RCCk =(2µxµyk + C1)(2σxyk + C2)
(µ2x + µ2
yk+ C1)(σ2
x + σ2yk+ C2)
(5.13)
![Page 75: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/75.jpg)
5.3 Proposed downsampling information based fast intra coding algorithms 65
The smaller value of RCC means the similarity of S0 and Sk is lower. On the
contrary, a big value of RCC indicates that S0 is similar to Sk. High reference con-
fidence (HRC) is defined when the values of RCC1, RCC2 and RCC3 are all over
than 0.85 (threshold θRCC for RCC), and the contrary case is specified as low refer-
ence confidence (LRC). In HRC condition, the 4 sub-PUs are similar which implies
that downsampling samples are similar to original samples in structural measure-
ment system. Therefore, the partitioning result of HRC condition can be utilized
to determine partitioning in intra coding. In our proposal, the optimal PU depth
is decided to be equal to that in the same position of downsampling coding result
when the RC is HRC. Depth skip and early termination is implemented to exclude
the other 4 redundant PU depths for RDO process. On the other side, more than
3 PU depths are utilized in RDO to avoid missing the optimal partitioning in LRC
condition. If the RC is LRC, any PU depth which is no more than downsampling
depth + 2 will be performed in RDO process. For instance, the downsampling depth
in the same position is 1 and RC is LRC, PU depth 0,1,2,3 will be utilized to decide
optimal PU partitioning.
In summary, the proposed fast PU depth decision algorithm is represented in
Algorithm 1. Firstly, the downsampling pixels are derived as ArrayDS which is
described in section 3.1, and RDO process in preprocessing stage is performed with
ArrayDS and output the DSDepth result. On the other side, RCC is calculated
with ArrayDS and the result is based on θRCC . Then, original pixels are imported
to perform RDO process, and terminate or skip the process according to the result
of RCC and DSDepth. Finally, optimal PU depth is derived.
5.3.3 Fast TU depth decision algorithm
HEVC employes a recursive RQT process to determine the TU depth. The orig-
inal RQT process allows a residual block to be further divided into TUs recursively
and contribute another quad-tree. The quadtree-based recursive structure increases
intra coding efficiency, however, brings a huge computation complexity to search
optimal partitioning and prediction mode which obstructs real-time hardware im-
plementation for intra coding of HEVC.
![Page 76: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/76.jpg)
66 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
Algorithm 1 Optimal PU depth derivation process
1: function DSDep(DSPel)
2: ArrayDS ← DSPel
3: Process RDO with ArrayDS[j]
4: for all i such that 8× 8 PU ∈ CTU do
5: DSDepth← PUDepth(i)
6: end for
7: return DSDepth
8: end function
9:
10: function PUDepthDecision(OrgPel)
11: ArrayOrg ← OrgPel
12: for all k ∈ DFcoeff vectors do
13: ArrayDS[k]← ArrayOrg ·DFcoeff
14: while k ̸= 0 do
15: Calculate RCCk with ArrayDS[0]&[k]
16: end while
17: end for
18: Decide HRC or LRC with θRCC
19: Process RDO with ArrayOrg
20: if HRC then
21: while Depth < DSDep(ArrayDS[0]) do
22: skip the Depth
23: end while
24: while Depth ≥ DSDep(ArrayDS[0]) do
25: Terminate process early
26: end while
27: else LRC
28: while Depth ≥ DSDep(ArrayDS[0])+2 do
29: Terminate process early
30: end while
31: end if
32: return PUDepth
33: end function
![Page 77: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/77.jpg)
5.3 Proposed downsampling information based fast intra coding algorithms 67
The fast TU depth decision is proposed to optimize the RQT process and re-
duce redundant TU depths in RDO. The recursive procedure for quantization and
transform (QT) is replaced by a simplified process in the paper. In the simplified
process, original pixels are input as ArrayOrg and RQT process is performed with
ArrayOrg. When TU depth is deeper than current PU depth, the RQT process
is early terminated. Conversely, the TU depth is skipped when it does not attain
to current PU size. For instance, when current PU size is 32×32, the PU is only
transformed and quantized with the size of 32×32, instead of implementing entire
intra prediciton process with PU sizes of 32×32, 16×16, 8×8 and 4×4. Accordingly
to the reduction of TU sizes, the downsampling information based fast intra coding
is ensured to reduce computation complexity and save encoding time further. From
experimental results, the fast TU depth decision algorithm save averaged 16.8% en-
coding time for proposed fast intra coding, at the cost of imperceptible BD bit-rate
increasing and PSNR loss.
Algorithm 2 Optimal TU depth derivation process
1: function TUDepthDecision(OrgPel)
2: ArrayOrg ← OrgPel
3: Process RQT with ArrayOrg
4: while Depth < PUDepthDecision(OrgPel) do
5: skip the Depth
6: end while
7: while Depth ≥ PUDepthDecision(OrgPel) do
8: Terminate process early
9: end while
10: return TUDepth
11: end function
5.3.4 Fast prediction mode decision algorithm
HEVC employs 35 intra prediction modes to create a mode candidate list for
RDO process. The number of initial list is {3, 3, 3, 8, 8} which is matching PU
depth {0, 1, 2, 3, 4}. More prediction modes are utilized in RDO process for
![Page 78: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/78.jpg)
68 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
smaller PU sizes. Moreover, HEVC supplements the mode candidate list with the
MPMs which are at most 3 non-contained prediction modes [53], as shown in Table
1. Undoubtedly, there are substantially redundant prediction modes in the mode
candidate list of HEVC. In our proposal, we reduce the number of initial list to {1,
1, 2, 3, 3}, and the downsampling prediction mode (DPM) is supplemented.
As discussed in section 3.1, downsampling prediction mode is similar to optimal
prediction mode with a specified probability. Therefore, we considers that the DPM
should be included into the mode candidate list to avoid missing the optimal pre-
diction mode. The number of intra prediction mode candidate list in worst case is
listed in Table 1. Compared with original HEVC, we reduce redundant prediction
mode candidates and cuts down computation complexity substantially, particularly
in PU size 4×4 and 8×8. Experimental results demonstrate that the fast prediction
mode decision algorithm reduces redundant prediction modes further and the supple-
mented downsampling prediction mode effectively avoids missing optimal prediction
mode.
Table 5.1 Number of mode candidate list for HEVC and proposal in worst case
The size Number of mode Number of mode
of PU candidates in HEVC candidates in proposal
64x64 [3] + 3 [1 + 1] + 3
32x32 [3] + 3 [1 + 1] + 3
16x16 [3] + 3 [2 + 1] + 3
8x8 [8] + 3 [3 + 1] + 3
4x4 [8] + 3 [3 + 1] + 3
5.4 Top-level design for VLSI architecture
As mentioned above, data dependency, throughput burden and additional com-
plex hardware structures severely restrict the real-time hardware implementation for
intra coding of HEVC. The proposed architecture aims to solve the three problems,
and support all CU, PU, TU paritioning and prediction modes.
![Page 79: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/79.jpg)
5.4 Top-level design for VLSI architecture 69
As illustrated in Figure 5.6, the pipeline architecture design consists of two
stages, preprocessing stage (PS) and fast intra coding stage (FICS). The two stages
are operated in parallel and the data in preprocessing stage is entirely independent
from that in fast intra coding stage. For preprocessing and deriving downsampling
partitioning, prediction mode and RC, the PS imports original pixels and source in-
formation. Downsampling samples are obtained by the original samples and stored
in downsampling samples buffer. Then RMD, RDO is implemented to derive down-
sampling depth and prediction mode. In addition, RC is calculated and stored in
buffer. Meanwhile, original pixel source is imported into original samples buffer in
FICS. The fast RQT process is implemented to optimal partitioning and prediction
mode with downsampling information and original samples. Finally, coding entropy
is operated and output coded bitstream. As discussed in section 2, obstacles need
to be overcome for hardware implementation. The parallelized design of PS and
FICS solves the problem of CTU based data dependency in intra coding of HEVC.
Both PS and FICS are able to design into pipelined structures and the pipelined
structures improve throughput. Furthermore, PS and FICS are able to utilize the
similar prediction elements (PEs) to predict current PU and calculate RDO cost.
�������������� ����
� ������ ����������
��� ���
���
��
�
�
���� �������������
������������ ����������
� ��������������������������
�����������������������
� �����������������������
� ������������������ ��� ��
������
� ���������������������
� �������������
��� ������������������������
����
������ ������
���
���
Figure 5.6 Top level architecture of proposed fast intra coding
![Page 80: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/80.jpg)
70 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
Owing to the proposed fast PU depth decision algorithm, fast TU depth deci-
sion algorithm and fast prediction mode decision algorithm, the number of cycles
in worst case (CFICS) is calculated in (12). The calculation shows that about 2361
cycles are necessary to encode a CTU. As our proposal is aiming at a software solu-
tion for 4k@60fps real-time hardware implementation, the max operating frequency
is estimated as about 287MHz by (13). Therefore, our proposed downsampling in-
formation based fast intra coding and parallelized architecture is available for the
requirement of real-time processing.
Time
DS+RC+RMD Preprocessing Stage
Fast Intra Coding Stage
RDO
910 Cycles
DS+RC+RMD
RDO
CTU 0
CTU 1
CTU 0
CTU 1
FRDO
2361 Cycles
FRDO
……
…………
……
Figure 5.7 Timing diagram with PS and FICS
The pipelined timing schedule of PS and FICS is illustrated in Figure 5.7. PS and
FICS operate parallelly and independently, which reduce CTU level data dependency
of intra coding in HEVC. In PS, source downsampling, RC calculation and RMD
are performed firstly for CTU 0. Then, the RDO process is executed and intra
encoding of CTU 0 is completed. Meanwhile, DS, RC calculation and RMD of CTU
1 can start. The RDO process of PS is obtained as 910 cycles, which is calculated
in (14). The PS is designed in pipelined architecture to improve the throughput. In
FICS, fast RDO process (FRDO) is performed with 2361 cycles to encode source
and output coded bitstream.
CFICS =64
4× 64
4× 7 +
64
8× 64
8× 7 +
64
16× 64
16× 6 (5.14)
![Page 81: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/81.jpg)
5.5 Experimental results 71
+64
32× 64
32× 5 +
64
64× 64
64× 5 = 2361(Cycles/CTU) (5.15)
Fmax = 2361× 3840
64× 2160
64× 60 = 286.87(MHz) (5.16)
CPS =32
4× 32
4× 11 +
32
8× 32
8× 11 +
32
16× 32
16× 6 (5.17)
+32
32× 32
32× 6 = 910(Cycles/CTU) (5.18)
5.5 Experimental results
Several experiments are carried out to evaluate the proposed downsampling in-
formation based fast intra coding which is integrated with the reference software of
HEVC test model (HM) 12.1 [54]. The simulation platform is Intel(R) Core(TM)
i7-4770 CPU @ 3.40GHz with 4 cores, RAM 8.00 GB and Windows 10 Home Edi-
tion 64-bit. Owing to our proposal aiming at a software solution for real-time intra
encoding of high resolution source, Class 4K (3840×2160), Class A (2560×1600), B
(1920×1080), sequences are employed to evaluate BD-rate, BD-PSNR [55] perfor-
mance and computational complexity reduction. The tested sequences are conducted
quantization parameters (QP) 22, 27, 32 and 37 with the common test condition
defined in [56]. As the proposed fast algorithms are mainly applied to the intra
coding, we set the period of I frames to 1 so that all the tested frames are intra
encoded.
The proposed fast intra coding consists of three algorithms, FPDD, FTDD and
FPMD which achieves time saving at BD-rate increasing. FTDD reduces redundant
TU depths in RQT process and provides 15.4% of time reduction with averaged
0.4% BD-rate increasing, as listed in Table 2. The comparison between FICS and
FICS with FTDD contains performance in terms of time saving (∆TSTU), BD-rate
(∆BRTU), BD-PSNR (∆PSNRTU), which are defined as follows:
∆BRTU = BRFICS −BRTU (5.19)
∆PSNRTU = PSNRFICS − PSNRTU (5.20)
![Page 82: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/82.jpg)
72 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
∆TSTU = TSFICS − TSTU (5.21)
TSTU =1
4
4∑i=1
THM(QPi)− TTU(QPi)
THM(QPi)× 100% (5.22)
TSFICS =1
4
4∑i=1
THM(QPi)− TFICS(QPi)
THM(QPi)× 100% (5.23)
where BRFICS and BRTU denote the BD-rate performance of FICS and FICS
without FTDD. PSNRFICS and PSNRTU indicates the BD-PSNR loss of FICS and
FICS without FTDD. TSFICS and TSTU is the encoding time reduction of FICS
and FICS without FTDD. THM indicates the encoding time of HM 12.1 with QPi.
THM(QPi) and THM(QPi) denotes the encoding time of FICS and FICS without
FTDD under QP value 22, 27, 32 and 37. FPMD reduces redundant prediction
modes and supplements the downsampling prediction mode to avoid missing optimal
prediction mode. As shown in Table 3, FPMD contributes 1.85% of encoding time
reduction and 0.03% BD-rate decreasing, compared with FICS without FPMD. The
comparison between FICS and FICS with FPMD are conducted with performance in
terms of time saving (∆TSPM), BD-rate (∆BRPM), BD-PSNR (∆PSNRPM), which
are defined as following:
∆BRPM = BRFICS −BRPM (5.24)
∆PSNRPM = BRFICS − PSNRPM (5.25)
∆TSPM = TSFICS − TSPM (5.26)
TSPM =1
4
4∑i=1
THM(QPi)− TFICS(QPi)
THM(QPi)× 100% (5.27)
where BRPM and PSNRPM denote the BD-rate performance and the BD-PSNR
loss of FICS without FPMD. TSPM is the encoding time reduction of FICS without
![Page 83: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/83.jpg)
5.5 Experimental results 73
FPMD. Thereinto, TPM denotes the encoding time of FICS without FPMD. FPDD
reduces the computational complexity with downsampling prediction depth. The
time saving and BD-rate increasing contribution can be derived by analyzing Table
2, 3 and 4.
Table 4 shows the simulation results in terms of BD-rate(BRFICS), BD-PSNR
(PSNRFICS), time consumption of preprocessing stage (PTC) and time saving for
fast intra coding (TSFICS). Compared with encoding time of original HM 12.1, PTC
and TSFICS is estimated to evaluate the computational complexity of PS and FICS,
PTC is defined as following:
PTC =1
4
4∑i=1
TPS(QPi)
THM(QPi)× 100% (5.28)
where THM indicates the encoding time of HM 12.1 with QPi. TFICS and TPS
denotes the encoding time of FICS and PS. As shown in in Table 2, the proposed fast
intra coding achieve 60.4% encoding time reduction and 1.26% BD-rate increment
on average. PS consumes approximately 19.2% encoding time compared with orig-
inal HEVC, however both PS and FICS are parallelized in hardware architecture.
Moreover, PS and FICS are designed to utilize similar PE to complete prediction
calculation, high efficient hardware utilization can be achieved.
For further evaluation of the proposed fast intra coding, we compare our work
with several related work which achieved real-time intra coding implementation.
Table 3 shows the overall specification of our proposal and other two work. Zhu
et al. propose to estimate the RD cost with image texture in [23]. Their work
is implemented for the HDTV1080p@44fps real-time encoding. The overall time
saving in [23] is better than our proposal with 1.3%, however the BD-rate increases
by 4.53%. Moreover, their implementation reduces PU size 64×64. In work [24],
Huang et al. utilizes source signal based fast RMD algorithm to parallelize the
hardware implementation which is able to encode HDTV1080p@60fps in real-time.
The BD-rate increasing is 4.30% which is better than [23], however the encoding
time saving is about 41.6%. Both of their work utilize the information of source to
accelerate intra coding process. However, the computational complexity reduction
![Page 84: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/84.jpg)
74 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
and BD-rate performance is not perfect enough to perform HDTV applications. In
our proposal the biggest advantage is that the proposed fast intra coding is based
on downsampling information and is considered to encode 4k(2160p) video in real-
time with max operating clock frequency less than 290MHz which is demonstrated
theoretically.
![Page 85: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/85.jpg)
5.5 Experimental results 75Tab
le5.2Perform
ance
comparisonbetweenFIC
SandFIC
SwithoutFTDD
Class
Sequence
Bitdepth
∆BR
TU(%
)∆PSNR
TU(dB)
∆TSTU(%
)
Bosphorus
10bit
0.52
-0.0152
15.4
Hon
eyBee
10bit
0.11
-0.0044
15.9
Class
4KHon
eyBee
8bit
0.15
-0.0049
15.6
3840×2160
Jockey
8bit
0.34
-0.0053
11.0
ReadySetGo
8bit
0.62
-0.0216
17.1
ShakeN
Dry
10bit
0.17
-0.0030
16.0
ShakeN
Dry
8bit
0.16
-0.0029
15.1
Traffic
8bit
0.71
-0.0385
20.5
Class
APeopleOnStreet
8bit
0.60
-0.0344
19.9
2560×1600
Steam
LocomotiveT
rain
10bit
0.21
-0.0125
21.7
NebutaFestival
10bit
0.08
-0.0055
22.4
Bosphorus
8bit
0.64
-0.0281
13.3
Hon
eyBee
8bit
0.57
-0.0227
15.1
Jockey
8bit
0.38
-0.0139
12.3
Class
BShakeN
Dry
8bit
0.10
-0.0051
14.9
1920×1080
BQTerrace
8bit
0.66
-0.0435
19.0
Kim
ono
8bit
0.31
-0.0108
16.8
Cactus
8bit
0.80
-0.0303
19.7
ParkScene
8bit
0.43
-0.0186
18.4
Total
average
0.40
-0.0169
16.8
![Page 86: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/86.jpg)
76 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
Table
5.3Perform
ance
comparisonbetweenFIC
SandFIC
SwithoutFPMD
Class
Sequence
Bitdepth
∆BR
PM(%
)∆PSNR
PM(dB)
∆TSPM(%
)
Bosphorus
10bit
0.05
-0.0011
3.9
Hon
eyBee
10bit
0.03
-0.0001
1.8
Class
4KHon
eyBee
8bit
0.03
0.0001
1.8
3840×2160
Jockey
8bit
-0.20
0.0027
1.1
ReadySetGo
8bit
-0.02
-0.0006
2.1
ShakeN
Dry
10bit
-0.04
-0.0012
1.6
ShakeN
Dry
8bit
0.05
-0.0013
1.0
Traffic
8bit
-0.03
0.0012
2.6
Class
APeopleOnStreet
8bit
0.01
-0.0006
3.0
2560×1600
Steam
LocomotiveT
rain
10bit
0.00
0.0001
2.6
NebutaFestival
10bit
0.00
0.0000
3.0
Bosphorus
8bit
-0.08
0.0034
0.6
Hon
eyBee
8bit
0.06
-0.0020
1.1
Jockey
8bit
-0.04
0.0011
0.9
Class
BShakeN
Dry
8bit
-0.09
0.0043
0.9
1920×1080
BQTerrace
8bit
0.03
-0.0020
1.5
Kim
ono
8bit
-0.11
0.0037
1.8
Cactus
8bit
-0.01
0.0001
2.1
ParkScene
8bit
-0.03
0.0012
1.7
Total
average
-0.03
0.0008
1.85
![Page 87: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/87.jpg)
5.5 Experimental results 77Tab
le5.4Perform
ance
comparisonofcompressionandencodingtimeforPSandFIC
S
Class
Sequence
Bitdepth
BR
FICS(%
)PSNR
FICS(dB)
PTC(%
)TSFICS(%
)
Bosphorus
10bit
2.12
-0.0611
18.2
74.2
Hon
eyBee
10bit
0.91
-0.0198
19.2
62.9
Class
4KHon
eyBee
8bit
0.92
-0.0200
19.2
63.1
3840×2160
Jockey
8bit
2.09
-0.0309
18.6
72.4
ReadySetGo
8bit
1.67
-0.0571
19.3
55.2
ShakeN
Dry
10bit
0.86
-0.0170
18.8
68.5
ShakeN
Dry
8bit
0.82
-0.0163
18.7
69.0
Traffic
8bit
1.37
-0.0743
19.4
50.6
Class
APeopleOnStreet
8bit
1.20
-0.0683
19.6
48.6
2560×1600
Steam
LocomotiveT
rain
10bit
0.61
-0.0372
19.5
53.6
NebutaFestival
10bit
0.29
-0.0214
19.8
46.7
Bosphorus
8bit
1.98
-0.0867
19.2
63.2
Hon
eyBee
8bit
1.66
-0.0651
19.8
58.6
Jockey
8bit
1.49
-0.0556
19.0
65.8
Class
BShakeN
Dry
8bit
0.79
-0.0400
19.2
65.4
1920×1080
BQTerrace
8bit
1.28
-0.0833
19.7
54.9
Kim
ono
8bit
1.02
-0.0362
19.2
66.2
Cactus
8bit
1.80
-0.0681
19.8
52.8
ParkScene
8bit
1.03
-0.0445
19.4
55.3
Total
average
1.26
-0.0475
19.2
60.4
![Page 88: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/88.jpg)
78 Chapter 5 Hardware implementation oriented fast intra coding based on downsampling information for HEVC
Table 5.5 Comparison of coding architecture and performance
Specification Proposal [23] [24]
BD-BR(%) 1.26 4.53 4.30
BD-PSNR(dB) -0.0475 -0.2000 NA
Time saving(%) 60.4 61.7 41.6
Frame rate(FPS) 60 44 60
Max resolution 2160p 1080p 1080p
Max Throughput 498 91 124
(MegaPxiels/s)
Max Frequency(MHz) 286.87 357.00 294.00
(Estimated)
Block Size ALL 4,8,16,32 ALL
5.6 Summary
The paper presents a downsampling information based fast intra coding for
HEVC to reduce the computational complexity in intra encoding process. The pro-
posed fast intra coding derives downsampling information in preprocessing stage.
Downsampling depth and reference confidence is employed to fast determine PU
partitioning. Downsampling mode is utilized to avoid missing optimal prediction
mode. Moreover, in RQT process, recursive TU searching is optimized. The experi-
mental results show that our proposal reduces encoding time on average 60.4% and
increases BD-rate about 1.26%, compared with HM 12.1. In addition, the feasibility
of real-time 4k@60fps intra encoder implementation is demonstrated theoretically.
The theoretical max operating frequency is derived as 286.87MHz. For further work,
the integrated hardware architecture design and implementation will be conducted
and a high resolution oriented intra encoder will be developed.
![Page 89: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/89.jpg)
Chapter 6 Conclusions 79
Chapter 6 Conclusions
The dissertation targets to reduce the computational complexity in video en-
coders by half while keeping the compression performance in terms of both video
quality and encoded bits, so that video data can be compressed efficiently within
halt encoding time or under lower power consumption.
Firstly, a fast algorithm for the intra prediction of HEVC is proposed, which
can efficiently decide the level on the basis of the roberts-cross edge detector.The
proposed algortihm utilizes the high correlation between regional texture and predic-
tion unit partitioning. It is mainly composed of a bottom-up level decision method
and an efficient decision flow based on an authentic image feature. Furthermore,
chrominance information is also employed to decide the prediction unit partitioning.
The experimental results for the proposed algorithm demonstrated that it achieved
a task with greatly reduced computational complexity compared with the original
HEVC. The average time-saving is approximately 37%, while the increase in bit rate
and decrease in PSNR are negligible.
Secondly, an efficient parallel scheme is proposed for intra coding of HEVC.
Downsampling signal is applied to generate prediction samples instead of neighbor-
ing pixels. It reduces spatial redundancy and removes the data dependency in intra
encoding for coding tree unit (CTU) structure. Meanwhile, a fast training method
is designed to derive downsampling signal adaptively. Experimental results show
that the proposed fast parallelized scheme achieves 4.17% bit saving on average,
with reducing computational complexity by 27.26%.
Thirdly, the paper proposes a downsampling information based intra coding
scheme which consists of two parts, preprocessing stage and fast intra coding stage.
Three downsampling information based fast decision algorithms are proposed in
fast intra coding stage. Moreover, a parallelized architecture of the fast intra coding
scheme is proposed. The preprocessed downsampling stage can be executed with
intra coding stage in parallel. The proposed architecture fully makes use of this
feature to improve throughput and fragment data dependency. Experimental results
![Page 90: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/90.jpg)
80 Chapter 6 Conclusions
demonstrate that the proposed algorithms achieves on average 60.4% reduction on
encoding time with negligible coding efficiency loss, compared with original HEVC.
![Page 91: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/91.jpg)
Acknowledgement 81
Acknowledgement
Time goes fast, and I am finishing the three years’studies and lives in Tokushima
University for pursuing my PH.D degree. Finally, it is the time to graduate and give
acknowledgments to those who ever helped me in my study and life.
First of all, I would like to express my gratitude to my advisors, Associate Pro-
fessor Song Tian and Professor Shimamoto Takashi, for giving me continuous in-
spiration, support and criticism throughout the whole of my work. They always
provided me with valuable insight and making sure that I was not lost in the re-
search directions. Without their supports, it would be impossible for me to finish
this work.
I have had the pleasure of study in the course of electrical and electronic sys-
tems, Tokushima University. I would like to extend my appreciation to the Professor
Hashizume Masaki, the Professor Nishio Yoshifumi, the Associate Professor Yot-
suyanagi Hiroyuki, the Assistant Professor Uwate Yoko, and other technical staffs
in our course, for their supports and friendship. I have had the pleasure of collab-
orating with numerous exceptionally talented graduate students over the last few
years. I would like to thank all my colleagues in our lab.
Finally, I would like to express my deepest love and gratitude towards my parents,
Jingyan Shi and Hong Liu for their unconditional supports and understanding over
the years.
Department of Electrical and Electronic Engineering,
College of Systems Innovation Engineering,
Graduate School of Advanced Technology and Science,
The University of Tokushima, Japan.
Wen Shi
![Page 92: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/92.jpg)
82 Bibliography
Bibliography
[1] B. Bross, W.J. Han, G. J. Sullivan, J.R. Ohm and T. Wiega: Working draft 11
of high efficiency video coding, JCTVC-L1003, Geneva, Switzerland, 2013.
[2] G.J. Sullivan, J.R. Ohm, W.J. Han, and T. Wiegand: Overview of the high
efficiency video coding (HEVC) standard, IEEE Trans. Circuits Syst. Video
Technol., vol. 22, no. 12, pp. 1649-1668, Dec. 2012.
[3] T.Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra: Overview of the
H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol.,
vol. 13, no. 7, pp. 560-576. 2003.
[4] G. Chen, Z. Pei, L. Sun and Z. Liu: Fast intra prediction for HEVC based
on pixel gradient statistics and mode refinement, IEEE Signal and Information
Processing (ChinaSIP), pp. 514-517, 2013.
[5] T. L. da Silva, L. V. Agostini and L. A. da S. Cruz: Fast HEVC intra prediction
mode decision based on EDGE direction information, IEEE Signal Processing
Conference (EUSIPCO), pp. 1214-1218, 2012.
[6] JCT-VC: High efficiency video coding (HEVC) test model 8 (HM 8) encoder
description, JCTVC-J1002, 2012.
[7] S. Na, W. Lee and K. Yoo: Edge-based fast mode decision algorithm for intra
prediction in HEVC, Consumer Electronics (ICCE), 2014 IEEE International
Conference, pp. 11-14, 2014.
[8] W. Jiang, H. Ma and Y. Chen: Gradient based fast mode decision algorithm
for intra prediction in HEVC, 2nd International Conference on IEEE Consumer
Electronics, pp. 1836-1840, 2012.
[9] Y. Zhang, Z. Li and B. Li: Gradient-based fast decision for intra prediction in
HEVC, IEEE Visual Communications and Image Processing (VCIP), pp. 1-6,
2012.
[10] W. Ma and B. S. Manjunath: A texture thesaurus for browsing large aerial pho-
tographs, Artificial Intelligence Techniques for Emerging Information Systems
Applications, Vol. 49, pp. 633-648, 1998.
![Page 93: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/93.jpg)
Bibliography 83
[11] https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/
[12] F. Bossen: Common test conditions and software reference configurations,
JCTVC-K1100, JCTVC of ISO/IEC and ITU-T, Shanghai, China, 2012.
[13] ITU: Recommendation ITU-R BT.2020-2 - Parameter Values for Ultrahigh Def-
inition Television Systems for Production and International Program Exchange,
Document ITU-R WP6C Contribution 399, Geneva, Oct. 2015.
[14] High Efficiency Video Coding (HEVC) Text Specification Draft 10 (for FDIS
and Consent), JCTVC-L1003, ITU-T/ISO/IEC Joint Collaborative Team on
Video Coding (JCT-VC), Jan. 2013.
[15] High efficiency coding and media delivery in heterogeneous environments Part
2: High efficiency video coding, ISO/IEC 23008-2. May, 2015.
[16] J. Lainema et al.: Intra Coding of the HEVC Standard, IEEE Trans. Circuits
Syst. Video Technol., Vol. 22, No. 12, Dec. 2012, pp.1792-1801.
[17] S. Li et al.: Improving Lossless Intra Coding of H.264/AVC by Pixel-Wise
Spatial Interleave Prediction, IEEE Trans. Circuits Syst. Video Technol., Vol.
21, No. 12, Dec. 2011, pp.1924-1928.
[18] S. Kanimozhi et al.: Efficient Lossless Compression Using H.264/AVC Intra
Coding and PWSIP Prediction, Int. Congr. Information Syst. and Computing
(ICISC), Vol. 3, Jan. 2013, pp.406-410.
[19] VA. Nguyen et al.: Adaptive Downsampling / Upsampling for Better Video
Compression at Low Bit Rate, IEEE Int. Symp. on Circuits and Syst. (ISCAS),
May 2008, pp. 1624?1627.
[20] J. Li et al.: Efficient Multiple Line-Based Intra Prediction for HEVC, IEEE
Trans. Circuits Syst. Video Technol., Vol. PP, No. 99, Nov. 2016, pp.1-1.
[21] C. Chen et al.: A New Block-Based Method for HEVC Intra Coding, IEEE
Trans. Circuits Syst. Video Technol., vol. 27, no. 8, April 2016, pp. 1727?1736.
[22] Y. Li et al.: Convolutional Neural Network-Based Block Up-sampling for Intra
Frame Coding, IEEE Trans. Circuits Syst. Video Technol., vol. PP, no. 99, July
2017, pp. 1?13.
![Page 94: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/94.jpg)
84 Bibliography
[23] S. Park et al.: Report on the evaluation of HM versus JM, Document JCTVC-
D181, Daegu, Korea, Jan. 2011.
[24] V. Sze et al.: High Efficiency Video Coding (HEVC): Algorithms and Archi-
tectures, Switzerland: Springer International Publishing, 2014, pp. 220-221.
[25] HEVC test model 16.11, Accessed, 2016.
https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.11/
[26] F. Bossen,“Common Test Conditions and Software Reference Configurations,”
Document JCTVC-F900, Torino, IT, July 2011.
[27] G. Bjontegaard: Calculation of Average PSNR Differences between RD-Curves,
Document VCEG-M33, Austin, TX, USA, Apr. 2001.
[28] Ultra Video Group, 4K Test Sequences, Accessed, 2015.
http://ultravideo.cs.tut.fi/testsequences
[29] Tarr, G.: 6M 4K Ultra HDTVs Shipped In North America In 2015. [Online].
Available:
http://hdguru.com/ihs-6m-4k-ultra-hdtvs-shipped-in-north-america-in-2015/
[30] Bross, B., Ohm J., Sullivan G.J., Han W.J., Wiegand T.: High efficiency video
coding text specification draft 10. JCTVC-L1003 (2013)
[31] ITU-T Study Group 16: High efficiency video coding. Final draft approval
(2013)
[32] ISO/IEC 23008-2: High efficiency coding and media delivery in heterogeneous
environments Part 2: High efficiency video coding. Final standard approval
(2013)
[33] Ohm, J.R., Sullivan, G.J., Schwarz, H., Tan, T.K., Wiegand, T.: Compari-
son of the coding efficiency of video coding standards including high efficiency
video coding (HEVC). IEEE Transactions on Circuits and Systems for Video
Technology, 22(12), 1668-1683 (2012)
[34] 3GPP organizational partners: Evaluation of High Efficiency Video Coding
(HEVC) for 3GPP services. 3GPP TR 26.906 (2014)
[35] Sze, V., Budagavi, M., Sullivan, G.J.: High Efficiency Video Coding (HEVC):
![Page 95: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/95.jpg)
Bibliography 85
Algorithms and Architectures. Springer International Publishing, Switzerland,
pp 220-221 (2014)
[36] Song, T., Song, H., Fujita, G., Onoye, T., Shirakawa, I.: A Codec of H.263
Advanced Intra Coding mode and it’s Architecture. IEICE technical report.
100(384), 45-50 (2000)
[37] Correa, G., Assuncao, P., Agostini, L., Cruz, L.A.S.: Performance and compu-
tational complexity assessment of high-efficiency video encoders. IEEE Transac-
tions on Circuits and Systems for Video Technology, 22(12), 1899?1909 (2012)
[38] Lei, H., Yang, Z.: Fast intra prediction mode decision for high efficiency video
coding. In: 2nd International Symposium on Computer, Communication, Con-
trol and Automation (2013)
[39] Song, Y., Zeng, Y., Li, X., Cai, B., Yang, G.: Fast CU size decision and
mode decision algorithm for intra prediction in HEVC. Multimedia Tools and
Applications. 1-17 (2016)
[40] Ramezanpour, M., Zargari, F.: Fast HEVC I-frame coding based on strength
of dominant direction of CUs. Journal of Real-Time Image Processing, Special
issue paper, 1-10 (2016)
[41] Shen, L., Zhang, Z., An, P.: Fast CU size decision and mode decision algorithm
for HEVC intra coding. IEEE Transactions on Consumer Electronics, 59(1),
207-213 (2013)
[42] Zhao, L., Fan, X., Ma, S., Zhao, D.: Fast intra-encoding algorithm for High
Efficiency Video Coding. Signal Processing: Image Communication, 29(9), 935-
944 (2014)
[43] Huang, X., Jia, H., Wei, K., Liu, J., Zhu, C., Lv, Z., Xie, D.: Fast algorithm of
coding unit depth decision for HEVC intra coding. In: IEEE Visual Communi-
cations and Image Processing Conference, 458?461 (2014)
[44] Shen, L, Zhang, Z., Liu, Z.: Effective CU size decision for HEVC intracoding.
IEEE Transactions on Image Processing, 23(10), 4232-4241 (2014)
[45] Shang, X., Wang, G., Fan, T., Li, Y., Zuo, Y.: Low-complexity intra-coding
scheme for HEVC, Circuits, Systems, and Signal Processing, 1-19 (2016)
![Page 96: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/96.jpg)
86 Bibliography
[46] Zhu, J., Liu, Z., Wang, D., Han, Q.: HDTV1080p HEVC Intra encoder with
source texture based CU/PUmode pre-decision. In: 19th Asia and South Pacific
Design Automation Conference (ASP-DAC), 367-372 (2014)
[47] Huang, X., Jia, H., Cai, B., Zhu, C., Liu, J., Yang, M., Xie, D., Gao, W.:
Fast algorithms and VLSI architecture design for HEVC intra-mode decision.
Journal of Real-Time Image Processing, Special issue paper, 1-18 (2015)
[48] Pastuszak, G., Abramowski, A.: Algorithm and architecture design of the
H.265/HEVC intra encoder. IEEE Transactions on Circuits and Systems for
Video Technology, 26(1), 210?222 (2016)
[49] Ju, C.C., Liu, T.M., Lee, K.B., Chang, Y.C.: 18.6 A 0.5nJ/pixel 4K
H.265/HEVC codec LSI for multi-format smartphone applications. In: IEEE
International Solid-State Circuits Conference, 1-3 (2015)
[50] Ju, C.C., Liu, T.M., Lee, K.B., Chang, Y.C.: 18.6 A 0.5nJ/pixel 4K
H.265/HEVC codec LSI for multi-format smartphone applications. IEEE Jour-
nal of Solid-State Circuits, 51(1), 56-67 (2016)
[51] Ozcan, E., Kalali, E., Adibelli, Y., Hamzaoglu, I.: A computation and en-
ergy reduction technique for HEVC intra mode decision. IEEE Transactions on
Consumer Electronics, 60(4), 745-753 (2014)
[52] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess-
ment: From error visibility to structural similarity. IEEE Transactions on Image
Processing, 13(4), 600-612 (2004)
[53] Zhao, L., Zhang, L., Ma, S., Zhao, D.: Fast mode decision algorithm for intra
prediction in HEVC. In: IEEE Visual Communications and Image Processing
Conference, 1-4 (2011)
[54] HEVC test model 12.1 [Online]. Available:
https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-12.1/
[55] Bjontegaard, G.: Improvements of the bd-psnr model. ITU-T Q.6/SG16 Doc-
ument, VCEG-AI11 (2008)
[56] Bossen, F.: Common test condition and software reference configurations.
JCTVC-L1100 (2013)
![Page 97: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/97.jpg)
A Publication Paper 87
Appendix
A Publication Paper
1. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Edge
Detector Based Fast Level Decision Algorithm for Intra Prediction of
HEVC, Journal of Signal Processing, Vol.19, No.2, pp.67–73, 2015.
2. Xiantao Jiang, Tian Song, Takashi Shimamoto, Wen Shi and Lisheng
Wang : Spatio-Temporal Prediction Based Algorithm for Parallel Im-
provement of HEVC, IEICE Transactions on Fundamentals of Electron-
ics, Communications and Computer Sciences, Vol.E98-A, No.11, pp.2229–
2237, 2015.
3. Xiantao Jiang, Tian Song, Takashi Shimamoto, Wen Shi and Lisheng
Wang : High Efficiency CU Depth Prediction Algorithm for High Res-
olution Applications of HEVC, IEICE Transactions on Fundamentals of
Electronics, Communications and Computer Sciences, Vol.E98-A, No.12,
pp.2528–2536, 2015.
4. Wen Shi, Tian Song, Takafumi Katayama, Xiantao Jiang and Takashi
Shimamoto : Hardware Implementation-Oriented Fast Intra-Coding Based
on Downsampling Information for HEVC, Journal of Real-Time Image
Processing, pp.1–15, 2017.
5. Xiantao Jiang, Xiaofeng Wang, Tian Song, Wen Shi and Takafumi
Katayama : An Efficient Complexity Reduction Algorithm for CU Size
Decision in HEVC, International Journal of Innovative Computing, In-
formation and Control , Vol.10, No.1, pp.1–10, 2017.
![Page 98: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/98.jpg)
88 Appendix
6. Takafumi Katayama, Tian Song, Wen Shi, Gen Fujita, Xiantao Jiang,
Takashi Shimamoto : Hardware Oriented Low-Complexity Intra Coding
Algorithm for SHVC, IEICE Transactions on Fundamentals of Electron-
ics, Communications and Computer Sciences, Vol.E100-A, No.12, pp.-,
2017.
7. Takafumi Katayama, Tian Song, Wen Shi, Xiantao Jiang, Takashi
Shimamoto : Fast Edge Detection and Early Depth Decision for Intra
Coding of 3D-HEVC, International Journal of Advances in Computer
and Electronics Engineering, Vol.2, No.7, pp.11–20, 2017.
![Page 99: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/99.jpg)
B International Conference 89
B International Conference
1. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Edge
Detector Based Fast Level Decision Algorithm for Intra Prediction of
HEVC, Proceedings of International Workshop on Nonlinear Circuits,
Communications and Signal Processing (NCSP’14), pp.129–132, Hon-
olulu, Mar. 2014.
2. Yutaro Tanida, Wen Shi, Tian Song and Takashi Shimamoto : Com-
plexity Reduction Algorithm for Intra Prediction of HEVC, The 29th
International Technical Conference on Circuits/Systems, Computers and
Communications (ITC-CSCC2014), pp.221–224, Phuket, THAILAND,
Jul. 2014.
3. Xiantao Jiang, Tian Song, Takashi Shimamoto, Wen Shi and Lisheng
Wang : Temporal Prediction Improvement for Parallel Processing of
HEVC, Proceedings of IEEE Asia Pacific Conference on Circuits Sys-
tems(APCCAS2014), pp.515–518, Okinawa, Nov. 2014.
4. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Edge In-
formation Based Fast Selection Algorithm for Intra Prediction of HEVC,
Proceedings of IEEE Asia Pacific Conference on Circuits Systems (APC-
CAS2014), pp.17–20, Okinawa, Nov. 2014.
5. Wen Shi, Xiantao Jiang, Tian Song, Jenq-Shiou Leu and Takashi
Shimamoto : Efficient Intra Coding for HEVC Based on Spatial Lo-
cality, Proceedings of International Forum on Advanced Technologies
(IFAT2015), pp.168–170, Tokushima, Mar. 2015.
6. Xiantao Jiang, Tian Song, Wen Shi, Lisheng Wang and Takashi Shi-
mamoto : Merge Prediction Algorithm for Adaptive Parallel Improve-
ment of High Efficiency Video Coding, Proceedings of IEEE International
Conference on Consumer Electronics(ICCE-Taiwan 2015), pp.310–311,
Taipei, Jun. 2015.
![Page 100: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/100.jpg)
90 Appendix
7. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Spatial
Locality Based Supplemental Modes for Intra Prediction of HEVC, Pro-
ceedings of IEEE International Conference on Consumer Electronics(ICCE-
Taiwan 2015), pp.298–299, Taipei, Jun. 2015.
8. Takuya Hamada, Yutaro Tanida, Wen Shi, Tian Song, Jenq-Shiou Leu
and Takashi Shimamoto : Original Pixel Based Parallel Algorithm for
Intra Prediction of HEVC, The 30th International Technical Conference
on Circuits/Systems, Computers and Communications (ITC-CSCC2015),
pp.400–401, Seoul, Jun. 2015.
9. Wen Shi, Xiantao Jiang, Tian Song, Jenq-Shiou Leu and Takashi Shi-
mamoto : Spatial Locality Based Parallel Scheme for Intra Coding of
HEVC, Tenth International Conference on Innovative Computing, Infor-
mation and Control (ICICIC2015), p.204, Dalian, Aug. 2015.
10. Wen Shi, Xiantao Jiang, Tian Song and Takashi Shimamoto : Seg-
mental Downsampling Intra Coding Based on Spatial Locality for HEVC,
Proceedings of IEEE International Conference on Consumer Electronics
Berlin(ICCE-Berlin 2015), pp.12–16, Berlin, Sep. 2015.
11. Wen Shi, Tian Song and Takashi Shimamoto : High Efficiency Intra
Coding Extension for HEVC/H.265, IEEE CASS Shikoku and Malaysia
Chapters Joint Seminar, Tokushima, Oct. 2015.
12. Xiantao Jiang, Tian Song, Wen Shi, Lisheng Wang and Takashi Shi-
mamoto : High efficiency CU depth decision algorithm for high resolution
application of HEVC, Proceedings of IEEE International Technical Con-
ference TENCON 2015, pp.1–4, Macau, China, Nov. 2015.
![Page 101: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/101.jpg)
B International Conference 91
13. Takafumi Katayama, Wen Shi, Tian Song and Takashi Shimamoto
: Low-Complexity Intra Coding Algorithm in Enhancement Layer for
SHVC, Proceedings of 2016 IEEE International Conference on Consumer
Electronics (ICCE), pp.457–460, Las Vegas, Jan. 2016.
14. Wen Shi, Xiantao Jiang, Tian Song, Jenq-Shiou Leu and Takashi
Shimamoto : Downsampled Information Based Low Complexity Intra
Coding for HEVC, Proceedings of 2nd International Forum on Advanced
Technologies *(IFAT2016), No.P1-17, pp.1–3, Tokushima, Mar. 2016.
15. Takafumi Katayama, Wen Shi, Tian Song, Jenq-Shiou Leu and Takashi
Shimamoto : Fast CU Size Decision for Intra Coding Algorithm in SHVC,
Proceedings of 2nd International Forum on Advanced Technologies (IFAT
2016), No.P1-18, pp.1–3, Tokushima, Mar. 2016.
16. Takafumi Katayama, Tian Song, Wen Shi, Takashi Shimamoto and
Jenq-Shiou Leu : Reference Frame Selection Algorithm of HEVC En-
coder for Low Power Video Device, Proceedings of 2nd International
Conference on Intelligent Green Building and Smart Grid (IGBSG 2016),
pp.34–39, Praha, Jun. 2016.
17. Yoshiki Ito, Wen Shi, Tian Song and Takashi Shimamoto : An Adap-
tive Search Range Selection Algorithm for HEVC, Proceedings of In-
ternational Technical Conference on Circuits/Systems, Computers and
Communications(CSCC2016), pp.211–214, Okinawa, Jul. 2016.
![Page 102: High performance intra algorithm and parallel hardware … · 2019-04-19 · twice compression ffi compared with H.264/AVC at the encoding condition. The new standard is expected](https://reader034.fdocuments.us/reader034/viewer/2022050607/5fae6c10040b54031d042217/html5/thumbnails/102.jpg)
92 Appendix
18. Ryo Kuroda, Wen Shi, Tian Song and Takashi Shimamoto : Hard-
ware Oriented Early CU Splitting Algorithm by Coding Unit Feature
Analysis for HEVC, Proceedings of International Technical Conference on
Circuits/Systems, Computers and Communications (CSCC2016), pp.217–
220, Okinawa, Jul. 2016.
19. Takafumi Katayama, Wen Shi, Tian Song and Takashi Shimamoto
: Early Depth Determination Algorithm for Enhancement Layer Intra
Coding of SHVC, Proceedings of IEEE International Technical Confer-
ence TENCON 2016, 3083-3086, Singapore, Nov. 2016.
20. Shota Yusa, Takafumi Katayama, Wen Shi, Tian Song and Takashi
Shimamoto : Fast CU Depth Decision Algorithm Using Depth-Map for
3D-HEVC, Proceedings of International Technical Conference on Cir-
cuits/Systems, Computers and Communications(ITC-CSCC2017), 473-
474, Busan, Jul. 2017.
21. Koki Tamura, Takafumi Katayama, Wen Shi, Tian Song and Takashi
Shimamoto : Coding Efficiency Improvement Algorithm for Inter-Layer
Reference Prediction in SHVC, Proceedings of International Technical
Conference on Circuits/Systems, Computers and Communications (ITC-
CSCC2017), 479-480, Busan, Jul. 2017.