Post on 30-May-2020
UNIVERSITÀ DEGLI STUDI DI PADOVA
Sede Amministrativa: Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione
DOTTORATO DI RICERCA IN:
INGEGNERIA ELETTRONICA E DELLE TELECOMUNICAZIONI
CICLO XIX
Source and Joint Source-Channel
Coding for Video Transmission
over Lossy Networks
Coordinatore: Ch.mo Prof. Silvano Pupolin
Supervisore: Ch.mo Prof. Gian Antonio Mian
Dottorando: Simone Milani
31 Dicembre 2006
UNIVERSITÀ DEGLI STUDI DI PADOVA
Sede Amministrativa: Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione
DOTTORATO DI RICERCA IN:
INGEGNERIA ELETTRONICA E DELLE TELECOMUNICAZIONI
CICLO XIX
Source and Joint Source-Channel
Coding for Video Transmission
over Lossy Networks
Coordinatore: Ch.mo Prof. Silvano Pupolin
Supervisore: Ch.mo Prof. Gian Antonio Mian
Dottorando: Simone Milani
31 Dicembre 2006
UNIVERSITÀ DI PADOVA FACOLTÀ DI INGEGNERIA
Source and Joint Source-Channel
Coding for Video Transmission
over Lossy Networks
Ph.D. THESIS
Author: Simone MilaniCoordinator: Ch.mo Prof. Silvano Pupolin
Supervisor: Ch.mo Prof. Gian Antonio Mian
2006
CORSO DI DOTTORATO IN INGEGNERIAELETTRONICA E DELLETELECOMUNICAZIONI – XIX CICLO
Abstract
During the last years, the IT world has showed an increasing interest about the transmission
of video sequences over a heterogeneous set of networks for awide variety of different appli-
cations. This concern has led the development of more and more efficient video coding stan-
dard characterized by an increasing compression efficiency. However, the recent emergence of
media-rich video applications over wireless channels has widened the set of requirements that
a video coding scheme must satisfy. The current tendency is to provide multimedia services to
each terminal without constraining its mobility or autonomy and ensuring a certain Quality of
Service (QoS). Therefore, some of the most important requirements that must be satisfied are
low-power and complexity at the mobile/sensor video encoding unit, high compression effi-
ciency due to the limited available bandwidth, and the robustness to packet/frame drops caused
by wireless channel impairments. More and more solutions were recently proposed in order
to face this problem that is now the main issue about the possibility of providing digital video
services over mobile terminals.
The research presented in this thesis concerns the analysisand the design of coding tech-
niques that allows video architecture to face the three requirements mentioned above. Nowa-
days video coding architectures consist in a synthesis of different coding tools that were defined
during the last 50 years. High compression gains can be obtained whenever all these coding
units are appropriately orchestrated in order to either maximize the visual quality of the re-
constructed sequence at the decoder for a given bit rate or, reciprocally, minimize the coded
bit stream for a given visual quality. The H.264/AVC coder proves to be quite successful in
this task since its coding performance outperforms all the previous video coding standards.
Therefore, it has been chosen as a starting point for the investigation presented in this work.
At first the optimization of the coding gain will be the focus of the investigation. High com-
pression ratio can be achieved through the adoption of efficient entropy coding schemes and
the optimization of internal coding parameters. This thesis investigates an enhanced arithmetic
coder that improves the compression gain of the original scheme defined by H.264/AVC using
a different probability estimate. At macroblock level, coefficient statistics can be estimated
more accurately enhancing the performance of the binary arithmetic coder.
An efficient statistical model proves to be effective also atframe level. In the following,
we will present a rate model that allows a good estimate of coefficient distribution. This model
has been used in a rate control algorithm providing higher visual quality and a tighter control
on the coded bit rate.
Then, the thesis will be focused on allowing a reliable transmission of video contents to
the end user. Many different approaches have been studied inliterature, and we focused on
viii Abstract
two solutions that have been recently on fashion. The first solution is based on the inclusion
of redundant information in the packet stream. Our researchis focused on matrix-based cross-
packet coding techniques, and we introduced some optimization techniques that permit both
controlling the bit rate and maximizing the performance of the FEC channel coder. On the other
hand, a DSC coding architecture has been considered, and we focused our efforts on obtaining
a good compression gain that makes its performance comparable with the one provided by its
hybrid non-robust counterpart.
The analysis and the implementation of all these techniqueswas carried out taking into
consideration the required computational complexity, andpreferring the techniques that engage
a limited amount of operations in order to suit all the three demands presented before.
Sommario
Nell’ultimo decennio, si è venuto a sviluppare un interessecrescente attorno alla trasmissione
di contenuti multimediali per diverse applicazioni attraverso reti di tipo eterogeneo. Questo
ha portato allo sviluppo di standard di codifica video caratterizzati da un capacità via via cres-
cente di ridurre la ridondanza del segnale. Tuttavia, l’introduzione di applicazioni di video
comunicazione su reti wireless ha aumentato l’insieme di requisiti che gli schemi di codifica
devono soddisfare. L’obiettivo è quello di poter fornire all’utente una vasta gamma di servizi
multimediali senza limitarne la sua mobilità e garantendo una certa qualità (Quality of Service
QoS). Tre dei requisiti fondamentali che i moderni sistemi di codifica devono soddisfare sono:
un elevato guadagno di codifica, la capacità di creare un bit stream che sia robusto in presenza
di errori o perdite e la possibilità di implementare queste tecniche su dispositivi con risorse di
calcolo e autonomia energetica limitate. In letteratura sono state proposte soluzioni differenti
a questi problemi, la cui soluzione costituisce un elementofondamentale nella diffusione di
servizi video su terminali mobili.
Il lavoro presentato in questa tesi riguarda l’analisi e la progettazione di tecniche di codi-
fica che permettano di soddisfare questi tre requisiti. Le architetture oggigiorno presenti sin-
tetizzano numerose tecniche di codifica studiate negli ultimi 50 anni. Promettenti guadagni
di codifica possono essere ottenuti qualora i differenti sistemi di codifica vengano ottimizzati
ai fini di massimizzare la qualità percettiva della sequenzaricostruita ad un dato bit rate, o
viceversa minimizzare la dimensione del bit rate fissata unacerta qualità visiva. In questo il
codificatore H.264/AVC permette di ottenere ottime prestazioni, migliorando notevolmente i
rapporti di compressione degli standard di codifica precedenti. Di conseguenza è stato preso
come punto di partenza della ricerca qui presentata.
Il primo problema affrontato è l’efficienza di codifica. Il guadagno di codifica può essere
migliorato attraverso l’implementazione di schemi efficienti di codifica entropica e algoritmi
di ottimizzazione dei parametri interni al codificatore. Questa tesi analizza in primo luogo
un miglioramento del codificatore aritmetico definito nellostandard H.264/AVC. Lo stimatore
delle probabilità dei simboli binari è stato modificato ai fini di migliorare le prestazioni del
codificatore aritmetico adottando un modello probabilistico pi accurato. Allo stesso tempo,
un secondo modello è stato utilizzato a livello di frame per modellare il bit rate prodotto dal
codificatore stesso. Analizzando il numero di bit prodotti in base alla percentuale di zeri e
l’energia del segnale quantizzato, è possibile progettareun algoritmo di controllo del rate che
garantisca una maggiore qualità della sequenza ricostruita e un controllo pi preciso del bit rate
prodotto.
In seguito, la tesi si focalizza sulla necessità di permettere una trasmissione affidabile del
x Sommario
contenuto video. In letteratura sono stati studiati approci differenti, e il lavoro qui presentato
si focalizza su due di questi. La prima soluzione si basa sulla trasmissione di pacchetti di
ridondanza assieme ai pacchetti RTP prodotti dal codificatore di sorgente ai fini di rendere
lo stream video pi robusto. Una soluzione efficiente è data dall’adozione di un codice FEC
cross-pacchetto basato sull’ inserimento dei pacchetti disorgente in una struttura a matrice
che ne ottimizza le dimensioni e effettua un interleaving delle informazioni. La tesi presenta
alcune tecniche di ottimizzazione della dimensione della matrice che possono essere utilizzate
in algoritmo di controllo congiunto del rate prodotto sia dal codificatore di sorgente che da
quello di canale.
Il secondo approcio studiato è basato sui principi della codifica di sorgente distribuita (Dis-
tributed Source Coding DSC). La ricerca è stata principalmente volta alla progettazione di
tecniche efficienti di codifica entropica i fini di ottenere unbuon guadagno di codifica rispetto
ai codificatori tradizionali.
Nell’analisi e nella progettazione di queste tecniche si è tenuto conto della complessità
computazionale richiesta da ogni tipo di applicazione, scegliendo quelle soluzioni che richiedessero
una quantità di calcoli limitata.
”The single biggest problem in communication
is the illusion that it has taken place.“
George Bernard Shaw
“ASBOKQTJEL”
Postcard from J. E. Littlewood to A. S. Besicovich
announcing A. S. B.’s election to fellowship at Trinity.
Acknowledgments
This thesis is the result of three years of work whereby I havebeen accompanied and supported
by many people.
The first person that should be named is my supervisor and master Gian Antonio Mian. It
is very difficult to express how much I have learned from him and how beneficial he was for
my work since he always gave me precious advices about my workand he was always well
disposed to discuss about my research.
Special thanks must be made to my collegues of the Digital Signal and Image Process-
ing Laboratory of the University of Padova, who made my work environment stimulating and
friendly during the last three years. Amongst them, Prof. Giancarlo Calvagno must be thanked
gratefully for the support he gave me in the last period of my PhD course. I also thank the past
and the current Ph.D. students (in random order Andrea , Daniele, Ottavio, Lorenzo, Stefano,
Mino, Matteo), who have shared with me the life of the laboratory and have been precious
mates for discussions and an analysis. I also had the pleasure to supervise and work with
several students who did their graduation work in our projects (Andrea, Nicola, Joe, Raffaele,
Stefano, Simone) and were beneficial for my investigation.
I also include the other Ph.D. students of the department ofvia Gradenigo: Matteo, Mas-
simo, Giamba, Vale, Federico, Filippo, Antonio, Daniele, Tommaso, Pietro, Nicola, Anna,
Elena, Elena A. and the others that I am sure I forgot.
I must remember all the STMicroelectronics people who followed and supported my re-
search work (in random order Luca Celetto, Daniele Bagni, Fabrizio Rovati, Andrea Vitali,
Daniele Alfonso). Among them, I must also mention Gianluca Gennari who was a precious
interlocutor.
A special thank should be made to prof. Kannan Ramchandran, who gave me the opportu-
nity of studying and carrying on my research about Distributed Source Coding at the University
of California - Berkeley. There I had the opportunity of working and studying in a stimulating
environment that improved both my professional expertize and my human growth. I want also
to thank the collegues of the BASICS lab and Wireless Foundation Center, Vinod, Ben, Dan,
Chuohao, Alexandros, Paolo, Animesh, or the always nice talks I had with them. I also want
to thank June for the discussions and the arguments we had. Despite it was not always easy to
understand each other, we always were able to work it out.
I want also to thank all the people that I lived with in the International House of Berkeley,
CA, USA, during the period August 2005-June 2006. Their friendship was both supportive in
moments of sorrow and cheerful in moments of joy and fun. A special thanks must be made
to the “Italian Community” (i.e. in random order David, Devis, Lorenzo, Davide, Luca, Laura,
xiv Acknowledgments
Sara, Bianca, Alberto) who were my companions in many adventures, and to Helena, Cristine,
Pedro, Michelle, Shobi, Tricia, Josephine, Melike, Alessandro, Pietro, Fatma, whom I spent
wonderful moments with. I must also remember Melike, Sergej, Kate, Scott, Mickey, Sanjay,
Albert, Arlene, Kim, Shani, and all the6th floor people. I want also to thank Victoria, George,
Edgar, Elisa, Karin, and all the Capoeira Narahari group together with Basma, Yui, and the
other guys of the dance class. But of course there are lots of other people in Berkeley and all
around the world (since we were a multiethnic community) that made my stayingin the Bay
Area a wonderful experience. There I really appreciated the richness coming from meeting
people of different cultures.
I must remember those people that I met around the world whilecarrying out this work.
Among them, Najat must be named since we kept on talking a lot after EUSIPCO 2004 sharing
professional knowledge and personal interests.
I cannot forget the support received during these years fromthe many friends I have even
outside the University. Above all I must thank my life-long friend Luca, who has supported
me throughout all these years. I must also thank Alberto, Betta, Enrico, Chiara, Alessia, Luisa,
Marta, Nicole, Stefano. I also thank Antonio for our nice discussions about working inside and
outside the university. I must remember all the friends thatI have in Camposampiero (PD),
Italy, who were able to make my life happier beyond work and study.
Finally, I owe a great deal of thanks to my parents, that taught me the discipline and per-
severance necessary to achieve any important result, and tomy brother, who waited patiently
whenever I was using the internet connection to carry on my research.
Contents
1 Introduction 1
1.1 The convergence of multimedia and mobile communications . . . . . . . . . . 1
1.2 Features of next generation coding schemes . . . . . . . . . . .. . . . . . . . 2
1.2.1 Compression gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . .. 3
1.2.3 Robustness to data corruption and losses . . . . . . . . . . .. . . . . . 4
1.3 Main purpose and outline of the thesis . . . . . . . . . . . . . . . .. . . . . . 4
2 Video Source Coding and the H.264/AVC video coding standard 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.2 A holistic overview of the building blocks . . . . . . . . . . . .. . . . . . . . 9
2.2.1 Spatial Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
2.2.2 Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Transformation and quantization . . . . . . . . . . . . . . . . .. . . . 16
2.2.4 Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.5 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Probability-Propagation Based Arithmetic Coding 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
3.2 The Context Adaptive Binary Arithmetic Coder (CABAC) . .. . . . . . . . . 25
3.2.1 Binarization and context modelling for the absolute values of non-zero
coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Modeling the contexts using a graph . . . . . . . . . . . . . . . . . .. . . . . 28
3.4 A Sum-Product based arithmetic coder . . . . . . . . . . . . . . . .. . . . . . 32
3.4.1 Probability modelling through DAGs . . . . . . . . . . . . . . .. . . 33
3.4.2 Estimation of the bit probability . . . . . . . . . . . . . . . . .. . . . 34
3.4.3 Context initialization . . . . . . . . . . . . . . . . . . . . . . . . .. . 36
3.4.4 Statistics update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.5 Reduction of the number of contexts . . . . . . . . . . . . . . . .. . . 37
3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 38
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xvi Contents
4 Rate control algorithms for H.264 454.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
4.2 Rate distortion modeling based on “zeros” . . . . . . . . . . . .. . . . . . . . 47
4.3 Parametric models for H.264 coefficients estimated through activity . . . . . . 50
4.3.1 Storing the coefficients histograms . . . . . . . . . . . . . . .. . . . . 50
4.3.2 Approximating the coefficients distribution via a parametric model . . . 51
4.4 Signal analysis in the(ρ,Eq)-domain . . . . . . . . . . . . . . . . . . . . . . 53
4.5 A (ρ,Eq)-based rate control algorithm . . . . . . . . . . . . . . . . . . . . . . 55
4.5.1 Bit rate control at GOP level . . . . . . . . . . . . . . . . . . . . . .. 55
4.5.2 Bit rate control at frame level . . . . . . . . . . . . . . . . . . . .. . 56
4.5.3 Bit rate control at macroblock level . . . . . . . . . . . . . . .. . . . 60
4.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 62
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Joint Source-Channel Video Coding Using H.264/AVC and FECCodes 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
5.2 On dealing with channel errors and losses in video transmission . . . . . . . . 71
5.2.1 Error concealment at the decoder . . . . . . . . . . . . . . . . . .. . 72
5.2.2 Error concealment at the encoder . . . . . . . . . . . . . . . . . .. . 73
5.3 Channel coding techniques based on FEC codes . . . . . . . . . .. . . . . . . 76
5.4 Adapting the matrix size to the input data . . . . . . . . . . . . .. . . . . . . 79
5.4.1 Adapting matrix size according to the packet lengths .. . . . . . . . . 79
5.4.2 Adapting matrix size according to the video content . .. . . . . . . . . 80
5.5 Joint source-channel rate control . . . . . . . . . . . . . . . . . .. . . . . . . 86
5.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 88
5.6.1 Results with a fixed matrix . . . . . . . . . . . . . . . . . . . . . . . .89
5.6.2 Results with an adaptive matrix . . . . . . . . . . . . . . . . . . .. . 93
5.6.3 Results with a joint source-channel rate control algorithm . . . . . . . 93
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Achieving H.264-like compression efficiency with Distributed Video Coding 97
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97
6.2 Distributed Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 99
6.3 A simple example of coding with side information . . . . . . .. . . . . . . . . 101
6.4 A quick glance at the original PRISM architecture . . . . . .. . . . . . . . . . 104
6.5 Structure of the implemented coder . . . . . . . . . . . . . . . . . .. . . . . . 105
6.6 The generation of syndromes . . . . . . . . . . . . . . . . . . . . . . . .. . . 106
6.7 Entropy coding of syndromes . . . . . . . . . . . . . . . . . . . . . . . .. . . 108
6.7.1 Entropy coding of syndromes . . . . . . . . . . . . . . . . . . . . . .108
6.7.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . .113
6.7.3 Evaluation of compression gain with no quality equalization . . . . . . 113
6.7.4 Evaluation of compression gain with Intra refresh . . .. . . . . . . . . 114
6.7.5 Evaluation of compression gain with rate control . . . .. . . . . . . . 116
Contents xvii
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7 Conclusions 1197.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122
A Relation betweenEq and ρ 123A.1 Derivation of probability distribution for syndromes .. . . . . . . . . . . . . . 126
Bibliography 129
List of Figures
2.1 A block-based scheme of the H.264/AVC coder. . . . . . . . . . .. . . . . . . 11
2.2 Relation between Video Coding Layer (VCL), Network Adaptation Layer (NAL),
and transmission networks. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12
2.3 The4× 4 Intra predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Motion Vector computation and its prediction. . . . . . . . .. . . . . . . . . . 14
2.5 Macroblock partitioning for Motion Compensation in theH.264/AVC standard. 15
2.6 Different coding and display order for GOPs. . . . . . . . . . .. . . . . . . . 16
2.7 Comparison between CAVLC, CABAC, and UVLC (a fixed VLC code defined
in the H.26L drafts). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 A simple example of arithmetic coding. . . . . . . . . . . . . . . .. . . . . . 22
3.2 Scheme of the CABAC coding engine. . . . . . . . . . . . . . . . . . . .. . . 24
3.3 Structure of the Finite State Machine related to the CABAC coder. . . . . . . . 26
3.4 Scheme of contexts for the absolute values of coefficients. . . . . . . . . . . . 27
3.5 Directed Acyclic Graph that models the statistical dependencies between the
coefficients in a transform block. . . . . . . . . . . . . . . . . . . . . . .. . . 28
3.6 Dependencies between the coefficients in a macroblock. .. . . . . . . . . . . . 29
3.7 Distinction between bit planes coded using the DAG probability model and bit
planes coded using the traditional CABAC scheme. . . . . . . . . .. . . . . . 34
3.8 Structure of the modified Finite State Machine in the new arithmetic coder. . . 37
3.9 Coding results for different QCIF sequences at 30 frame/s. . . . . . . . . . . . 40
3.10 Results for different QCIF sequence at 30 frame/s. . . . .. . . . . . . . . . . . 42
3.11 Results for different CIF sequence at 30 frame/s. . . . . .. . . . . . . . . . . 43
4.1 Distortion vs. Rate for coded Intra, Inter and B frames. .. . . . . . . . . . . . 48
4.2 Plots of bit rate vs.ρ for the coded sequenceforeman. . . . . . . . . . . . . 49
4.3 Histogram of coefficients frequencies from the coded sequencecarphone. . . 50
4.4 Eq vs. ρ for the sequencecarphone. . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Bits/Frame and PSNR/Frame plot of 240 QCIF frames for thesequencesalesman. 62
4.6 Distortion-Rate plot of 120 CIF frames for the sequencesalesman. . . . . . 63
4.7 Distortion-Rate plot for different QCIF sequences at 30frame/s. . . . . . . . . 64
4.8 PSNR and Rate plots of 180 QCIF frames for the sequenceforeman. . . . . . 66
5.1 A pictorial example of Multiple Description Coding. . . .. . . . . . . . . . . 74
xx List of Figures
5.2 General scheme for the coding matrix in RFC2733 approachwith and without
byte padding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Experimental results for different sequences showing the relative quality loss
δE(PSNR)/E(PSNR) and the parameterN3dB vs. the activityact. . . . . . . . 83
5.4 Experimental results for different sequences showing the relative quality loss
δE(PSNR)/E(PSNR) and the parameterN3dB vs. the percentageρ. . . . . . . . 85
5.5 Results for different sequence with loss probability0.03. . . . . . . . . . . . . 90
5.6 Results of FEC-NoPadding forforeman QCIF with different FEC redun-
dancy and loss probability0.03. . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 Results of FEC-NoPadding with different rows and columns (loss probability
0.06). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Comparison between adaptive and fixed methods with loss probability 0.06. . . 94
5.9 Comparison between adaptive and fixed methods with loss probability 0.06. . . 95
6.1 Two different coding scenarios for the example in Section 6.3. . . . . . . . . . 102
6.2 Example of Wyner-Ziv decoding with sources in{0, 1}3. . . . . . . . . . . . . 102
6.3 A pictorial representation of innovation and correlated info for blocks. . . . . . 104
6.4 CRC coding mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.5 Block diagram for the presented DSC-based coder. . . . . . .. . . . . . . . . 106
6.6 Partitioning of the quantized values (the lattice of integersZ) into three sub-
lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.7 Comparison between the actual pmf of syndromes and the presented model. . . 109
6.8 Difference between the entropies of DFD and DSC syndromes. . . . . . . . . . 110
6.9 Comparison between the probabilities of non-null DSC syndromes and non-
null H.264 coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110
6.10 Coding performance of the original CABAC on H.264 coefficients and DSC
syndromes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.11 Example of quad-tree coding using CBP variables. . . . . .. . . . . . . . . . 112
6.12 PSNR vs. Bit rate for the first frame in the GOP. . . . . . . . . .. . . . . . . . 113
6.13 PSNR vs. Bit rate for a whole GOP. . . . . . . . . . . . . . . . . . . . .. . . 115
6.14 PSNR vs. Bit rate with Intra refresh enabled. . . . . . . . . .. . . . . . . . . 116
6.15 PSNR vs. Bit rate with rate control enabled. . . . . . . . . . .. . . . . . . . . 117
List of Tables
2.1 Timeline and coding applications for different video coding standards. . . . . . 8
3.1 Sequence of states for the considered example. . . . . . . . .. . . . . . . . . 23
4.1 Configuration parameter for the H.264 encoder. . . . . . . . .. . . . . . . . . 62
4.2 Results for the sequencesalesman. . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Comparison between the(ρ,Eq)-based algorithm and JM7.6 algorithm. . . . . 65
4.4 PSNR/Rate for VBR tests on different sequences. . . . . . . .. . . . . . . . . 66
5.1 Comparison betweenρ-adaptive and fixed rate control methods for the se-
quencenews. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Comparison betweenρ-adaptive and fixed rate control methods for the se-
quenceforeman. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1 Comparing the average bit rate needed to code the position of zeros and ones
in the H.264 coder and CBP blocks for the DSC coder. . . . . . . . . .. . . . 113
Chapter 1
Introduction
1.1 The convergence of multimedia and mobile communications
Over the last decade, the IT world has assisted a joint development and widespreading of both
multimedia technologies and wireless transmission systems.
The commercial success of digital audio/video applications and the attracting business pos-
sibilities that have been created by the gradual penetration of multimedia communications has
rushed both industries and universities towards the designof digital architectures with advanced
multimedia functionality. This tendency has led to the creation of more and more efficient au-
dio/video coding systems, while the availability of fasterprocessors and increased amounts of
memory has made possible the adoption of complex coding algorithms with increased com-
pression capability.
At the same time, we also assisted at an unprecedent widespreading of wireless commu-
nications. The need of connecting distant users at any placein any time has promoted the
investigation of more efficient modulation schemes and transmission protocols allowing con-
sumers to exchange an enhanced and varied set of data.
During the last years, these two research fields have startedto converge since the need of
providing ubiquitous access to multimedia services over anheterogeneous interconnection of
networks has posed new challenges to the existing coding schemes. The goals of second gen-
eration cellular networks, i.e. supporting integrated voice and data, were extended in the third
generation cellular networks to provide the user with a wider set of multimedia services that
span from the video communication to the fruition of video-on-demand contents. As a conse-
quence, wireless terminals with advanced multimedia functionalities have been progressively
gaining importance as the production of multimedia contents for both business and personal
use has become an essential element in everyday communications.
This convergence of mobile and multimedia communication has raised new problems re-
lated to the heterogeneity of the scenarios and time-varying nature of the channels involved.
Since the source coding technology has achieved a sufficientgrade of maturity in these appli-
cation contexts, next-generation architectures have to deal with the interoperable exchange of
multimedia information and the efficient transmission overnetworks that may be affected by
information losses and data corruption.
2 Chapter 1. Introduction
A flexible infrastructure for the exchange of multimedia contents is required since distinct
users in a heterogeneous scenario are willing to communicate and interact with different media,
such as audio, video, and text. Most users have common concerns (efficient management of
contents, protection of contents, and privacy issues), andnew solutions are required to manage
the access and delivery process of these different content types in an integrated and harmonized
way, entirely transparent to the different users. The challenge is made utterly difficult by the
fact that communicating terminals may have different transmission capabilities, and the trans-
mission system must be able to adapt the sent data according to network possibilities. These
needs have led the definition of the emerging standard MPEG-21 Multimedia Framework that
“aims to enable transparent and augmented use of multimediaresources across a wide range
of networks and devices”[10].
On the other hand, the capability of providing reliable video transmission is the most rele-
vant issue in the widespreading and the diffusion of multimedia mobile services. Radio chan-
nels present non-stationary characteristics that result in losses or alterations of the received
data. The missing or corrupted information leads to a mismatch between receiver and trans-
mitter that may reduce the quality perceived by the end user and, in some cases, preclude the
correct decoding of the following information. The receiver can mitigate the drawbacks of
data losses by estimating the lost information up to a given uncertainty. However, in case the
amount of lost information is substantial or the non-stationary characteristics of the signal do
not allow good recovering performance, it is necessary to adopt a more efficient coding strat-
egy or a more robust coding scheme at the encoder. Moreover the intrinsic mobility of wireless
systems leads to a frequently-changing network topology that makes the resource allocation
difficult. For example, a variable number of different usersthat shares a common resource, like
bandwidth, changes the amount of resources assigned to eachone and this variability must be
taken into account by the coding schemes.
In the following section, different issues which characterize the choices and the design of
new coding schemes are identified
1.2 Features of next generation coding schemes
As the emerging scenarios in the multimedia world demand theinteroperability and the re-
liability of communications, the Information Technology professionals have looked for new
solutions that could efficiently cope with the new requirements. In this investigation, their
concern was mainly focused on some features regarding videocoding architectures that had a
straight influence of the resulting coding performance.
1.2.1 Compression gain
In order to obtain an efficient delivery of multimedia information, devices should be endowed
with advanced compression algorithms that allow the receiving terminal to reconstruct the
coded information with the highest possible fidelity.
Usually the size of coded data is constrained by different external factors. One of them can
be the available storage space in case of sequences that are filed into multimedia archives or
1.2. Features of next generation coding schemes 3
memorized on physical supports such as DVDs. Another constraining factor is the available
bandwidth in real-time video communications which limits the amount of data that can be
sent per time unit. In this case the arrival time of sent data concerns the final service quality
experienced by the end user, and therefore, the adopted coding algorithm must make sure that
the produced bit rate does not overwhelm the transmission capacity.
Remember that a relevant compression rate can only be obtained adopting a “lossy” coding,
i.e. a coding scheme that represents the coded visual information tolerating a distortion of the
original video sequence but greatly-reducing the amount ofsent information. Video coding
world has been facing this problem designing new standards characterized by continuously-
improving compression efficiency. The performance of the first MPEG video coders has now
been overcome by the high coding gains obtained by the lateststandards such as H.264/AVC
and the upcoming MPEG-21/SVC [10], which have made video communication possible on
third generation cellular networks.
The main goal of the standardization process is defining a “common language” that enables
different terminals from different vendors to exchange video data. Despite the standardization
process has strictly specified the syntax of coded video stream, there are no limitations on how
the stream can be generated, i.e. about the encoding process. Each designer is free to implement
the encoder according to his/her specific targets and depending on the physical device: the
aimed coding strategy is ensuring the highest visual quality given the imposed constraints on
the size of the coded video stream and the hardware resourcesavailable. This has brought to
the creation of a wide range of different control algorithmsthat tune the coding parameters
according to the number of bits to transmit and to the relevance of the coded information with
respect to the resulting visual quality.
1.2.2 Computational complexity
A second issue is the computational complexity, which is still an important discriminating
element because of its implication on the autonomy of mobiledevices. Despite the availability
of more and more powerful processors, the limited power supply that characterizes most mobile
video devices prevents the adoption of architectures that require a high amount of computation.
Therefore, the video coding literature has been presentingseveral low-complexity solutions
that permit a light implementation of complex video coding architectures on battery-supplied
systems. In addition, during the last years new coding paradigms have emerged inspired by
a “uplink” broadcast model (where a multitude of light encoders sends different coded video
streams to a complex decoder) [93, 95, 94, 26] in place of the traditional “downlink” broadcast
model (where there is a complex encoder and a multitude of light decoders). In this way the
problem of computational complexity on the mobile terminalappears efficiently addressed by
adopting a hybrid system where the coding adopted for the uplink transmission is different
from the coding adopted for the downlink transmission. In this way, the most computationally-
expensive tasks can be performed by the network whose available power supply and bearable
computational load have higher bounds.
4 Chapter 1. Introduction
1.2.3 Robustness to data corruption and losses
A third element that is worth of consideration in the design of a video coder is the robust-
ness of the scheme to errors and losses, i.e. the capability of avoiding error propagation when
the coded bit stream has been corrupted by errors and losses.All the current video coding
paradigms, which are part of popular standards like MPEG [47, 44] and H.26x [45, 87], fail to
address this requirement as most of their compression gain is achieved through the adoption of
an inter-frame motion prediction. Since each frame is codedtaking one of the previous ones as
a reference, in case part or all of the previous reference is missing because of channel errors,
the decoding process is stopped until the state of the decoder is refreshed, i.e. non-predicted
data are sent with a great waste of bandwidth [62]. A possiblealternative is to estimate the lost
information and replace the missing data with its approximation in the decoding process [22]
introducing an additional noise, which results in a qualitydegradation of the image. Other cod-
ing schemes protect the coded stream applying channel codes[33, 34, 106, 79] or code multiple
correlated streams that allow the decoder to estimate the lost stream from the ones that were
correctly received [32, 13, 134, 108]. Most of these techniques prove to be greatly effective
whenever the protection level is matched to the channel conditions and the characteristics of
the coded signal. As a consequence, the recent literature has been characterized by different
proposals of joint source-channel coding algorithms that try to detect the protection strategy
that allows the receiver to recover most of the information lost across the transmission channel.
In this context, R. Puri and K. Ramchandran have recently faced the interesting question of
whether it is possible to design an efficient video coding paradigm that attains simultaneously
both motion-like compression efficiency and robustness with a low encoding complexity. Their
investigation have lead to the design of PRISM [93], a new video coding scheme founded on
the principles of distributed source coding ([113, 130]) that allows the prediction of the current
block of data from a set of different possible references. Incase some of them are missing,
a correct decoding is still possible from those references that were correctly decoded. This
coding solution (called "syndrome encoding") classifies the current input data and codes it ac-
cording to the classification. The receiver is able to decodethe information using an arbitrary
reference that belongs to the same class. In this way, the motion-search is transferred to the
decoder and it is possible to match both the requirement of a light encoder and the need for a
robust bit stream.
The following section will describe how these and other topics are dealt with in this thesis.
1.3 Main purpose and outline of the thesis
The focus of this thesis is the design of efficient coding algorithms for video transmission over
wireless channels. These strategies aim both at increasingthe coding gain and at reducing the
quality degradation in case of information losses. In all these techniques a special attention
was paid to the computational complexity, which was kept as low as possible in order to apply
these solutions to mobile devices with a limited power supply.
Chapter 2 presents a brief overview of the H.264/AVC standard which is the starting point
of our investigation. The purpose of the Chapter is not to provide a detailed description of the
1.3. Main purpose and outline of the thesis 5
standard H.264/AVC, but to define the conventions and the notation that will be used throughout
the whole thesis.
Chapter 3 describes a novel arithmetic coding engine based on statistical graphical models.
It is possible to improve the performance of H.264/AVC arithmetic coder modifying the con-
text structure of its arithmetic coding engine, the ContextAdaptive Binary Arithmetic Coder
(CABAC). In this case, probabilities are modelled through aset of Directed Acyclic Graphs
(DAGs), which allows a more accurate estimate of the probabilities of binary digits. Experi-
mental results show that it is possible to reduce the size of the coded bit stream by approxi-
mately 10 %.
Chapter 4 presents an efficient rate control algorithm that allows to obtain a high objective
quality in the reconstructed sequence while keeping the coded bit stream within the avail-
able bandwidth. The algorithm is based on modelling the number of bits produced by the
H.264/AVC coder in the(ρ,Eq) domain, whereρ is the percentage of null quantized DCT co-
efficients andEq is the energy of the quantized signal. The resulting strategy proves to be very
effective with respect to other solutions since it allows anaccurate estimate of the produced bit
rate. Moreover, this result is improved by the adoption of aneffective skipping technique that
avoids coding frames whenever their lack does not affect significantly the smoothness of the
reconstructed sequence and the transmission buffer is soaked by the previous frames.
Chapter 5 copes with the problem of allowing a robust transmission through a faulty chan-
nel. After an overview of the existing solutions, the chapter describes an implementation of a
cross-packet FEC channel coder. The coding is performed by including RTP packets colum-
nwise into a matrix and computing the redundant informationalong the rows. This approach
presents several issues in terms of optimizing the matrix dimension since the number of rows
and columns lead to different performances according to thechannel characteristics and the
coded sequence. The chapter proposes an optimization strategy that is based on the percentage
of null quantized DCT coefficients. This criterion is used todesign a novel joint source-channel
rate control algorithm that varies the protection level according to the characteristics of the in-
put sequence.
Chapter 6 faces the problem of robust transmission considering a different approach. In-
stead of adding redundancy in the coded bit stream, it is possible to recreate a robust packet
stream though the principle of Distributed Source Coding (DSC). Among all the proposed DSC
solutions, we have considered the scheme proposed by Puri and Ramchandran in [93]. The re-
search presented in Chapter 6 focus on the entropy coding unit since most DSC coders obtain
an inferior compression gain with respect to their non-robust hybrid counterparts. A quad-
tree based arithmetic coder is presented, which make possible to improve the coding results of
previous coders and efficiently compares the original H.264/AVC standard.
Finally, we draw our conclusions in Chapter 7, which gives a brief summary of the results
obtained by this investigation and some guidelines for future research.
The material in Chapter 2 is mainly introduced for review of the H.264/AVC standard. The
remaining chapters represent the original contribution ofthe author and of the supervisor to the
field. Most of the material covered in this thesis has been published in [75, 4, 81, 76, 80, 108,
78, 82].
Chapter 2
Video Source Coding and theH.264/AVC video coding standard
“If you wish to converse with me, define your terms”
Voltaire
The chapter provides a short introduction into the structure of the H.264/AVC coder, underly-ing the features of interest for the algorithms presented inthe following chapters. The aim isto provide a background of conventions related to the syntaxelements and the functional unitsdefined by the standard. The first section gives a general introduction about the purposes andthe guidelines that inspired the standardization process.Then, the H.264/AVC coder is decom-posed into its building blocks, providing more details for those parts that affect more directlythe resulting performance in terms of compression gain. In addition, some conventions aboutthe use of terms related to the H.264/AVC syntax are introduced.
2.1 Introduction
During the last two decades, different video coding standards have been developed to ensure
an efficient handling of visual information along the entirechain that covers the production,
distribution, and reception of video content. Their designwas mainly inspired by the need of
shrinking as much as possible the overwhelming amount of data produced by a video source
since the transmission capacity or the storage space is limited. Each standard defines the syntax
and semantics of the bit stream as well as the processing thatthe decoder needs to perform
when decoding the bit stream back into video. Therefore, manufactures of video decoders
can only compete in areas like post processing, optimization of coding parameters, cost and
hardware requirements, while the implementation of the encoder is completely free as long as
the produced bit stream can be correctly decoded.
This standardization policy has played a crucial role beingthe leading factor in the widespread
of digital video communication and affecting the way we create, communicate, and consume
audio-visual information. In fact, the decoder-oriented standardization allows the interoper-
ability among products developed by different manufacturers ensuring to the content creators
that their content runs everywhere and that they do not have to create and manage multiple
8 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
copies to match the products of different manufacturers. Atthe same time, manufacturers
are free to resort to different implementation schemes in order to find the right performance-
cost trade off matching the requirements of the target applications and the characteristics of
the terminal on which the coder is implemented. Worldwide, two working groups dominate
Name of Year Title ofstandard Organization of release standard
H.261 ITU-T 1990 Video Codec for Audiovisual Ser-vices atp× 64 kbit/s
MPEG-1 ISO/IEC 1991 Coding of moving pictures and as-sociated audio for digital storagemedia at up to about 1.5 Mbit/s
MPEG-2 ISO/IEC 1994 Generic coding of movingH.262 ITU-T pictures and associated audio infor-
mation
H.263 ITU-T 1995 Video coding for low bit rate com-munication
MPEG-4 ISO/IEC 1999 Coding of audio-visual objects
H.264 ITU-T 2003 Advanced Video CodingAVC ISO/IEC
Table 2.1: Timeline and coding applications for different video coding standards.
the video coding standardization processes, namely, the ITU-T Video Coding Experts Group
(VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).VCEG has traditionally
focused on low bit rate video coding applications, where there is a need for high compression
rates and error resilience tools. MPEG groups a larger community targeting higher bit rates for
entertainment-quality broadcasting applications. Both organizations have produced very suc-
cessful standards in their respective domains, and in 2001,they joined to form the Joint Video
Team (JVT) with the purpose of designing an efficient video coder that was able to satisfy the
novel requirements created by the transmission of video contents over wireless networks [91].
The main goals of this standardization effort were improvedcompression efficiency, network
friendly video representation for interactive (i) and non-interactive (ni) applications such as:
(i) conversational services over ISDN, Ethernet, LAN, DSL,wireless and mobile networks,
modems, etc. or mixtures of these;
(ni) video-on-demand or multimedia streaming services on ISDN, cable modem, DSL, LAN,
wireless networks, etc. ;
(ni) broadcast over cable, satellite, cable modem, DSL, terrestrial, etc. ;
(ni) storage on optical and magnetic devices, DVD, etc. ;
(ni) multimedia messaging services (MMS) over ISDN, DSL, Ethernet, LAN, wireless and
mobile networks, etc. ;
The result of this collaboration is the video coding standard H.264/MPEG-4 AVC, which
reached a first complete definition by the end of 2002 and was completed on February 2005
2.2. A holistic overview of the building blocks 9
[29]. The coder structure reflects the traditional scheme ofhybrid video source coders with
some additional features that improve its coding performance [127]. In fact, it is possible to
consider H.264/AVC as a “collection” of different coding tools that can obtain a high com-
pression gain when orchestrated in an appropriate manner. The following sections will give
an overview of these features providing evidence for their influence on the final coding perfor-
mance.
2.2 A holistic overview of the building blocks
The structure of the H.264/AVC coder can be seen as a comprehensive synergism of coding
solutions designed in the last 50 years. In fact, many features that are included were already
present in some of the previous coders. However, the standardization process that has led
to the definition of this coding scheme has redesigned some ofthese techniques in order to
adequately combine them in a general architecture. In addition to these, some new elements
were introduced providing the final coder with a wide set of tools that can be rearranged in
many different ways.
The input signal is a digitized video sequence, i.e. an ordered sequence of digital pictures
taken at fixed equally-spaced time intervals by a digital video camera. Each picture (called
frame) can be seen as a grid of picture elements (pixelsor pel) that represent the local infor-
mation of the picture similarly to the way a tile is a fractionof a mosaic. The density of pixels
per squared inch can vary but usually it is around hundreds ofpicture elements. Each pic-
ture element carries a color information that can be represented by a set of three integers that
can vary according to the adopted color space representation. Since the Human Visual Sys-
tem (HVS) is much more sensible to the luminance than to the chrominances, theRed, Green
and Blue(RGB) color components acquired by the digital sensors are first transformed into
theLuminance and Chrominance(YUV) color space, with chrominance components spatially
sub-sampled. Since an extensive description about the color representation and chrominance
sampling is beyond the scope of this work, more information can be found in [30, 16]. The
current state of the art for the H.264/AVC coder allows different color space representations
and different types of sub-sampling for the input signal, despite in this work we will always
consider video signals in the YUV format with sampling 4:2:0(i.e. the chrominances are sam-
pled at half the frequency along both the vertical and the spatial direction with respect to the
luminance component). Therefore, each picture can be considered as made of three separate
matrices of integer values. The first contains the luminancecomponent Y (also called Luma),
while the others contain the chrominance components U and V (also called Chromas). Note
that each of Chroma matrices contains a quarter of the pixelsof the Luma matrix because of
the sub-sampling. Most of the conventions about the acquisitions of the frames were inherited
from previous coding standards ([47, 45, 44])
Like many of the previous video coding architectures, the basic processing unit in the
H.264/AVC standard is theMacroblock(MB), i.e. a square of16 by 16 pixels from the Luma
component associated with two squares of8 by 8 pixels from the two Chroma components.
Each macroblock is processed as shown in the scheme reportedin Fig. 2.1, which reports the
10 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
following set of processing elements:
• Motion Estimation. Each Macroblock is at first partitioned into smaller blocksthat can
have heterogeneous sizes, so that Block Matching Motion Estimation (BMME) is applied
to each sub-block, i.e. the coder searches for an equally-sized block among previously-
coded frames that are available in areference frame bufferin order to accurately predict
the current one. The prediction is fully described by a Motion Vector (MV), which is
differentially coded with respect to neighboring MVs in thetransmitted bit stream.
• Intra Prediction. One of the main innovations brought by the standard is the adoption of a
block-based spatial prediction that estimates the currentblock according to the neighbor-
ing reconstructed pixels. The Intra prediction compensates the adoption of a sub-optimal
transform size and allows the H.264/AVC coder to obtain a better coding efficiency with
respect to previous standards.
• Transform. This module reversibly transforms the pixels from thespatial domaininto
the frequency domain, where information appears to be less correlated so that a more
compact representation of the current block can be achieved. The transform adopted by
the H.264/AVC standard is an approximation of the4× 4 DCT [35, 36] which proves to
be an efficient coding solution when it is matched with an efficient prediction mechanism
like the Intra spatial block-based prediction [14].
• Quantization. The quantization phase is intended to shrink the set of possible reconstruc-
tion levels for the transform coefficients in order to reducethe number of representation
symbols so that the size of the coded bit stream results smaller. In fact, small variations
in coefficient values are not perceptible by the human visualsystem, and therefore, it
is possible to slightly distort the transform coefficients without affecting the perceived
visual quality. However, whenever high compression gains are needed (i.e. when the
bit rate is constrained by external factors such as available bandwidth or storage space),
an objectionable visual degradation of the reconstructed images becomes evident so that
some additional measures must be taken into account (such asincreasing the strength of
deblocking filter).
• Deblocking Filter. The quantization of the transform coefficients and the block-based
transform performed on the residual signal cause the appearing of unpleasant visual ar-
tifacts, especially at low bit rates. These artifacts usually result in an additional high
frequency noise that makes the reconstructed image appearslike composed by different
tiled-up blocks (blocketization). This high frequency noise can be significantly reduced
by an adaptive low-pass filter which is able to tune its strength according to the values
of various coding parameters and syntax elements [64]. In addition, it allows a better
motion compensated prediction since the deblocking filter is included in the prediction
loop improving the compression efficiency.
• Entropy Coder. This block converts the syntax elements produced by the coder into
variable length binary strings that can be formatted and packetized into different ways
2.2. A holistic overview of the building blocks 11
according to the coding parameters. The H.264/AVC defines two different entropy cod-
ing algorithms: the Context-Adaptive Variable Length Coder (CAVLC) [127] and the
Context-Adaptive Binary Arithmetic Coder (CABAC) [73].
The structure can be roughly decomposed in a DPCM coder, which can perform either
a temporal or a spatial prediction, followed by a transform coder and an entropy coder (see
Fig. 2.1).
Decoder
Quant.
∆I−Quant.
EstimationMotion
CompensationMotion
PredictionIntra−frame
FilterDe−blocking
CodingEntropy
ControlCoder
Transf.
Output Frame
Motion Data
Bit−stream
Input Block Quant. Trans. Coeffs
Control Data
I−Transf.
Figure 2.1: A block-based scheme of the H.264/AVC coder.
As for temporally-predictedmacroblocks, the encoder takes advantage of the temporal
correlation existing among subsequent frames and estimates pixel blocks of the current frame
according to pixel blocks from the previous ones (see section 2.2.2). This estimate is then
refined by the transform coding unit which processes the residual signal.
On the other hand, thespatially-predictedmacroblocks are characterized by blocks which
are predicted according to the previously coded pixels in the same frame (see 2.2.1). Note that
in this case the coded MB does not depend on the previous frames and the decoding can be
done independently (i.e. we can refer to them asindependently-codedor intra macroblocks).
As a consequence, the decoding of a randomly-chosen temporally-predicted frame implies the
decoding of all the previous pictures until a spatially-predicted frame is found.1 In addition, the
loss of one frame precludes the correct decoding of all the following temporally-predicted pic-
tures. Therefore, spatially-predicted frames must be coded at regular intervals in order to both
allow pseudo-random access to each frame and avoid the errorpropagation in case of frame
losses. The periodicity of independently-coded frames depends on the application and affects
1We define aspatially-predictedframes orIntra frames, the coded pictures that are made only of Intra mac-roblocks. At the same time, we call temporally-predicted orInter frames those pictures that can be made of bothtemporally-predicted and spatially-predicted macroblocks.
12 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
both the characteristics of the coded bit stream and the quality of the reconstructed sequence
whenever the transmission channel is corrupted by errors and losses. More information on this
subject will be given in section 2.2.2.
Each reconstructed image is then processed using a deblocking filter in order to remove
visual artifacts and improve the performance of the temporal prediction. Further details will be
given in the next sections, focusing the attention on the parts of the standard that concern the in-
vestigation. We refer the reader to [87, 105] for a complete description.
Figure 2.2: Relation between Video Cod-ing Layer (VCL), Network AdaptationLayer (NAL), and transmission networks.
The output of the scheme reported in Fig. 2.2
is a binary stream that needs to be packetized
and organized into an appropriate way ac-
cording to the coding order of macroblocks
and the coded information. This operation is
performed by theNetwork Adaptation Layer
(NAL), which defines a flexible interface in-
tended to adapt the coded bit stream to the
transmission network (as depicted in the fig-
ure on the left).
According to the H.264/AVC specifica-
tion, each video packet carries the informa-
tion related to oneslice, a set of macroblocks
belonging to the same frame. In the previous
standards, slices were made of sequences of
macroblocks processed in raster scan order (see [47, 41, 44]). For example, many coding set-
tings included a row of macroblocks into one slice. More recently many issues concerning
error resilience and packetization have suggested new strategies to design the pattern of mac-
roblocks forming a slice like choosing randomly the MBs across the frame, selecting all the
macroblocks inside a specific area, interlacing rows of macroblocks. The investigation of the
possibilities offered by a Flexible Macroblock Ordering (FMO) is beyond the scope of this
work, and all the experiments were carried out considering slices with adjacent macroblocks
in raster scan order. Slices were made considering a fixed number of MBs per slice or a fixed
number of bytes (i.e. including macroblocks until the number of bits reached a fixed threshold).
For further details about FMO policies, it is possible to refer to [87, 5, 6, 7].
The information related to each slice can be packetized intoone or more RTP packets ac-
cording to whether the Data Partitioning (DP) option is enabled or not. This distinction makes
possible the inclusion of different syntax elements into different packets that can be transmit-
ted or protected according to different criteria. Wheneverthe video sequence is transmitted
over a network affected by losses, the Data Partitioning option allows the decoder to improve
the performance of the error concealment algorithm since part of the coded information can
be correctly received. However, in our work we created only one packet per slice without en-
abling the Data Partitioning mode since the investigation of the possible benefits produced by
its adoption is beyond the scope of the present thesis (for further information see [87]).
The following sections will provide a more detailed insightof some basic building blocks.
2.2. A holistic overview of the building blocks 13
2.2.1 Spatial Prediction
One of the main innovations introduced by H.264/AVC within the scenario of hybrid video
coders is the adoption of a block-based spatial prediction.In fact, the independently-coded
macroblocks (calledIntra) defined by previous video coders were created applying transform
coding to the input signal directly without performing any kind of prediction. However, the
adoption of a DCT transform with a lower dimension (see subsection 2.2.3) required some
additional processing in order to supply the lower coding gain associated to unpredicted sig-
nals (see [14]). The prediction of the original signal before transform coding was a feasible
solution, provided that the frame could be independently decoded. This requirements led to
the adoption of the spatial prediction unit reported in Fig.2.1 (Intra-frame Prediction), which
creates an estimate of each transform block without resorting to the previous frames but consid-
ering the values of the neighboring pixels. Unlike most of the previous coding standards which
adopted spatial prediction on a pixel basis [42, 122], H.264/AVC defines a block-oriented spa-
tial prediction stage, where each block can be estimated in two different ways. One possible
prediction can be performed on the whole macroblock considering the pixels of the upper and
the left MBs. The estimate is computed choosing one among a set of 4 predictors, and it is
performed whenever the current MB is coded in theIntra16x16mode. In order to obtain a
A B C D E F G H
LKJIM
0 (vertical)
LKJIM A B C D E F G H
1 (horizontal)
LKJIM A B C D E F G H
2 (DC)
LKJIM A B C D E F G H
3 (diagonal down−left)
LKJIM A B C D E F G H
4 (diagonal down−right)
LKJIM A B C D FE G H
6 (horizontal−down)
LKJIM A B C D E F G H
7 (vertical−left)
LKJIM A B C D E F G H
8 (horizontal−up)
LKJIM A B C D F G H
5 (vertical−right)
E
Figure 2.3: The4× 4 Intra predictors.
finer prediction the standard defines a set of 9 possible spatial predictors (depicted in Fig. 2.3)
that approximate the pixel of the current4 × 4 block using the values of the upper and left
pixels. This coding mode (calledIntra4x4) implies the coding of the adopted predictor for
each4× 4 block.
Since an extensive description of the Intra prediction process in H.264/AVC is not the main
topic of this work, [128, 87, 14] can be consulted for furtherdetails.
14 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
2.2.2 Motion Compensation
Despite the fact that the spatial prediction permits improving the coding gain of the4× 4 DCT
alone, it is possible to obtain a further compression through temporal prediction. The efficiency
of this technique was known to the previous coding standardstoo (see [47, 44, 45]), and differ-
ent techniques have been applied to take advantage of it. Themost widely used methods im-
plies the estimate of Motion Vectors (MV). Motion Vector coding is based on a vector-oriented
model of motion derived from classical mechanics ([18, 85]), where the movement of objects
is simplified as a sequence of small local translation on a plane, which are the projection on the
image plane of the real three-dimensional movements. Despite the model proves to be ineffi-
cient in modelling rotations, deformations or movements along the optical axis of the camera,
it is broadly adopted for its intrinsic simplicity. The original image is divided into blocks of
pre-determined size which are predicted using an equally-dimensioned block of pixels taken
from the previous images. The identification of the prediction block is typically performed
through a Block Matching (BM) algorithm, which finds the block that minimizes a given dis-
tortion function among a set of possible candidates. The candidate set could include all the
possible blocks from the previously coded frames, but practical approaches confine the search
to those blocks that lie within a limited window (see Fig. 2.4(a)). Each block can be identi-
fied by specifying aMotion Vector, i.e. the difference between the Cartesian coordinates of the
prediction block and the current one. Experimental resultsand the physical characteristics of
(a) Translational model for BMME (b) MV median predictor applied to heterogeneousneighboring blocks
Figure 2.4: Motion Vector computation and its prediction.
real objects in a scene indicate that neighboring motion vectors are correlated. Therefore, it
is possible to take advantage of this correlation both in theMotion Estimation (ME) process
and in the coding of MV values. The H.264/AVC coder identifiesa predictor for each motion
vector, which corresponds to the median of the neighboring ones as depicted in Fig. 2.4(b). The
median predictor identifies the center of the search window and its value is used in a DPCM
coding of the current MV.
In H.264/AVC, the MV-based prediction scheme is enhanced byallowing a flexible parti-
tioning of the MB to be predicted. Fig. 2.5 reports the possible partitioning structures that can
be applied to a macroblock. This flexibility makes possible amore accurate temporal predic-
2.2. A holistic overview of the building blocks 15
16x16 16x8 8x16 8x8
4x44x88x48x8
(a) Possible block-partitioning for a macroblock (b) Partitioning of a single frame from the sequenceforeman
Figure 2.5: Macroblock partitioning for Motion Compensation in the H.264/AVC standard.
tion since the macroblock partitioning can be fitted to the shape of moving objects in the scene,
minimizing the energy of the residual signal that is to be coded. In order to improve the coding
gain, motion compensation is performed from interpolated frames with quarter-pel resolution,
where the expanded frames were obtained using a 6 taps followed by a 4 taps FIR filter as
reported in [124].
There are two types of temporally-predicted frames/slices: theP-typeframes/slices andB-
typeframes/slices. P-type frames/slices are predicted considering only the previous frames in
the display order, and for each predicted block only a singlemotion vector is specified (classi-
fied asforward MV). B-frames are characterized by a bi-directional temporalprediction (spec-
ified by a forward and abackward MV), which allows a better estimate of the current block
since the prediction process takes into account both the previous and the following frames in the
display order. As a consequence, the display order does not correspond to the coding order as
Fig. 2.6(b) shows. Usually, bi-directional temporal prediction requires a higher computational
cost, and therefore, the lowest complexity profiles for the H.264/AVC coder do not include
this coding option. Moreover, the standard imposes additional limitations on the coding types
of macroblocks in each frame. Intra slices can include Intramacroblocks only, while P slices
can be made of both Intra and motion-compensated macroblocks where prediction is charac-
terized by a single motion vector (forward). As for B-slices, no Intra macroblocks are present
since all the macroblocks are temporally-predicted, and the coder can specify one or two MVs
for each motion-compensated block permitting the estimation of the prediction block either
from the previous or from both the previous and the followingframes in displaying order (see
Figure 2.6 and [87]). As it was said before, the coding type ofeach frame is pre-determined
by the encoder according to certain structure. In fact, the sequence of pictures is divided in
Group Of Pictures (GOP), which generally includes all the frames between two subsequent
Intra pictures. The structure and the length of each GOP is determined by the availability of
computational resources, the characteristics of the application, and the state of the channel.
In fact, B-directional prediction requires a double ME thatresults quite expensive for a de-
vice with reduced hardware resources or limited power supply. At the same time, bidirectional
16 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
BB B BI P P
displaying order
coding order 0 2 1 5 6 43
0 1 3 4 5 62
(a) Coding and display order for GOPs with structureIBBP
BB B BP P
BB B BP P
IPPP
I
IBBP
I P PPP P P P PPP P P
(b) GOP structures IBBP and IPPP
Figure 2.6: Different coding and display order for GOPs.
prediction requires a computational time that could be prohibitive for interactive applications,
where the coded pictures must be ready for transmission at fixed equally-spaced time instants.
Finally, in presence of losses the reconstructed sequence results distorted since the loss of one
or more frames is compensated by estimating the lost information with an error concealment
algorithm. As a drawback, the decoder has a different frame buffer with respect to the en-
coder precluding a correct reconstruction of the coded sequence until a refresh is performed
with some Intra data. Therefore, the ordering of coding types and length of GOP can be quite
varying, and in the present work two GOP structures are considered: the IBBP structure (two
B-frames either between an I- and a P-frame or between two P-frames), and the IPPP structure
(an I-frame followed by only P frames). These two structuresare depicted in Fig. 2.6(a). The
baseline profile of the H.264/AVC coder, which is the standard configuration with the lowest
computational complexity, only includes the IPPP sequence.
It is worth mentioning as a specific coding type theDirect mode, where no residual
information is sent and the motion vectors for the current block are estimated from the MVs of
the blocks at the corresponding positions in the previous and the following frames. In this way,
it is possible to minimize the amount of bits that have to be coded for temporally-predicted
macroblocks. Further details are available in [105, 87].
2.2.3 Transformation and quantization
One of the basic features that distinguishes the H.264/AVC coder from the previous ones is the
adopted transform. Since the development of earliest standards for image compression, such
as JPEG in the end of the eighties [122], transform coding throughDiscrete Cosine Transform
(DCT) has represented the usual approach for both image and video coding [47, 44, 87]. DCT
is able to compact theenergyof transform block in the frame to be coded into a smaller num-
ber of frequency coefficients, and most of the previous transform-based video coders adopt
a Discrete Cosine Transform performed on8 × 8 blocks as it provides a good trade-off be-
tween computational complexity and compression efficiency. Recently, new paradigms have
2.2. A holistic overview of the building blocks 17
been proposed as good substitutes for DCT, like the waveletsused in the JPEG2000 standard
[43, 118].
At the beginning of the standardization process of the H.264/AVC, technical literature had
presented several implementation of the8× 8 DCT in fixed-point arithmetic designed for low-
complexity devices. However, the designer of H.264/AVC looked for a reduced complexity
transform which was implementable with a few additions and shift-registers. The solution was
found by simplifying the structure of the4×4 DCT, which can be easily approximated through
a multiplierless transform followed by a rescaling. In fact, the4 × 4 DCT can be rewritten as
follows
Y = AXAT =
a a a a
b c −c −ba −a −a a
c −b b −c
X
a b a c
a c −a −ba −c −a b
a −b a −c
, (2.1)
whereX is the input block,Y is its transformed version, and the transform matrixA is fully
described by the factorsa = 12 , b =
√
12 cos
(
π8
)
andc =√
12 cos
(
3π8
)
= bd . The transform
matrixA can be written as
A =
a 0 0 0
0 b 0 0
0 0 a 0
0 0 0 b
1 1 1 1
1 d −d −1
1 −1 −1 1
d −1 1 −d
, (2.2)
which makes possible to rewrite (2.1) as follows
Y =
1 1 1 1
1 d −d −1
1 −1 −1 1
d −1 1 −d
X
1 1 1 1
1 d −1 −1
1 −d −1 1
1 −1 1 −d
⊗
a2 ab a2 ab
ab b2 ab b2
a2 ab a2 ab
ab b2 ab b2
(2.3)
where⊗ denotes a coefficient-by-coefficient multiplication (see [35, 36]). In the H.264/AVC
standard, the parameterd, whose value is√
2− 1 ≃ 0.414, is approximated to1/2 allowing an
implementation of the transform with additions and shift registers only.
In addition to the presented4 × 4 transform, the standard includes an additional4 × 4
Hadamard transform, which is applied to the DC coefficients of the 4 × 4 blocks for the
Intra16x16 coding mode. The Hadamard transform is also applied on the prediction error
blocks before computing the cost function, since it allows the Rate-Distortion (RD) optimiza-
tion algorithm to discriminate between low-pass and high-pass residual errors2. An additional
2 × 2 transform is also applied on the DC coefficients of Chroma blocks as it is done in the
Intra16x16 case.
Recent developments of the standard have led to the adoptionof a higher dimension trans-
form (sized8 × 8 pixels), which was introduced in order to efficiently code high-definition
2The Hadamard transform permits distinguishing blocks withdifferent features that have the same distortionvalue (see [87]).
18 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
video formats (i.e. HDTV). The adopted8 × 8 transform is derived from a DCT, and it is
implemented without any multiplication. Since most of the research work presented in this
thesis concerns the processing of4 × 4 blocks, a detailed description of the8 × 8 DCT of the
H.264/AVC standard is let to [87].
Since the transform block is directly followed by a quantization, the rescaling matrix can
be included in the quantization step by specifying different quantizations steps according to
the position of the coefficient to be quantized. In the first definitions of H.264/AVC, the quan-
tization steps depended on the spatial frequency of the coefficients and on the Quantization
Parameter (QP), a coding parameter that can be specified at macroblock level and can be re-
lated to the quantization step through the equation
∆ = K(i,j) · 2QP/6, i, j = 0, . . . , 3 (2.4)
whereK(i,j) is a scaling factor (see [87])that depends on the position(i, j) of the coefficient
in the block and includes the factors of the rescaling matrix.3 In a more recent definition of the
standard, the factorsK(i,j) can be arbitrarily specified at the encoder by aquantization matrix,
which allows a coarser or finer representation of the coded coefficients according to the adopted
coding strategy. The investigation of the optimal quantization matrix is beyond the scope of
this work and will not be treated. The quantized coefficientsare scanned according to a zig-
zag order and run-length coded, i.e. each non-null quantized coefficient will be mapped into a
couple(r, l), wherel equals the value of the quantized coefficient itself whiler specifies the
number of null coefficients that occur before the current non-zero coefficient and the previous
one in the scanning order. In this work, non-null coefficients will be calledlevelswhile the
null coefficients will be calledzeros. The couples(r, l) are then sent to the entropy coder
together with other syntax elements such as the motion vectors, the partitioning structure for
the current macroblock, the macroblock coding mode, and theadopted prediction modes in
case of Intra macroblocks. For a more detailed description of the run-length coding procedure,
see [127, 105, 87].
2.2.4 Entropy coding
The final step of the encoding process for a macroblock implies the conversion of all the syntax
elements into a set of binary strings that can be sent to the Network Adaptation Unit in order to
be transmitted. This conversion results to be efficient whenever the length of the binary strings
assigned to the syntax elements is matched with their probabilities, i.e. the most probable values
are represented with short strings while the least probablevalues are coded using long strings.
This task is demanded to two different coding algorithm: theContext-Adaptive Variable Length
Coder (CAVLC) [127] and the Context-Adaptive Binary Arithmetic Coder (CABAC) [73].
Since the CABAC algorithm will be widely described in chapter 3, the following paragraph
aims at giving a short insight of the CAVLC algorithm.
In most of the previous video coders, the entropy coding algorithm maps the produced
syntax elements into binary variable-length binary strings (calledcodewords) according to a
3The Equation 2.4 is derived from the Tables in [103] considering that the quantization step doubles every6 QPvalues.
2.2. A holistic overview of the building blocks 19
fixed table (calledcoding table). This approach proves to be efficient in terms of computational
cost, but most of the times, the compression gain is limited since the algorithm is not able
to adapt the length of the codewords to the changing statistics of the input data. Therefore,
since the beginning of source coding it was evident that the efforts of researchers should have
focused on adaptive approaches. First attempts were made considering adaptive Huffman codes
where the probabilities associated with the nodes of the coding tree were updated according to
the input statistics (see [19]). Then, different adaptive codes were proposed according to the
characteristics of the probability distribution of the coded source. Among these we can mention
the CAVLC algorithm, which represents an efficient way of coding the quantized coefficients
choosing adaptively the coding table among a set of possibleones in order to match the signal
statistics.
The CAVLC algorithm adopts a fixed Variable Length Code (VLC)in order to specify
all the syntax elements which are not related to residual information. As for each block of
quantized DCT coefficients, the coder first specifies the number of coefficients different from
zero and whether there are coefficients equal to±1 at the end of the scan. Then according to the
number of coefficients, couples(r, l) are coded by writing all thel values first and ther values
in the following. The adopted coding table is chosen from a set of possible ones, and it depends
on the number of non-zero coefficients and on the value of the previously coded information
(i.e. the previousl or r values, the number of non-zero coefficients in the neighboring blocks).
Since the statistics of the source can be roughly approximated with a geometric variable, the
coding tables defines an Exp-Golomb code (see [87]), which proves to be optimal for that kind
of source. The performance of the CAVLC algorithm compares very well with the performance
of the arithmetic coder since the difference is only a10% increment of the coded bit stream.
Fig. 2.7 reports the coding results of CAVLC algorithm compared with CABAC algorithm and
UVLC, the variable length coding algorithm based on a fixed coding table that was adopted in
the early definitions of the standard.
Figure 2.7: Comparison between CAVLC, CABAC, and UVLC (a fixed VLC code defined inthe H.26L drafts).
20 Chapter 2. Video Source Coding and the H.264/AVC video coding standard
2.2.5 Deblocking Filter
All the images that have been coded by a block-based transform coding algorithm present
some visual artifacts related to the fact that each block is processed independently from the
neighboring ones. In fact, the loss of part of the information (caused by the quantization) may
lead to reconstructing the coded blocks in different ways despite in the original signal they are
similar. These artifacts usually appear as a high frequencynoise that is added to the image
and makes it appear as if made by separate tiles in a sort of mosaic-like effect. This distortion
(calledblocketization) can be mitigated by filtering the reconstructed image alongthe edges of
each block with a low-pass filter, which is adaptively tuned in order to attenuate more or less
strongly high frequencies whenever they carry a significantamount of distortion.
The H.264/AVC standard defines a really efficientdeblocking filter, which is applied along
the vertical and horizontal edges of every4 × 4 block. For each edge, the strength of the
filter (i.e. the number of involved pixels) depends on the coding type of the two neighbor-
ing blocks, the quantization parameters, the presence of non-zero coefficients, and in case of
motion-compensated blocks, the values of their corresponding motion vectors. Since a full
description of the deblocking routine is not the main subject of this thesis, further details can
be found in [87, 105, 64].
2.3 Summary
This chapter has presented a short overview of the video coding standard H.264/AVC, starting
from its general scheme and describing some of its building blocks. This coding scheme can
be summarized, for the sake of brevity, as a general DPCM coder followed by a transform
coding block, and an entropy coder. The DPCM prediction can be performed either spatially or
temporally according to the coding type that characterizesthe current frame. The residual sig-
nal is then transformed through a multiplierless Discrete Cosine Transform, which is followed
by a rescaling/quantization phase. The quantized coefficients, also known as levels, are coded
according to a run-length coding algorithm and sent to the entropy coder, which converts them
into binary variable length strings together with the macroblock coding type, the macroblock
partitioning information, the prediction modes, and the predicted motion vectors. A specific
attention has been payed for those blocks that will be involved in the coding algorithms that
will be presented in the following chapters, i.e. the transform block, the quantization and the
CAVLC entropy coder. Further details about the Arithmetic Coding engine will be given in
chapter 3, where an improvement of the original algorithm isdescribed.
The basic aim of the chapter is not to provide an accurate overview of the H.264/AVC
coding standard, since it is not the main topic of this work, but to introduce the set of tools
which will be tuned by the algorithms presented in the following chapters. Further details on
the H.264/AVC standard can be found in [105, 87, 71, 64, 127, 72].
Chapter 3
Probability-Propagation BasedArithmetic Coding
“In nature we never see anything isolated,but everything in connection with something else
which is before it, beside it, under it and over it.”
Johann Wolfgang von Goethe
The previous chapters have described the main issues of thiswork and the starting point ofour research, the video coding standard H.264/AVC. This chapter concerns the first of therequirements that an efficient video coding architecture for mobile applications must satisfy,i.e. a good compression efficiency. The whole chapter is focused on improving the compressiongain of the arithmetic coder specified by the standard by modelling the probability of bitsthrough a graph. The proposed model takes advantage of the statistical dependence amongneighboring coefficients in order to improve the probability estimate. Enhancing the numberand the structure of contexts, the proposed solution permits improving the compression gainwithout increasing the number of required operations.
3.1 Introduction
Arithmetic coding is known in the form we use it today from thelate 70’s. The first investiga-
tions about this topic appeared in the 1960 thanks to Abramson and Elias, despite the proposed
solution was far from what was the first “arithmetic coder”. Only in 1976 the work of Pasco
and Rissanen led to the design of the first arithmetic coding engine (1979-1980) that is similar
to the one used nowadays (see [9]). Despite the idea of arithmetic coding results quite simple,
we had to wait the 80’s in order to witness the first practical implementations (Q-coder and
MQ-coder) [83, 129]. Nowadays most of the arithmetic codingengines are a re-elaboration of
the MQ-coder and are used in a varied set of applications.
The key idea is mapping strings of values into separate intervals of “real” numbers allowing
the identification of the string by specifying one value in the final interval. As an example, let
us consider the string of binary symbolsb = [01101] and the corresponding probabilities for
the symbol “0”p0 = [p0,i]i=1,...,5 = [1/3,1/4,3/8,1/8,1/4] at every instant. At thei-th
22 Chapter 3. Probability-Propagation Based Arithmetic Coding
iteration the coding intervalIi = [li, li + Wi), whereli is its lower bound andWi its width,
can be partitioned in two sub-intervalsI0i andI1
i associated respectively to the probability of a
“0” symbol and a “1” symbol. In the adopted notation the subscript index refers to the index of
the coded bit and the exponent index refers to the associatedbinary value. The width of each
sub-interval can be computed as follows
W 1i = Wi · (1− p0,i) W 0
i = Wi · p0,i.
while the lower bounds are
l1i = W 0i l0i = 0
where we are assuming that the lower interval is always associated with the symbol “0”.
The intervals can be univocally identified by their widthsW si and their lower boundslsi ,
s = 0, 1 and i = 0, . . . , 4. According to the coded binary symbol, the coding interval is
shrinked each time into one of its partitions
Ii+1 ←{
I0i if bi = 0
I1i if bi = 1
i.e. li+1 ←{
l0i if bi = 0
l1i if bi = 1andWi+1 ←
{
W 0i if bi = 0
W 1i if bi = 1
In the following step, the intervalIi+1 = [li, li+Wi] is further partitioned into two sub-intervals
according to the probability distribution of the followingsymbol to code. In the considered
b = 00
b = 11
02I 2
1I
0I 1 11I
I = [0,1/3)1
03I 3
1I
b = 03
b = 14 I = [11/96,109/768)4
I 010
0I
04I 4
1I
0I = [0,1)
2b = 12
3I = [11/96,1/3)
I = [1/12,1/3)
Figure 3.1: Example of arithmetic coding: the string[01101] is coded considering the vectorof probabilitiesp0 = [1/3,1/4,3/8,1/8,1/4] for the symbol “0”.
example, at each iteration the state of the arithmetic codercan be identified by the couple
(li,Wi) while the coded string can be specified by a real numberr internal to the final interval
[l4, l4 +W4). The sequence of states for the arithmetic coder looks like shown in Fig. 3.1 and
reported in Table 3.1.
As the number of coded bits increases, the width of the codinginterval shrinks at each iter-
3.1. Introduction 23
Iteration symbolbi p0,i [li, li + Wi) state(li, Wi)
0 “0” 1/3 [0, 1) (0, 1)
1 “1” 1/4 [0, 1/3) (0, 1/3)
2 “1” 3/8 [1/(4 · 3), 1/3) (1/12, 1/4)
3 “0” 1/8 [11/(8 · 4 · 3), 1/3) (11/96, 7/32)
4 “1” 1/4ˆ
11/(8 · 4 · 3), 109/(82· 4 · 3)
´
(11/96, 7/256)
Table 3.1: Sequence of states for the considered example.
ation requiring a finer resolution to specify the internal real number. In actual implementations,
the resolution is chosen in the design of the coding architecture and the length of the string that
can be coded is derived from it. Whenever the width of the coding interval has reached the
smallest size allowed, a rescaling is performed [58]. At thedecoder, given the numberr it is
possible to reconstruct the coded bit stream repeating the partitioning procedure carried out at
the encoder and verifying in which interval the numberr lies in.1 Moreover, although during
encoding the algorithm generates one codeword for a whole sequence of data, it is possible to
implement it with a sequential algorithm that outputs bits whenever possible.
With respect to the Huffman code, Arithmetic Coding (AC) codes each symbol with a
fractional number of bits leading to higher efficiency. Indeed it can be proven to almost reach
the best compression ratio possible, i.e. the entropy rate of the source being coded.
Despite AC is still very young with respect to other fields of Information Theory, it is al-
ready a mature and widely-used coding solution. The attractive coding gains that could be
achieved has led to its introduction in most coding standards for video and image compression
(see the specifications of standards JBIG, JBIG2 [42], JPEG [122], JPEG2000 [118, 43], H.263
[45]). However, its complexity still remains an important issue since it still can be considered
computationally-prohibitive for devices with a very limited power supply. As a consequence,
the most recent video coding standards define two separate entropy coding algorithms: one
is based on arithmetic coding while the other is a simpler coder that requires a limited com-
putational load. Among these, we can include the H.264/AVC [127] video coder, which has
standardized the non-arithmetic algorithm CAVLC (ContextAdaptive Variable Length Coder)
[87] and the arithmetic engine CABAC (Context Adaptive Binary Arithmetic Coder) [72]. The
efficiency of the latter is one of the main improvements that allow the H.264/AVC coder to out-
perform all existing standards reducing the size of the coded bit stream up to 50%, especially
in comparison to MPEG-2 [47]. Its performance is mainly due to a precise context modelling
and an efficient binarization strategy, as it will be shown inthe following sections.
In the CABAC architecture, each syntax element is convertedinto a variable-length binary
string, and each string is coded via a binary arithmetic coder according to the probability of
the bit value. The probability is given by probability distribution function (pdf) associated to a
context, which depends on the coded syntax elements and on the position of the binary digit in
the string.
1The procedure reminds the approximation of a real number through successive partitions of the real interval[0, 1).
24 Chapter 3. Probability-Propagation Based Arithmetic Coding
Figure 3.2: Scheme of the CABAC coding engine.
Since adaptive algorithms have already proved to be extremely effective for other classes
of codes (see [19]), the same strategy was adopted for the arithmetic architecture. After coding
each binary digit, the pdf is updated in order to adapt the coder statistics to the input signal.
Adopting the same conventions that were adopted in [73], in the rest of the chapter we
will also refer to the binary elements of strings with the name “bin” in order to avoid any
misunderstanding with the actual bits that are written in the coded bit stream. This choice was
motivated by many previous works, which have shown that it ispossible to implement a mul-
tilevel arithmetic coder by applying a binary arithmetic coder to the bins obtained binarizing
the input symbols. This allows a great simplification in designing the coding architecture, and
varying the contexts, it is possible to process heterogeneous syntax elements without changing
the coder structure.
At the same time, the binarization improves the coding performance while contexts refines
their probability estimate (statistics can change). At thebeginning of each slice, contexts are
reset in order to make the decoding operation independent from other slices, and therefore,
the probability values are not matched to the input data yet.This mismatch may reduce the
compression performance, mostly whenever the slice size issmall. The binarization block (see
Fig. 3.2) performs an initial “entropy coding” of the input symbols during the transient period
of probability estimate improving the final performance.
Finally, the binarization allows an accurate and efficient design of contexts structure. An
excessive number of contexts can potentially model the probability mass function (pmf) of
each syntax element very accurately, provided that the estimated binary probabilities have con-
verged. This process could result difficult whenever the input data are diluted in an excessive
number of contexts since this precludes a fast convergence of the estimates. At the same time,
a great number of contexts increases the hardware requirements. A prior binarization of coded
symbols helps in detecting the optimal contexts structure since those contexts that are related
to the least probable bins can be collapsed into a smaller set(see [73]).
Even if the context structure of the CABAC coding engine is well-designed, the probability
model for the transform DCT coefficients can be improved. In fact, the context set does not
take into account the fact that the amplitudes of coefficients at neighboring frequencies are
statistically dependent, as well as for coefficients of neighboring blocks at the same frequency.
3.2. The Context Adaptive Binary Arithmetic Coder (CABAC) 25
This statistical dependence can be represented through a proper probability mass function,
which can be schematized through a graphical model [133]. Through this model, it is possible
to modify the structure of the CABAC coder estimating the probability of each bin through a
Sum-Product approach [51, 68]. Then, the estimated probability value is used to select the state
of the binary coder implemented in the CABAC coding engine.
The chapter is structured as follows. Section 3.2 gives a brief overview of CABAC coder.
Section 3.3 presents the adopted graphical model and how it was implemented. Section 3.4
reports the details of the modified arithmetic coder. Section 3.5 reports the experimental results
obtained on a set of different video sequences.
3.2 The Context Adaptive Binary Arithmetic Coder (CABAC)
The H.264/AVC standard includes two different entropy coding algorithms. The first one is
a Context-Adaptive Variable Length Code (CAVLC) that uses afixed VLC table for all the
syntax elements except for the transform coefficients, which are coded choosing adaptively a
VLC table among a set of different possible coding tables. The second entropy coder defined
within the H.264/AVC standard specification [87] is a Context-Adaptive Binary Arithmetic
Coder (CABAC) [73], schematized in Fig. 3.2, which allows a bit stream size typically10%
smaller with respect to CAVLC (see Section 2.2.4).
The encoding process can be specified in three different stages:
1. binarization;
2. context modeling;
3. binary arithmetic coding.
In the first step, a given non-binary valued syntax element isuniquely mapped to a variable-
length sequence of bins (called bin-string). The only exception is the coding of a binary value:
in this case no conversion is needed, and the binarization step is bypassed (see in Fig. 3.2). In
this way, the input symbols for the arithmetic coder are always binary values, independently
of the characteristic of the syntax elements. For each binary element, one or two subsequent
steps may follow depending on the coding mode. In the so-called regular coding mode, prior
to the actual arithmetic coding process, the given binary digit (bin) enters the context modeling
stage. According to the syntax element it belongs to, a context and its related probability model
are selected and the bin probability is computed. Then the bin value, along with its associated
probability, is sent to the binary arithmetic coding engine, which will map them into an interval.
After the coding operation, the encoder updates the probability model for the current context.
In the following paragraph we will focus on the coding operations related to the coding of the
transform coefficients.
The coding of DCT data is characterized by the following distinct features:
• a one-bit symbol notifies the occurrence of nonzero transform coefficients in the current
block following the reverse scanning order;
26 Chapter 3. Probability-Propagation Based Arithmetic Coding
• in case the coefficient is different from zero, an additionalcouple of bits code the sign of
the coefficient and the flag that indicates whether it is the last coefficients different from
zero or not;
• then non-zero levels are coded, assigning a context to each one of them according to the
number of previously transmitted nonzero levels within thereverse scanning path.
The binary coding engine can be modeled via a Finite State Machine (FSM) with 64 states,
where each state identifies the probability of the least probable symbol (LPS), i.e. the less
probable binary value. A memory table maps each state into a width value for the LPS interval
according to the width of the whole interval. In the same way,the transition from one state to
another is driven by the correspondence between the coded bit value and the most probable one
(MPS). Fig. 3.3 reports a section of the whole FSM with the corresponding state transitions.
Each state transition also depends on the state value of the current context, and the admitted
Figure 3.3: Structure of the Finite State Machine related tothe CABAC coder. The dashedline reports the state transitions for the standard CABAC coding engine. The solid lines referto the DAG-based version of CABAC.
transitions are specified through a matrix. After a varying number of coding steps, the length
of the coding interval is rescaled in order to keep it greaterthan a quarter of the full resolution
(that is equal to210 = 1024). Each rescaling operation increases the number of bits to be
written in the bit stream, and therefore, the number of rescalings must be limited in order to
keep the bit stream as small as possible. To this purpose, thebinary strings that represent the
non-null coefficients are codedvertically instead ofhorizontally(bit plane by bit plane). In this
way the interval shrinking is mitigated since the binary strings associated to the absolute values
of levels present a high occurrence of “1”s and most of the times the coding interval is mapped
into the MPS sub-interval, i.e. the larger one.
Note that the convergence speed of the probability estimateis limited by the allowed state
transitions and the coder must process a certain amount of data before obtaining a reliable pmf
for each context. This problem can be solved by initializingthe coding contexts in different
ways according to the characteristics of the data that have to be processed. For example, since
quantized coefficients are partially binarized using a unary code, the initial probability mass
functions associated with the contexts for the coefficientshave a mean greater than 0.5. The
following subsection will give a more detailed insight on how the absolute values of DCT
coefficients are converted into binary strings and how contexts are assigned to each bin.
3.2. The Context Adaptive Binary Arithmetic Coder (CABAC) 27
3.2.1 Binarization and context modelling for the absolute values of non-zero co-efficients
Binarizer
ctx
11111110
11110
11
11
1
ctx
ctx
ctx
ctx
ctx
ctx
0 00
0 0 0
Figure 3.4: Scheme of contexts for the absolute values of coefficients.
Analyzing the CABAC routines that code the residual data, itis possible to infer two differ-
ent phases: the coding of the positions of non-zero coefficients (calledsignificance map) and
the coding of their values. Our investigation was focused onthe latter phase of the process and
since the signs of non-null coefficients are coded using a non-adaptive uniform pmf, the next
section will be focused on the binarization and the context modelling for the absolute values of
coefficients as it proves to be the most interesting aspect.
The occurrence of non-null coefficients with an absolute value equal to1 (calledones) is
frequent,2 and therefore, the CABAC coder specifies a separate significance map for coeffi-
cients that are equal to 1. A binary value signals whether thecurrent non-zero coefficient is
a one or not, and4 possible contexts can be associated. For the first three coefficients in the
reverse scanning order that occur before the first absolute value greater than one, the context
modeller associates three separate contexts. The remaining bins are coded using a fourth binary
context which eventually models the occurrence probability of ones at the lowest frequencies.
Each of the remaining absolute values is then binarized using a unary-Exp-Golomb code,
which corresponds to a unary code eventually followed by an Exp-Golomb code, and the coder
assigns to each of them a single separate context up to a fixed limit which depends on the
block type (e.g.5 for inter blocks) in order to keep the context number as low aspossible.
The remaining values are assigned to the final binary contexts and each absolute value is then
coded vertically using the binary context that has been assigned. Considering that the adopted
binarization is a unary code most of the time, each context models the average length of the
DCT coefficients that it is assigned to. In this way, the needsfor both a good modelization
and low complexity are met, but this choice is paid with a lower compression gain since the
modelling of the probability may result inadequate. Note that the context assignment does not
take into account the spatial frequency of the coefficients but only their ordering in the scanning
process, and sometimes this makes statistically-heterogeneous data to be mapped into the same
probability model.
Despite its limitations, the CABAC architecture proves to be quite efficient with respect
to the previous coding engines since it significantly contributes to improve the coding perfor-
mance of H.264/AVC coder with respect to the previous standards. However, the performance
can be significantly improved considering that each coefficient is statistically dependent on the
2Transform coefficients have a symmetric pdf monotonically decreasing for positive values (see Section 4.3).
28 Chapter 3. Probability-Propagation Based Arithmetic Coding
neighboring ones. This dependence can be used to refine the statistical model of the original
CABAC architecture.
3.3 Modeling the contexts using a graph
The basic idea that lies beneath transform coding is to reduce the correlation among different
samples of the signal to be coded. Adopting Karhunen-Loève (KL) transform, for a Gaus-
sian source each transform coefficient is independent from the other ones since it is computed
projecting the original signal on a different vector from anad-hoc built orthogonal basis. In
addition it is possible to reproduce the original signal once the transform coefficients (and the
relative basis) are known. In this way the signal can be efficiently coded by specifying its
KL coefficients since the intrinsic redundancy of the original data has been removed. Un-
fortunately, the KL transform must be adapted to the input signal and sent to the decoder in
order to convert the decoded coefficient in the original signal domain. In addition, the estimate
of the optimal KL transform is a computationally-expensivetask and need to be computed
frequently in order to match the varying statistics of the input signal. In most practical ap-
proaches, transform coders resort to the Discrete Cosine Transform (DCT) since it has good
decorrelating properties for correlated signals, and it can be efficiently implemented with an
integer arithmetic. Previous works have shown that the compression performance of DCT de-
pends on the transform size, and most video coding standardsadopt a8 × 8-sized DCT since
it provides a good trade-off between compression gain and low computational complexity. As
it was mentioned in chapter 2, H.264/AVC resorts to a sub-optimal 4 × 4 transform obtained
from a DCT (see [35, 36]) that does not require multiplications and whose lower performance
can be compensated by an efficient prediction mechanism (see[14]).
The choice of a smaller transform size has some drawbacks on the corresponding signal
basis that is used to decompose each4 × 4 block since the coding efficiency of4 × 4 DCT
is lower with respect to its8 × 8 and 16 × 16 versions. In this way, the statistics of the
(a) Dependencies among coef-ficients
X8 X9 X10 X11
X1 X2 X3X0
X7X6X5X
X12 X13 14 X15
4
X
(b) DAG model
Figure 3.5: Directed Acyclic Graph that models the statistical dependencies between the co-efficients in a transform block.
current coefficients partly depends on the statistics of theneighboring coefficients whenever
the Manhattan distance between them is lower than two. The statistical dependence between
3.3. Modeling the contexts using a graph 29
coefficients at greater Manhattan distances is much lower, and therefore its influence on the
resulting pdf is less significant. A similar relation was found for other types of transform (see
[8, 65]), and it can be related to the intrinsic statistical dependence between the energy levels
of the signal at neighboring bands. Therefore, we can model this relation with the Directed
Acyclic Graph (DAG)G that is reported in Fig. 3.5(b). The DAGG can be specified by a
couple of setsG = (V,E), whereV denotes the set of the nodes (corresponding in this case to
the coefficients) andE is the set of edges (corresponding to the conditioned probabilities).
The model connects each coefficient with the coefficients lying on the left and above since
we can assume that the correlation is horizontally and vertically oriented. In addition to the
Figure 3.6: Dependencies between the coefficients in a macroblock.
dependence among coefficients of the same block, there is also a statistical dependence among
coefficients belonging to neighboring transform blocks. This relation is partially used by the
CAVLC coder when coding the number of coefficients differentfrom zero in each block. The
number of non-zero levels in the current block is predicted averaging the number of non-zero
levels in the upper one and the left one [98]. Once again this can be justified by the transform
size: since the blocks are small, some features of the image are correlated for neighboring
blocks (e.g. the value of the DC coefficient or the coefficients at certain frequencies). Note
that for predicted frames this relation also depends on the different performance of prediction
on distinct block. For example, in case the motion estimation finds a good estimate for a block
and bad ones for its neighbors, the resulting coefficients are very poorly correlated since in the
first block the residual signal is nearly AWGN while residualsignal of the neighboring block
is highly correlated with the original image. As a consequence, statistical dependence arises
whenever the temporal predictions performed on adjacent blocks are correlated. In this case,
the same model of Figure 3.5(b) can be applied considering group of 164× 4 blocks, and we
can associate a separate DAG to each frequency in the transformed signal (see Fig. 3.6). The
choice of considering a4×4 group of4×4 blocks was made in order to consider the statistical
dependencies within a macroblock.
According to the statistical dependences modelled by the graph in Figure 3.5(b), the joint
probability mass function (pmf) for a block of coefficient (or for a grid of coefficients at the
30 Chapter 3. Probability-Propagation Based Arithmetic Coding
same position in different transform blocks) can be factorized into conditional pmfs as follows
p(x) = p(x0) · p(x1, x4/x0) · p(x2, x5, x8/x1, x4)·p(x3, x6, x9, x12/x2, x5, x8)·p(x7, x10, x13/x3, x6, x9, x12)·p(x11, x14/x7, x10, x13) · p(x15/x11, x14)
= p(x0) ·∏
s∈V,s 6=0
p(xs/xπs)
(3.1)
wherex is the vector that reports the value of each coefficientxi, i ∈ V . The setπs contains
the coefficients adjacent tos
πs = {t ∈ V : (t, s) ∈ E} = {xs,A, xs,B} (3.2)
wherexs,A,xs,B are respectively the upper and the left levels for the current coefficients.
In case one or both of the adjacent coefficients are not available, we assumexs,A andxs,B
undefined.
The factorization is possible since each pair of variables that have a common parent is
conditionally independent with respect to the parent itself. As a consequence, it is trivial to
verify that all the nodes lying on each diagonal are conditionally independent given the nodes
on the previous diagonal. The probabilistic relations expressed by the DAGG can also be
applied to the bit planes that are found slicing horizontally the binary representation of the
block of coefficients.
It can be seen that the bitsbks of thek-th bit plane, where
xs =15∑
k=0
bks 2k (3.3)
are related according to equation
p(bk) = p(bk0) ·∏
s∈V,s 6=0 p(bks/b
kπs
), (3.4)
with
p(bks/bks,u) =
215∑
xs = 0 :
thek-th bit
of xs is bks
215∑
xs,u = 0 :
thek-th bit
of xs,u is bks,u
p(xks/x
ks,u) p(xk
s,u)
u ∈ πs andk = 0 . . . 15.
(3.5)
(see [68]).3 In this way, we obtain more than one DAG that can be modelled using an Ising
3Note that in the previous equations we assume that the maximum number of bit planes is16. The transformreported in Section 2.2.3 is applied on4×4 residual blocks of 9-bits samples (8 bits for the original sample plus onebit because of prediction), and the maximum amplification performed by the4 × 4 transform is36 (correspondingto 5.17 bits). Therefore,14.17 bits suffice for representing a transform coefficient, whichare rounded up to16 inthe arithmetic of the H.264/AVC coder. For further details,see [35, 36].
3.3. Modeling the contexts using a graph 31
model. This probability structure was first introduced by Lenz and Ising in the early 1920s in
the field of ferromagnetism [86]. The model has been widely applied to describe cooperative
phenomena, and more recently, it has been intensively adopted in statistical image processing
for different applications (see [116, 84]). Omitting the index of the bit planek, it is possible to
rewrite the pmf reported in eq. (3.4) as
p(b) = p(b0) ·15∏
s=1
p(bs/πs)
= exp log p(b0) · exp∑15
s=1 log p(bs/πs)
= exp{
θ01 · b0 + θ0
0 · (1− b0)}
·
exp
15∑
s=1
1∑
i,j,z=0
θsABijz · ψsAB
ijz (bs, bs,A, bs,B)
(3.6)
whereθ0i = log p(b0 = i)
θsABijz = log p(bs = i/bs,A = j, bs,B = z)
(3.7)
and the sufficient statistics is
ψa1(b0) = b0
ψa0(b0) = (1− b0)ψsAB
000 (bs, bs,A, bs,B) = (1− bs)(1− bs,A)(1 − bs,B)
ψsAB001 (bs, bs,A, bs,B) = (1− bs)(1− bs,A)bs,B
ψsAB010 (bs, bs,A, bs,B) = (1− bs) bs,A(1− bs,B)
ψsAB011 (bs, bs,A, bs,B) = (1− bs) bs,A bs,BψsAB
100 (bs, bs,A, bs,B) = bs(1− bs,A)(1− bs,B)
ψsAB101 (bs, bs,A, bs,B) = bs(1− bs,A)bs,B
ψsAB110 (bs, bs,A, bs,B) = bs bs,A(1− bs,B)
ψsAB111 (bs, bs,A, bs,B) = bs bs,A bs,B.
(3.8)
(see [120, 121]).
Given a set of observations{
x1,x2, . . . ,xM}
wherexj = [xj
s]s∈V ∈ Zn
∀j = 0, . . . ,M − 1 with n = |V |,
it is possible to extract the sets
{
b1,k,b2,k, . . . ,bM,k}
that includes the vectorsbj,k =
[
bj,ks
]
s∈V(3.9)
wherebj,ks is thek-th bit of xjs.
32 Chapter 3. Probability-Propagation Based Arithmetic Coding
Omitting the bit plane indexk, it is easy to check that the log-ML estimate of moments
µsABi,j,z andµa
i
µsABijz = arg max
µsABijz
1
M
M−1∑
k=0
log p(
bj/µsABijz
)
µai = arg max
µai
1
M
M−1∑
k=0
log p(
bj/µai
)
i, j, z = 0, 1.
(3.10)
areµsAB
ijz = E[
ψsABijz
]
= p(bs = i/bs,A = j, bs,B = z)
=1
M
M−1∑
k=0
ψsABijz (bks , b
ks,A, b
ks,B)
µai = E [ψa
i ] = p(bs = i) =1
M
M−1∑
k=0
ψai (bka)
i, j, z = 0, 1.
(3.11)
(see [51]).
Note that in this case the normalizing conditions are
1∑
i=0
µsABijz = 1
1∑
i=0
µ0i = 1.
(3.12)
The application of the Ising model for each bit plane of coefficient blocks proves to be an effi-
cient solution since it simplifies the equations for the log-ML estimate. However, the CABAC
coding engine codes each coefficient vertically since the high occurrence of “1”s improves the
coding performance. Therefore, it is possible to apply the DAG model in a different way. In
fact, as the CABAC coder associates only one binary context per coefficient, it is possible to do
the same with the DAGs using only one graph to model the statistical relation between the co-
efficients. Remember that since the coded binary strings belongs to a unary code, each context
models the average value of each coefficient which corresponds to the average value for each
non-zero level. Therefore, the Ising model represents the relation between the average values
of coefficients placed in different spatial positions.
The following section will show how the binary model can be used in the arithmetic coder.
3.4 A Sum-Product based arithmetic coder
The previous section has proposed a probability model that can be used to characterize the
probability of the different bit planes for a block of transformed coefficients. Therefore, an
interesting application to investigate is its inclusion into a binary arithmetic coder.
In the CABAC coder, the probability of each binary value is associated with the state of
a Finite State Machine, and the transition from one state to another is fixed by a transition
3.4. A Sum-Product based arithmetic coder 33
matrix. One of the disadvantages of this model is that the probability is correctly estimated
after coding a certain amount of data since the convergence speed of the probability is limited
by the transitions allowed from each state. In addition, thestatistics estimate does not take into
account either the position of the DCT coefficient in the transform block or the values of the
neighboring pixels, but it performs a simple estimation of the probability for each bit.
Graphical models allow a better probability estimation using a Sum-Product algorithm
along the edges of the DAG structure. In the following subsections, the whole encoding process
is presented.
3.4.1 Probability modelling through DAGs
At first, the encoder creates a4× 4 matrix of coefficients either belonging to the current block
or positioned at the same frequencies in different neighboring blocks. In this work, the first
approach will be denoted as DAGB (DAG on a Block), while the second will be called DAGMB
(DAG on a macroblock). In both cases, each coefficient is considered as a node in a graph
structure that model the statistical relations with its neighbors and allow the estimate of a
probability model.
In our first approach (see [78]), we associate a distinct binary DAG for each bit plane in
order to model the statistical dependence among the bits. Nospecific binarization is applied in
this case, and the coded binary strings consist in the binaryrepresentations of coefficients. The
most significant bit planes are not coded using the DAG model.In fact, the bins to code are
few and sparse since the number of high-energy coefficients is low. Therefore, there is no need
to apply the DAG model to these bits because it would provide only a small improvement. In
the implemented algorithm, only the 5 least significant bit planes were coded using the DAG
scheme while the remaining bits were coded using only one probability model per bit level,
since increasing the number of DAG-modelled bit planes doesnot improve the performance.
This distinction can be schematized as in Fig. 3.7. This model is able to estimate the statistics
for each bit plane but proves to be computationally demanding since the estimate of bit prob-
ability must be repeated for each bit plane. At the same time,the storage of the conditional
probabilities requires a great amount of memory since theirvalues may change according to
the significance of bits. Therefore, a better implementation can be obtained by considering the
binarization performed on transform coefficients.
The CABAC implementation defined by the H.264/AVC standard converts the absolute
valuesxk of transform coefficients into VLC strings using a unary codefollowed by an Exp-
Golomb code in casexk − 2 > 12. At first the encoder signals whetherxk is greater than one
or not. In case it is, the binarization unit specifies a binarystring ofxk − 2 digits equal to1
followed by an ending digit equal to0. In case the value ofxk − 2 is greater than12, the first
13 "1" digits are followed by an Exp-Golomb code as it is specified in [73]. Therefore, for
a given binarized coefficient the probability that a binary digit is equal to one results deeply-
related to the expected value for that coefficients (at leastin case of levels lower than13). In
this way it is possible to reduce the multiple DAGs in Figure 3.7 into only one graph. Note
that in this case the graph models the relation between the probabilitiesP (bs = 1) at different
spatial frequencies. In this way, the moments of the Ising model characterize the statistical
34 Chapter 3. Probability-Propagation Based Arithmetic Coding
Figure 3.7: Distinction between bit planes coded using the DAG probability model and bitplanes coded using the traditional CABAC scheme. In the depicted example, the 3 leastsignificant bit plane were coded using the DAG model, while the upper bits were coded usingone context per bit plane.
dependence between the average absolute values of coefficients placed at different positions in
the graph.
Coding operations can be divided into two steps: the estimation of the probability for the
current bit and its arithmetic coding.
3.4.2 Estimation of the bit probability
Given the bit planeb made of the bitsbi, with i = 0, 1, . . . , 15, we associate tob the proba-
bility mass functionp(b), which can be factorized as reported in eq. (3.1). The sum-product
algorithm is run from position(0, 0), and scans the bits according to the zig-zag path defined
by the H.264 standard. The zig-zag ordering was chosen sinceit is the scanning order of the
DCT coefficients, and examines the low-frequencies coefficients first.4
After zig-zag scanning, the sequence of bits can be represented by a vectorb = [b0,b1, . . . ,b15],
and the sum-product algorithm (see [68, 121]) is run following this ordering. For each nodebi,
i = 0, . . . , 15, the algorithm stores the probability valuep(bi) which is computed as
p(bs = i) =1
∑
j,z=0
exp(
θsABijz
)
· p(bs,A = j)p(bs,B = z)
p(b0 = i) = exp(
θ0i
)
(3.13)
wherei = 1, . . . , 15. In case the bitbs is lying on the borders of the graph, only one predecessor
will affect its value.
In our implementation we first used a floating point estimate for the probabilityp(bt) related
4The algorithm could be run using an arbitrary causal path, e.g. raster scan.
3.4. A Sum-Product based arithmetic coder 35
to t-th bit of the bit plane, which was used to initialize the context before coding the current bit.
In the on-line estimation of the probabilities, eq. (3.13) is modified using a recursive equation
(as it is reported in the following section).
A computationally-efficient implementation requires the adoption of a fixed point arith-
metic in the DAG modelization. Following the same approximations of the CABAC algorithm,
it was possible to implement the whole architecture in a fixedpoint arithmetic associating the
conditional probabilities to a set of binary contexts. At the beginning of each slice, contexts
are initialized according to the corresponding conditional probabilities, which have been es-
timated from a training sequence. Throughout the coding process, the probability estimation
and propagation is performed using the same FSM structure adopted by the CABAC coder (and
reported in [73]). Experimental results prove that the approximation is feasible also in this case
since the coding performance is not significantly affected.
Probability estimation significantly affects the final performance of the arithmetic coder. As
it was reported in the first sections of this chapter, the estimated probability value is associated
to the width of the coding interval. In the CABAC algorithm, arescaling is needed whenever
the width of the coding interval is lower than one fourth of the full resolution, and it can be
described as follows
while (range < QUARTER)
{
if (low >= HALF)
{
write_one_bit(1);
low -= HALF;
}
else if (low < QUARTER)
{
write_one_bit(0);
}
else
{
Ebits_to_follow++; /*bits to be written later*/
low -= QUARTER;
}
low <<= 1;
range <<= 1;
}
wherelow is the lower bound of the coding interval andrange its width.Ebits_to_follow
reports the number of bits that have to be written in additionto the current one whenever
write_one_bit() is called. It is possible to notice that the number of bits written in the
bit stream is strictly dependent on the number of rescaling operations, i.e. on the number of
processed binary symbols before an expansion of the coding interval is done. Whenever the
bit source presents a highly biased probability distribution (like the output of a unary code), it
36 Chapter 3. Probability-Propagation Based Arithmetic Coding
is possible to code a high number of binary symbols before a rescaling operation in case the
estimated pmf is close enough to the real one. Hence, the shrinking of the coding interval is
limited since most of the time the coded binary digit equals the MSB and its corresponding
sub-intervals is large. Therefore, a good probability estimate “delays” the rescaling (i.e. adding
bits to the stream) improving the compression efficiency.
The following section will show how this probability estimate makes possible to improve
the performance with respect to the simple context update and initialization of CABAC. This
estimation consists in two crucial phases: the context initialization and the probability update
during the coding process.
3.4.3 Context initialization
In the binary arithmetic coder, the FSM machine is set to a state where the dimension of the
interval is proportional to the estimated probability. In the original version of CABAC, the
state depends on the previous state and the binary symbol that has just been coded. In fact,
in case the coded bit equals the most probable bit for the current context, the state index is
increased by one. In case the bit equals the least probable one, the state index is decreased of
an integer that may vary in the range[1, 3] according to its current value (Figure 3.3 reports a
graphical representation of the algorithm). In the modifiedversion, the probability is computed
independently from the state probability of the previous iteration. In order to reuse the quan-
tized intervals of CABAC, we need a rule to map a probability value to the inner state of FSM.
In the standardization process of H.264/AVC, FSM states [73] were computed considering the
probability values
p(state) = p0 αstate (3.14)
wherep0 = 0.5 andstate = 0, . . . , 63 andα =(
0.018750.5
)163 in order to include thep(state)
values in the interval[0.01875, 0.5]. According to eq. (3.14), it is possible to map the probabil-
ity valuep(bt) to the statestate_init
state_init =
⌊(
log2p(bt)
0.5 · 0.025
)⌋
(3.15)
as Fig. 3.8 depicts. This initialization of the contexts proves to be really effective in terms of
coding performance.
A wrong initialization requires a certain number of adapting iterations before the state of the
arithmetic coder fits the statistics of the coded data. During this transient period, probabilities
are mapped into intervals of inappropriate widths, which may result too wide or too narrow.
As a drawback, the number of rescalings may increase in the encoding process leading to an
excessive number of bits written in the final bit stream.
This fact is utterly evident in the statistics updating of binary contexts, which must suit
tightly the statistics of input data. The following sectionprovides a detailed description of the
statistics update process.
3.4. A Sum-Product based arithmetic coder 37
Figure 3.8: Structure of the modified Finite State Machine inthe new arithmetic coder.
3.4.4 Statistics update
In the floating point approach, the update operation was performed via the moment estimate
µ0i ← α · µ0
i + (1− α) · ψ0i (x0)
µsABijz ← α · µsAB
ijz + (1− α) · ψsABijz (xs, xs,A, xs,B)
giveni, j, z = 0, 1 ands = 1, . . . , 15
(3.16)
wherebs is the actual value of the coded bitbs. Equation (3.16) reports the same MAP estimate
of eq. (3.11) implemented using a recursive average. After updating the moment, the context
state is reinitialized using eq. (3.15) in order to translate a floating point probability value into
the arithmetic of the binary coder.
In the fixed point implementation, the context update corresponds with the one performed
by the standard CABAC encoder. Each conditional probability is updated using the FSM ap-
proximation defined in the CABAC. In this case, DAG-based probability estimate changes the
context initial state, but it does not affect its evolution while coding the current coefficient.
Computational complexity is significantly reduced and equals the one required by the origi-
nal CABAC algorithm in terms of coding and context updating operations. However, in our
case the number of adopted contexts is bigger requiring a wider memory to store their relatives
binary pmf.
3.4.5 Reduction of the number of contexts
The approach described so far involves a significant number of conditional probability (i.e. cod-
ing contexts) to be modelled. As a drawback, a high amount of coding contexts dilutes the
number of statistical samples that are used to estimate the binary pdf functions and the re-
quired memory area increases. Therefore, in order to make feasible this approach, the number
of coding contexts is reduced adopting some approximations.
For each coded coefficient, the DAG-based model chooses among four different contexts
38 Chapter 3. Probability-Propagation Based Arithmetic Coding
depending on the values of the upper and left pixels. Therefore, in case we are modelling
the probabilities taking advantage of the dependencies among coefficients in the same block
(block-based DAG), the number of required contexts is4 · 16 = 64, where we are using
different contexts for different spatial frequencies. In the other case, the number increases up
to 4 ·16 ·16 = 1024 since we must take into consideration which4×4 block in the macroblock
we are considering. These numbers can be easily reduced using some approximations.
A first approximation regards the position in the transform block for the current coeffi-
cients. An accurate model must vary the conditional probability according to the frequency of
the coefficients. It is possible to collapse different contexts into one for coefficients at differ-
ent frequencies that present the same behavior. This fact isalready exploited in the original
CABAC algorithm since the context assignment is performed according to the order of non-
zero coefficients in the zig-zag scan. Considering a reversescanning order (from non-zero
levels at high frequencies to non-zero levels at low frequencies), the context modelling unit of
CABAC assigns a different context for the first consecutive levels equal to one and the first 4
consecutive levels greater than one. The following levels are related to the last context. In the
same way, it is possible to adopt this simplification for the DAG-based model assigning four
contexts to each non-zero coefficient in the reverse scanning order up to a last context. The
total number is reduced to4 · 7 = 21 for the block-based DAG and to4 · 7 · 16 = 336 for the
macroblock-based DAG.
Moreover, in the MB-based DAG it is possible to assume the statistical dependence among
neighboring blocks independent from the position of the current block in the MB. Therefore,
the MB-based DAG can reduce its number of context down to4 · 7 = 21, like in the case of the
block-based DAG.
The coding performance can be increased differentiating contexts according the energy of
the residual signal. In this case, its is possible to adopt different sets of contexts according to
the maximum number of bits that need to be coded for a coefficient in the current block.
3.5 Experimental results
The evaluation of this probability estimate has been performed coding different sequences us-
ing the CABAC scheme that was previously described. In our experimental tests, we con-
sidered the following statistical dependences: the ones that links the coefficients within the
same block (DAGB), and the ones that links the coefficients ofneighboring blocks (DAGMB).
Both schemes are used to code a set of heterogeneous sequences after a training phase that
is intended to initialize the conditioned probability values. In the training phase, we used
the sequencemobile since it presents many features that can be found in other sequences.
Tests were done on sequences with different resolution in order to evaluate the effectiveness of
these methods at different spatial resolutions. In our approach we avoided any rate-distortion
optimization in order to prevent the optimization algorithm from affecting the resulting perfor-
mance. At the same time, no partitioning of the motion-compensated macroblocks is applied
since motion estimation can produce different residual signals with different energies within
the same macroblock. The coding performance of the DAGMB-based CABAC can be affected
3.5. Experimental results 39
by this choice since blocks result less correlated as the prediction efficiency can be different
(i.e. a block that represents a portion of background is moreeasily predictable than one that re-
ports a new element in the scene or an object moving with non-translational movement). In our
first implementation the algorithm is applied to P frames only without adopting any binariza-
tion (i.e. using the simple binary representation of the absolute values of coefficients) [78]. In
the second architecture presented here, we adopted the samebinarization used in the CABAC
algorithm since it allowed the assignment of only one context per coefficient without resorting
to different DAGs on different planes. Both algorithms wereimplemented on the reference
software JM 9.5 and are compared with the standard CABAC algorithm as it is defined in the
H.264/AVC specification [87].
At first, we evaluated the capability of estimating the binary pdf for each coded bits in the
CABAC engine. The performance of each algorithm was evaluated considering the average
symmetric Kullback-Leibler distortion between binary distribution p andq
D (p ‖ q) =
1∑
b=0
p(b) log
(
p(b)
q(b)
)
+
1∑
b=0
q(b) log
(
q(b)
p(b)
)
. (3.17)
In this work, the symmetrized divergenceD (p ‖ q) is taken as a measure of the effectiveness
for the estimating algorithm. In our simulations, we compared the binary pdf assigned to each
context with the binary distribution estimated for the current frame. The average Kullback-
Leibler divergence was computed for different sequences with different quantization parame-
ters. Figure 3.9 shows how the DAGMB and the DAGB approach areable to provide a better
estimate of the binary statistical distribution for each context. It is possible to notice that the
DAG-based estimators reduce the divergence to one third with respect to the divergence ob-
tained with the original CABAC estimator. Note also that forsequences with a lot of details
and non-translational movement (likemobile andtable) the original CABAC estimator
presents a higher divergence with strong quantization. This fact is mainly due to the fact that
coefficients statistics is highly varying and the CABAC estimator can not adapt quickly to
changes. On the contrary, DAG-based model is able to tune thecontext structure properly,
providing a precise estimate of binary pmfs. The DAGB approach results less effective than
the Macroblock-based approach since the statistical dependence of coefficients is lower within
the same block, depending on the coded sequence. This phenomenon is more evident for se-
quences with a lot of small details enhancing the differencein the performances of the two
algorithm (see results formobile in Fig. 3.9(b) and in Fig.3.10(e)). It is also possible to no-
tice that the mismatch between the performances of different algorithms depends also on the
quantization parameter QP. Strong quantization increasesthe introduced distortion, which al-
ters the statistical distribution of coefficients since theperformance of motion compensation is
more varying on different blocks. The performance of the DAG-based estimator is mostly af-
fected in those sequences that present a high compression gain thanks to motion compensation,
like foreman. Fig. 3.9(a) shows that the performances of estimators result more affected by
the quantization parameter with respect to the other sequences.
Finally, we compared the performance of different algorithms in terms of PSNR vs. rate.
We considered fixed point implementations of the DAGB and DAGMB algorithms that take
40 Chapter 3. Probability-Propagation Based Arithmetic Coding
15 17 19 21 23 25 27 29 310
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
QP
Ave
rage
Kul
lbac
k−Le
ible
r di
verg
ence
CABAC−DAGMB CABAC CABAC−DAGB
(a) foreman
15 17 19 21 23 25 27 29 310
0.05
0.1
0.15
0.2
0.25
0.3
0.35
QP
Ave
rage
Kul
lbac
k−Le
ible
r di
verg
ence
CABAC−DAGMB CABAC CABAC−DAGB
(b) mobile
15 17 19 21 23 25 27 29 310
0.05
0.1
0.15
0.2
0.25
QP
Ave
rage
Kul
lbac
k−Le
ible
r di
verg
ence
CABAC−DAGMB CABAC CABAC−DAGB
(c) container
15 17 19 21 23 25 27 29 310
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
QP
Ave
rage
Kul
lbac
k−Le
ible
r di
verg
ence
CABAC−DAGMB CABAC CABAC−DAGB
(d) table
Figure 3.9: Coding results for different QCIF sequences at 30 frame/s.
3.6. Summary 41
advantage of multiple FSMs to update the contexts. This implementation adopts the same
arithmetic and approximations that the original CABAC adopts for the probability, but the
context structure is changed. The computational complexity in terms of arithmetic operations
is the same of CABAC algorithm despite the modified approach requires an increased number
of contexts (i.e. a wider memory to store the binary pmf).
Despite the two DAG-oriented algorithms performs differently in terms of final average
divergence, the compression gain obtained by their application in the CABAC structure do
not vary significantly. Figures 3.10 and 3.11 reports the results obtained for different video
sequences. The adopted GOP structure is IPPP, and the DAG models were adopted for Inter
macroblocks which are coded imposing one motion vector. It is possible to notice that the best
performance provides a bit stream reduction of about10% (for container sequence), while
the compression gain can be lower or slightly lower according to the specific sequence. It is
possible to notice that the performance of both DAG-based algorithms is quite similar since the
reduction of the number of coding contexts does not allow a finer estimation of probability.
However, the adoption of the DAG model allows to increase thePSNR by 0.5 dB for
low bit rates and up to 1 dB for high bit rates. For sequences atdifferent resolutions (see
Fig. 3.11), the compression gain results lower since the higher amount of data available for the
estimation of context probabilities allows the original CABAC coder to improve the accuracy
of the statistical modelling operated by the context structure. In this case, the ratio between the
average Kullback-Leibler divergence for the original CABAC and the DAG based algorithms
is lower, and therefore, the difference in the performance is less evident.
3.6 Summary
This chapter describes the arithmetic coding architectureCABAC and how it is possible to im-
prove its performance changing the probability estimate algorithm. The improvement comes
from modelling the probability of the absolute value for each coefficient using a directed graph.
The statistics of each coefficient is thus estimated considering the conditioning of its neighbors
in the graph. Two dependence structures are considered. Thefirst one includes all the coef-
ficients in a4 × 4 transform block while the other considers coefficients at the same spatial
frequency belonging to neighboring4 × 4 blocks. However, the number of possible values
that each coefficient may assume makes prohibitive the adoption of a model based on integer
values, and therefore the transform coefficients are converted into binary strings. In this way,
the statistical dependence to be estimated involves binaryvariables and it is possible to model
it using a Ising model. Whenever coding a binary symbol, a Probability-Propagation algo-
rithm is run on the corresponding graphical model estimating a probability value that is used
to initialize the CABAC contexts. Experimental results show that the adoption of the graphical
models allows to improve the coding performance of the CABACalgorithm by estimating the
binary probabilities in a more accurate way and avoiding thetransient periods that are required
by context updating in the original coder. In fact, this allows to limit the interval shrinking at
each coding step reducing the number of bits written in the coded stream. The graph structure
results more effective when it is used to model the relation between coefficients of different
42 Chapter 3. Probability-Propagation Based Arithmetic Coding
0 100 200 300 400 500 600 700 800 90030
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(a) foreman
0 50 100 150 200 250 300 35032
34
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)CABAC−DAGMBCABAC−DAGBCABAC
(b) news
0 50 100 150 200 250 300 350 40032
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(c) container
0 100 200 300 400 500 600 70032
34
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(d) mother
0 200 400 600 800 1000 1200 1400 1600 1800 200028
30
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(e) mobile
0 200 400 600 800 1000 1200 140032
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(f) table
Figure 3.10: Results for different QCIF sequence at 30 frame/s.
3.6. Summary 43
0 500 1000 1500 2000 2500 3000 350032
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(a) foreman
0 1000 2000 3000 4000 5000 6000 7000 800028
30
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(b) mobile
0 500 1000 1500 2000 2500 3000 3500 4000 4500 500032
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(c) table
1000 2000 3000 4000 5000 6000 7000 800030
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
CABAC−DAGMBCABAC−DAGBCABAC
(d) football
Figure 3.11: Results for different CIF sequence at 30 frame/s.
44 Chapter 3. Probability-Propagation Based Arithmetic Coding
blocks due to the higher correlation. The obtained bit stream reduction is approximately equal
to 10% or equivalently, the obtained quality increment for a givenbit rate varies between0.5
and1 dB. Future work will include the modelling of the probability as a mixture of DAGs
depending on a set of parameters.
Chapter 4
Rate control algorithms for H.264
“We must cut our coat according to our cloth,and adapt ourselves to changing circumstances”
W. R. Inge
“Change is inevitable - except from a vending machine.”
Robert C. Gallagher
This chapter deals with the problem of controlling the bit rate produced by the H.264/AVCcoder. Given a certain available bandwidth, the rate control unit has to fit the coded bit ratewithin the transmission constraints maximizing the quality of the sequence reconstructed atthe decoder. This goal can be achieved modifying appropriately the coding parameters (likethe quantization step, the coding mode, etc. ), which must betuned according to the inputstatistics. In case of time-varying channels, the control algorithm must be flexible enough toadapt the parameters of the video coder to the modified channel condition. The chapter presenta control approach which is based on an accurate modeling of the bit rate. This characterizationis possible analyzing the produced bit rate as a function of the percentage of quantized nullcoefficient and the energy of the quantized residual signal.The propose approach provides aneffective control at a low computational cost.
4.1 Introduction
During the last decades the communication world has showed an increasing interest about the
transmission of video sequences over a heterogeneous set ofnetworks for a wide variety of
different applications. The main aim is to provide multimedia services to each terminal with-
out constraining its mobility or autonomy and granting a certain Quality of Service (QoS).
According to these requirements, wireless communicationsprove to be the most suitable way
to distribute multimedia content and allow videocommunication in all the environments. How-
ever, the characteristics of radio channels has also brought the need for rate control algorithms
that allow controlling the encoding parameters in both a flexible and efficient way. In fact, we
can identify some basic features that a rate control must satisfy to be suitable for wireless video
transmission [3]. A first requirement is low computational complexity, since some of video ter-
minals might have limited hardware resources or a finite power supply. The time-varying nature
of wireless channels also implies that rate control algorithms must be flexible and must quickly
46 Chapter 4. Rate control algorithms for H.264
adapt the coding parameters to changing transmission capacity. In the end, the algorithm must
show good compression efficiency allowing good visual quality in the reconstructed sequence
at the decoder despite a limited available bandwidth. Focusing on these essential demands, we
investigated a flexible low-complexity rate control algorithm that maximizes the video quality
of the coded sequence respecting the bandwidth constraints.
In literature it is possible to find different solutions to this problem, and for all of them
the key issue is to model the statistics for the coded data. The papers [37] and [38] present
an efficient model for the bit rate produced by a transform video coder, which is based on the
percentageρ of null quantized transform coefficients (calledzeros). These algorithms prove to
be very efficient since they present a simple structure and a sufficient accuracy in controlling
the produced bit rate. However, their implementations require either a lot of memory accesses
in order to store the statistics or an accurate probabilistic model for transform coefficients.
Since the latter solution allows for a reduction of the memory requirements, different solutions
were proposed to provide a coefficient model that is both simple and sufficiently accurate.
Most of the solutions which were presented in literature arebased on Laplacian and gen-
eralized Gaussian models (see [11, 57]). Since the generalized Gaussian model presents some
issues in terms of computational complexity, many applications prefer Laplacian probability
density function (pdf), which proves to be both simple and sufficiently accurate for some cod-
ing standards. The first application to the pdf modeling based on the percentage of zeros was
presented in [39].
In order to obtain high compression performance, we have focused our investigation on one
of the most efficient video coding architectures that have been introduced in the last years, the
video coding standard H.264/AVC [105]. Thanks to an improved motion estimation technique
[124], an efficient entropy coder [73], the adoption of spatial prediction [128, 14], and an
improved deblocking filter [71], H.264/AVC provides a higher compression gain with respect
to the previous coding architectures and places itself among the top candidate video coders
for video communications over mobile channels. However, experimental results show that the
Laplacian model of [39] is inaccurate in modeling the statistics of the H.264/coefficients. In
[53], Kamaciet al. propose a better solution using a Cauchy probability density function to
estimate the rate and distortion in a rate control algorithm. The Cauchy distribution proves to
be effective in estimating the coefficients probabilities but its application in finding the optimal
quantization parameter still requires a high computational complexity that makes it not suitable
for low level devices. A simpler model based on aLaplacian+impulsivepdf can be found in
[75]. This solution proves to be effective at low bit rates, and its implementation requires a
minimum amount of computational complexity. In our investigation we tried to find a solution
that works well for different target bit rates and adapts quickly to frequent variations of the
available bandwidth without requiring great computational complexity. The solution was found
introducing in the modelization of the bit rate an additional parameterEq that approximates
the energy of the quantized signal. The modelling of the bit rate in the joint domain(ρ,Eq)
proves to be very efficient in controlling the size of the produced bit stream with respect to the
algorithm adopted by the Joint Video Team [59].
In Section 4.2 the “zeros”-parameterization is presented.The size of the coded image can
be linearly related to the percentage of null quantized transform coefficients (calledρ as in
4.2. Rate distortion modeling based on “zeros” 47
[37]). Both temporally and spatially predicted pictures provide strong experimental evidence
for the accuracy of this model. Section 4.3 describes how a parametric model can replace
the storage of the transform coefficients histogram and to take advantage of H.264 internal
parameters in order to estimate the parameters of the coefficients probability density function.
Section 4.5 describes a rate control algorithm based onρ-modeling. The algorithm relates the
quantization step with the target percentage of “zeros” through a parametric function estimated
from previously encoded data. The quantization step is carefully modified while coding the
different macroblocks in order to fit the bit allocation constraints and to provide maximal and
uniform video quality.
Finally, Section 4.6 reports experimental results that compare the “zeros”-based rate control
with the rate control algorithm implemented with the JM7.6 coder. The experimental data show
that ρ-modeling provides better performances both in terms of video quality and in terms of
required computation.
4.2 Rate distortion modeling based on “zeros”
In every rate control algorithm, the key issue is to map the produced bit rate and the distortion
in the reconstructed sequence to the encoding parameters inan optimal way. Given a constraint
on the available bandwidth, the rate control algorithm mustfind that set of parameter values
that maximize the visual quality of the reconstructed sequence. The provided results for this
constrained maximization problem are mainly due to the adopted optimization algorithm and
the adopted Rate-Distortion model [90]. In most of the applications, the choice of the optimiza-
tion algorithm is mainly influenced by the amount of calculations that is required. Whenever
the computational complexity or the encoding delay are not issues, it is possible to adopt very
efficient optimization routines that allow for the estimateof the optimal set of encoding param-
eters [15, 54]. Unfortunately, the computational resources of many devices or the constraints
imposed on the encoding time by some applications bind the choice of the optimization algo-
rithm among those solutions that need a limited complexity and a small amount of memory.
In these cases, the main difference is given by the capability of the Rate-Distortion model in
characterizing the statistics of the encoded signal.
Most of rate control algorithms are based on hyperbolic R-D models, where bit rate and
distortion are functions of the quantization step (e.g. [112, 41, 60, 59, 61, 56, 96]). Fig. 4.1
shows that for different coding types (I,P, and B) and imagesthe rate produced by the H.264
coder is a non-linear function of the quantization step. This model has been adopted in many
control techniques providing a simple approximation of theR-D function and a practical tool
to control the quantization parameters. However, this approach can be inefficient in some
cases. Whenever there is a low spatial correlation (e.g. theencoder is processing a picture with
varying characteristics) or the motion compensation is notequally efficient all over the frame,
the estimated R-D model is not suitable for all the regions.
In [37, 38, 39] Z. Heet al. present a better solution to parameterize the number of bits
produced by a video encoder. In these papers, the size (in bits) and the distortion of a coded
image are functions of the percentage of null quantized transform coefficients (called “zeros”).
48 Chapter 4. Rate control algorithms for H.264
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
10
20
30
40
50
60
70
80
90
Rate (bits)
Dis
tort
ion
Hyperbolic model
0(I)3(P)2(B)
Figure 4.1: Distortion vs. Rate for coded Intra, Inter and B frames. The plot was obtainedcoding the frames0, 3 and2 of foreman sequence and varying QP from13 to 35.
Experimental results (see [37, 38, 39]) show that the “zeros” parameterization suits a great
number of images better than previous models since the influence of input image characteristic
is lower. Moreover, this model can be successfully implemented on different transform-based
coding standards. As a matter of fact, its application to theemerging video coding standard
H.264 is an interesting topic of investigation.
As it is reported in Chapter 2, the H.264 encoder (sketched inFig. 2.1) implements a
hybrid transform video coder with motion compensation or spatial prediction. Each4 × 4
block of the current frame is predicted, and the residual error is transformed and quantized (see
Section 2.2.3).
After the transform operation, we can store the frequency ofeach coefficient in a histogram
px(a). The percentage of “zeros” in the current frame,ρ, can be computed through the equation
ρ(∆) =∑
|a|<∆
px(a), (4.1)
where∆ is the quantization step chosen for the picture andpx(a) represents the percentage of
DCT coefficientsx equal toa for the current frame. Note that∆ in H.264/AVC depends on the
position of the coefficient, the quantization parameter QP,the macroblock coding type, and the
quantization matrix1 (see Section 2.2.3). However, here we omitted the indexes inthe notation
for the sake of simplicity. Despite∆ may vary in the same block, the coefficients can be stored
in a common histogram rescaling the coefficients before quantization.
Figure 4.2 shows plots of the bit rate vs.ρ as QP varies between15 and45 for I,P, and B
frames offoreman sequence. From the graph, it is apparent that the picture rateR(ρ) is well
1In our approach, we do not consider the adoption of a quantization matrix for Rate-Distortion optimization.The variations of∆ within the same transform block are related only to the fact that the matrix is not orthonormaland the coefficients need to be rescaled (see Section 2.2.3).
4.2. Rate distortion modeling based on “zeros” 49
represented by a linear function ofρ expressed through the equation
R(ρ) = µρ+ q, (4.2)
whereq is the number of overhead bits that codes all the informationthat is not related to
DCT coefficients, whileµ is the ratio between the percentage of bits that code the transform
coefficient andρ. Similar results are obtained for different kinds of pictures, independently of
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10
10
20
30
40
50
60
70
80
ρ
Rat
e (k
bit)
frame 0(I)frame 3(P)frame 2(B)
Figure 4.2: Plots of bit rate vs.ρ for the coded sequenceforeman (GOP IP. . . P 15 frames),as QP varies between 15 and 45; the plots refer to frame 0 (Intra coded), frame 3 (Inter coded),and frame 2 (B-predicted Inter).
the nature of the prediction (spatial, temporal or temporalbi-directional). As a matter of fact, it
is possible to use the “zeros” parameterization to carefully model the number of bits produced
by the H.264 encoder. In order to find the “zeros” percentage for a coded picture, we can avoid
using eq. (4.1) asρ is directly available from the H.264 encoder syntax. Since CAVLC context
modeling is based on the percentage of quantized DCT coefficients different from zero, the
number of “zeros” is computed at macroblock level by the coding routine itself.
For rate control purposes, the coder needs to relate the target percentage of null quantized
coefficientsρT to the quantization step∆. One possible solution is computingρ through equa-
tion (4.1) for every quantization step and choosing the quantization parameter QP that produces
a ρ value closer toρT (a bisection-like technique would help to reduce the computations). A
second solution approximates the coefficients histogram through a parametric model that makes
the estimation of the target value∆T faster than the iterative approach.
Since coefficient statistics can differ for pictures of different type, three separated coeffi-
cients statistics for I, P, and B pictures respectively are to be kept.
50 Chapter 4. Rate control algorithms for H.264
4.3 Parametric models for H.264 coefficients estimated throughactivity
−500 −400 −300 −200 −100 0 100 200 300 400 5000
0.002
0.004
0.006
0.008
0.01
0.012
max=0.7607
coeff
freq
uenc
y
laplacian componentcoefficient histogram
(a) Frame no. 0 QP=30 type I
−500 −400 −300 −200 −100 0 100 200 300 400 5000
0.005
0.01
0.015
0.02
0.025
max=0.7601
coeff
freq
uenc
y
laplacian componentcoefficient histogram
(b) Frame no. 18 QP=30 type P
−500 −400 −300 −200 −100 0 100 200 300 400 5000
0.005
0.01
0.015
0.02
0.025
max=0.7610
coeff
freq
uenc
y
laplacian componentcoefficient histogram
(c) Frame no. 6 QP=30 type B
−500 −400 −300 −200 −100 0 100 200 300 400 5000
0.005
0.01
0.015
0.02
0.025
max=0.7607
coeff
freq
uenc
y
laplacian componentcoefficient histogram
(d) Frame no. 20 QP=30 type B
Figure 4.3: Histogram of coefficients frequencies from the coded sequencecarphone (360frames coded with GOP IBBP 60 frames, at 30 frame/s coded withQP=30=cost).
In order to correlate the “zeros” percentage with the quantization step, equation (4.1) needs
the knowledge of the coefficient distribution, which can be provided either by the storage of
coefficient histograms or by a parametric model.
4.3.1 Storing the coefficients histograms
At first, the frequencies of the coefficients were stored in three different histograms, as the
quantization step depends on the position of transform coefficients in the4 × 4 block. Since
the transform described in eq. (2.3) is simply performed with additions and register shifts, the
resulting values are integer numbers. The storage of their frequencies requires a great memory
area, since each coefficient can be represented with approximately 14.7 bits (see [35, 36]), and
in our first approach we kept three different memory vectors.Moreover, given a coefficient
4.3. Parametric models for H.264 coefficients estimated through activity 51
distribution and a QP, the computation of “zeros” percentage from the data of each histogram
requires a great amount of additions.
In a second solution, all the information was stored in one single histogram. Rescaling each
coefficient before counting it in the histogram allows keeping a single histogram and reducing
the memory area and the computational effort by three times.However, the requirements are
still demanding since the possible coefficient values are214.17/4 = 212.17 ≃ 4698, where4 is
the smallest rescaling factor.2
4.3.2 Approximating the coefficients distribution via a parametric model
A parametric model allows for a further reduction of memory area (see [11]). In a first imple-
mentation, the coefficients histogram was approximated with a generalized Gaussian function
px(a) = γe−β|a|α (4.3)
where
β =1
σx
[
Γ (3/α)
1/α
] 12
γ =αβ
2Γ (1/α).
andΓ(·) denotes the gamma function.
The equation (4.1) turns to the integral
ρ(∆) =
∫ +∆
−∆px(a)da. (4.4)
This model provides a good estimate of coefficients statistics for I and P frames and can
be fully defined by parametersα andσx. The value ofα is computed fromm|x| = E[|x|] and
σ2x = E[(x−mx)2] according to the equation
α = F−1
(
m|x|
σx
)
(4.5)
whereF (·) is defined as
F (α) =Γ
(
2α
)
√
Γ(
1α
)
Γ(
3α
)
. (4.6)
SinceF (·) is a monotone increasing function, it is possible to store its values in a table for a
given set ofα values, and use them in order to approximate the inverseF−1(·). Equation (4.5)
can be implemented through a non-uniform quantizerQF [·] that outputs analpha = QF [α]
value.
Thanks to this parametric model, the memory requirements are reduced. On the other hand,
a lot of calculations are needed to compute the statistical parametersm|x|, σx, andρ(∆) via
eq. (4.4). In order to find a faster and less demanding solution, we focused on estimatingm|x|
2Further information about the range of coefficient values can be found in the footnote 3 at page 30 and in[35, 36].
52 Chapter 4. Rate control algorithms for H.264
andσx directly from some parameters of the H.264 syntax. One of these is the activityact(m)
of m-th macroblock that can be expressed as
act(m) =15∑
x,y=0
|errm(x, y)| =15∑
x,y=0
|Im(x, y)− Im(x, y)|, (4.7)
whereIm(x, y) is the original pixel ofm-th macroblock at position(x, y), Im(x, y) is its
prediction, anderrm(x, y) is the residual prediction error. In many video coders and rate
control algorithms (e.g. [60], [59]), the activity is used as a measure of coefficient standard
deviationσx (see Appendix A)since its computation does not imply any multiplication and it
can be directly extracted from the encoding process. This replacement is supported by the great
correlation betweenact(m) andσx. In addition it is possible to estimatem|x| andσx through
a second order polynomial expressed as
σx(act) = s0(ρ) + s1(ρ)act+ s2(ρ)(
act)2
m|x|(act) = m0(ρ) +m1(ρ)act+m2(ρ)(
act)2
(4.8)
with
act =1
NMB
NMB−1∑
m=0
act(m) (4.9)
whereNMB is the number of macroblocks of the picture. The coefficientssi(ρ) andmi(ρ) are
tabulated for different values ofρ.
However, this approximation does not fit the number of coded bits for B frames especially
at low bit rates, and it is necessary to find a new model both simple and sufficiently accu-
rate in matching coefficient statistics. To this purpose, weadopted a“Laplacian+impulsive”
distribution, described by the equation
px(a) = α′δ(a) + (1− α′)1
γ′e− 2
γ′|a|, (4.10)
whereδ(a) is the Dirac impulse function. This solution is best-suitedfor B frames, but Fig. 4.3
shows that it can be used for I and P frames as well. The whole pdf is identified by the two
parametersα′ andγ′ that can be expressed as functions of the average activityact through the
equationsγ′(act) = γ′0(ρ) + γ′1(ρ)act+ γ′2(ρ)
(
act)2
α′(act) = α′0(ρ) + α′
1(ρ) log(
act)
+ α′2(ρ) log log
(
act)
,
(4.11)
where coefficientsα′i(ρ), γ
′i(ρ), i = 0, 1, 2, are stored for a set ofρ values. These values were
computed forρ varying in the range[0.79, 0.99] with decimal resolution0.01. As [75] reports,
this model results to be the most efficient since it both matches the statistical data and allows
an easy estimate of the quantization step∆ associated to a given targetρ value. According to
equations (4.10) and (4.4),ρ can be expressed as function ofα′, γ′ and∆
4.4. Signal analysis in the(ρ,Eq)-domain 53
ρ =
∫ +∆
−∆px(a)da = 1−
(
1− α′)
e− 2
γ′∆ (4.12)
whereα′ andγ′ are estimated via eq. (4.8) and (4.11). The inverse functionthat relates the
quantization step∆ to ρ is
∆ = −γ′
2ln
(
1− ρ1− α′
)
. (4.13)
In this way it is possible to estimate a target average quantization step∆ for a targetρ value
in a simple way. This solution proves to be quite efficient at low bit rates (see [75]), but the
required memory area increases in case the algorithm has to support a wide range of target
bit rate values. In fact, at different bit rates the parameter ρ changes, and as a consequence,
we have to store many additional coefficients tables. In order to reduce the memory area, we
designed the QP-estimation algorithm described in the nextSection.
4.4 Signal analysis in the(ρ, Eq)-domain
According to the experimental results, it is possible to relate the percentage of null quantized
DCT coefficients with the activity of the signal. In fact, theprediction residual of an image with
high average activity presents a great number of coefficients different from zero. Therefore, for
a given quantization step value the percentage of null quantized coefficients is lower than in an
image that is efficiently predicted and therefore presents alow activity value. In our work, we
tried to characterize the relation between the three parametersρ, activity, and QP.
The parametric models of the previous subsection show that there is a nearly inverse-
logarithmic relation between the percentage of “zeros” andthe variance of the signal for a
given quantization step. Since a rate control algorithm hasto estimate a quantization parameter
value for a given target number of bits, we investigated which relation occurs betweenρ and
QP once the activity level is known. To this purpose, we analyzed the relation betweenρ and
the parameter
Eq =act
∆(4.14)
that gives the average activity level normalized to the quantization step∆. In the Appendix
it is shown thatEq is an approximation of the average energyEq of the quantized signal,
i.e. the quantized DCT coefficients. Moreover, in the Appendix it is shown that the parametric
model of eq. (4.10) suggests that there is a quadratic relation betweenEq andρ. This fact
is well confirmed by the experimental results reported in Fig. 4.4, and as a consequence, the
parameterEq can be expressed via the second order polynomial
Eq = ci,0 + ci,1(1− ρ) + ci,2(1− ρ)2 (4.15)
with i = I, P,B.
The equation (4.15) provides a computationally-simple butaccurate relation that is used
in the rate control algorithm described in the following section. In fact, the quadratic model
54 Chapter 4. Rate control algorithms for H.264
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−20
0
20
40
60
80
100
120
140
act /
∆
1 − ρ
(a) Frame 0(I)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16−10
0
10
20
30
40
50
60
act /
∆
1 − ρ
(b) Frame 3(P)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−10
0
10
20
30
40
50
60
70
act /
∆
1 − ρ
(c) Frame 1(B)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35−20
0
20
40
60
80
100
120
140
act /
∆
1 − ρ
(d) Frame 30(I)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160
10
20
30
40
50
60
act /
∆
1 − ρ
(e) Frame 33(P)
0 0.02 0.04 0.06 0.08 0.1 0.12−10
0
10
20
30
40
50
60
70
act /
∆
1 − ρ
(f) Frame 31(P)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35−20
0
20
40
60
80
100
120
140
act /
∆
1 − ρ
(g) Frame 120(I)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20
10
20
30
40
50
60
act /
∆
1 − ρ
(h) Frame 123(P)
Figure 4.4:Eq vs. ρ for the sequencecarphone coded with constantQP ∈ [15, 51] (GOPIBBP 15 frames @ 30 frame/s QCIF resolution). Hereact denotes the parameteract of eq.4.14.
4.5. A (ρ,Eq)-based rate control algorithm 55
allows the coder to relate a target percentage of “zeros” with a target QP value with a low
computational effort since the activity is computed while estimating the best predictor (either
spatial or temporal) and the target percentage of zeros is given by the rate constraints (see
eq. (4.2)).
4.5 A (ρ, Eq)-based rate control algorithm
In the previous section it has been described how the number of “zeros” can model the bit rate
produced by an H.264 encoder. In this section we show how thisparameterization can be used
to control the amount of coded bits in order to transmit a video sequence over a channel of a
given capacity.
The algorithm adopts a feedback scheme where the control is performed in different steps
operating at different levels. The first control is performed at the beginning of each GOP and
allocatesGk,0 bits for thek-th group of pictures. The second level deals with the bit rate and
the coding parameters of a single picture. Finally, the quantization parameter is corrected at
macroblock level in order to fit the global constraints. In the following subsections each control
level is described in detail.
4.5.1 Bit rate control at GOP level
Given the target bit rateRb and the frame rate of input video sequenceFr, the video encoder
sets the overall number of bits
G =RbN
Fr(4.16)
to code the whole GOP, whereN is the number of frames for each group of pictures. This value
has to be corrected during the coding process because of bit allocation errors and variations of
the available bandwidth.
Bit rate allocation errors are corrected through the equation
Gk,0 = δGk−1 +G−(
Bc −Bs
8
)
(4.17)
whereGk,n represents the available bits before coding then-th frame of thek-th GOP andG
is defined in (4.16).δGk−1 is the difference between target and effective bit usage after coding
the(k− 1)-th GOP, and the parametersBc andBs respectively refer to the buffer level and the
buffer dimension. The GOP level rate control tries to keepBc as close as possible toBs/8 in
order to avoid underflows.
The second type of correction is carried out whenever the transmission bit rate changes,
and it is given by the equation
Gk,n ← Gk,n +(
R′b −Rb
) N − nFr
(4.18)
whereR′b is the new available bit rate. These two operations make it possible to adapt the coded
bit stream to channel variations avoiding transmission delays. However, whenever the available
56 Chapter 4. Rate control algorithms for H.264
bandwidth is reduced too much, the algorithm starts skipping some B-coded pictures in order
to allocate more bits for those frames that are used as references for motion compensation.
In order to deal with fast time-varying channels, the algorithm also considers a“micro-
GOP”, i.e. a group made of an I or P-type picture and the following B-type pictures. The
number of available bits for a micro-GOP is
Gmicroj = Gmicro
j−1 +R′
b (1 + number_of_B)
Fr(4.19)
wherenumber_of_B is the number of B-type consecutive pictures before the following I or
P frame, andj is the number of the micro-GOP in the current GOP. After the coding of each
frame,Gmicroj is updated according to the rules presented in the followingparagraphs.
4.5.2 Bit rate control at frame level
After computing the available bits for the current group of pictures, the rate control algorithm
has to distribute them between all the frames in the GOP. For this purpose, the algorithm has
to estimate the target bit rateTn and the target average quantization parameterQP T,n for the
current frame.
The target number of bits for each picture is computed according to the frame type and
bit rate allocation errors. The bit rate allocation errors may have happened while coding the
previous pictures and they are related to an unforeseen behavior of the H.264 encoder. As a
consequence, the control routine has to correct the target bit rate in order to keep the produced
bit rate within the bandwidth constraints.
Moreover, the assignment of the target rate has to take into account the coding type of the
whole picture and its parameters. For example, an I-type picture requires a greater number of
bits since its video quality affects the coding performancein the following frames. In addition,
the spatial prediction is less performing than the temporalone, and as a consequence, the
residual information to code for an Intra image requires more bits with respect to other types
of frames even if the quantization parameter is the same.
As for P-type frames, they need to be coded with a lower distortion than the one affecting
B-type frames since they are used as references for temporalprediction.
In the end, we must stress that the coding performance of a video encoder is deeply related
to the image characteristics and their variations in time. The produced bit rate for a given quan-
tization parameterQP depends on the statistics of transform coefficients. As a consequence,
the rate control algorithm requires a complexity parameterXt (t = I, P,B),which is defined
in the following paragraphs, to characterize the complexity of the current frame and adapt the
choice of coding parameters to the actual picture statistics.
The bit rate control for then-th frame in thek-th GOP is divided into four steps. First,
the target bit rate is computed according to the available number of bits in the GOP, the coding
type and the characteristics of the previous images. Second, the algorithm estimates whether
it is worth coding the current picture or skipping it. In the first case, the algorithm goes on the
third step and computes the average QP value for the current frame. Then, the current picture is
coded, and the parameters of image statistics are updated according to the coding results. If the
4.5. A (ρ,Eq)-based rate control algorithm 57
current frame is skipped, the algorithm starts processing the following picture. In the following
paragraphs each step is presented in detail.
B.1 Computation of the target bit rate
Before coding then-th frame in the current GOP, the algorithm estimates the target bit rate
Tn as a convex combination of the target bit rateTn at GOP level and the target bit rateTn at
micro-GOP level
Tn = βTn + (1− β)Tn n = 0, . . . ,N − 1 (4.20)
where
Tn = KiGk,n
KI · nI +KP · nP +KB · nBi = I, P,B (4.21)
with
KI = KI,P ·KP,B KP = KP,B KI = 1 (4.22)
and
Tn =Rb
Fr− γTn. (4.23)
In (4.21)ni is the number of remainingi-type frames in the GOP, in (4.22)Ki,j is the com-
plexity ratio between ani-type coded frame and aj-type one (i, j = I, P,B), andγ, β are
constants (γ = 0.25 andβ = 0.9 in our experiments). A more efficient implementation should
change adaptively the value ofβ in order to modify the influence ofTn and Tn according to
the channel behavior.
Equation (4.21) shares the available bits among the different frames of the GOP while
(4.23) distributes the bits in the current micro-GOP. In fact, a pure GOP-based bit allocation
results to be not effective whenever the channel bandwidth frequently varies. In this case, the
rate control algorithm performs a wrong estimation of the target bit rate because of an obsolete
value ofRb, and the following frames can be affected by bit starvation.
The quantityTn in (4.23) is computed through the equation
Tn = Tn−1 + δB1 + δB2 −Rb
Fr(4.24)
where
δB1 = K{I,P} ·Gmicro
g
K{I,P} · nmicro{I,P} + nmicro
B
(4.25)
and
δB2 =
(
Tn −Bs
8
)
Ki
K{I,P} · nmicro{I,P} +KB · nmicro
B
. (4.26)
The parameternmicroi (i = I, P,B) is the number of remainingi-type frames in the micro-
GOP.
58 Chapter 4. Rate control algorithms for H.264
All these parameters are updated after the coding of the current frame as it is described in
the following paragraphs.
B.2 Frame skipping control
After computing the target number of bitsTn for the current frame, the rate control algorithm
estimates whether skipping the current frame or not. In fact, whenever a picture is skipped,
Tn bits are saved for the following frames. Therefore, frame skipping permits dealing with
bit rate allocation errors and scene changes in an efficient way since the algorithm skips those
frames that are not used as references for temporal prediction whenever the remaining frames
in the GOP suffer from bit starvation. In this way, we avoid anexcessive distortion of reference
pictures that decreases the motion estimation efficiency.
In the proposed algorithm, a frame is skipped whenever the inequality
Tn ≤Rb
8Fr(4.27)
holds.
In addition, whenever the current picture is a B-type frame,the following condition is tested
Gn,k ≤(NP +NB)Rb
8Fr. (4.28)
This test allows the rate control to check whether the coder has already used a greater number
of bits than expected. In this case, the current B-type frameis skipped.
B.3 Computation ofQPn,T
As the H.264 encoder is driven by the quantization parameterQP, the algorithm has to
compute the average QP value forn-th frame fromTn. According to the R-D model presented
in Section 4.3, the bit rateTn has to be referred to a target average percentage of “zeros”ρT,n
through the equation
ρT,n =Tn − qµ
(4.29)
whereµ andq are estimated from previously coded pictures (e.g. the(n − 1)-th frame). The
parametersµ andq are the slope and the intercept of equation (4.2).
From equation (4.15), the parameterEq can be computed fromρT,n for a given set of
coefficientsci,t, i = 0, 1, 2 andt = 0, 1, 2. Therefore, the target percentage of “zeros”ρT,n is
related to a target quantized signal energy valueEq,T via (4.15), where the set of coefficients
varies according to the coding type of the current frame. Then, according to eq. (4.14) the
algorithm estimates the target average quantization step∆n,T as
∆n,T =Eq,T
actn,pred(4.30)
whereactn,pred is the predicted average activity for the current frame. In our approachactn,pred
is equal to the average activity of the previous frame of the same coding type. Nevertheless,
more efficient prediction schemes can be implemented.
4.5. A (ρ,Eq)-based rate control algorithm 59
Finally, the target average quantization step∆n,T is converted into an average target quan-
tization parameterQPn,T as described in [75].
B.3 Parameters update
The rate control algorithm requires estimating the quantization parameter QP correspond-
ing to a given target percentage of “zeros”. In order to provide an accurate control over a
wide range of bit rates, an adaptive approach, which requires a reduced computational load
and avoids the storage of many coefficients tables as in [75],is adopted. For this purpose, an
LMS-based technique proved to be satisfactory.
After coding then-th frame, the coefficientsci,t of eq. (4.15) are updated in the following
way. First, the estimation error ofEq is found through the equation
eEq = Eq − Eq,T (4.31)
with
Eq,T =
2∑
t=0
ci,t (1− ρn)t (4.32)
whereρn is the actual percentage of “zeros” of the current frame.
Then, the appropriate set of coefficients is updated
ci,t ← ci,t + κ eEq ρtn (4.33)
whereκ is the adaptation gain of the estimator. We kept a lowκ value (κ = 0.01) resettingct,ivalues whenever the relative bit allocation errors are greater than a threshold. The initial values
of ct,i are computed from a training set of sequences coded with constant QP.
In addition, the algorithm updates the slopeµ and the intersectq of eq. (4.2) setting
µ← hn − Sn
1− ρn(4.34)
q ← 0.9 q + 0.1 (hn − µ) (4.35)
whereSn is the total number of bits produced andhn is the number of header bits for then-th
frame in the current GOP.
As for the bit rate related parameters, the available numberof bits is updated according to
Gk,n+1 = Gk,n − Sn (4.36)
Gmicroj+1 = Gmicro
j − Sn. (4.37)
In this way, the target bit rate for the following picture is modified compensating previous bit
rate allocation errors.
60 Chapter 4. Rate control algorithms for H.264
The ratiosKI,P andKP,B are set to
KI,P =XI
XPKP,B =
XP
XB(4.38)
and characterize the relations between the complexitiesXi, i = I, P,B, for frames of different
type.
In order to avoid sudden changes in the complexity ratios,Xi is found through the averag-
ing filter
Xi ← ωXi + (1− ω)Xi i = I, P,B (4.39)
where the inputXi is thecomplexity
Xi = 2QP n/6 · Sn. (4.40)
The parameterQPn is the average QP value of the whole picture, andSn is defined in (4.34).
The variablesXi, Xi and their corresponding complexity ratiosKI,P , KP,B allow the coding
process of the H.264 encoder to adapt to the input video data.
In previous coding standards, several rate control algorithms defined a complexity propor-
tional to the quantization parameter and coded bits, as reported in the following expression
Xi = QPn · Sn. (4.41)
As a matter of fact, since in H.264 the relation between QP and∆ is not linear but expo-
nential, equation (4.41) has to be changed into eq. (4.40).
4.5.3 Bit rate control at macroblock level
At macroblock level the quantization parameter is corrected according to the number of remain-
ing bits and the percentage of “zeros”. This grants a good control both over picture quality and
coded bits, while keeping bit rate within the given constraints and smoothing visual distortion
across different macroblocks. The proposed algorithm usesthe same macroblock level control
reported in [75].
After coding them-th MB of then-th frame, the percentage of null quantized coefficients
in the previous codedm macroblocks isρPm and the number of bits used to code the picture
is BPm. According to the given target,BR
m = Tn − BPm bits are left to code the remaining
macroblocks: the percentage of “zeros” required to fit the constraints is equal to
ρRm = 1− BR
m
µ
NMB
NMB −m(4.42)
whereNMB is the total number of macroblocks in each frame.
This leads to estimate the ratiok = ρRm/ρ
Pm, which affects the quantization parameter
4.5. A (ρ,Eq)-based rate control algorithm 61
QPm+1 of the following macroblock according to the equation
QPm+1 =
QP T,n + 3 if 1 + 3δκ ≤ k < +∞
QP T,n + 2 if 1 + 2δκ ≤ k < 1 + 3δκ
QP T,n + 1 if 1 + δκ ≤ k < 1 + 2δκ
QP T,n if 1− δκ ≤ k < 1 + δκ
QP T,n − 1 if 1− 2δκ ≤ k < 1− δκ
QP T,n − 2 if 1− 3δκ ≤ k < 1− 2δκ
QP T,n − 3 if −∞ ≤ k < 1− 3δκ.
(4.43)
with δκ specified in the following paragraph.
In [37] a linear law was used in order to correct theQPn value since in the H.263 encoder
the corresponding relation betweenQP and quantization step∆ can be expressed as∆H.263 =
2 QP . In the H.264 encoder this relation is given by the exponential relation (2.4). Therefore,
δκ can be estimated by
δκ =0.67
C· 2QPT,n/6. (4.44)
In order to fit the targeted bit rates, the constantC has been set to500. A reduced value ofδκ
allows the encoder to react more quickly to the changes inρPm.
Note thatδκ is monotonically increasing with the quantization parameter since, according
to equation (2.4), the variation of∆ is more relevant for higher values of the quantization
parameter. This can bring to strong variations in bit rate and coding quality across different
macroblocks, and it is necessary to avoid frequent changes using a greaterδκ.
In order to achieve a sufficient statistic forρPm, QPm remains equal toQP T,n until BP
m ≥0.1 · Tn.
We adopted the RD-optimization performed by the JVT encoderin order to compare our
results with the rate control algorithm included in the version JM 7.6 of the encoder. There-
fore, in our approach the rate control chooses the quantization parameter while the macroblock
coding mode is selected minimizing the cost function
J(mode,QPm) = D(mode,QPm) + λ R(mode,QPm) (4.45)
wheremode = 0, . . . , 10 is the macroblock mode,QPm is the quantization parameter chosen
for the current macroblock,D(m,QPm) is the coding distortion, andR(m,QPm) is the bit
rate. The Lagrange multiplierλ is set to
λ = λ0 2QP/6 (4.46)
62 Chapter 4. Rate control algorithms for H.264
with λ0 = 0.85 for I or P-slices andλ0 = 3.4 for B-slices (see [114, 125, 126]).
4.6 Experimental results
In order to evaluate the “zeros” algorithm performance, we coded different sequences at various
bit rates using two different rate controls. The first one is the proposed “zeros”-based rate
control, while the second one is the algorithm implemented in the Joint Model 7.6 of H.264 by
the Joint Video Team (here, denoted with the labelJVT[59, 67]).
The configuration parameters of the H.264 video coder are reported in Table 4.1. For each
Parameter Value
GOP structure IBBPGOP length 15 and60
Coding algorithm CABACSearch window width 16
MV resolution 1/4 pixelHadamard enabledNum. of reference frames 1
RD optimization enabledSP pictures not usedSlice mode not used
Table 4.1: Configuration parameter for the H.264 encoder.
coded sequence we computed the bit rate and the PSNR. In addition, we calculated the stan-
dard deviation of this parameter (σPSNR) in order to evaluate how strongly the distortion varied
among different frames. In fact, strong PSNR variations affect the resulting video quality since
the displayed sequence looks unnatural and visually unpleasant. A video sequence with great
PSNR variations may be worse than a sequence which has a loweraverage video quality but
limited quality variations. At first, we coded different sequences at different bit rates. The
0 50 100 150 200 2500
1
2
3
4
5
6x 10
4 solid=Zeros / dotted=JVT
Frame number
Bits
(a) Bits vs. Frame
0 50 100 150 200 25030
32
34
36
38
40
42solid=Zeros / dotted=JVT
Frame number
PS
NR
(Y)
(dB
)
(b) PSNR (dB) vs. FrameFigure 4.5: Bits/Frame and PSNR/Frame plot of 240 QCIF frames for the sequencesalesman (GOP IBBP 60 frames) at 30 frame/s.
results are reported in Fig. 4.5, 4.7 and in Table 4.2 for QCIFsequences, while Fig. 4.6 reports
the results for a CIF sequence. The reported data show that the “zeros”-based approach pro-
4.6. Experimental results 63
“zeros” JVT “zeros” JVT
Target Rate err. Rate err. PSNR ± σPSNR PSNR ± σPSNR
64.00 63.41 -0.92 63.92 -0.12 34.01 ± 1.99 33.83 ± 2.26
80.00 79.65 -0.44 79.90 -0.12 35.46 ± 1.51 34.81 ± 3.02
96.00 95.79 -0.22 95.81 -0.19 36.52 ± 2.18 35.73 ± 4.23
112.00 111.53 -0.42 111.66 -0.30 37.38 ± 3.58 36.44 ± 5.13
128.00 127.82 -0.14 127.54 -0.36 38.51 ± 1.82 37.08 ± 6.17
144.00 143.60 -0.28 143.59 -0.29 39.29 ± 2.49 37.68 ± 6.91
160.00 159.09 -0.57 159.48 -0.32 40.22 ± 3.14 38.27 ± 8.25
176.00 175.41 -0.34 175.37 -0.36 40.76 ± 3.05 38.72 ± 8.81
192.00 191.63 -0.20 191.34 -0.34 41.84 ± 2.76 39.14 ± 10.55
208.00 207.09 -0.44 207.28 -0.35 42.54 ± 2.88 39.73 ± 12.19
224.00 223.08 -0.41 223.20 -0.36 43.10 ± 3.36 40.22 ± 13.56
240.00 239.10 -0.38 239.18 -0.34 43.74 ± 3.77 40.61 ± 15.69
256.00 254.59 -0.55 255.01 -0.39 44.34 ± 4.43 41.00 ± 16.97
Table 4.2: Results for the sequencesalesman.[PSNR]=[σPSNR]=dB,[Rate]=[Target]=kbit/s,[err]=(%).
vides a better quality (as measured by the PSNR) with respectto the JVT algorithm. Fig. 4.5(a)
shows the number of bits allocated for each frame of the sequencesalesman coded at128
kbit/s, and Fig. 4.5(b) shows the corresponding PSNR value of the luma component. The plots
of Fig. 4.5(b) underline that the video quality of the “zeros” algorithm is less varying even if
the allocated number of bits is approximately the same.
100 150 200 250 300 35031
32
33
34
35
36
37
38
Bit rate (kbit/s)
PS
NR
(Y)
(dB
)
solid=Zeros / dotted=JVT
Figure 4.6: Distortion-Rate plot of 120 CIF frames for the sequencesalesman (GOP IBBP60 frames) at 30 frame/s; the superimposed vertical bars denote±σPSNR.
In addition, the data in Table 4.2 and in Fig. 4.7 show that theperceptual quality vari-
ation (measured byσPSNR) is smaller in the proposed algorithm. In fact, the graphs of
Fig. 4.7 show the experimental distortion-rate curve with superimposed vertical bars that denote
±σPSNR. The results were obtained coding the sequencessalesman,foreman,news, and
container. The figures confirm that the proposed algorithm produces both a greater PSNR
value and lowerσPSNR at all bit rates, i.e. both a higher and smoother quality. This fact proved
64 Chapter 4. Rate control algorithms for H.264
60 80 100 120 140 160 180 200 220 240 26030
32
34
36
38
40
42
Bit rate (kbit/s)
PS
NR
(Y)
(dB
)
solid=Zeros / dotted=JVT
(a) 360 frames from sequenceforeman (QCIF GOPIBBP 15 frames).
60 80 100 120 140 160 180 200 220 240 26032
34
36
38
40
42
44
46
48
Bit rate (kbit/s)
PS
NR
(Y)
(dB
)
solid=Zeros / dotted=JVT
(b) 240 frames from sequencesalesman (GOP IBBP60 frames).
60 80 100 120 140 160 180 200 220 240 26032
34
36
38
40
42
44
46
48
Bit rate (kbit/s)
PS
NR
(Y)
(dB
)
solid=Zeros / dotted=JVT
(c) 240 frames from sequencenews (GOP IBBP 60frames).
60 80 100 120 140 160 180 200 220 240 26034
36
38
40
42
44
46
48
Bit rate (kbit/s)
PS
NR
(Y)
(dB
)
solid=Zeros / dotted=JVT
(d) 240 frames from sequencecontainer (GOPIBBP 60 frames).
Figure 4.7: Distortion-Rate plot for different QCIF sequences at 30 frame/s; the superimposedvertical bars denote±σPSNR .
4.6. Experimental results 65
Target (kbit/s)/GOP length /
Format
Seq.JVT algorithm (ρ,Eq) algorithm
Bit rate(kbit/s)
PSNR (dB)±σPSNR
Bit rate(kbit/s)
PSNR (dB)±σPSNR
64/60/QCIF
foreman 65.78 33.28 ± 2.24 63.93 33.56 ± 1.26
news 63.72 35.50 ± 2.39 63.89 36.44 ± 2.03
container 63.68 38.17 ± 0.68 63.55 38.39 ± 0.63
silent 66.02 34.93 ± 0.63 63.97 34.87 ± 0.87
table 66.95 32.43 ± 2.36 63.85 32.62 ± 2.85
salesman 65.52 33.93 ± 2.26 63.92 34.01 ± 1.99
96/60/QCIF
foreman 95.99 34.79 ± 1.79 95.66 35.35 ± 1.27
news 96.19 37.49 ± 1.97 95.39 38.96 ± 2.46
container 95.51 39.64 ± 1.73 95.55 39.87 ± 1.03
silent 98.08 36.57 ± 0.70 95.93 37.66 ± 0.34
table 99.75 34.42 ± 2.29 95.70 35.12 ± 2.37
salesman 98.08 36.01 ± 2.36 95.81 36.52 ± 2.18
128/60/QCIF
foreman 130.81 36.05 ± 1.81 127.67 36.58 ± 1.21
news 128.43 39.19 ± 2.92 128.20 40.50 ± 1.63
container 127.59 40.79 ± 3.08 127.58 41.24 ± 2.03
silent 130.49 38.05 ± 1.68 127.94 39.25 ± 0.80
table 132.19 35.75 ± 2.22 127.40 36.84 ± 2.19
salesman 132.04 37.49 ± 3.17 128.08 38.51 ± 1.82
96/15/QCIFforeman 96.37 34.94 ± 2.84 96.70 35.38 ± 1.38
mobile 96.20 27.69 ± 1.77 96.67 28.52 ± 0.67
salesman 95.79 36.63 ± 2.30 96.64 37.08 ± 1.83
silent 96.87 36.57 ± 1.60 97.00 37.07 ± 0.90
128/15/QCIFforeman 128.25 36.05 ± 2.81 129.54 36.84 ± 1.51
mobile 127.99 28.88 ± 1.62 129.00 30.01 ± 1.40
salesman 127.66 37.91 ± 2.41 128.93 38.94 ± 2.02
silent 128.93 38.00 ± 1.97 129.22 39.34 ± 1.01
192/60/CIFforeman 202.15 34.39 ± 1.14 192.128 34.62 ± 1.40
news 205.56 37.24 ± 1.22 191.46 37.96 ± 1.88
salesman 209.29 34.52 ± 0.83 192.26 34.79 ± 0.92
table 201.99 30.83 ± 1.39 191.53 31.06 ± 1.22
256/60/CIFforeman 267.65 35.53 ± 1.11 255.74 35.86 ± 1.08
news 270.92 38.54 ± 1.30 260.73 39.45 ± 0.90
salesman 277.08 35.43 ± 0.87 256.38 35.79 ± 0.73
table 266.47 31.93 ± 1.24 255.71 32.44 ± 1.13
Table 4.3: Comparison between the(ρ,Eq)-based algorithm and JM7.6 algorithm.
66 Chapter 4. Rate control algorithms for H.264
to be independent of the complexity of the sequence. We performed the same analysis on CIF
sequences in order to evaluate the performance of the algorithm with wider-sized pictures.
Experimental results for the sequencesalesman are reported in Figure 4.6 and confirm the
previous results. More results are reported in Table 4.3.
0 20 40 60 80 100 120 140 160 1800
0.5
1
1.5
2
2.5
3
3.5x 10
4
128 kbit/s 102 kbit/s 154 kbit/s
solid=Zeros / dotted=JVT
Frame number
Bits
per
fram
e
(a) Rate vs. Frame number
0 20 40 60 80 100 120 140 160 18024
26
28
30
32
34
36
38
40
42
44
128 kbit/s 102 kbit/s 154 kbit/s
solid=Zeros / dotted=JVT
Frame number
PS
NR
(Y)
(dB
)
(b) Rate vs. Frame number
Figure 4.8: PSNR and Rate plots of 180 QCIF frames for the sequenceforeman (GOP IBBP15 frames) at 30 frame/s. The bit rate decreases to102 kbit/s at the90th frame and increasesto 154 kbit/s at the130th frame.
The “zeros”-domain algorithm exhibits good performance also in case of varying band-
width. In fact, Figure 4.8 shows the results obtained when transmitting theforeman sequence
on a channel that varies its capacity. The algorithms has to adapt to changes in channel bit rate,
which varies from128 kbit/s to102 kbit/s in the first transition and increases to154 kbit/s in
the second one. The performances of both algorithms are reported showing both the PSNR
and the number of coded bits for each frame. The performance of “zeros”-algorithm does not
appear to be seriously affected by changes in channel capacity. In fact, the algorithm provides
a smoother quality between consecutive frames than the JVT algorithm, avoiding peaks in the
number of bits per frame. In this way, the transmission jitters are limited, and as a consequence,
it is possible to avoid frequent freezing of the displayed frames whenever the decoder has not
enough buffered data and has to wait the complete reception of the next frame to be decoded.
Moreover, we tested the algorithm using the same conditionsof VBR tests3 in [61] with RD
Sequence foreman QCIF carphone QCIF news CIF
ρ, Eq 39.68/152.40 41.51/152.40 43.67/305.23
JVT 39.36/156.79 41.31/157.68 42.62/314.37
Table 4.4: PSNR/Rate for VBR tests on different sequences.
3Different sequences are coded at 10 frame/s (100 frames withGOP IPPP). The bit rate is128 kbit/s until the60 frame and it is incremented to192 kbit/s for QCIF frames, while the initial target rate is256 kbit/s and it isincremented to384 kbit/s for CIF sequences.
4.7. Summary 67
optimization enabled. The proposed algorithm has proved tobe more effective in terms of both
visual quality and rate control accuracy. Results are reported in Table 4.4
4.7 Summary
In this chapter we analyzed the application of a rate distortion model based on the percentage
ρ of null quantized transform coefficients ("zeros") to the video coding standard H.264/AVC.
In fact, the bit rate proves to be a linear function ofρ for frames of different coding types, and
this relation can efficiently be implemented to control the produced bit rate. However, in this
modelization the probability density function of transform coefficients plays an essential role.
Experiments show that it is possible to reduce the computational requirements of storing the
coefficient statistics by parameterizing the percentage ofzeros via the energy of the quantized
signalEq. In fact, it is possible to find out a quadratic relation betweenEq andρ that makes
the analysis of the produced bit rate very easy. Modeling thesignal in the joint domain(ρ,Eq)
permits the design of a low-cost rate control algorithm which provides good performance both
at high and low bit rates. The results were also obtained adopting an enhanced skipping strat-
egy that avoids sending a useless amount of information and prevents an unnecessary waste of
bandwidth. This choice increases the coding performance with respect to the algorithm imple-
mented in the JM7.6 software, both in terms of average PSNR and in terms of its variance. The
resulting visual quality proves to be higher and smoother for the proposed algorithm while the
JM7.6 algorithm proves to be unable to modify quickly its coding parameters to the statistics
of the input signal. In addition, the proposed algorithm also proves to be flexible in presence of
bandwidth variation showing a fast capability of adapting its parameter setting. Experimental
results show that bit rate oscillations around25% do not significantly affect the performance of
the proposed algorithm while the technique used in the reference software is not able to quickly
react to changes and its buffer suffers from overflows or underflows according to the variations
in the available bandwidth. In the end, the computational complexity is significantly reduced
since no pre-analysis is required and the coding of a single macroblock is performed only once.
Chapter 5
Joint Source-Channel Video CodingUsing H.264/AVC and FEC Codes
“Nothing hurts a new truth more than an old error”Johann Wolfgang von Goethe
One of the most challenging drawbacks of video transmissionover mobile channel is theperceptual degradation of the reconstructed video sequence at the decoder. In fact, the highpercentage of lost packets, as well as the intensive use of prediction to obtain a high compres-sion ratio, affects the visual quality of the reconstructedsequence. As a matter of fact, it isnecessary to introduce some redundant data in order to increase the robustness of the codedbit stream. A possible solution can be found filling a matrix structure with RTP packets andapplying a FEC code on its rows. However, the matrix size and the chosen FEC type affectthe performance of the coding system. The chapter discussesa rate allocation algorithm thatdistributes the available number of bits between the H.264/AVC coder and the channel coderin order to maximize the perceptual quality of the decoded sequence.
5.1 Introduction
As it was anticipated in Chapter 1, one of the technical challenges that are posed by video
transmission over wireless networks is granting a certain QoS to the end user. In fact, the ever-
changing nature of the radio channels and the varying topology of wireless networks modify the
transmission conditions more often with respect to the wired communication. As a drawback,
the transmission of video contents becomes a challenging task since varying channels require
flexible algorithms that adapt the coding parameters to the different transmission conditions.
At the same time, the greatest difficulty is due to the fact that mobile networks can not grant
a reliable transmission because of errors and losses, whichis the very Achilles’ heel of video
transmission.
Losses and errors may be produced by different causes. A firstcause is the time-varying
characteristics of the transmission environment, where the transmitted information is often
corrupted by bursty bit error patterns. One of the classicaltechniques that are used to make
the bit stream less vulnerable to errors is to increase the redundancy of the sent information
70 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
providing the decoder with some “extra” data. The additional amount of information allows
the recovering of the lost data in case their amount is limited beyond a given threshold. These
techniques are called Forward Error Correction schemes (FEC) since no interaction is needed
between the encoder and the decoder in the recovering process, and their effectiveness is limited
by the capability of designing a protection scheme that suits the channel conditions all the
time. Closed-loop error control techniques like AutomaticRepeat reQuest (ARQ) provide a
more efficient protection against errors since the decoder interacts with the receiver sharing its
knowledge of the channel conditions and allowing the encoder to tune the allocated redundancy
in an appropriate way. Unfortunately, many applications can not resort to ARQ schemes since
they need a reliable feedback channel and introduce an excessive delay in the transmission
which results prohibitive for interactive communications.
In addition to the problems related to the radio link, we musttake into consideration the
amount of coded data and the network condition too. Video sources produce a huge amount of
information per time unit with respect to other kind of sources. Hence, video communications
are crucial in affecting the network conditions since one ormore uncontrolled users sending
video packets across the network may seriously limit the transmission capacity available to the
others. Usually, network management adopts a set of different policies in order to prevent a
single user from jeopardizing network resources. These solutions monitor the entering traf-
fic, and whenever the network is overloaded, they take appropriate measures, such as packet
dropping1 or queuing. At the receiver, the dropping of a packet is perceived as a loss.
Finally, transmission delays must be mentioned as well. In fact, each data packet requires
a certain amount of time to reach its destination depending on the average transmission ca-
pacity, the number of crossed links and the overall waiting times in the queues. Whenever
the statistics of delays presents a limited variance (jitters), the delay is compensated buffering
the decoded information at the receiver and displaying the reconstructed sequence after an ini-
tial appropriate time interval (playout delay). Unfortunately, a highly-varying delay statistics
makes difficult the estimate of the initial waiting time. Moreover, interactive applications re-
quire limited delays since the decoded frames must be displayed at fixed instants which are
pre-determined according to the negotiated QoS. Whenever aframe arrives too late, it can be
discarded and regarded as lost.
Despite this chapter is focused on dealing with losses in thepacket video stream, errors
and corruption can be efficiently addressed too. In the data stream produced by a video coder,
syntax elements are coded into a sequence of variable-length binary strings. The corruption of
one bit is crucial in the whole decoding process since all theremaining symbols are wrongly
decoded until a synchronization bit marker is found. In these cases, the resulting bit stream
can be correctly decoded until an error occurs, and therefore, the impact of the loss in the
decoding process may vary according to where the error takesplace and where the decoder
detects it. Hence, the decoder may carry on decoding erroneous data as far as it finds a feasi-
ble bit stream, introducing a distorted frame in the processwhich affects the decoding of the
remaining sequence.
1A packet that results non-conforming with respect to the allowed amount of traffic is discarded. Dropping ofpackets can take place whenever the user does not respect theconditions specified in the traffic contract indepen-dently from the real presence of a network congestion.
5.2. On dealing with channel errors and losses in video transmission 71
According to these premises, errors and losses are a peculiar characteristic of wireless
communication and must be appropriately addressed since they can seriously affect the quality
perceived by the end-user. The following section will present on overview of different tech-
niques that are adopted to cope with error and losses in the video packet stream. Then, an
efficient FEC approach that permits reducing the quality degradation in the coded bit stream
whenever packets are lost will be presented. However, its performance is greatly improved
whenever the strength of the protection is varied accordingto the characteristics of the video
content. Hence, we will present an optimization strategy, which improves the quality of the
reconstructed sequence for a given coding rate. This technique is included in a joint source-
channel rate control which adaptively partitions the available bandwidth between the channel
coder and the video source coder. The reported experimentalresults shows that the solution
can significantly improve the performance of a non-adaptiveapproach.
5.2 On dealing with channel errors and losses in video transmis-sion
The transmission rates offered by the current communications providers are inadequate to trans-
mit the uncompressed multimedia contents produced by each user. This limits requires the
adoption of efficient coding algorithms with a good compression ratio that significantly reduce
the amount of transmitted data. Nowadays, all the approaches that have been proposed in-
clude a DPCM loop along the temporal dimension. At every instant, video information can be
predicted according to the previously decoded data that constitute the state of the decoder (in-
terframe coding). Nevertheless, the price to be paid for the high coding gainof Inter prediction
is an extreme vulnerability against transmission errors.
The loss of part of the information prevents a correct reconstruction of the encoder state
at the decoder and has a considerable impact on the quality ofthe following frames. In most
of video communications, the state of the encoder/decoder is given by those frames that have
been encoded/decoded and are included in the frame buffer asavailable references for motion
compensation. In case one of them is either missing because the connection was temporarily
lost during the transmission or corrupted because the codedstream was altered by channel
noise, the decoder must take appropriate measures to replace the loss.
A possible solution is to estimate an approximated frame that replaces the original one at the
decoder. The mismatch between the original frame and its approximated version introduces an
additional distortion in the reconstructed sequence that can be reduced as much as the estimated
frame is close to the missing one. To this purpose, during thelast years technical literature has
proposed manyerror concealmentalgorithms [22, 7], which have adopted more and more
sophisticated estimate techniques.
On the other side of the connection, the encoder can optimizethe packet stream in order
to maximize the estimated quality of the reconstructed sequence given the loss statistics and
the available bandwidth. The optimization can be performedby coding the video source in
appropriate way including a certain amount of non-predicted information in the coded stream
in order to stop the error propagating from a frame loss. Moreover, it is possible to add some
72 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
redundant information in the bit stream that allows the estimate of the lost data in case of losses.
In the end, technical literature has also presented some ARQprotocols, which allow the
decoder to ask for the retransmission of part of the lost information [28]. However, these tech-
niques imply the existence of a control channel that connects the receiver with the transmitter
and allows the decoder to provide the encoder with an error report. In a wireless network sce-
nario, the reliability and the timeliness required by the feedback channel can not be granted
(it could be granted at cell level, but it is not available forlong paths), and therefore, ARQ
techniques will not be considered.
The following sections will present an overview of error concealment techniques performed
both at the decoder and at the encoder, paying great attention for the latter ones since our
investigation is focused on them.
5.2.1 Error concealment at the decoder
The previous section has given a short overview of the possible different errors that may affect
the received bit stream. According to the nature of the corruption, different results can be
obtained. Transmission glitches range from single bit errors to bursts, or even the temporary
loss of connection, causing a wide range of different conditions.
In case of bit errors, the corruption of one bit in the stream causes an incorrect decoding
of the corrupted symbol, which propagates the error in the following values until a resynchro-
nization point is reached [22, 24]. The first step that the decoder must take in order to decode
a corrupted bit stream is to detect syntax errors, discarding the rest of the corrupted data unit
and recovering the synchronization with the encoder. The location of the first corrupted bit in
the stream is made possible by checking the values of the decoded elements and detecting the
violations of coder syntax. The difficulty of this task is strictly dependent on the entropy cod-
ing algorithm that is adopted. In relation to the standard H.264/AVC, the error concealment of
a CAVLC stream is much easier than the error concealment of a CABAC stream since syntax
errors are detected far behind the point where they actuallyoccurred (see [22]). At every syntax
exception, the decoding process is interrupted, given the impossibility to recover the remaining
information in the current slice. Due to the data structure introduced by the H.264 standard,
every slice is independent from the others, and therefore, the resynchronization with the data
flow can happen in correspondence of every new slice.
In case of losses, no error location is needed because a wholepacket is lost and the decoder
do not need to be resynchronized. Since in most of the cases the error detection and correction
is demanded to the lower levels of communication stack, in this work we will consider only
packet losses even if the presented algorithms can be efficiently applied to a corrupted bit
stream.
Usually, the process of video decoding takes place at the highest levels of the protocol
stack, and in most of the protocols the lowest levels block the corrupted packets and signal to
the highest levels that they are lost. However, the widespread of multimedia communication
over packet networks has underlined the possibility of decoding a multimedia packet at the
highest levels despite it contains bit error. In many cases corrupted information can still pro-
vide significant information to the video decoder despite itcontains errors. Therefore, some
5.2. On dealing with channel errors and losses in video transmission 73
transmission protocols have been proposed that make corrupted packets available to the high-
est levels whenever errors do not affect some crucial parts of the packets, like headers. One of
these is UDPLite, which is an extension of UDP, and allows thesender to specify whether to
compute the checksum on the whole packet or only on the header. In this way the packet is kept
even if it is corrupted in the payload as far as the destination and the length are correct. Error
concealment algorithms can detect the errors, and in some cases, correct them checking the
compatibility of the decoded information with the syntax ofthe coding standard [22, 24, 27].
After detecting a loss, concealment methods are necessary to reconstruct the missing parts
of a damaged image. Note that in these approaches a feedback transmission to the encoder is
avoided since it implies longer delays in the displaying of the pictures. On the contrary, most
of the adopted algorithms perform an image post-processingat the decoder taking advantage
of the intrinsic correlation that can be found in a video sequence ([21, 20]). Some techniques
are based on the interpolation of lost pixels according to the neighboring information, while
others recover the lost syntax elements according to the neighboring ones. For example, it is
possible to estimate a lost motion vector predicting its value from the neighboring ones thanks
to the correlation existing among spatially-adjacent motion vectors [23]. The efficiency of
each solution varies according to the characteristics of the video sequence, and it is strictly
dependent on how the video stream is coded. Hence, the encoder can optimize the coding
variables in such a way to enhance the error concealment performance at the decoder.
5.2.2 Error concealment at the encoder
The previous section has quickly glanced at the techniques engaged by the decoder to mitigate
the effects of transmission errors on the decoding process.The performance of these techniques
is deeply affected by the coding choices adopted at the encoder, i.e. the performance of the
error concealment is greatly improved whenever the coder control takes into consideration the
channel condition while tuning its parameters.
Intra refresh of coded video information
In literature, one of the first algorithms that addresses theproblem of producing a robust video
stream is based on including periodically non-predicted information in the bit stream in order
to block the propagation error (Intra refresh). This mechanism was inherited from DPCM cod-
ing [50], and it can be properly tuned according to the video content and the channel statistics
(see [62]). The video encoder may force the RD-Optimizationalgorithm so that the crucial
parts of an image are coded with Intra coding, introducing inthis way non-predicted informa-
tion in the coded video stream. As a drawback, the increment of the Intra-coded macroblock
produce an increment in the number of coded bits or, in case the bit rate is constrained by the
available bandwidth, a decrease in the quality of the reconstructed sequence with respect to its
counterpart in a error-free environment. The identification of the parts to be refreshed depends
on the error model for the channel and on the required robustness. One possible strategy is to
randomly intra-code the macroblocks of the sequence so thatafter a certain number of frames
the whole image has been refreshed (see [24]). Another strategy, which proves to be extremely
sensible and efficient for bit errors, is to identify which macroblocks are either crucial in the
74 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
decoding process or more likely to be lost, and refresh them (see [62]). In this way, while
the former approach refresh all the macroblocks the same number of times, the latter selects
more often those macroblocks whose loss has a stronger effect on the visual quality of the
reconstructed sequence provided to the end user. With the adoption of a flexible macroblock
ordering structure (FMO) in the H.264/AVC standard, Intra refresh can be performed in a very
efficient way by intra-coding a whole slice of non-neighboring macroblocks. The increment
of Intra macroblocks percentage in the coded sequence increases the overall bit rate since the
coding gain for Intra coding is much lower than the one for temporal prediction.
Multiple Description coding algorithms
Recently, other techniques have been studied in order to allow a reliable transmission even
in presence of a high loss rate (10 − 20%) and with bursts of consecutive losses. Among
the possible solution, Multiple Description Coding (MD Coding or MDC) have been broadly
investigated during the last years. According to a cunning definition of MDC given by V.
Goyal [32], Multiple Description Coding aims at“representing a single information source
with several chunks of data (‘descriptions’) so that the source can be approximated from any
subset of chunks”. Note that chunks are perfectly equivalent in the reconstruction process, viz.
the performance of the error concealment algorithm at the encoder does not depend on which
piece of data correctly arrives but on their number. In fact,the more chunks the decoder gets,
the higher quality is obtained from the decoding process independently from which pieces
of information arrived, and this is the main distinctive element that differentiate MDC from
scalable coding. In scalable coding, different chunks (or packets) of data are used to represent
the source, but packets are also hierarchically ordered, and the loss of the most important one
precludes the decoding of all the others. In a multiple channels environment, the transmission
system must adaptively select the transmission channel according to the importance of the
transmitted data sending the packets with the most relevantinformation across the most reliable
transmission path. MDC schemes permits avoiding this step since all the chunks of data are
equally relevant. The key idea beyond multiple descriptiondates back to the late 70’s, when the
Figure 5.1: A pictorial example of Multiple Description Coding. Three descriptions of thesame video source are transmitted across three independentchannels (red, green, and blue).
problem was to allow DPCM speech transmission over faulty channels, i.e. channels that were
not working in certain periods. The aim was to find a more efficient solution that replicating or
splitting the transmitted information on more than one channel (see [25]). A solution proposed
by Jayant was based on the separation of odd and even samples in a speech coding method
([49]). The original sequence of samples was split into two sub-sequences, the odd ones and
5.2. On dealing with channel errors and losses in video transmission 75
the even ones, which were coded by two separate DPCM coders. Then, the output signals
are merged together and transmitted over separate channels. Assuming that loss patterns on
different channels are uncorrelated, whenever the sample of one description is lost (e.g. an odd
sample), it is possible to reconstruct the original sequence at half the sample rate. In addition,
the lost information can be estimated by interpolating the previous and the following samples
thanks to the high correlation of adjacent speech samples. In this way, the state of the DPCM
decoder is recovered and the decoding of the subsequence cango on with some additional
distortion included. This solution has been recently applied to video coding, where speech
samples are replaced by frames at different instants.
From these initial techniques, different MDC scheme have been proposed. For example,
other approaches are based on quincunx sampling ([13]), other ones adopt multiple states for
motion compensation in order to replace the lost frame with acoarser version in case it is
lost [134], others includes a correlating transform in the coding process that increases the re-
dundant information in the stream [88, 123, 89, 31], and others use different quantizer with
a translated characteristics [119, 12]. All these techniques take advantage of the correlation
existing or created between syntax elements that are eithertemporally or spatially close. Some
schemes allows to tune the allocated redundancy while for other approaches the redundancy is
intrinsically determined by the MDC scheme and it can not be controlled.
Despite some promising results have been obtained, multiple description still proves to be
young with respect to the current technological resources [117], since the required bandwidth
is high in most of the cases and for many efficient algorithms the allocated redundancy can not
be tuned.
Distributed Source Coding
A possible alternative to the previous techniques is provided by Distributed Source Coding
(DSC), who dates back to the pioneering works of Slepian and Wolf (1973), and Wyner and
Ziv (1976), but has lied dormant for more than a quarter century, perhaps due a lack of ap-
plications focus. Recent applications, like video-over-wireless, multimedia cellular telephony
and wireless video surveillance camera systems, have aroused a new interest in distributed
coding thanks to its built-in robustness to drift caused by prediction-error mismatch between
encoder and decoder following a channel loss. As a consequence, during the last years several
DSC-based video coding architectures have appeared in literature, aimed at providing robust
alternatives to the traditional coding standards, like MPEG-x and H.26x. More details about
this topic are to be found in Chapter 6.
Automatic Repeat reQuest techniques
All the previous techniques are characterized by the fact that video encoder is completely un-
aware of which parts of information have been correctly received and which have been lost or
corrupted. Whenever a feedback channel is available, the quality of the reconstructed sequence
at the decoder can be greatly improved by designing a coding scheme that allows the decoder
to communicate with the encoder [27]. The feedback channel is used to signal to the encoder
76 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
which parts of the video sequence has been correctly received and which parts have been lost
[28, 27].
Usually, the information sent across the feedback channel can be made of positive acknowl-
edgements (ACKs) or negative acknowledgements (NACKs), which state whether a loss has
occurred or not. The feedback message is not part of the videocoder syntax, but it is handled
at a lower layer in the protocol stack, where control information is exchanged. However, some
standards were defined in order to provide video encoders with a more detailed knowledge of
missing parts. In relation to H.263 video coder, ITU-T Recommendation H.245 [46] allows re-
porting the spatial and temporal location of macroblocks that could not be decoded successfully
and had to be concealed. According to the decoder reports, the sender keeps on retransmitting
the lost part until it is correctly received. In case of errors, the delay introduced by this method
can be significant, especially for real-time and interactive applications [27]. Video encoder can
take more appropriate measures to compensate the loss, suchas retransmitting the whole or part
of the lost information (maybe at a lower quality), selecting for MC references those frames in
the buffer that have been correctly decoded, and coding withIntra mode those parts that could
be corrupted by the loss in order to block the propagation of the error. These different measures
allow a reduction in the overall amount of transmitted data and in the average delay that elapses
between the first transmission and the instant when a correctdecoding is possible.
The following section will present another alternative to all the previous coding techniques,
based on the adoption of FEC codes to generate some additional redundancy packets in the bit
stream that allows to recover the lost information in case oflosses.
5.3 Channel coding techniques based on FEC codes
All the previous techniques aim at reducing the probabilityof receiving a corrupted bit stream
by increasing the redundant information sent across the channel. This is obtained either ex-
ploiting the intrinsic redundancy, which is both present inthe video signal and between the
syntax elements of video coder, or retransmitting the lost information at different quality lev-
els. In this section we will present another approach that creates some additional redundant
information using Forward Error Correction (FEC) codes.
FEC codes constitute the first class of codes that have been efficiently applied in com-
munications thanks to their error correcting performance that does not imply signaling to the
transmitter the correct reception of the transmitted data.Given a source of binary symbols with
bit rateRb, the FEC coder converts this stream into a new one with bit rateRb · (1 + r), where
r ≥ 0 is the additional redundancy. For example, a simple inefficient code is therepetition
codethat replicates each input binary symbols in the output stream1+r times [68]. It is possi-
ble to obtain good correcting performance without wasting away the available bandwidth (like
repetition codes do) by processing blocks of symbols from the input source. In case of ablock
code,s-length strings of bits are mapped inton-length strings of bits (n = s + k) addingk
redundant bits. In this case the redundancy is equal tor = k/s. The cardinality of the domain
is 2s, but the cardinality of the codomain is2n, where2n−2s corrupted codeword are included
(here a corrupted codeword is intended as a codeword that is not included in the image set of
5.3. Channel coding techniques based on FEC codes 77
the map). One of the first examples of block code is the parity-check code [68], but in literature
several others have been proposed. The length of the input and the output strings of a block
code characterize the error correcting and detecting performance of the channel code itself.
Whenever the codewords are equally distributed in the codomain set (i.e. they are equally dis-
tant), it is possible to correct⌊(k − 1)/2⌋ errors and detectk corrupted bit. The arithmetic of
binary block codes is based on binary Galois fields (GF (2)), but they can be extended to wider
Galois fields (GF (2q)) and all the previous property still holds for non-binary symbols.
Despite this techniques were introduced at the lowest layers in protocols stack to cope with
errors on communication channels, they have been recently reused at the higher layers in order
to recover the lost information. As for multimedia signals,different proposal have aimed at
applying block codes on distinct video packets. One of thesecross-packet strategies implies the
inclusion of video packets in a matrix according to a column-wise order and to apply the chosen
block codes along the rows [33, 34, 106, 63]. For a given channel, the performance of the code
strictly depends on the filling strategy and the size of the matrix. Assuming thatn is the total
number of columns andL is the number of rows, the firsts columns (source code columns)
are filled with video source packets while the remainingk = n − s columns (channel code
columns) are computed according to the adopted channel code. The symbols in the channel
code columns are included in the redundant packets that allow the decoder to reconstruct the
lost information in case of errors. In the following paragraphs, a brief overview of different
coding methods will be given. The simplest cross-packet methods that was adopted is described
���������������������������������������������
���������������������������������������������
������������������������������������������������������������������������������
������������������������������������������������������������������������������
���������������������������
���������������������������
��������������������������
��������������������������
��������������������������������
��������������������������������
������������������������������������������������������������������������������
������������������������������������������������������������������������������
Mat
rix h
eigh
t
Codeword length
Packet4
FEC packets
Filled
Packet1
Channel coding bytesSource coding bytes
(a) with zero-padding
���������������������������������������������
���������������������������������������������
������������������������������������
������������������������������������
���������������������
���������������������
���������������������������������������������������������������
���������������������������������������������������������������
���������������������
�������������������������������������
����������������
���������������������������������
���������������������������������
����������������������������������������������������������������������
����������������������������������������������������������������������
����������������������������������������������������������������������������������������������������
����������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������
������������������������������������������������������������������������������
���������������������������
���������������������������
��������������������������
��������������������������
������������������������������������������������������������������������������
������������������������������������������������������������������������������
Mat
rix h
eigh
t
Codeword length
Packet9
FEC packets
Filled
Packet1
Channel coding bytesSource coding bytes
(b) without zero padding
Figure 5.2: General scheme for the coding matrix in RFC2733 approach with and withoutbyte padding.
in the RFC2733 [106] and introduces one parity-check packetof L bytes for each matrix of
source packets (i.e.k = 1). The source packets are included in the matrix one-per-column,
permitting a correct decoding of video information only whenever up tok packet overn = s+k
are lost or corrupted. The scheme was extended later adopting more complex FEC coding
solutions, like the Reed-Solomon codes (RS), which allow the receiver to recover more packets
per matrix and are widely used in many transmission systems.2 This scheme generalizes the
2Another important class of codes is nowadays used in this type of protection, the Digital Fountain Raptor codes[66, 74, 111]. However, in this work they will not be considered since we focused on tuning the protection leveland the matrix size.
78 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
RFC2733 scheme and provides an improved recovering efficiency since aRS(n, s) code allows
the channel decoder to recover up ton − s lost columns of the same matrix. The scheme is
depicted in Fig. 5.2(a). The characteristics of the code canbe varied according to the desired
performance and complexity, and there are a lot of works thatcompares the performance of
different coding solutions. Here only RS codes are considered since the investigation of the
optimal code is beyond the scope of the work.
The size of the matrix depends on the longest video packet since its length determines the
number of rows. The columns of the shortest packets are padded with dummy symbols equal
to zero. As it will be shown later, this filling strategy causes a waste of bandwidth whenever
the variance of the packet lengths is high and implies a different channel coding rate depending
on the row.
After computing the values of the channel code columns, the information in the matrix
is sent across the network, and the redundancy bytes are packetized in a column-wise order
according to the size of the PDU. An extra payload (12 bytes) is added to these packets in order
to make them compliant with the RTP format (see [106] for further details). At the receiver,
the same matrix is recreated filling the cells with the incoming data. Whenever a packet is
missing, all the cells that contained its byte are labeled asmissing, and the decoder scans each
row checking the number of missing bytes for each row. In casethe missing byte in thej-th
row x[j,1...n] are lower thank, it is possible to reconstruct the lost information by solving the
linear system in the Galois fieldGF (M)
2
6
6
6
6
4
α11 · · · αnerr
1
α12 · · · αnerr
2
.... . .
...
α1nerr
· · · αnerr
nerr
3
7
7
7
7
5
2
6
6
6
6
4
x[j,l1]
x[j,l2]
...
x[j,lnerr]
3
7
7
7
7
5
=
2
6
6
6
6
4
−S(α)
−S(α2)...
−S(αnerr )
3
7
7
7
7
5
(5.1)
whereαi ∈ GF (M), i = 1, . . . , nerr, andli are the indexes of the lost bytes in the row. The
functionS(a) is the syndrome computed ona from the received rowx[j,1...n] where the missing
bytes are replaced by zeros.
In this approach, matrix size plays a crucial role both in terms of allocated redundancy and
playout delay. Therefore, its dimensions must be properly optimized. To reduce the percentage
of FEC bytes sent across the network, it is possible to include more than one packet per column,
wrapping the exceeding bytes on the following columns (see [33, 34] and Fig. 5.2(b)). In this
way the matrix height does not depend on the length of the longest packet, but can be tuned up
in different manners.
The matrix has a double function. From one point of view it packs the video source infor-
mation in appropriate manner in order to provide all the datawith the same level of protection
and limiting the number of cells padded width dummy bytes. Onthe other point of view,
the matrix can be properly dimensioned in order to work like an interleaver, which scrambles
the packets before computing the FEC bytes allowing a betterrecovering from cancellations
whenever the network is affected by bursts of losses. As a drawback, we need to include more
packets increasing the delay of the playout. In case of real-time application, we may reduce the
number of matrix columns. This adaptive approach allows thecoding scheme to obtain better
performances as Figure 5.7 shows.
5.4. Adapting the matrix size to the input data 79
Note that these schemes introduce an additional delay in thedecoding process since in case
of losses, the channel decoder has to wait the matrix to be filled before recovering the lost
information. The current frame is displayed after a time delay that depends on the size of the
matrix. Hence, the need of finding efficient algorithm that shapes the matrix dimensions in
order to keep a limited delay and control the included redundancy.
5.4 Adapting the matrix size to the input data
The previous section has presented an efficient approach that enables the decoder to reconstruct
the transmitted information in case of losses including some redundant information in the trans-
mitted bit stream. However, the proposed approach performsquite differently according to the
size of the coding matrix with respect to the input data sincewrong dimensioning may lead to
overprotecting some bytes while underprotecting others. As a consequence, the performance
of the channel coding scheme decreases because the correcting capability is weakened and the
allocated redundancy wastes the available bandwidth. Therefore, matrix dimensions must be
appropriately tuned in order to maximize the recovering performance.
In this work we considered two adaptations. The first one is based on the length of the
packets. The second one is based on the characteristics of video information included in the
corresponding packets.
5.4.1 Adapting matrix size according to the packet lengths
Packet lengths significantly affects the recovering performance in relation with the size of the
matrix. In case the matrix height is too small, the longest packets may be wrapped and in-
serted in more than one column. Whenever they are lost, two cells in the same row may be
marked asmissingdecreasing the number of packets that can be recovered. On the other hand,
increasing the height of the matrix with a given number of source code columns avoids the
problems related to packet wrapping but increases the number of included video packets. As a
consequence, the recovering time after the loss of a packet is delayed since the decoder needs
to wait a complete filling of the matrix. Finally, the number of channel code columns must be
properly varied in order to match the channel characteristics.
In this work we adopt the codeRS(255, C), with C varying according to the number of
desired channel code columns. This choice was suggested by the consideration that Reed-
Solomon codes have been widely studied and implemented in the transmission of digital video
signals, and the market presents a wide offer of chipsets that efficiently perform the computa-
tion in real time. The number of source code columnss varies independently ofC in order to
match the input data. On the other hand, for a given coding rate r, the value ofC is chosen
according to the equation
k = 255 − C = ⌊s · r + 0.5⌋ that impliesC = 255 + ⌊s · r + 0.5⌋. (5.2)
A first criterion we have followed in order to dimension the coding matrix is based on the
length of the input video packets. Under the assumption of including more than one packet,
80 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
the matrix height is tuned on the average length of the packets in the coded stream. This
choice proves to be efficient for large matrices that includenearly one GOP, like the approaches
[33, 34] designed for mobile messaging applications (MBMS)over third generation mobile
channels.3 In this case, the target application does not impose tight constraints on the time
delay. Focusing on real-time applications, a smaller matrix is needed since a playout delay of
one GOP is too much. This imply an accurate dimensioning since a reduction of the matrix
may lead to a dramatic decrement of the recovering performance. The following paragraph
will provide details for the adopted matrix dimensioning algorithm.
In order to constrain the decoding delay, the matrix shapingalgorithm limits the number
of packets that can be included in the matrix. Then, the number of rows and columns is varied
in order to suit the characteristics of the packets that enters the matrix. Since the correcting
capability of the matrix is significantly affected by long packets, the number of rowsL must
be greater or equal to the length of the longest packetLmax = maxi Li. However, in case
the lengths of packets show a high variance a lot of dummy bytes could be inserted in the
last source code column. Therefore, the algorithm at first set L equal toLmax, and checks
the number of non-dummy bytes which are present in the last column. In case the number of
non-dummy bytes is lower than half of the matrix height, the matrix height is increased of one
byte until there are no more bytes in the last column. In this way, the constrain imposed by the
lengthLmax is respected, and the number of dummy bytes is minimized as the performance
of the algorithm shows with respect to its non-adapted version (see Fig.??). Since our target
application includes low delay multimedia communications, like videophoning and streaming,
the number of packets included in a matrix is equal to the number of packets that codes a single
frame (i.e. the number of slices).
5.4.2 Adapting matrix size according to the video content
So far the discussion about matrix dimension has never takeninto account the video content
that is carried in the video packets. Following Shannon’s separation principle [110], the opti-
mization of matrix size has been performed only according tothe statistics of packet length and
ignoring the characteristics of coded video signal. However, varying the protection according
to the video content avoids allocating unnecessary redundancy and improves the overall per-
formance. This approach makes possible to increase the channel coding rate for those parts of
the video stream that are crucial in the decoding process andreduce the additional redundancy
for those parts that are uninfluent. The following paragraphs will show how it is possible to
classify the packets produced by a source coder and choose anappropriate protection level for
each one of them.
As it has been shown before, the significance of frame losses for a hybrid video coding
architecture is tightly connected with the significance of the frame in the motion compensation
process, e.g. the number of frames that can be correlated to it through the motion compensation.
A corruption or a loss of its visual information has a major impact on the overall quality of the
reconstructed sequence until a refresh of the frame buffer is performed (by coding an Intra
frame). Several works in literature have studied distortion propagation and have shown its
3The work was carried on within 3GPP.
5.4. Adapting the matrix size to the input data 81
effects whenever different types of frames are lost. In the studied approach, B frames are not
used as references for motion estimation since they usuallydisplay a lower visual quality and
motion compensation may be negatively affected. Therefore, the redundancy is varied only on
I and P frames.
On the other hand, error concealment at the decoder must be considered. Error concealment
estimates the lost information according to the neighboring one [24, 22]. More specifically, the
correlation that exists among adjacent syntax elements partially allows the estimate of the lost
data, like in the case of motion vector. However, the correlation may vary according to the
input sequence, and error concealment performs quite badlywhenever the correlation among
neighboring syntax elements is low. An efficient tuning of the channel code must reduce the
protection level in case the lost information can be accurately reconstructed and increase it
whenever the coded information is hardly predictable. Hence the need of finding a parameter
that is able to characterize the importance of each packet inthe overall decoding process.
The first parameter to be considered is the activity of the residual signal. Computing the
activity of the current frame, it is possible to understand whether the displayed picture can
be easily predicted or not with respect to the other frames. Alow activity value states that
motion estimation has performed quite well, and the currentpicture can be well represented
by partitions of the previous frames. In case the activity ishigh, the level of“innovation” ,
i.e. the amount of unpredictable visual information, risesup, and it is possible to deduce that in
the current frame there are some elements that can not be efficiently motion compensated. The
occurrence of high activity values is usually related to thepresence of complex motion (i.e. non-
translational), the shooting of new objects, and scene changes. All these elements could be
crucial for motion estimation since the intrinsic correlation that exists in a video sequence
makes them highly-probable candidate references for the following MCs and their loss may
significantly affect the following frames. Hence, frame activity results deeply correlated with
the relevance of the picture in the motion estimation, and itcan be used to adapt the FEC code
for each frame in order to increase the correcting performance.
The propagation of the error deriving from the loss of thek-th packet, can be simply mod-
elled as follows
σ2corr(n, k) = σ2
corr(0, k) fptf (n) (5.3)
whereσ2corr(n, k) is the distortion resulting on the(k+n)-th frame at the decoder andfptf (n)
is thepower transfer functionthat models the propagation of the distortion through the GOP
(see [28]). According to [17], the power transfer function (p.t.f.) can be well approximated by
the following equation
fptf (n) =1
1 + ηn(5.4)
where the parameterη characterizes how fast the prediction loop compensates thedistortion
introduced by the loss. The p.t.f. describes the distortionleakage in the prediction loop (see
[17]) that is produced by spatial filtering during the encoding. Spatial filtering can be either in-
troduced by an explicit loop filter, like in H.261, or implicitly as a side-effect of fractional-pel
motion compensation4 and deblocking filtering, like in H.264/AVC. Other prediction tech-
4Blocks are interpolated using a 6-taps and a 4-taps low-passFIR filter.
82 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
niques like overlapped block motion compensation (OBMC) may also contribute to the overall
quality increment, but they will not be considered since they are not included in the H.264/AVC
standard. Since an accurate derivation of the effect produced by each individual technique is
a hard task, the overall effect can be described by a separable average loop filter(see [17]).
In this approach, the main concern regards the propagation of the distortion due to the lost in-
formation, and therefore, we will focus on the overall effects of this generalized filter at frame
level.
Simulations were run in order to evaluate the relation between the activity value of video
contents included in each packet and the impact of its loss onthe quality of the reconstructed
sequence at the decoder. The first analysis concerns the average decrement of quality through
the whole GOP, i.e. the amount of distortion introduced in the sequence, as a function of the
activity value associated to the lost packet. Results are reported in Figure 5.3(a), 5.3(b), and
5.3(c), and show that there is a linear relation between the average relative quality loss and the
activity value itself. In a second step, we evaluated the propagation of the error through the
whole GOP. Figure 5.3(d), 5.3(e), and 5.3(f) reports the dependence between activity and the
parameterN3dB which represents the number of frames after which the distortion is lower than
3 dB and is computed through the equation
N3dB =σ2
corr(0, k)/2 − 1
γ(5.5)
derived from eq. (5.4).
The reported results show that there is a linear relation between the activity value of a
packet and the overall quality decrement produced by its loss. This linear relation is also found
for the parameterN3dB , proving that a high activity value identifies those frames whose loss
produces a significant distortion on the whole sequence. More precisely, Figures 5.3(d) and
5.3(f) show that the slope of the linear relation depends on the characteristics of the coded
signal since the recovering time increases more quickly with the activity for sequences that
contains a lot of motion.
Therefore, it is possible to design an optimization algorithm that adapts the amount of
redundancy introduced in the video stream according to the significance of the packets.
While coding the video signal, the algorithm estimates the average value of the activity
avg_acti and its variancevar_acti, i = I, P,B, for each frame typei . After H.264 has
coded then-th picture, the matrix-based channel coder adopts aRS(K +Cn,K) code withK
depending on the lengths of H.264 RTP packets5 andCn equal to
Cn = C +
2 if act ∈ avg_acti +√var_acti · [1,+∞)
1 if act ∈ avg_acti +√var_acti · [0.5, 1)
0 if act ∈ avg_acti +√var_acti · [−0.5, 0.5)
−1 if act ∈ avg_acti +√var_acti · [−1,−0.5)
−2 otherwise
(5.6)
5The number of rows and the number of source code columns are tailored according to the adaptive algorithmpreviously described in Section 5.4.1.
5.4. Adapting the matrix size to the input data 83
10 12 14 16 18 20 22 24 26 28 300
0.05
0.1
0.15
0.2
0.25
0.3
0.35
act
‘δ E
(PS
NR
)/E
(PS
NR
)
(a) δE(PSNR)/E(PSNR) forforeman
12 14 16 18 20 22 24 26 280.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
act
‘δ E
(PS
NR
)/E
(PS
NR
)
(b) δE(PSNR)/E(PSNR) formobile
5 10 15 20 250
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
act
‘δ E
(PS
NR
)/E
(PS
NR
)
(c) δE(PSNR)/E(PSNR) fortable
12 14 16 18 20 22 24 26 28
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
act
γ
(d) N3dB for foreman
12 14 16 18 20 22 24 26 280
500
1000
1500
2000
2500
3000
3500
4000
4500
act
Neq
(e) N3dB for mobile
5 10 15 20 25−2.5
−2
−1.5
−1
−0.5
0
0.5x 10
4
act
Neq
(f) N3dB for table
Figure 5.3: Experimental results for different sequences showing the relative quality lossδE(PSNR)/E(PSNR) and the parameterN3dB vs. the activityact. Results were obtainedcoding the first 15 frames of each sequence (GOP IPPP andQP = 15+2kwith k = 0, . . . , 10)and enabling error concealment at the decoder.
84 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
provided that the final value ofCn is non-negative.C is the average number of redundancy
bytes for the coder whileact is the activity value for the current frame. Since the loss ofa
B frame does not affect the quality of the reconstructed sequence as I and P pictures do, we
decreased the final valueCn of one unit whenever the current frame is coded as B. In the
end, if Cn−1 > C, Cn is set toC − 1 in order to reduce the overall final redundancy of
the bit stream. Fig.?? compares this algorithm with the previous ones. The reported graphs
shows an increment of the visual quality of the reconstructed sequence since each frame is
protected depending on how important it is in the decoding process. In addition, the activity-
based optimization is able to increase the coding performance of the scheme. In fact, the
optimized RFC2733-like scheme is able to overcome the performance of the scheme optimized
only on frame size (see Fig.??). However, the best result was obtained combining both the
optimization of the matrix size and the optimization of the adopted FEC code.
The previous paragraph has shown how the activity value proves to be an efficient param-
eter to characterize the relevance of the frame in the decoding process. However, a parameter
that characterizes both the source and the channel coder results useful since it allows an external
controller to allocate the available bandwidth between these two in an optimal way. Chapter 4
showed how it is possible to use the percentage of “zeros” to model the bit rate produced by
the H.264/AVC coder. On the other hand, it is possible to notice that the percentageρ proves
to be a good substitute for the activity in the FEC scheme too.A low percentage of zeros is
related to a complex texture information (and, therefore, ahigh activity level) since it implies
the presence of many high frequency coefficients. Moreover,quantized DCT coefficients cod-
ify the residual information of the frame, i.e. the innovation that can not be approximated from
the previous pictures. Therefore, a high occurrence of zeros is deeply connected with a rather
simple residual signals that can be more easily estimated bya concealment algorithm. This fact
can be highlighted relating the percentageρ of zero for a packet with the distortion produced
by its loss. Figures 5.4(a), 5.4(b) and 5.4(c) report the relative distortion produced in a GOP by
the loss of a packet as a function of its percentage of null quantized coefficients. It is possible
to notice that the higher is the percentage of zeros, the easier is the task of concealment. Re-
sults were obtained erasing one packet from the coded streamand evaluating the average PSNR
obtained using the error concealment algorithm described in [22]. In Figure 5.4(d), 5.4(e) and
5.4(f), the parameterN3dB is reported as a function ofρ. It is possible to notice that a low
percentage of zeros is associated with a lower capacity of recovering the quality after the loss
of a packet. Therefore, its is possible to adopt the percentageρ of zeros in place of the activity.
Since an increase in the complexity of the residual signal ischaracterized by a reduction of
the percentage of zeros for a given QP, an increment of the recovering capability whenever the
percentage of zeros decreases enhances the probability of restoring some of the most important
information in the sequence. In this investigation, we adapt the code according to the linear
equation
Cn = C +
−2 if ρ ∈ avg_ρi + [0.04,+∞)
−1 if ρ ∈ avg_ρi + [0.02, 0.04)
0 if ρ ∈ avg_ρi + [−0.02, 0.02)
1 if ρ ∈ avg_ρi + [−0.04,−0.02)
2 otherwise.
(5.7)
5.4. Adapting the matrix size to the input data 85
0.975 0.98 0.985 0.99 0.995 1 1.0050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
act
‘δ E
(PS
NR
)/E
(PS
NR
)
(a) δE(PSNR)/E(PSNR) forforeman
0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 10
0.05
0.1
0.15
0.2
0.25
act
‘δ E
(PS
NR
)/E
(PS
NR
) (b) δE(PSNR)/E(PSNR) formobile
0.975 0.98 0.985 0.99 0.995 1 1.0050
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
act
‘δ E
(PS
NR
)/E
(PS
NR
)
(c) δE(PSNR)/E(PSNR) fortable
0.975 0.98 0.985 0.99 0.995 1 1.005−500
0
500
1000
1500
2000
2500
3000
3500
act
Neq
(d) N3dB for foreman
0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 10
200
400
600
800
1000
1200
act
Neq
(e) N3dB for mobile
0.975 0.98 0.985 0.99 0.995 1 1.005−500
0
500
1000
1500
2000
2500
3000
act
Neq
(f) N3dB for table
Figure 5.4: Experimental results for different sequences showing the relative quality lossδE(PSNR)/E(PSNR) and the parameterN3dB vs. the percentageρ. Results were obtainedcoding the first 15 frames of each sequence (GOP IPPP andQP = 15+2kwith k = 0, . . . , 10)and enabling error concealment at the decoder.
86 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
whereavg_ρi is the average percentage ofzeros for i-type frames (i = I, P,B). Fig. ??reports the results for the sequenceforeman. The graphs show that theρ-based algorithm has
approximately the same performance as the activity-based algorithm.
It is possible to notice that the optimization algorithm based onρ performs better than
activity sinceρ is able to better characterize the characteristics of the frame. Considering the
performance of theρ parameterization, we included this optimization in a jointsource-channel
rate control strategy that allows to maximize the quality ofthe reconstructed video sequence at
the decoder.
5.5 Joint source-channel rate control
Experimental results reported in Fig.?? shows that the adoption of a matrix-based FEC coder
is an efficient solution for protecting the data transmittedover an unreliable channel. However,
the reported plot provides evidence for the need of optimizing the matrix dimension and the
adopted code according to the input video signal, the channel characteristics, and the available
bandwidth. In fact, Fig.5.7 shows that a blind application of the matrix-based FEC scheme
may reduce the performance of the scheme in case the matrix size and the protection level are
not adequate.
The previous results have showed that the quality of the reconstructed sequence at the
decoder is strictly dependent on the characteristics of thelost frames, i.e. the loss of a frame
with a high activity value or lowρ value affects more deeply the decoding process since the
error concealment results more difficult. The adaptive algorithms of the previous section copes
with this problem by tuning the additional redundancy according to either the activity or the
percentage of “zeros”. However, in a real transmission the number of redundant bits, as well
as the bit stream produced by the source coder, is constrained by the transmission capacity.
Hence, a joint source-channel rate control algorithm, which partitions the available bandwidth
between the source coder and the channel coder, is needed. Incase the rate assigned to the
channel code is reduced in order to provide the source coder with more bits, the small amount
of redundant packets does not allow the recovery of the lost information, and the decoder can
rely on the error concealment only. On the other hand, an excessive number of redundant
packets lead both to a waste of the transmission capacity andto a decrement of the quality in
the reconstructed sequence. In this case the assigned protection is overestimated and many FEC
packets are not used, while the H.264/AVC coder has to code the input data introducing a higher
distortion since the available number of bits is small. Thischicken-egg problem can be solved
only by a joint rate allocation strategy to accurately tune both coders. Previous results have
shown that the percentage of zerosρ can efficiently model the bit rate produced by a generic
transform-based coder. However, the previous section has also shown that the percentageρ
is also correlated with the significance of video information in the decoding process. As a
consequence, it is possible to merge both techniques to design an effective joint strategy.
Given the target overall bit rateRb, the frame rateFr, and the numberN of frames in a
GOP, the algorithm assignsTi bits to i-th frame that is computed according to the following
5.5. Joint source-channel rate control 87
equation
Ti =Gi,j
KI,P ·KP,B ·NI +KP,B ·NP +NB(5.8)
whereGi,j is the number of bits remaining in thej-th GOP after coding thei-th frame and
Nt, t = I, P,B, is the number of not-yet-codedt-type frames that still remains in the current
GOP. Note that equation (5.8) is similar to eq. (4.21) in Section 4.5.2.Kt1,t2 is the complex-
ity ratio that is computed as described in eq. (4.38) of Section 4.5.2. However, in this case
the parametersXi, i = I, P,B, are the average of frame complexitiesXi (see eq. (4.39) in
Section 4.5.2), which are now modified in order to include thestatistics information from the
channel coder. Hence, the equation (4.40) remains valid, but the parameterSi now is the sum
of the numberSSi of bits coded by the H.264/AVC coder and the numberSC
i of bits added by
the matrix-based channel coder.
The available numberTi is partitioned into two target amounts of bits such that
Ti = T Si + TC
i with T Si =
Ti
1 + r(5.9)
wherer is the coding rate. The targetT Si is the number of bits available for the H.264/AVC
to code the current frame whileTCi is the number of bits that can be use to add redundancy
information to protect the stream. Given the5 possible channel coding rate of equation (5.7),
the joint control algorithm select the channel code rate that results best in order to keepT Si +TC
i
as close as possible toTi.
Given a certain rate valuer, the corresponding number of channel code columnsCn is
Cn = ⌊s · r + 0.5⌋. (5.10)
According to eq. (5.7), it is possible to relate the difference δC = Cn − C to the difference
δρ = ρ − avg_ρi, which identifies an interval of possibleρ values[ρmin, ρmax] through the
conditions
[ρmin, ρmax] =
[avg_ρi + 0.04,+∞] if δC ≤ −2
[avg_ρi + 0.02, avg_ρi + 0.04] if δC = −1
[avg_ρi − 0.02, avg_ρi + 0.02] if δC = 0
[avg_ρi − 0.04, avg_ρi − 0.02] if δC = 1
[−∞, avg_ρi − 0.04] if δC ≥ 2
(5.11)
In case the target value forρT,i which is obtained from the equation
ρT,i =Ti − q
(1 + r)µ(5.12)
whereµ andq were first presented in eq. (4.2) in Section 4.2. In caseρT,i ∈ [ρmin, ρmax], the
target value forρ is found, and the H.264/AVC coder has to tune its coding parameters in order
to match the percentageρT,i. The procedure is the same described in Section 4.5.2, and assigns
an average QP value to the current frame according to the target percentage of zerosρ. In case
the estimated target percentage of zeros does not lie in the interval [ρmin, ρmax], the joint rate
88 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
control algorithm takes into consideration another redundancy ratio.
After coding the first frame, the number of channel code bitsTCi can be remodelled accord-
ing to the actual percentageρ that was obtained from the frame. In this way it is possible to
increase the protection of those part that present some important changes in the video sequence.
GivenSSi the actual number of bits produced by the H.264 coder andSC
i the number of
bits coded by the channel coder, the available number of bitsfor the current GOP is updated
through the equation
Gi+1,j = Gi,j − SSi − SA
i . (5.13)
The parameters of the H.264 coder for a given targetT Si are chosen according to the algo-
rithm reported in [75].
5.6 Experimental results
In order to evaluate the efficiency of the different coding solutions, we simulated the transmis-
sion of the RTP packets over the mobile IP network selecting the 3GPP framework, described
in [33], and using different models of channel error. In our simulations, we varied the length
of channel code and the error generation process in order to test the solutions under different
conditions. In case some RTP packet is still missing, the adopted decoder performs an error
concealment interpolating the lost part of the image from the neighboring pixels [22].
In our first investigation, we simulate a random loss of the transmitted RTP packets using
different types of parameter settings. For QCIF sequences,we coded different streams using
the GOP structure IPPP. In order to avoid a significant propagation of the distortion deriving
from the loss of a packet, we used GOPs of 15 frames, where the information of each frame
is carried by a fixed number of packets.6 The H.264/AVC standard defines different slice
partitioning modes (see Section 2.2 and the papers [87, 5, 6,7]). The most popular modes that
are chosen in order to ease the error concealment task include in a slice an equal number of
macroblocks and an equal number of bytes. The adoption of FMOalgorithms allows a wide
variety of different configurations, but still their adoption is quite a novelty since this slice
partitioning mode was not present in the previous coding standard. On the contrary, using
a fixed number either of macroblocks or of bytes has been already adopted in some of the
previous architectures. In our approach, each slice is madeof a fixed number of MBs so that
each packet loss corresponds to the loss of a fixed amount of visual information in a frame. In
fact, using a fixed number of bytes implies a variable number of macroblocks in a slice, which
affects the resulting distortion in case the current packetis lost. In this way, for each packet
lost the amount of corrupted visual information is the same.For QCIF sequences each slice
contains11 macroblocks (i.e. an entire row) while for CIF sequences slices are made of22
macroblocks.
At first, loss patterns were generated adopting an equal lossprobability for each RTP pack-
ets. In a second step, we compared the performance of different FEC coding systems on an
actual radio channel. In our simulation we considered a packet-switched transmission on an
AWGN radio channel withEb/Nr = 4 dB. The length of the frame was200 bytes and the
6Note that in H.264/AVC each slice is contained in an RTP packet.
5.6. Experimental results 89
adopted transmission scheme was a QPSK modulator with a convolutional code (rate = 1/2
andmemory = 5). The measured BER is0.2 · 10−3.
In the following subsections, results for different algorithms and configurations are re-
ported.
5.6.1 Results with a fixed matrix
At first, we evaluated the correcting performance coming from the adoption of a fixed matrix
structure. The performance is significantly affected by theinsertion criterion and the dimen-
sions of the matrix with respect to the average length of included packets. In a first approach,
we evaluated the different performances obtained by fillingone packet per columns (padding
the remaining matrix cells with dummy 0 bytes) and filling completely each columns with
packets (made exception for the last column). In the second approach, which can be schema-
tized by Fig. 5.2(a), the number of rowsL depends on the lengthLmax of the longest packet.
In the second case (see Fig. 5.2(b)) for a graphical example)the number of rows is computed
according to the average lengthL of the included RTP packets. In the reported results, the pa-
rameterL is set to3L since in this way the probability of packet wrapping, i.e. the probability
that a single packet is included in more than one column is lowenough. In fact, packet wrap-
ping could affect badly the recovering performance since the loss of one packet could result in
the cancellation of more than one byte in a row.
Figure 5.5 reports the average PSNR obtained for different sequences corrupted with 10
independent loss patterns. The255 columns of the matrix include239 columns for the source
information and16 for the redundant packets which are generated using a RS(255,239) code
in both cases. The bit stream was corrupted generating different independent loss patterns with
cancellation probability0.03. Note that the FEC-padding solution implies a significant waste
of bandwidth since most of the allocated redundancy is not actually used to correct errors.
In fact the FEC-NoPadding solution is able to obtain the samerecovering performance with
a significantly lower redundancy. This fact is utterly evident in Fig 5.5(b) which reports the
experimental results for the sequencenews, where many parts of the displayed image are static
or slowly moving. Therefore, the lengths of RTP packets are highly varying, and therefore, a
lot of cells in FEC-Padding matrix are filled with dummy zeros. Another drawback concerns
the relation between the PSNR vs. rate plots of FEC-padding solution and the solution that
relies only on error concealment at the decoder (without FECpackets). This allows the error
concealment to reconstruct most of the lost images with a small amount of distortion, and as a
result, the performance of the transmission without any additional FEC packets results better
than the performance of FEC-padding solution in terms of Rate-Distortion. This furtherly
justifies the need for adapting the amount of FEC informationincluded in the stream according
to the characteristics of the coded sequence.
In a second set of simulations, we tested the sensitivity of the FEC-NoPadding approach
in case the loss probability is underestimated and the amount of FEC packets included in the
bit stream may result insufficient to recover the lost information. Results in Fig 5.6 report the
average PSNR values obtained for different sequences varying the number of FEC columns in
the matrix. Note that the ratio between channel columns and source columns do not correspond
90 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
0 50 100 150 200 250 30032
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses
(a) foreman QCIF
20 40 60 80 100 120 140 160 18034
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses
(b) news QCIF
100 150 200 250 300 350 400 450 500 55028
30
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses
(c) mobile QCIF
0 50 100 150 200 250 30030
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses
(d) table QCIF
100 200 300 400 500 600 700 800 90034
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses
(e) foreman CIF
200 400 600 800 1000 1200 1400 1600 180030
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses
(f) mobile CIF
Figure 5.5: Results for different sequences (coded at30 frame/s with GOP IPPP and fixedQP) where the bit stream is affected by a loss probability of0.03.
5.6. Experimental results 91
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.0830
35
40
45
FEC rate (channel col. / source col.)
PS
NR
(dB
)QP=15QP=19QP=23QP=27QP=31
(a) Results with rate given by the ratiok/n
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.230
35
40
45
FEC rate (percentage of bytes)
PS
NR
(dB
)
QP=15QP=19QP=23QP=27QP=31
(b) Results with rate given by the percentage of FECbytes
Figure 5.6: Results for the sequenceforeman QCIF (coded at30 frame/s with GOP IPPPand fixed QP) where the bitstream is affected by a loss probability of 0.03 and the numberof redundant columns is varied. The graphs reports the performance of the FEC-NoPaddingalgorithm in terms of average PSNR vs. the channel code rate,which is measured both as theratiok/n and as the percentage of FEC bytes transmitted in the stream.
to the actual ratio that results from the transmitted RTP packet stream. This is mainly due to the
additional overhead information derived from the extra header that must be added to the FEC
bytes to create new RTP packets, and the final padding bytes that can be relevant for the last
matrix (the amount of redundancy can be reduced coding longer sequences). It is possible to
notice that the full-recovery point7 is obtained when the ratiok/n equals the loss probability.
Instead the FEC matrix requires a higher percentage of FEC bytes to provide the transmitted
bit stream with enough FEC packets to make it robust to losses.
In the end, the influence of the number of rows with respect to the average length of
RTP packets is considered. Figure 5.7 reports the average PSNR and the average relative
lossδE(PSNR)/E(PSNR) for different configurations of the matrix. The reported experimental
results shows that whenever the number of rows is too small the performance of the matrix
significantly decreases despite the code rate (i.e. the number of channel code columns with
respect to the number of source code columns) could allow a perfect reconstruction of the lost
information. On the other hand, whenever the height of the matrix is long enough, it is possible
to adopt a code with a lower correcting capacity with respectto the channel loss, e.g. Fig.5.7(a)
shows that withL > 4L the coder ratek/n = 0.4 is enough to recover the whole sequence
from losses. The variance of the packet lengths plays a significant role too. Considering Fig-
ures 5.7(c) and 5.7(d) it is possible to notice that, for low QPs,L > 2L suffices for a perfect
recovering since the lengths of the RTP packets are less varying. With strong quantization,
the number of skipped macroblock increases and the lengths of packets start varying signifi-
cantly. Therefore, it is necessary to adopt a higher scalingfactor in order to reduce the effects
of wrapping and to increase the influence of interleaving.
The configurations that have been considered so far implies filling the matrix with a great
7Here, the full-recovery point is the point where all the information is recovered by the FEC scheme and thesequence reconstructed at the decoder equals the one reconstructed at the coder.
92 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codesnu
m. r
ows
/ avg
. len
. pac
kets
code rate
QP=20
No losses
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
(a) δE(PSNR)/E(PSNR) forforeman QP=20nu
m. r
ows
/ avg
. len
. pac
kets
code rate
QP=30
No losses
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
(b) δE(PSNR)/E(PSNR) forforeman QP=30
num
. row
s / a
vg. l
en. p
acke
ts
code rate
QP=20
No losses
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
(c) δE(PSNR)/E(PSNR) fornews QP=20
num
. row
s / a
vg. l
en. p
acke
ts
code rate
QP=30
No losses
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
(d) δE(PSNR)/E(PSNR) fornews QP=20
Figure 5.7: Results of FEC-NoPadding with different rows and columns (loss probability0.03) for the sequencesmobile andforeman QCIF (coded at30 frame/s with GOP IPPPand fixed QP). The performance is evaluated reporting a contour plot of the relative qualitylossδE(PSNR)/E(PSNR). The number of rows is characterized with respect to the averagelength of the RTP packets by a scaling factor, and the code rate is given by the ratiok/n.
number of packets. Although this setting proves to be quite efficient for non-interactive video
transmission, there are significant drawbacks regarding videophoning applications. As it was
mentioned in Section 5.3, the recovering of a lost packet is possible only after the complete fill-
ing of the matrix. This introduces variable jitters in displaying the reconstructed images, which
can be compensated delaying appropriately the playout of the sequence. Since such a delay
can not be tolerated in the interaction of two remote users, we need to modify the parameter
setting reducing the matrix dimensions. However, smaller matrices imply a decrement in the
efficiency of the coding scheme, and therefore, it is necessary to set the number of rows, source
columns and channel columns in appropriate way. In the following section, adaptive methods
will be tested.
5.7. Summary 93
5.6.2 Results with an adaptive matrix
Previous section has shown how the matrix size is deeply correlated with the performance of
the FEC scheme. In this section, we present experimental results related to the adaptive al-
gorithms reported in Section 5.4. Figure 5.8 reports the average PSNR vs. the produced bit
rate for tree adaptive solutions and their non-adaptive counterpart. The first adaptive solution
(referenced with the label FEC-Adaptive) tailors the matrix size according to the packet length
(see Section 5.4.1). In this way, no critical packet wrapping is allowed, i.e. the loss of a single
packet correspond to the cancellation of one byte for some matrix rows. The second adaptive
solution improve the performance of the previous one increasing the number of redundancy
columns according to the average activity value of the visual information included in the ma-
trix as Section 5.4.2 reports. In this way, the algorithm identifies the significant frames in the
decoding process and protects them with additional FEC packets while it reduces the redundant
information for those frames that can be easily interpolated from the neighboring ones. The
third adaptive algorithm uses the percentage of zeros to increase the amount of additional re-
dundancy instead of activity. This behavior allows an accurate control over the bit rate as it will
be shown later. Experimental results (see Fig. 5.9) shows how theρ-adaptive approach proves
to be significantly better in terms of Rate-Distortion sinceit saves FEC bytes from frames that
can be easily estimated by error concealment to improve the protection level of critical frames
that can not easily estimated. Note that the allocated redundant bytes results higher for those
sequences that present rapid movements and a high activity level.
In the following, we tested the joint source-channel rate control algorithm that is reported
in Section 5.5.
5.6.3 Results with a joint source-channel rate control algorithm
The final set of simulations concerns the results obtained with the rate control algorithm de-
scribed in the Section 5.5. Different sequence
In the end, we tested the joint source-channel rate control algorithm described in Section 5.
The simulation benchmark was the same described and the losspattern are generated from the
simulation of a AWGN radio channel.
It can be appreciated that theρ-adaptive algorithm is able to change the coding rate ac-
cording to the signal characteristics providing a higher quality with respect to the fixed rate
(see Tables 5.1 and 5.2). In fact, the algorithm is able to partition the available bandwidth in
appropriate manner increasing the code rate whenever the input signal is not correlated with
the previous data.
5.7 Summary
The chapter presented an effective joint source-channel coding scheme for video transmission
over RTP channels, which is based on cross-packet matrix-based FEC coding. The RTP packets
produced by the H.264/AVC video coder are included into a matrix columnwise, and redun-
dant data are generated applying a Reed-Solomon code along matrix rows. The additional
94 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
0 50 100 150 200 250 30030
31
32
33
34
35
36
37
38
39
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−FixedLosses with FEC−Adaptive
(a) foreman QCIF
40 60 80 100 120 140 16031
32
33
34
35
36
37
38
39
40
41
Rate (kbit/s)
PS
NR
(dB
)Losses with FEC−FixedLosses with FEC−Adaptive
(b) news QCIF
100 150 200 250 300 350 400 450 500 55028
29
30
31
32
33
34
35
36
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−FixedLosses with FEC−Adaptive
(c) mobile QCIF
50 100 150 200 250 30029
30
31
32
33
34
35
36
37
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−FixedLosses with FEC−Adaptive
(d) table QCIF
0 200 400 600 800 1000 120032
32.5
33
33.5
34
34.5
35
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−FixedLosses with FEC−Adaptive
(e) foreman CIF
400 600 800 1000 1200 1400 1600 1800 2000 220027
27.5
28
28.5
29
29.5
30
30.5
31
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−FixedLosses with FEC−Adaptive
(f) mobile CIF
Figure 5.8: Comparison between length-adaptive method presented in Section 5.4.1 and fixedmethod for different sequences (coded at30 frame/s with GOP IPPP and fixed QP) where thebit stream is affected by a loss probability of0.06. The code isRS(10, 9) for QCIF sequencesandRS(20, 18) for CIF sequences.
5.7. Summary 95
50 100 150 200 250 30032
33
34
35
36
37
38
39
40
41
42
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive
(a) foreman QCIF
40 60 80 100 120 140 160 18034
35
36
37
38
39
40
41
42
43
44
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive
(b) news QCIF
100 150 200 250 300 350 400 450 500 550 60029
30
31
32
33
34
35
36
37
38
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive
(c) mobile QCIF
50 100 150 200 250 300 35031
32
33
34
35
36
37
38
39
40
41
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive
(d) table QCIF
100 200 300 400 500 600 70032
32.5
33
33.5
34
34.5
35
35.5
36
36.5
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive
(e) foreman CIF
400 600 800 1000 1200 1400 160028
28.5
29
29.5
30
30.5
31
31.5
32
32.5
33
Rate (kbit/s)
PS
NR
(dB
)
Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive
(f) mobile CIF
Figure 5.9: Comparison between adaptive methods presentedin Section 5.4 for different se-quences (coded at30 frame/s with GOP IPPP and fixed QP) where the bit stream is affectedby a loss probability of0.06. The code isRS(10, 9) for QCIF sequences andRS(20, 18) forCIF sequences.
96 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes
Target BitRate (kbit/s)
Actual BitRate (kbit/s)
Effective ChannelCode Rate (r/s)
LostRTP packets (%)
Final LostRTP packets (%)
AveragePSNR (dB)
356 355.67/357.29 0.40/0.57 13.52/13.68 5.93/2.62 29.19/32.00
400 400.44/400.24 0.60/0.57 13.51/13.57 3.07/2.48 28.87/32.64
450 471.35/450.19 0.41/0.58 13.44/13.95 6.88/3.36 29.04/33.12
500 511.05/507.89 0.54/0.56 14.34/14.19 3.48/3.73 29.60/33.41
550 552.26/562.14 0.62/0.57 14.29/14.76 3.33/4.85 28.98/32.22
Table 5.1: Comparison betweenρ-adaptive and fixed rate control methods for the sequencenews. The values on the left report the results for the fixed channel rate method. The valueson the right report the results for theρ-adaptive joint source-channel rate control. The defaultchannel code rate isr/s = 0.22.
Target BitRate (kbit/s)
Actual BitRate (kbit/s)
Effective ChannelCode Rate (r/s)
LostRTP packets (%)
Final LostRTP packets (%)
AveragePSNR (dB)
356 359.03/360.94 0.40/0.57 12.91/13.10 7.04/0.34 22.29/28.44
400 404.24/404.54 0.60/0.77 13.59/13.09 5.11/0.45 25.56/39.22
450 453.07/453.92 0.61/0.77 13.51/13.19 3.07/0.29 27.05/30.51
500 502.13/503.57 0.62/0.77 13.59/13.19 2.48/0.42 22.24/32.04
550 551.93/553.35 0.62/0.78 13.77/13.43 2.70/0.50 27.20/30.78
Table 5.2: Comparison betweenρ-adaptive and fixed rate control methods for the sequenceforeman. The values on the left report the results for the fixed channel rate method. Thevalues on the right report the results for theρ-adaptive joint source-channel rate control. Thedefault channel code rate isr/s = 0.44.
information is then packed and transmitted across the channel together with the video source
packets. In case some video RTP packets are lost and the decoder receives enough redundant
packets, it is possible to recover the missing information.However, the proposed scheme ob-
tains different performances according to the size of the matrix and the protection level applied
to each frame. Experimental results show that the matrix dimension must suit both the lengths
of packets and the video content they carries. The chapter propose an optimization algorithm
that either increases or reduces the number of channel code columns in the matrix according
to the percentage of null quantized DCT coefficients in codedinformation. At the same time,
the height of the matrix is adjusted according to the length of the longest packet and the overall
number of coded bytes. These optimizations can be included in a joint source-channel coding
rate control that partition the available bandwidth between the source coder and the channel
coder in order to maximize the quality of the reconstructed sequence at the decoder. Exper-
imental results show a significant improvement in terms of visual quality for a given bit rate
with respect to its non-adaptive counterpart.
Chapter 6
Achieving H.264-like compressionefficiency with Distributed VideoCoding
“I shall try to correct errors when shown to be errors,
and I shall adopt new views so fast as they shall appear to be true views”
Abraham Lincoln
Previous chapters have discussed different source and channel coding methods focused on thetraditional hybrid video coders. In this chapter, a new typeof video coding architecture, whichallows a robust transmission of coded images, is presented.This scheme can be included in theemerging class of Distributed Source Coding (DSC) based video coders. Despite these codersenable low-complexity encoding, they have been unable to reach a compression efficiencycomparable with that of motion-compensated predictive coding based video codecs, such asH.264/AVC, due to insufficient accuracy in video data modeling. The DSC-based approachdescribed in this chapter is intended to achieve H.264-likecompression efficiency. The successof H.264/AVC highlights the importance of accurately modeling highly non-stationary videodata through fine-granularity motion estimation. This motivates us to deviate from the popularmethod of approaching the Wyner-Ziv bound with sophisticated capacity-achieving channelcodes, which require long block lengths and high decoding complexity, and instead focus onthe investigation of efficient models for video data. Such a DSC-based, compression-centricencoder is an important step towards building a robust DSC-based video coding framework.
6.1 Introduction
A recent innovation in the communication world is the massive introduction of multimedia
services over wireless networks, which was mainly inspiredby the aim of providing video and
audio applications almost anywhere and anytime [10]. More and more Internet and mobile
communication providers offer a wide variety of multimedia-related services that span from
the video communication to the fruition of video-on-demandcontents on mobile devices. This
accomplishment was possible thanks to the recent development of mobile communication and
the technological advances in digital coding of multimediadata. However, the appearing of
98 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
heterogeneous network scenarios, characterized by the interconnection of different types of
networks and devices, and the massive wide-spreading of mobile communications, affected by
a higher percentage of losses and errors with respect to the traditional wired communications,
has modified the needs and the guidelines followed in the design of compression algorithms. As
a matter of fact, the capability of providing reliable videocommunication in a heterogeneous
scenario is the most relevant issue in the widespread and thediffusion of multimedia mobile
services, and the recent literature reports a wide number ofdifferent proposal that try to cope
with the problems of transmitting a video sequence across a network affected by losses (see
Chapter 5).
As it was anticipated in Chapter 1, the requirements of coding algorithms for wireless video
communications can be summarized into three main topics:
• low-power and complexity at the mobile/sensor video encoding unit;
• high compression efficiency due to both bandwidth and transmission power constraints;
• robustness to packet/frame drops caused by wireless channel impairments.
Current video codecs fail to deliver on all these demands since most of them are based on
temporal prediction. Despite this obtains high coding gains, it results to be inefficient whenever
some of the information is lost. In this case, the state of theencoder can not be recovered until
it is refreshed (i.e. the encoder codes a frame without any temporal prediction, called Intra
frame). Unfortunately, the presence of frequent Intra refresh leads to a waste of the available
bandwidth since the amount of bits produced by Intra coding is much higher than that produced
by temporal prediction coding.
Moreover, we must mention that Motion Estimation (see Section 2.2.2) is a computation-
ally demanding task that has to be run at the encoder. Since inmobile communications the
hardware resources of communicating devices are quite heterogeneous and quite often the
transmitting device has a low computational capacity, it isconvenient to choose coding schemes
that require a limited computational power and complexity to the terminal devices. A possible
solution is to adopt two different coding architectures forthe uplink transmission and for the
downlink transmission. In the uplink communication (from the transmitter to the network),
the encoding paradigm must require a low complexity at the encoder shifting the computa-
tional load to the decoder. In the downlink transmission, the coding scheme must keep the
computationally-demanding tasks at the encoder side demanding to the terminals the imple-
mentation of a light decoder (such as decoders for traditional hybrid video coding standards).
In this way, the encoding/decoding load is mainly sustainedby network hardware, which has
to transcode the uplink bit stream coming from the transmitter into a new bit stream which is
compliant with the video coding standard adopted in the downlink transmission.
During the last years, novel coding solutions that cope efficiently with these problems have
been found, and most of them are based on the Distributed Source Coding (DSC) theory.
One of them is the PRISM coder, which aims at satisfying all ofthe previous requirements
by implementing“a modified side-information paradigm where there is inherent uncertainty
in the state of nature characterizing the side information”[93]. This coding architecture char-
acterizes the side information as a class of possible predictor values, and in the reconstruction
6.2. Distributed Video Coding 99
of the transmitted information the decoder can use any of them provided that it is available,
i.e. it is received correctly. However, it is also possible to adopt the PRISM coding paradigm
even when the are no information losses or corruption (such as in video storage application).
In this case, the side information can be identified by a Motion Vector (see Section 2.2.2).
One weak point of these solutions is the compression ratio, since DSC coding solutions
show a lower coding gain with respect to their hybrid counterparts. Therefore, the investigation
of effective entropy coding algorithms is a clue element in improving the efficiency of this
coding solution.
6.2 Distributed Video Coding
The theory of Distributed Source Coding dates back to two major theoretical results: the
Slepian-Wolf (1973) and Wyner-Ziv Theorems (1676) [113, 130]. However, despite the the-
oretical basis was defined in the 70’s, only the last years have assisted to the appearing of
DSC-based applications for video transmission. This novelcoding paradigm relies on the cod-
ing of two or more dependent random sequences in an independent way, i.e. associating a
separated independent encoder to each of them. In this context, the term “distributed” refers
to the encoding operation mode and not to its location. An independent bit stream is sent
from each encoder to a single decoder which performs a joint decoding of all the received bit
streams exploiting the statistical dependencies between them. Being aware of this, the different
encoders can take advantage of the mutual correlation between source sequences to reduce the
overall bit stream size. Assuming that two sourcesX andY have to be transmitted with rates
RX (RX ≥ H(X)) andRY (RY ≥ H(Y )) respectively, the statistical relation betweenX
andY allows a sensible reduction of the coded bit stream since thelower bounds of coded bit
rates decrease (RX ≥ H(X|Y ) with RY ≥ H(Y ) or RY ≥ H(Y |X) with RX ≥ H(X))
[113]. Despite Distributed Source Coding can still achievethe compression gains allowed by
joint source coding (R = RX + RY ≥ H(X,Y )), Wyner-Ziv coding [130] focus on the
rate point (RY = H(Y ), RX = H(X|Y )) assuming that the sourceY is fully encoded and
transmitted to the decoder while the sourceX is coded taking into consideration the existing
correlation. Although the encoder does not know the other source, the decoder can useY to
decodeX.
Since a detailed description of Distributed Source Coding is beyond the scope of this work,
further information can be found in [113, 130, 40].
Based on this independent-encoding/joint-decoding configuration, a new video coding pa-
radigm, called Distributed Video Coding (DVC), has emerged. In this case, the statistical de-
pendence that is exploited is the correlation among temporally-close frames, which many video
coding standards have already used in Motion Compensation (see Section 2.2.2). However, this
encoding technique allows the decoding of the current framewithout using a specific reference
as Motion Compensation requires. Any frame that suits the correlation characteristics that was
used to code the current frame is good enough to allow an error-free decoding of transmitted
video data, and therefore, the research of a suitable reference has to be performed at the de-
coder through a Motion Estimation algorithm. Note that in this case both the requirements
100 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
of robustness to errors and low-complexity at the encoder are met. Since the decoder can use
any sufficiently-correlated reference, in presence of errors the loss of part of the information
does not preclude a correct decoding as long as a suitable reference can be found in the frame
buffer. In addition, Motion Estimation, which is one of the most computationally-expensive
task, is performed at the decoder (i.e. at network side) reducing the hardware requirements of
the encoder.
Different DSC-oriented coding schemes have been presentedduring the last years.
In 2002, Jagmohan, Sehgal and Ahuja [48] used coset codes forpredictive encoding in or-
der to reduce the consequences of the predictive mismatch without a large increment in terms of
bit rate. In the same year, Aaron, Zhang and Girod [2] have shown results on video coding using
an Intra-encoding/Inter-decoding scheme through a Turbo decoding scheme. In 2002, an ap-
proach well-known as PRISM (Power-efficient, Robust, hIgh-compression, Syndrome-based
Multimedia coding) was proposed by Puri and Ramchandran [93] for multimedia transmissions
on wireless networks using syndromes. The major goal of thissolution is to join the traditional
intraframe coding error robustness with the traditional interframe compression efficiency.
In 2003, Zhu, Aaron and Girod have proposed an approach to Wyner-Ziv based low-
complexity coding that aims at compressing video signals for large camera arrays [135]. In
this solution, multiple correlated views of a scene are independently encoded with a pixel
domain Wyner-Ziv coder but are jointly decoded at a central node. The same article shows
a comparison between pixel domain Wyner-Ziv coder and an independent encoding and de-
coding of each view employing the JPEG-2000 wavelet image coding standard. The results
demonstrate that at lower bit rates the solution presented by Zhu et al. achieves higher PSNR
than JPEG-2000 with a lower encoder complexity. For more details, the reader should consult
[135]. In 2004, Aaron, Rane, Setton and Girod [1] proposed anarchitecture similar to the one
in [2]; the key difference with respect to [2] is the additional use of transform coding (DCT
transform) at the encoder. The results obtained show that the new coding solution leads to a
better coding efficiency when compared with the solution in [2] (at the cost of a high encoder
complexity associated with the DCT transform).
In the same year, the most recent Wyner-Ziv low-complexity video coding solution by
Aaron, Rane and Girod was proposed in [131]. This solution isbased on an Intra-encoding/Inter-
decoding system, and in addition to the bit stream resultingfrom the current frame encoding
process the encoder also transmits supplementary information about the current frame to help
the decoder in the motion estimation task. In 2004, Rane, Aaron and Girod have presented
another approach [115] aimed at making a traditionally encoded bit stream more error-resilient
when it is transmitted over an error-prone channel with no protection against channel transmis-
sion errors, for example by means of channel coding.
In this scenario, a common element links all these strategies [109, 26, 132]. All these works
utilize capacity-achieving channel codes to approach the Wyner-Ziv bound. This solutions
require both a high decoding complexity and a long block length which can be applied to a
very large area of the video frame1. This contradicts with the highly non-stationary nature of
video data.
1Typically bit plane encoding is over an entire frame.
6.3. A simple example of coding with side information 101
In [70], it was shown that a distributed video coding approach has the potential of achieving
high compression efficiency by modeling video data with motion search and without sophisti-
cated channel codes. In light of the success of H.264, in thiswork we design and implement a
DSC-based video coder that adopts some key primitives underlying the H.264 standard, such as
more sophisticated motion search (and thus more accurate correlation estimation) and in-loop
deblocking filter [87]. Due to the fact that a quantized encoding unit itself instead of DFD is
used to generate the encoded bit stream, a new arithmetic coder is designed and implemented
to suit the new statistics of encoding coefficients. This approach allows us to achieve H.264-
like compression efficiency without having to use sophisticated channel codes that entail high
decoding complexity.
The benefits of adopting such a baseline compression-centric distributed video coder are
two-fold. Firstly, the architecture can be efficiently extended to a system that is robust to
channel losses. The DSC-based encoder sends information about the source, the amount of
which depends on thestatisticalcorrelation between the source and side-information (predic-
tor). When channel loss alters this statistical correlation, only the amount of source information
needed to successfully decode changes. Therefore an incremental amount of source informa-
tion can easily be sent to ensure successful decoding when channel noise weakens the corre-
lation between the source and the predictor. In an MC-based2 system, on the other hand, the
compressed data (residual signal) depends on both the source and the predictordeterministi-
cally. Therefore, a channel loss that alters the reconstructed predictor will require coding for
unpredicted residual signal. Secondly, for both the MC-based and DSC-based systems, there
is a complexity-performance tradeoff. When encoding complexity becomes a constraint, the
lower the encoder complexity has to be, the lower the compression efficiency is. However, for a
DSC-based system with lowered compression efficiency, we obtain a bit stream with increased
robustness [93, 70].
In this work, we focus on a compression-centric DSC-based video coder that is an impor-
tant building block for a video coding system robust to channel losses.
6.3 A simple example of coding with side information
Previous sections have presented a bird’s eye view of different DVC-based video coders that
take advantage of Wyner-Ziv theorem to independently code different sources (i.e. frames)
assuming that a certain correlation structure exists amongthem. Each frame is coded under the
assumption that the decoder has in its buffer another frame that is correlated enough with the
current one. Therefore, the coder has to specify only the non-correlated information in order
to permit a correct decoding. To see how we can achieve this, it is instructive to examine the
following example that was first presented in [92]. LetX andY be two correlated pieces of
information that are generated by two separate sources (or related to two different frames in a
DVC setting) and are to be transmitted to a common receiver. Assuming thatY has already
been sent, the informationX can be efficiently transmitted considering its correlationwith Y .
In this way, the redundancy existing between different partof the overall information is reduced
2It indicates Motion Compensation based coders.
102 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
XXDecoderEncoder
Y
D
(a) Predictive Coding (Y available both at theencoder and at the decoder)
XXDecoderEncoder
Y
S
.
(b) Wyner-Ziv Coding (Y only available at thedecoder)
Figure 6.1: The two different scenarios considered in the example. Side information can beavailable to the encoder or not.
leading to a smaller bit stream size. From these premises, two different coding settings can be
derived. The first setting (see Fig. 6.1(a)) assumes thatY is known both at the encoder and
at the decoder. Therefore, it is possible to transmitX by coding its difference withY . In
the second setting (see Fig. 6.1(b)), the informationY is known only at the decoder, but the
encoder knows the characteristics of the correlation existing betweenX andY . This can be
used to reduce the coded bit stream provided that the information is coded in such a way that
the decoder can recover it using the available informationY , which is calledside information.
In the following paragraph, a simple example is provided.
Let X andY be two binary vectors belonging to the space{0, 1}3, which includes8 dif-
ferent values. However,X andY are correlated in such a way that the Hamming distance
dH between them is at most 1. For example, givenY = [1 1 0], X can assume the val-
ues{[1 1 0], [0 0 0], [0 1 1], [0 1 0]}. In the first scenario, both encoder and decoder knows the
value ofY , and therefore, the encoder only needs to transmit the difference3 D = X ⊕Y . The
differenceD can assume4 different values, and therefore, it can be coded with2 bits, assuming
that its4 values are equally probable. The decoder can combine the transmitted differenceD
with the symbolY to reconstructX = Y ⊕D. We can relate this example to traditional hy-
brid video coders considering the valuesX andY like pixels or blocks belonging to different
frames which are temporally correlated. In this way, the considered example is analogous to
the predictive coding paradigm reported in Section 2.2.2.
(a) The cosets adopted in the example. (b) Example of decoding.
Figure 6.2: Example of Wyner-Ziv decoding using sources in{0, 1}3 with Hamming distancedH ≤ 1.
In the second scenario, it is possible to partition the space{0, 1}3 into 4 separate sets where
the Hamming distance between every couple of values is greater than two. Despite the encoder
is completely unaware of the value assumed by the side informationY , it can transmit to the
3The symbol⊕ denotes a bitwise XOR operation.
6.3. A simple example of coding with side information 103
decoder in which set the informationX lies. Then the decoder can choose among the possible
values included in the signalled set the one that is the closest to the side informationY . Since
the distance among values in the same set is greater than1, there is only one value at minimum
distance among the others. Note that the encoder does not need to know the value ofY , but
only the maximum Hamming distance that could exist between the two sources. It is also worth
mentioning that the decoding ofX can be done correctly even if the value ofY is different. As
an example, letX be the string[0 1 0] with Y = [1 1 0] and the space{0, 1}3 partitioned as
Fig. 6.2(a) depicts. In this case, the encoder signals to thedecoder that the value ofX belongs
to the3rd set, and givenY the decoder can reconstructX as Fig. 6.2(b) depicts. However, a
correct reconstruction is possible even ifY assumes the values{[0 1 0], [0 1 1], [0 0 0]}, since
the decoding is not strictly dependent on a specific reference like in the case of predictive
coding. Since the number of separate partitions is4, we need only two bits to code the sets
whereX lies in assuming that all the partitions are equally probable. In this case, the amount
of transmitted information is exactly the same of the previous case.
For the second scenario just described, different analogies can be derived. One of the
most popular ones concerns channel coding theory and associates each partition to acosetCi
of codewords obtained perturbating the binary words of a maximum-distance channel code
C with a specific error vectore such that4 wH(e) ≤ 1 (see [68]). The error is associated
to the difference existing between separate sources while the coset indexi can be associated
to a set of possible correct codewords in the space{0, 1}3. Following the same conventions
used for channel codes, the coset index will be calledsyndromethroughout the rest of this
chapter. Assuming that the channel decoding process can be seen like a vector quantization in
the domain{0, 1}3, in Section 6.6 the termsyndromewill be extended to identify a specific sub-
lattice structure related to a quantization characteristics that allows a correct reconstruction of
the transmitted value by quantizing the side information. In the rest of the chapter, this analogy
will be used many times, using the word syndrome to characterize the transmitted information
and the word coset in relation to quantizers with shifted characteristics.
Previous paragraphs have shown that the coding efficiency ofthe second scenario is com-
parable with the coding efficiency of predictive coding. This fact is possible whenever the
variance of the prediction error (the Hamming distance in the example) is comparable with
the distance of codewords within the same coset. The amount of information that needs to be
transmitted (i.e. the syndrome) strictly depends on the volume of the quotient group inferred
from the codeC on the space{0, 1}3, and therefore, the scheme proves to be efficient under
the assumption that the codeC accurately suits the characteristics of the correlation betweenX
andY . To this purpose, different approaches were proposed to tailor the shape of the quotient
group in order to match the correlation between different sources.5
The following paragraph will show how these principles can be applied to the robust trans-
mission of video contents.
4The symbolwH(e) denotes the Hamming weight of vectore.5Some of them resort to trellis coded quantization allowing the adoption of quotient groups with spherical
geometry instead of square geometry (see [77]). Others adopt simple probabilistic models for the sake of simplicity(see [52]).
104 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
6.4 A quick glance at the original PRISM architecture
The starting point of this investigation is the PRISM coder [93, 95, 94], a DVC coding archi-
tecture that tries to“marry the high compression efficiency of the predictive-coding mode with
the robustness and the low encoding complexity features of the intra coding mode”[93]. It
is possible to notice that the coding paradigm described in the previous section satisfy these
requirements. Indeed it proves to be robust to channel losses since the transmitted information
can be correctly decoded once there is some side informationat the decoder that is correlated
enough with the transmitted data. In the following, it will be shown how it is possible to apply
these principles to the video.
As it was mentioned in Section 2.2.2, video sequences present a strong correlation between
pixels belonging to temporally-close frames. This fact is widely used in traditional hybrid video
coding architecture to achieve high compression gains, andit can be used in a DSC approach
too. In the case of video signals, it is possible to associatethe informationX of the previous
example with the current block to be coded and its side information Y to another block that
belongs to a different temporally-close frame which has already been correctly decoded. The
correlated blockY can be estimated performing a block-matching motion estimation (BMME)
algorithm, and the result of this search permits decomposing the original block into the su-
perposition of some correlated data and some innovation (see Fig. 6.3). Depending on the two
different coding paradigms presented in Section 6.3, this search can be done in different places.
Since the predictive solution (Fig. 6.1(a)) implies that the reference block is available to both
coder and decoder, BMME is performed at the encoder and the coordinates of the estimated
blockY have to be transmitted together with the residual difference in order to allow a precise
reconstruction of the signal. A correct decoding is possible only in the case that coder and
decoder are perfectly aligned, i.e. the reference frames are the same and the coded informa-
tion arrives without losses. In the Wyner-Ziv solution, theencoder does not need to perform a
BMME in order to find a correlated block in the previous pixelsbut requires the knowledge of
the correlation existing between the current blockX and its predictionY .
Figure 6.3: A pictorial representation of innovationand correlated info for blocks.
1010
01
11
1
01
01
01
11
1 1
1001
011
01
11
1
mask
coefficient array
CRC bits Intra bits
Figure 6.4: CRC coding mask.
Starting from these premises, in 2002 R. Puri and K. Ramchandran designed PRISM [93],
a robust DSC-based video coder that relies on this second solution. The PRISM architecture
6.5. Structure of the implemented coder 105
processes the video signal in the transform domain. For eachblock of quantized transform
coefficients (appropriately shifted in order to have only positive values), the coder estimates
which part of the information can be correlated with blocks in the previous frames and which
are not. Typically, the correlated information are the mostsignificant bits of the binary repre-
sentation for each coefficient, and their estimation is performed using a classifier [93] which,
according to the mean square error between the current transformed block and its counterpart in
the previous frame (i.e. the block placed at the same coordinates), identifies a bit mask. The bit
mask selects those bits that could be correlated with a possible prediction block, and the coder
computes a 16-bits CRC on them (see Fig. 6.4). The remaining bits are intra coded, i.e. they are
coded independently from other blocks, and they will be called syndromes. The CRC and the
bit mask (together with the intra-coded bits) are sent to thedecoder which looks for a blockY
with the same CRC given the received mask. This block will be the side information that is to
be used in the decoding process. From this point, the decoding process is the same adopted by
the coder described in Section 6.5. The least-significant intra-coded bits (syndromes) are used
to identify the correct coefficients given the estimated predictor blockY . Since this chapter is
not intended to give an exhaustive description of the PRISM architecture, further details can be
found in the papers [93, 94, 95].
Section 6.3 has presented a simple example where the Wyner-Ziv coder obtains the same
compression efficiency of its predictive counterpart. However, in most of its practical imple-
mentations DVC is not able to match the compression performance offered by its predictive
counterparts. Therefore, the investigation of innovativeand performing entropy coding mech-
anisms is a stimulating research topic.
6.5 Structure of the implemented coder
One of the big issues that concern DVC is its compression performance. Distributed Video
Coding is nowadays seen as an efficient solution to transmit video content over unreliable
channels, but its possibilities in terms of compression performance are still at their beginning.
Different approaches are focused on achieving high coding gains with DVC in order to design
a coding architecture that embraces both robustness and compression efficiency. Following this
tendency, we have investigated an implementation of an efficient DSC architecture focusing on
the entropy coding of quantizer-related syndromes.
In order to obtain good compression results, we have designed our implementation using
the building blocks of the H.264/AVC coding architecture. The structure of the H.264/AVC
coder can be seen as a comprehensive synergism of coding solutions designed in the last 50
years. Many features that are included were already presentin some of the previous hybrid
coders, but were redefined in order to suit them to the generalarchitecture. In addition, some
new elements were introduced providing the final coder with awide set of tools that can be
rearranged in many different ways. Experimental results prove that this orchestration of many
different coding strategies is a winning solution as the H.264/AVC architecture outperforms
all of the previous coding standards, MPEG-4[44] and H.263[45] included. Therefore, the
implementation of a DSC coding scheme on its basic structure(depicted in Fig. 6.5) is an inter-
106 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
esting investigation field, considering that some new features of the coder pose new challenges.
Instead of obtaining DFD through motion compensation, motion search is used to determine
Figure 6.5: Encoder block diagram. The key differences between the presented DSC-basedencoder and a H.264 encoder are the syndrome generator and re-designed entropy coder.
how much of the current quantized information needs to be encoded, i.e. the correlation struc-
ture between the source and side-information. From this estimate, the encoder generates a
piece of information, calledsyndrome, which allows the identification of the subspace where
both the quantized information and its prediction lie.
The key differences between the presented DSC-based encoder and a H.264 encoder are:
(1) an additional syndrome generator and (2) a modified entropy coding algorithm to better
suit the probability distribution of syndrome values in place of the original H.264 entropy
coders, i.e. Context-Adaptive Variable-Length Coder (CAVLC) and Context-Adaptive Binary
Arithmetic Coder (CABAC), which were designed according tothe statistics of the quantized
transformed DFD. We now describe these two modules in more detail.
6.6 The generation of syndromes
One of the issues raised by the implementation concerns the syndrome generation, which is
strictly linked with the transform and quantization block.Traditional hybrid coders process the
Displaced Frame Difference (DFD) between the current blockand the reference provided by
the Motion Estimation unit (see Section 2.2.2), transforming the residual error and quantizing
the resulting coefficients. The H.264/AVC standard adopts a4 × 4 multiplication-free trans-
form, mapping the block of the residual signalx into a blockX of transform coefficient (see
Section 2.2.3).
The coefficients dynamic range is then reduced using a dead-zone quantizer, which can be
characterized by the equation
Xq(i, j) =
⌊
X(i, j) +O(i, j,QP,mb_type)∆(i, j,QP,mb_type)
⌋
, (6.1)
where the quantization step∆(i, j,QP,mb_type) and the offsetO(i, j,QP,mb_type) de-
pend on the coefficient position(i, j) in the block, the Quantization Parameter QP, and the
macroblock coding typemb_type. The standard allows specifying a quantization matrix that
may vary the relation between different quantization parameters according to Rate-Distortion
optimization criteria. In the current JVT implementation,the offsetO(i, j,QP,mb_type)
is 13∆(i, j,QP,mb_type) for Intra blocks and1
6∆(i, j,QP,mb_type) for Inter blocks. For
6.6. The generation of syndromes 107
the sake of simplicity, in the following paragraphs we will refer to∆(i, j,QP,mb_type) and
O(i, j,QP,mb_type) as∆ andO respectively.
The adopted DSC scheme, in its counterpart, transforms and quantizes the original signal
x into the coefficientsXq using the ME unit to find how much of the quantized information
needs to be encoded, i.e. the correlation structure betweenthe source and side-information. In
our implementation, the quantization rule was changed in order to match the characteristics of
the input signal and avoid an excessive mismatch between thequality obtained by H.264/AVC
and DSC coder for the same QP. Indeed, the adopted quantization offsetO is set to
O =
{
∆3 QP < 12∆6 · (2− 2
QP−1140 ) QP ≥ 12.
(6.2)
allowing a coarser quantization for high values of the QP parameter. In these cases the quanti-
zation rule glitches towards a truncation rule which reduces the occurrence of small coefficient
avoiding the coding of unnecessary information that does not significantly affect the resulting
distortion. Side-information is found in the previous frames through ME and computing, in the
transform domain, the numbern(i, j) of least significant bits that cannot be inferred from the
predicted block, according to the equation
n =
2 +
⌊
log
( |(Xq ·∆)−Xp|∆
)⌋
, if d > ∆
0 otherwise(6.3)
with d = min {|(Xq∆)−Xp| , |X −Xp|}. The parameter∆ is the quantization step for that
coefficient,Xq is the quantized coefficient from the current block,6 andXp is the corresponding
unquantized transform coefficient from the predicted block. From the value ofn, a syndrome
Z is generated corresponding to then least significant bits ofXq according to
Z = Xq & (2n − 1) (6.4)
where& denotes a bitwise AND operation (note that in equations (6.3) and (6.4) we omitted
the indexes). Considering the latticeΛ that includes all of the quantized real values, the symbol
Z identifies the sub-latticeΛZ , where the binary representations of all values have the same
least significant bits (see Fig. 6.6). Therefore the symbol can also be thought of as a sub-
lattice index, also called syndrome. This coding strategy corresponds to the multilevel coset
framework reported in [69]. In the following, we will represent syndromes with the notation
S = 2n + Z in order to signal both the number of bits and the syndrome value.
Given S and the referenceXp from motion compensation, the decoder can reconstruct
the original quantized valueXq selecting the point in the sub-latticeΛZ which is closer to
the referenceXp. Each syndrome conveys both the number of coded bitsn and their values.
Since these syndromes are not equally likely, they can be entropy coded to achieve higher
compression efficiency. Here, we present a quad-tree based arithmetic coder that is tailored to
6In our DSC implementation,M = 214/∆ is added to each coefficient value in order to make it positive, wherethe214 factor depends on the amplification of the4 × 4 transform (6 bits) on the input signal (8 bits). For furtherdetails, see [35].
108 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
∆
Λ01 Λ10
Λ11
Λ00 X q
1B = 0 1B = 1 1B = 0 1B = 1
X p
X q
X p
X q
X p
B = 00 B = 10
000 010 011 100 101 110 111001
2∆
4∆
X
001 011 101 111110100010000
100000 010 110 101 011 111001
Figure 6.6: Partitioning of the integer lattice into 3 levels. The parameter∆ identifiesthe quantization step,X is the source,Xq is the quantized codeword andXp is the side-information (omitting the spatial coordinates(i, j)). The number of levels in the partition treedepends on the correlation betweenXq · δ andXp givenX .
the distribution of the syndromes.
6.7 Entropy coding of syndromes
6.7.1 Entropy coding of syndromes
The compression gain of H.264/AVC is partially due to an effective entropy coding algorithm,
the Context-Adaptive Binary Arithmetic Coder (CABAC [73]). Its structure relies on an ef-
ficient symbol binarization and on accurate context modeling that well suits the statistics of
input data. At first, the syntax elements produced by the video coder are converted into vari-
able length binary strings, and for each binary digit, the modeling block assigns a context that is
associated with a binary probability mass function (p.m.f.). Then, both the binary digit and the
associated p.m.f. are sent to a binary arithmetic coder thatmaps them into an interval through
a Finite States Machine (FSM) and updates the binary context. Unfortunately, the CABAC
coder was designed and optimized for compressing quantizedand transformed DFD. Some
modifications were necessary in order to make it suitable forcompressing syndromes.
Modeling syndrome distribution
Despite each syndrome is actually represented by the least significant bits of a transform co-
efficient, its probability distribution may result quite different from the one of a transform
coefficient. In literature several works have proposed different probabilistic models for trans-
form coefficients according to the characteristics of the adopted transform and its dimension.
Most of the solutions that were adopted for video coding standard precedent to H.264/AVC are
based on Laplacian and generalized-Gaussian models (see [11, 57]). In [53], Kamaciet al. pro-
pose a better solution using a Cauchy probability distribution function to estimate the rate and
distortion in a rate control algorithm, while [75] resorts to aLaplacian+impulsivedistribution
which proves to be a sufficiently-accurate low-cost approximation of the generalized-Gaussian
distribution. After quantization this model can be easily approximated by a symmetric geo-
metric pmf or a symmetric piecewise geometric pmf, which we use to simplify the analysis of
syndrome distribution.
We divide the coded symbols (S) into two categories: (1) null (zero) coefficients (second
6.7. Entropy coding of syndromes 109
case in Equation 6.3), i.e.S = 0 and (2) non-null coefficient, i.e.S = 2n + Z (first case in
Equation 6.3). We first analyze the distribution of these coded coefficients.
From Equation 6.3, the probability distribution of symbolS can be approximated as
p(S) ≃ KS p2n−2
e
(
1− p2n−2
e
) cosh(
(2n−1 − Z) log(pr))
cosh (2n−1 log(pr)), (6.5)
wherepe andpr are constants characterizing two different geometric distributions, andKS is
a normalizing constant (see Appendix A.1). Note that forpr → 1, i.e. log(pr) → 0, the termcosh((2n−1−Z) log(pr))
cosh(2n−1 log(pr))is close to1, and Equation (6.5) can be simplified as
p(S) ≃ KS p2n−2
e
(
1− p2n−2
e
)
. (6.6)
Experimental results prove that the model fits the syndrome statistics quite well (see Fig. 6.7).
The fitting was made considering a differentpr for n = 2 syndromes, since the pdf of trans-
form coefficients for the4 × 4 transform is fitted better by using two different values forpr.
Generalized-Gaussian distribution with exponent lower than1 can be simplified using a Lapla-
cian model with an additive peak component which can be well represented by an impulsive
term [75] or, more precisely, by another Laplacian component with a lower variance. The re-
0 5 10 15 20 25 30 35 400
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Pro
b.
syndrome value
Position 0
syndrome frequenciesmodel
(a) position0
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Pro
b.
syndrome value
Position 1
syndrome frequenciesmodel
(b) position1
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
Pro
b.
syndrome value
Position 2
syndrome frequenciesmodel
(c) position2
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Pro
b.
syndrome value
Position 3
syndrome frequenciesmodel
(d) position3
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Pro
b.
syndrome value
Position 4
syndrome frequenciesmodel
(e) position4
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Pro
b.
syndrome value
Position 5
syndrome frequenciesmodel
(f) position5
Figure 6.7: Comparison between the probability mass functions of syndromes (solid line) andthe model in eq. (6.5) (dashed line). The results were computed from the sequenceforemanwith QP = 28. The x-axis reports the syndrome value while the y-axis reports its probability.Each different plot is referred to a different position in the scanning order of4 × 4 transformblock.
ported graphs show that the statistics of syndromes is much more irregular than the statistics
of H.264/AVC coefficients, and since the whole distributionis less biased towards zero despite
110 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
the probability of having a null syndrome is higher, an efficient coding of syndrome informa-
tion becomes harder. Fig. 6.8 reports the difference between the entropy of syndromes and the
entropy of DFD coefficients for typical values ofpe andpr. In addition, the occurrence of null
0.05
0.1
0.15
0.2
0.60.65
0.70.75
0.80.85
0.90.95
1.6
1.9048
2.2096
pe
pr
diffe
renc
e be
twee
n en
trop
ies
Figure 6.8: Difference between the en-tropy of syndromes and the entropy ofDFD for differentpe andpr values.
0 2 4 6 8 10 12 14 16 180
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Pro
babi
lity
of n
on−
null
sym
bols
Position in the scanning order
DSC coder H(s)=5.89H.264/AVC H(s)=5.10
Figure 6.9: Probability of non-nullsyndrome/coefficient for DSC coderand H.264/AVC (from the sequenceforeman (frame 1 QP=30)).
syndromes must be considered. In H.264/AVC, the high percentage of null quantized coeffi-
cients (calledzerosas in [39, 75]) in a transform block is efficiently exploited by a run-length
coding algorithm. The quantized transform coefficients arescanned according to a zig-zag or-
der, and then, the number of “zeros” that interlie between two non-null coefficients is coded
(called run). In the structure of CABAC coder, run-length coding is replaced by coding the
position of non-zero coefficients. A binary context associated with its position in the scanning
order, which models the probabilities of having a null coefficient at that position. Experimental
results show that, in DFD-based video coders, transform blocks have a low-pass characteristics
since the probability of non-null quantized coefficients ishigher at low frequencies. On the
contrary, the DSC syndromes show a more irregular distribution of null values. According to
the result reported in Fig. 6.9, the probability of a null syndrome is more equally distributed at
all the frequencies, and a low-pass characteristic is less evident. As a consequence, the adop-
tion of a zig-zag scan of syndromes followed by a run-length coding strategy results to be quite
efficient, as well as coding the position of each single non-null syndrome, since both distribu-
tions are less biased towards zero. Fig. 6.10 reports the results of the two coders (H.264-based
PRISM and the original H.264) on different sequences. The graphs show a 2 dB loss that is
due to an excessive waste of bit rate to code the DSC syndromes.
Quad-tree based entropy coding of syndromes
Experimental results show that, in a transform block, null syndromes occur in neighboring
positions, while non-null syndromes appear to be more sparse. From this result, the entropy
coding block can take advantage of null syndromes positionsadopting a quad-tree [55, 107]
based solution. Adopting a hierarchical quad-tree partitioning of the4×4 syndromes intosub-
blocksallows an efficient coding of the syndromes. The top level variableCBP-bit indicates
if there is any non-zero syndrome value in the4× 4 block. CBP-block then indicates which
6.7. Entropy coding of syndromes 111
0 1 2 3 4 5 6 7 8 9
x 105
28
30
32
34
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)
DSC SsCABAC
(a) news
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 106
20
25
30
35
40
45
50
Rate (kbit/s)
PS
NR
(dB
)
DSC SsCABAC
(b) mobile
Figure 6.10: Coding performances of the original CABAC algorithm on the H.264 coefficientsand the DSC syndromes. The CABAC was used to code the DSC syndromesas it is. The inputsignal has QCIF format at 30 frame/s GOP IPPP 15 frames.
of the 4 sub-blocks contain non-zero syndrome values. Finally, each of these indicated sub-
blocks has a variableCBP-subblock that indicates where and what the non-zero values
are (see Figure 6.11 for an example). At this level, the quad-tree coder characterizes which
syndromes are different from zero, which ones are coded using two bits (called d1s or d1-
syndromes), and which ones are coded with a higher number of bits. These variables are
then sent to the binary arithmetic coder. Note that their names recall the CBP structure that
is present in the H.264/AVC coder and specifies which8 × 8 block has non-zero syndromes.
However, the CBP-like variables that were introduced push the things further and pack more
information with respect with the original CBP. At first, thecoder signals whether there are
non-zero syndromes in the block. In case some syndromes are not null, the4 × 4 block is
divided into four2× 2 sub-blocks (see Fig. 6.11), and the encoder generates the first quad-tree
parameterCBP_block equal to
CBP_block = c0 + c1 · 2 + c2 · 4 + c3 · 8,
whereci =
{
1 if there are non-zero syndromes in thei-th subblock
0 if all the syndromes of thei-th subblock are nulli = 0, . . . , 3.
(6.7)
112 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
4 4
7 0
0
0 0
0 0 4
0
10
12
5
7
7
CBP_subblock=0
CBP_subblock=1256
CBP_subblock=151
CBP_block=14
0 1
11
CBP bit =1
CBP_subblock=870
Figure 6.11: Example of quad-tree coding using CBP variables.
In the next step of hierarchical quad-tree coding, for each sub-block that contains some syn-
dromes different from zero, the encoder computes the parameter
CBP_subblock= c0 + c1 · 6 + c2 · 36 + c3 · 216,
whereci =
0 if zi = 0
1 if zi is a d1 syndrome andzi&3 = 0
2 if zi is a d1 syndrome andzi&3 = 1
3 if zi is a d1 syndrome andzi&3 = 2
4 if zi is a d1 syndrome andzi&3 = 3
5 otherwise
with i = 0, . . . , 3.
(6.8)
It is possible to compare the average bit rate needed to code the position of zeros and d1-
syndromes using the traditional CABAC scheme and the one needed using the quad-tree scheme.
The comparison is reported in Table 6.1 for different sequences without applying the arithmetic
coding; the syndromes were obtained varying the quantization parameter in the range[15, 39].
It is possible to notice that the algorithm used by H.264 works well for low motion sequences,
but it results highly inefficient whenever there are a lot of details and the number of coefficients
increases.
Then, all of the CBP parameters are coded into a variable-length binary string using a
Huffman coding table, and each bit is successively sent to the binary arithmetic coder. The
remaining syndromes are coded separately, specifying the number of coded bit planes and their
values for each syndrome. However, it is possible to notice from the experimental results that
the number of non-d1 syndromes in a sub-block is very rarely bigger than one, and whenever
there are more non-d1 syndromes the number of coded bit planes is the same in most of the
cases. Therefore, it is possible to specify the same number of coded bit planes for all the non-d1
syndromes in the sub-block, which is equal to the biggest oneamong all the non-d1 syndromes
in a sub-block. In this way, we waste some bit planes wheneverthere are non-d1 syndromes
with different bit planes number, but we are able to reduce the amount of information sent
6.7. Entropy coding of syndromes 113
through the network since the number of bit planes need to be specified only once per sub-
block.
Sequences quad-treeRun-Lengthof CABAC
‘foreman’ 9.92 10.18‘mobile’ 14.34 15.33‘news’ 5.53 5.25
Table 6.1: Comparison of average bit rate (from binarization unit) needed to code the positionof zeros and ones in the H.264 coder and CBP blocks for the DSC coder (frame 1 QP=28).
6.7.2 Experimental results
6.7.3 Evaluation of compression gain with no quality equalization
The effectiveness of the designed entropy coder was evaluated comparing the performance of
DSC coding with that provided by the H.264/AVC coder using the same set of R-D optimization
parameters7. Different video sequences were coded using different quantization parameters
centering0 5 10 15 20 25
28
30
32
34
36
38
40
42
44
46
Rate (kbit)
PS
NR
(dB
)
Proposed schemeH.264
(a) ‘foreman’ (training sequence)
0 0.5 1 1.5 2 2.5 3 3.528
30
32
34
36
38
40
42
44
46
48
Rate (kbit)
PS
NR
(dB
)
Proposed schemeH.264
(b) ‘news’ (test sequence)
Figure 6.12: PSNR vs. Bit rate for the first frame in the GOP (QP∈ [15 , 39]).
with GOP IPPP of 15 frames at30 frame/s under different test conditions. At first, we compared
the two entropy coders using a common reference for motion estimation, i.e. we coded the first
P frame of each sequence with the same temporal reference. Inthis case the reference block
used by H.264/AVC to compute the DFD and the reference block used by the DSC coder
are the same, and the different performance depend on the coding of residual information.
The implemented DSC coder proves to be very effective as it compares well with H.264/AVC
providing even higher quality for some sequences at medium bit rates (see Fig. 6.12). This
coding performance is mainly due to the adoption of a quad-tree entropy coder, which proves to
be more efficient than the traditional schemes based on run-length coding that are adopted in the
7Here both systems use only4×4 transform with Lagrangian R-D optimization disabled and without cancellingunnecessary coefficients at high frequencies.
114 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
previous DFD-based coding architectures. For the sequenceforeman we were able to obtain
the same bit rate of H.264/AVC up to40 dB of quality, while the sequencenews was more
efficiently coded at medium bit rates thanks to the little motion that characterizes this sequence
and increases the percentage of null syndromes in the DSC blocks. As it was explained in
the previous section, the number of null syndromes per frameis higher with respect to the
H.264/AVC, and therefore, the DSC coder avoids coding a lot of blocks using the hierarchical
CBP structures.
Unfortunately, this efficiency decreases whenever we are coding a whole GOP of frames.
In this case, the efficiency of motion compensation for the following frames is reduced since
the distortion introduced by the DSC coder in the sequence ishigher with respect to the one
introduced by H.264. The DSC coder quantizes the transform coefficients of the original signal,
while H.264/AVC quantizes the transform coefficients of prediction error. Therefore, the DSC
coded sequences are more affected by the distortion drift that is related to any prediction-
based coding scheme since the references used by the DSC coder in the motion compensation
have a lower quality that precludes an efficient prediction and increases the number of bits for
each syndrome. Despite this decrement of performances, Figure 6.13 shows that the proposed
coder is still able to closely match the compression efficiency of H.264/AVC also for the rest
of the GOP with a slight decrement in terms of performance with respect to the common-
reference case, especially at lower rates (below40 dB). In addition, we must remind that every
Rate-Distortion optimization strategy and coefficient cancellation was disabled in the proposed
scheme in order to evaluate the performance of the entropy coding itself. It is possible to
improve the compression gain optimizing the adopted quantization step, the coding mode, and
the erasure of unnecessary syndromes. Experimental results show that enabling a random Intra
refresh for the macroblocks in the sequence, the performance of DSC coder gets closer to the
one of H.264/AVC.
6.7.4 Evaluation of compression gain with Intra refresh
The quality degradation that affects the DSC-coded sequences can be significantly mitigated
by forcing a certain number of macroblocks to be coded in Intra mode. Figure 6.14 shows
the coding results for the sequencesforeman andnews enabling a random intra refresh of
macroblocks in the sequence. Hybrid coders frequently resort to it when transmitting over an
error-prone channel since the partial refresh of the decoder state makes possible to stop the
propagation of distortion in case some information gets lost. In this case, random intra refresh
performs a sort of “quality equalization” on the reference frame buffer of H.264/AVC decoder
and the one used by the DSC decoder. The distortion of the reference frames used by DSC
decoder is closer to the distortion of frames in the buffer ofH.264/AVC decoder mitigating the
effects of error propagation. As a result, it is possible to notice that the performance of DSC
coder gets closer to that of H.264 recovering part of the performance gain that was shown in
Figure 6.12.
6.7. Entropy coding of syndromes 115
0 100 200 300 400 500 600 700 800 90028
30
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(a) ‘foreman’ QCIF (training sequence)
0 50 100 150 200 250 300 35028
30
32
34
36
38
40
42
44
46
48
Rate (kbit/s)P
SN
R (
dB)
Proposed schemeH264/AVC
(b) ‘news’ QCIF (training sequence)
0 50 100 150 200 250 300 350 40025
30
35
40
45
50
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(c) ‘salesman’ QCIF (test sequence)
0 50 100 150 200 250 300 350 40028
30
32
34
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(d) ‘sean’ QCIF (test sequence)
0 100 200 300 400 500 60028
30
32
34
36
38
40
42
44
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH.264
(e) ‘foreman’ CIF (test sequence)
0 50 100 150 200 25028
30
32
34
36
38
40
42
44
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH.264
(f) ‘news’ CIF (test sequence)
Figure 6.13: PSNR vs. Bit rate for a whole GOP (IPPP QP∈ [15 , 39]).
116 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding
0 200 400 600 800 1000 120028
30
32
34
36
38
40
42
44
46
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(a) ‘foreman’ QCIF
50 100 150 200 250 300 350 400 450 500 55028
30
32
34
36
38
40
42
44
46
48
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(b) ‘news’ QCIF
Figure 6.14: PSNR vs. Bit rate with Intra refresh enabled (11Macroblocks) (QP∈ [15 , 39]).
6.7.5 Evaluation of compression gain with rate control
Figure 6.12 shows that, for a given QP, the DSC-based coder achieves a lower quality at a
reduced bit rate. This mismatch can be equalized by implementing a rate control algorithm
that tunes the quantization parameter QP both at the macroblock and at the frame level in
order to keep the coded bit rate close to a target value. We adopted a modified version of the
algorithm proposed in Chapter 4, where the use of the percentage of “zeros” is replaced with
the percentage of null syndromes.
For then-th frame, the algorithm allocatesTn bits, whereTn is computed as
Tn =G
KI,DSC · nI + nDSC. (6.9)
The parameterG represents the number of bits that are left for the current GOP andnt, where
t =I or DSC, is the number oft-type frames in the GOP that still remain to be coded. As it is
explained in Section 4.5.2, the ratioKI,DSC characterizes the complexity relation between Intra
frames (I) and DSC-coded frames (DSC), and it is equal to
KI,DSC =XI
XDSC, where Xt = 2QPt/6Rt, t = I,DSC. (6.10)
These parameters are updated in the same way of their counterparts for the H.264/AVC coder
(see Equation 4.39). The quantization parameterQPt is used to quantize the lastt-type frame,
while Rt is the related number of bits. Experimental results show that also for syndrome
coding there is a linear relation between the number of codedbitsR and the percentageρ of
null syndromes, and therefore, the target bit rateTn can be related to a target percentageρn
through the equation
ρn =Tn − qm
. (6.11)
The target percentageρn of null syndromes makes possible the identification of an average
quantization stepQPn, which has to be corrected at macroblock level in order to match the
bandwidth constraints (see Section 4.5.3 for more details). The parametersm andq are es-
6.8. Summary 117
timated from previously-coded frames (see Equations 4.34 and 4.35). The same rate control
80 100 120 140 160 180 20032
33
34
35
36
37
38
39
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(a) ‘sean’ QCIF
80 100 120 140 160 180 20033
34
35
36
37
38
39
40
41
Rate (kbit/s)
PS
NR
(dB
)
Proposed schemeH264/AVC
(b) ‘news’ QCIF
Figure 6.15: PSNR vs. Bit rate with rate control enabled (target bit rates{80, 96, 112, 128, 144, 160, 176, 192} kbit/s).
algorithm was adopted for both the proposed DSC coder and theH.264/AVC architecture (in
this case DSC frames are replaced by P frames) changing only the complexity ratioKI,P . In
the DSC coder the complexityKI,P is divided by a constantc = 1.4 in order to equalize the
quality mismatch between DSC frames and P frames in H.264/AVC. The scaling of the com-
plexity ratio reduces the number of bits allocated for Intraframes, and increases the bit rate
for DSC frames improving their quality. The adopted choicesproved to be effective, as the
compression performance showed in the plots of Fig. 6.15 report that the designed DSC-based
architecture is able to improve the performance of H.264/AVC (using the same rate control
algorithm).
6.8 Summary
This chapter has presented the implementation of a DSC coderthat reuses the building blocks
of the H.264/AVC coder. The enhanced features of H.264/AVC allow improving the coding
performance of the DSC coder itself, but pose, at the same time, new challenges in terms of
entropy coding. The adoption of a DCT transform on smaller blocks increases the variance
of coefficients and makes the coding of syndromes harder since the usual low-pass structure
of each block to be coded is altered. A hierarchical quad-tree approach copes with this prob-
lem efficiently since it allows obtaining good compression results with respect to H.264/AVC
without the use of sophisticated channel codes. The coding performance is also improved
by the adoption of rate control strategies and random intra refresh of macroblocks. Such a
compression-centric DSC-based encoder is an important building block for a robust version of
DSC-based video coders.
Chapter 7
Conclusions
Previous chapters have presented the basic needs that are required by multimedia communi-cations on wireless networks. They concern mainly the compression gain, the computationalrequirements, and the robustness to errors and losses. In this thesis, we presented some possi-ble improvements that lead to a more efficient implementation of videocommunication appli-cations in terms of both compression gain and robustness. Inthis chapter, we summarize thekey results of this work and discuss some future research directions.
7.1 Summary
A recent innovation in the communication world is the massive introduction of multimedia
services over wireless networks, which was mainly inspiredby the aim of providing video
and audio applications almost anywhere and anytime. More and more Internet and mobile
communication providers offer a wide variety of multimedia-related services that span from
the video communication to the fruition of video-on-demandcontents on mobile devices. This
accomplishment was possible thanks to the recent development of mobile communication and
the technological advances in digital coding of multimediadata. However, the appearance of
heterogeneous network scenarios, characterized by the interconnection of different types of
networks and devices, and the massive wide-spreading of mobile communications, affected by
a higher percentage of losses and errors with respect to the traditional wired communications,
has modified the needs and the guidelines followed in the design of compression algorithms.
As a matter of fact, the capability of providing reliable video is the most relevant issue in the
widespread and the diffusion of multimedia mobile services, and the recent literature reports a
wide number of different proposals that try to cope with the problems of transmitting a video
sequence across a network affected by losses.
In this thesis, three main issues, which characterize the choices and the design of new
coding schemes, are addressed.
The first issue is the compression gain, which has to grant both the respect of bandwidth
constraints and a high visual quality in the reconstructed sequence at the decoder. This problem
can be addressed in two ways: designing efficient entropy coding schemes and implementing
efficient rate optimization algorithms. The standard H.264/AVC has proven to achieve the
120 Chapter 7. Conclusions
highest compression gain of the last ten years among hybrid video coders, and since its defini-
tion was oriented towards wireless application, it has beenadopted in this research as the basic
coding architecture.
The coding performance of the H.264/AVC standards is the result of an efficient orches-
tration of different coding techniques that span from predictive coding to arithmetic entropy
coding. However, the compression performance of the standard can be mainly ascribed to
some of them, such as an enhanced macroblock partitioning inthe Motion Compensation,
the adoption of an efficient spatial prediction, the introduction of an adaptive deblocking fil-
ter in the prediction loop, and finally, the implementation of an efficient adaptive arithmetic
coding engine, called Context Adaptive Binary Arithmetic Coder (CABAC). Syntax elements
are converted into binary strings, and a context is assignedto each binary digit. The couples
(symbol,context) are then processed by a binary arithmeticcoder which codes the binary digit
according to the probability model identified by the contextand updates the statistics. Our
results have showed that it is possible to improve this estimate by modifying the original prob-
ability model. In the original CABAC coder, the residual information is coded mapping at first
the positions of non null quantized DCT coefficients, and coding their values in a second step.
The context modeller assigns to each non-zero coefficient a context according to its order in the
zig-zag scanning. This context modelization does not take into consideration the position of
the coefficient in the block and its neighbors. Experimentalresults show that there is a statisti-
cal dependence between DCT coefficients both at neighboringpositions within the same block
and at the same positions in neighboring blocks. Therefore,it is possible to take advantage
of this dependence to improve the estimate by associating contexts to conditional probabilities
in place of absolute probabilities. The absolute values of DCT coefficients are sliced into bit
planes and for each bit plane a Directed Acyclic Graph is adopted to represent the statistical
relations among neighboring bits. Each edge in the model is associated to a conditional prob-
ability among adjacent bits, and it is used to propagate binary probabilities through the graph.
The CABAC, endowed with this new probability estimate, produces a smaller bit stream with
respect to its original definition (about 10 % smaller).
Compression gain also concerns the design of efficient rate optimization and control algo-
rithms, and in literature different rate control strategies have been published, which are able
to keep the produced bit rate within the bandwidth constraints and allowing a good perceptual
quality in the reconstructed sequence at the same time. Computational complexity introduces
a further differentiating criterion among different rate control algorithms. A good control of
bit stream size and a high perceptual quality at the decoder can be obtained with a highly
computationally-expensive rate control algorithm. On theother hand, the adoption of compu-
tationally lighter solutions is payed with a reduced codingperformance in terms of both low
visual quality and coarse accuracy in the respect of bandwidth constraints. The investigation
of Chapter 4 is mainly focused on finding an efficient trade-off between the two opposite so-
lutions. Since the efficiency of each rate control algorithmis deeply affected by the adopted
rate model, an efficient solution was found by Heet al. in [38]. The proposed model states the
linear relation that exists between the produced bit rate and the percentageρ of null quantized
DCT coefficients. However, it is necessary to map the estimated ρ values to a quantization
step. This thesis introduces a rate modelling in the joint domain(ρ,Eq), which permits an ac-
7.1. Summary 121
curate low-cost estimate of the target quantization step relating the percentageρ to the energy
of the quantized signalEq. This model is then adopted in a low-cost rate control algorithm
implementing a proportional control together with an efficient frame skipping scheme. Less
significant frames, like B frames, are skipped whenever the transmission buffer is close to an
overflow saving their bits for the improvement of the visual quality in the following frames.
Experimental results show that the proposed technique compares well with respect to other
techniques proposed by the same JVT committee.
Despite the adoption of efficient entropy coding algorithmsand rate control strategies al-
lows the receiver to experience a better visual quality at a given bandwidth, these coding efforts
can result completely useless in case the channel is affected by errors and losses. Different so-
lutions have been proposed to cope with this problem, and their efficiency often depends on
the target application for what they are conceived. An efficient solution consists in introducing
some redundant information in the coded packet stream. A recent approach includes RTP pack-
ets produced by the video source coder into a matrix columnwise and applies a cross-packet
FEC code along the rows. Redundant data are then packetized and sent to the transmitter,
which can reconstruct the lost information in case the received redundant packets are enough.
This approach proves to be very effective whenever the matrix size is well tuned to the packet
lengths and the information they carry. Experimental results shows that the performance of
this scheme can be significantly improved whenever matrix size is chosen according to packet
lengths and the percentage of null quantized DCT coefficients. Chapter 5 proposes a novel
joint source-channel rate control algorithm based on the percentage of zeros, which adapts the
protection level to the characteristics of each frames. Theperformance of the algorithm proves
to be significantly better that its non-adaptive counterpart.
A possible alternative to the introduction of redundant data is adopting a robust source
coding scheme based on Distributed Source Coding (DSC) principles. In literature, several
approaches have been proposed during the last years, mainlyfocused on complex channel
coding/decoding schemes. A simpler solution was proposed in 2002 by Puri and Ramchan-
dran [93], who designed a Distributed Video Coding scheme that compared with the efficiency
of traditional hybrid techniques but was able to produce a robust bit stream. PRISM coder
processes the signal in the transform domain and assigns to each transform block a sort of “sig-
nature” that identifies the most significant bits of transform coefficients. The remaining least
significant bits are Intra coded, and both signature and Intra coded bits are sent to the receiver.
The decoder searches for a block in its frame buffer with the same signature, and through the
Intra-coded bits, reconstructs the coded transform block.Note that both the requirements of
robustness and reduced computational complexity at the encoder are satisfied. The motion
estimation is performed at the decoder shifting the computational complexity at the network
side. Moreover, decoding is possible independently from the adopted reference block allowing
a correct reconstruction of the coded sequence even when at the decoder the reference buffer
is different from the one at the encoder. Therefore, the investigation of effective algorithms
that allow a high compression gain is an interesting investigation topic. Chapter 6 presents new
results obtained investigating efficient entropy coding algorithm for the PRISM syndromes.
The investigation has shown that PRISM syndromes present a rather peculiar statistics that
makes the coding solutions adopted for DSC coefficients ineffective and requires ad-hoc cod-
122 Chapter 7. Conclusions
ing schemes. In our work we proposed an original model for syndromes statistics that proved to
be quite accurate when matched with experimental results. Moreover, the investigation has led
to the design of a novel arithmetic coder scheme based on quad-tree coding that improves the
performance of the PRISM coder. Adopting the same prediction mechanism of H.264/AVC
(based on Motion Vectors), it is possible to obtain coding results comparable to the ones of
H.264/AVC with the same computational effort.
7.2 Future Research
Several important areas of research stem from the developments discussed in this thesis. As
for the DAG-based arithmetic coder, it is possible to improve the proposed scheme working
towards two directions, i.e. the reduction of the computational complexity and the extension
of the DAG model to other syntax elements. The computationalcomplexity can be reduced
clustering the DAGs into different clusters. On the other hand, the spatial correlation allows
the DAG modeller to characterize the probability of other syntax elements, like motion vectors.
Another important investigation field regards the improvement of the matrix-based FEC
scheme of Chapter 5. So far only Reed-Solomon codes have beenconsidered while the litera-
ture have proposed more efficient schemes. An interesting issue is raised from the adoption of
Turbo Codes. Redundant packets can be computed both along the columns and along the rows
creating two different sets of redundancy bytes. It is possible to shape the additional redun-
dancy including packets from both codes implementing a sortof Turbo codes at RTP levels.
Since the performance of Turbo codes can significantly increment the recovering performance
the investigation of this possibility is an interesting research field.
Finally, the novelty of DSC schemes opens a wide variety of research topics. Among the
most important ones, two issues result to be determinant in the performance of these schemes.
The first is achieving a high compression gain, which can be obtained through effective en-
tropy coding solutions and rate distortion optimization algorithms. On the other hand, the final
performance is deeply affected by the classification algorithm. Therefore, the design of new
strategies that are able to characterize the coded information in a transmission environment af-
fected by errors is a stimulating research field that offers many possible solutions to investigate.
Appendix A
Relation betweenEq and ρ
In the rate control algorithm, the energy of the quantized signal is approximated using the
parameter
Eq =act
∆(A.1)
which can be expressed as follows
act
∆=
∑NMB−1m=0
∑15x,y=0 |errm(x, y)|
NMB ·∆. (A.2)
For a largeNMB, we can relate the average activity to the average energy
Eq =1
NMB
NMB−1∑
m=0
15∑
x,y=0
|errm(x, y)|2 (A.3)
via the relation
1
N
N−1∑
n=0
|xn| = ξx
√
√
√
√
1
N
N−1∑
n=0
|xn|2 (A.4)
whereξx = E[|x|]/E[
|x|2]
is a shape factor depending on the p.d.f. of the zero-mean random
variablex (e.g.ξx =√
2/π for a Gaussian variable,ξx = 1 for a Laplacian variable).
This leads to the approximation
124 Appendix A. Relation betweenEq andρ
Eq =act
∆=
∑NMB−1m=0
∑15x,y=0 |errm(x, y)|
NMB ·∆
=ξx∆
√
∑NMB−1m=0
∑15x,y=0 |errm(x, y)|2NMB
=ξx∆
√
∑NMB−1m=0
∑15x,y=0 |Errm(x, y)|2NMB
= ξx
√
∑NMB−1m=0
∑15x,y=0 (Errm(x, y)/∆)2
NMB
≃ ξx
√
E[
(Errm(x, y)/∆)2]
= ξx
√
Eq.
(A.5)
where the activityactm is computed as expressed in equation (4.7) andErr(x, y), x, y =
0, . . . , 15, is the signalerr(x, y) after the transformation. In fact, the average activity valueact
computed on the original residual signal (eq. A.1) is linearly proportional to the square root of
the energy of its transformed version.
According to the probability density function reported in eq. (4.10), the percentage of
quantized DCT coefficients different from zero is equal to
θ = 1− ρ = 2 ·∫ +∞
∆px(a)da =
e− 2
γ′∆
1 + α′. (A.6)
The energy of the quantized signal is
Eq = 2
∫ +∞
∆[Q(a)]2 px(a)da, (A.7)
whereQ(a) is the quantized index of the coefficienta, and it depends on1 − ρ as in the
following equation
Eq =
∫ +∞
−∞[Q(a)]2 px(a)da
= 2+∞∑
i=1
i2∫ ∆·(i+1)
∆·ipx(a)da
= 2
+∞∑
i=1
i2∫ ∆(i+1)
∆i
2 · e−2γ′
a
(1 + α′) · γ′da
= 2
+∞∑
i=1
i2e− 2
γ′∆i ·
(
1− e−2γ′
∆)
1 + α′
(A.8)
Let
125
ς = 1− e−2γ′
∆, (A.9)
then the series in equation (A.8) converges to the value
+∞∑
i=1
i2e− 2
γ′∆i ·
(
1− e−2γ′
∆)
=+∞∑
i=1
i2 (1− ς)i ς
=ς2 − 3ς + 2
ς2.
(A.10)
The energy of the quantized signal can be expressed as
Eq = 2ς2 − 3ς + 2
(1 + α′) · ς2 =2 (1− ς) · (2− ς)
(1 + α′) · ς2 . (A.11)
and therefore, the approximation of equation (A.1) can be written as
Eq(ς) = ξx
√
Eq(ς) = ξx
√
2ς2 − 3ς + 2
(1 + α′) · ς2 . (A.12)
The first derivative of equation (A.12) shows that√
Eq(ς) has only one stationary point
ς = 2/3 in the range[0, 1]. In addition, the second derivative shows that the functionis convex
for the whole interval[0, 1]. Therefore, the Taylor expansion of the functionEq can be well
approximated in[0, 1] by a second degree polynomial. The experimental results reported in
Fig. 4.4 show that this approximation is sufficiently accurate. Therefore, taking into account
that eq. (A.6) implies
ς = 1− e−2γ′
∆= 1− (1 + α′)θ, (A.13)
we can write
Eq = ξx
√
Eq(ς) = a0 + a1 · ς + a2 · ς2, (A.14)
which is equivalent to
Eq = a0 + a1 · (1− (1 + α′)θ)) + a2 · (1− (1 + α′)θ)2
= c0 + c1 · θ + c2 · θ2,(A.15)
with
c0 = a0 + a1 + a2
c1 = (−a1 − 2a2) (1 + α′)
c1 = a2 (1 + α′)2 .
(A.16)
Sinceθ is the complementary value of the percentageρ, the equation (A.15) can be written
as in (A.2).
126 Appendix A. Relation betweenEq andρ
A.1 Derivation of probability distribution for syndromes
The probability distribution for non-zero syndromes can beapproximated as follows. Accord-
ing to the first case of equation (6.3), the number of bits thatmust be included in the syndrome
isn = 2 +
⌊
log2
(
|Xq·∆−Xp|∆
)⌋
= 2 +⌊
log2
(
|Xq − Xp
∆ |)⌋
≃ 2 + ⌊log2 (|Xq −Xp,q|)⌋ = 2 + ⌊log2 (|E|)⌋ ,(A.17)
whereXp is the side-information (reference block),Xp,q is the quantized version ofXp and
E = Xq − Xp,q. Assuming that bothXq and the differenceE can be approximated by an
independent symmetrical geometric variable, the probability mass function ofXq andE can
be respectively expressed as
pr(Xq) =1− pr
1 + prp|Xq−M |r pe(E) =
1− pe
1 + pep|E|
e , (A.18)
where we assume that the coefficientsXq are shifted in such a way that, omitting the tails of
the distribution, they can be included in the set[0, 2M ]. The parameterspr andpe completely
characterize the two probability distributions. In our implementation, they are estimated from
the experimental data using log-linear fitting.
Pay attention to the fact that the coefficient are shifted in order to be always positive, that
is to say that, after the transform operation, we addM = 215 to each coefficient. This value
was computed considering that the amplification of the4 × 4 transform, which is equal to36
(worst case), can be represented with6 bits and the residual signal can be represented with
1 + 8 bits. In the following analysis, we will omit considering the tails of the p.m.f. since they
have a small influence on the final probability value and the effective transform coefficients are
included in the range[−M,M ]. Therefore, the previous pmfs in Equation (A.18) are centered
aroundM .
Let the syndromeZ be coded withn bits (i.e. 2n−2 ≤ |E| = |Xq − Xp,q| < 2n−1). In
the following the couple(E,n) will be also referenced with the symbolS = Z + 2n. We can
write the joint p.d.f. of the syndromeZ and the number of bitsn as
p(Z, n) =2M−1∑
Xq=0
2M−1∑
Xp,q=0
pr(Xq) · 1(2n−2 < |Xq −Xp,q| < 2n−1) · 1(Z = Xq&(2n − 1)).
(A.19)
where1(·) is the indicator function.
Let kT = M/2n (k is an integer sinceM is some power of 2), then the sum can be written
A.1. Derivation of probability distribution for syndromes 127
as
p(S) =
kT−1∑
k=0
2n−1−1∑
E=2n−2
pr(k · 2n + Z)pe(E) · [1(k · 2n + Z ≥ E) + 1]+
+
2·kT−1∑
k=kT
2n−1−1∑
E=2n−2
pr(k · 2n + Z)pe(E) · [1(k · 2n + Z < 2M − E) + 1]
≃kT−1∑
k=0
2n−1−1∑
E=2n−2
2pr(k · 2n + Z)pe(E) +
2·kT−1∑
k=kT
2n−1−1∑
E=2n−2
2pr(k · 2n + Z)pe(E)
(A.20)
since typicallylog2M ≫ n, thusp(1(k · 2n + Z ≥ E) = 1) ≃ 1 andp(1(k · 2n + Z <
2M − E) = 1) ≃ 1.
This can then be rewritten as
p(S) =1− pr
1 + pr
1− pe
1 + pe·
kT −1∑
k=0
pkT ·2n
r p−k·2n
r pe−Z
2n−1−1∑
E=2n−2
2peE+
+
2kT −1∑
k=kT
pk·2n
r p−kT ·2n
r peZ
2n−1−1∑
E=2n−2
2peE
,
(A.21)
which can be further simplified into
p(S) =1− pr
1 + pr
p2n−2
e
(
1− p2n−2
e
)
1 + pe
{
pMr − 1
1− p−2n
rp−Z
r +1− pM
r
1− p2n
r
pZr
}
≃ KS · p2n−2
e ·(
1− p2n−2
e
) cosh(
(2n−1 − Z) · log(pr))
cosh (2n−1 · log(pr))(A.22)
whereKS is a normalizing constant. Note that forpr → 1, i.e. log(pr) → 0, the termcosh((2n−1−Z) log(pr))
cosh(2n−1 log(pr))is close to1, and Equation (A.22) can be simplified as
p(S) ≃ KS p2n−2
e
(
1− p2n−2
e
)
. (A.23)
Bibliography
[1] A. Aaron, S. Rane, E. Setton, and B. Girod. Transform-Domain Wyner-Ziv Codec for
Video. In Proceedings of SPIE Visual Communications and Image Processing Confer-
ence, San Jose, California, USA, January 2004.
[2] A. Aaron, R. Zhang, and B. Girod. Wyner-ziv coding for motion video. InProceed-
ings of Asilomar Conference on Signals, Systems and Computers 2002, Pacific Grove,
California, USA, November 2002.
[3] M. E. Al-Mualla, C. N. Canagarajah, and D. R. Bull.Video Coding for Mobile Commu-
nications. Academic Press, An Imprint of Elsevier Science, 2002.
[4] D. Alfonso, D. Bagni, L. Celetto, and S. Milani. Constantbit-rate control efficiency
with fast motion estimation in H.264/AVC video coding standard. InProc. of the 12th
European Signal Processing Conference (EUSIPCO 2004), pages 1271–1274, Wien,
Austria, September 6–10 2004.
[5] P. Baccichet.Solutions for the protection and reconstruction of H.264/AVC video signals
for real time transmission over lossy networks. PhD thesis, Istituto di Elettronica e di
Ingegneria dell’Informazione e delle Telecomunicazioni (I.E.I.I.T.), University of Milan,
Milan, Italy, 2006.
[6] P. Baccichet and A. Chimienti. Forward Selective Protection Exploiting Redundant
Slices and FMO In H.264/AVC. InProc. of the IEEE International Conference on
Image Processing - ICIP, submitted, Atlanta, GA,USA, October 2006.
[7] P. Baccichet, S. Rane, and B. Girod. Systematic Lossy Error Protection based on
H.264/AVC Redundant Slices and Flexible Macroblock Ordering. InProc. of the IEEE
Packet Video Workshop, Hangzou, China, April 2006.
[8] I. Bauermann and E. Steinbach. Further Lossless compression of JPEG-images. In
Picture Coding Symposium, PCS 2004, San Francisco, California, USA, December15–
17 2004.
[9] Eric Bodden, Malte Clasen, and Joachim Kneis. Arithmetic Coding revealed. InProsem-
inar Datenkompression 2001. RWTH Aachen University, 2002. German version avail-
able: Proseminar Datenkompression, Arithmetische Kodierung.
130 Bibliography
[10] J. Bormans, J. Gelissen, and A. Perkis. MPEG-21: The21st century multimedia frame-
work. IEEE Signal Processing Mag., 20(2):53–62, March 2003.
[11] G. Calvagno, C. Ghirardi, G.A. Mian, and R. Rinaldo. Modeling of subband data for
buffer control.IEEE Trans. Circuits Syst. Video Technol., 7(2):402–408, April 1997.
[12] O. Campana and R. Contiero. An H.264/AVC video coder based on Multiple Descrip-
tion Scalar Quantizer. InProc. of40th Asilomar Conference on Signals, Systems, and
Computers, Pacific Grove, CA, USA, October 29 – November 1 2006.
[13] O. Campana and S. Milani. A Multiple Description CodingScheme For The H.264/AVC
Coder. InProc. of the International Conference on Telecommunication and Computer
Networks IADAT-tcn2004, pages 191–195, San Sebastian, Spain, December 2004.
[14] L. Cappellari and G.A. Mian. Analysis of joint predictive-transform coding. InProc.
of the Sixth Baiona Workshop on Signal Processing in Communications, Baiona, Spain,
September 8–10, 2003.
[15] J.-J. Chen and D. W. Lin. Optimal bit allocation for coding of video signal over ATM
networks.IEEE J. Select. Areas Commun., 15(6):1002–1015, August 1997.
[16] M. D. Fairchild. Color Appearence Model. Wiley, 2005.
[17] N. Färber, K. Stuhlmuller, and B. Girod. Analysis of error propagation in hybrid video
coding with application to error resilience. InProc. of International Conference on
Image Processing, ICIP 1999, pages 550–554, Thessaloniki, Greece, October 1999.
[18] G. Galilei. Dialogo sopra i due massimi sistemi Tolemaico e Copernicano (Dialog on
the Two Chief Systems of the World), 1632.
[19] R. G. Gallager. Variations on a theme by Huffman.IEEE Trans. Inform. Theory,
24(6):668–664, December 1978.
[20] G. Gennari, D. Bagni, A. Borneo, and L. Pezzoni. Slice header reconstruction for
H.264/AVC robust decoders. InInternational Workshop on MultiMedia Signal Pro-
cessing (MMSP 2005), Shanghai, Cina, November 2005.
[21] G. Gennari and L. Celetto. A H.264 robust decoder for wireless environments. In
STMicroelectronics Journal, Wien, Austria, September 6–10 2004.
[22] G. Gennari, G. A. Mian, and L. Celetto. A robust H.264 decoder with error concealment
capabilities. InProc. of the 12th European Signal Processing Conference (EUSIPCO
2004), pages 649–652, Wien, Austria, September 6–10 2004.
[23] G. Gennari, G.A. Mian, D. Bagni, and L. Celetto. A robustH.264/AVC decoder capable
of error concealment and slice header reconstruction. InProc. of Wireless Reconfig-
urable Terminals and Platforms (Wirtep 2006), Rome, Italy, April10–12 2006.
Bibliography 131
[24] Gianluca Gennari. Decodificatore H.264 robusto nei confronti degli errori di trasmis-
sione. Master’s thesis, Department of Information Engineering, University of Padova,
Padova, Italy, July 2003.
[25] A. Gersho. The channel splitting problem and modulo-PCM coding. Technical report,
Bell Labs Memo for Record (not archived), October 1979.
[26] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero. Distributed Video Coding.
Proc. of the IEEE, Special Issue on Video Coding and Delivery, 93(1):71–83, January
2005. Invited Paper.
[27] B. Girod and N. Fäber. Wireless Video. In M.-T. Sun and A.R. Reibman, editors,
Compressed Video Over Networks, chapter 12. Marcel Dekker Inc., September 2001.
[28] B. Girod and N. Färber. Feedback-based error control for mobile video transmission.
Proc. of the IEEE, 87(10):1707–1723, October 1999.
[29] C. Gomila. The H.264/MPEG-4 AVC Video Coding Standard.EURASIP Newsletter,
15(2):19–34, June 2004.
[30] R. C. Gonzalez and R. E. Woods.Digital Image Processing. Pearson Education, 2002.
[31] V. K. Goyal and J. Kovacevic. Generalized Multiple Description Coding with Pair-
wise Correlating Transform.IEEE Trans. Inform. Theory, 47(6):2199–2224, September
2001.
[32] Vivek K. Goyal. Multiple Description Coding: Compression Meets The Network.IEEE
Signal Processing Mag., 8(5):74–93, September 2001.
[33] 3GPP TSG-SA4 Siemens Group. Matrix approach vs. packetapproach for MBMS appli-
cation layer FEC. In3GPP TSG-SA4 Meeting TSG-SA4 # 30, Malaga, Spain, February
23–27 2004.
[34] 3GPP TSG-SA4 Siemens Group. Simulation results of MBMSapplication layer FEC
with RS-codes. In3GPP TSG-SA4 Meeting TSG-SA4 # 30, Malaga, Spain, February
23–27 2004.
[35] A. Hallapuro, M. Karczewicz, and H. Malvar. Low complexity transform and
quantization - part I: Basic implementation. InJoint Video Team (JVT) of
ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-TSG16
Q.6), 2nd Meeting, Geneva, CH, January 29 – February 1 2002. files:
jvtb038.doc,jvtb038.xls,jvtb038r1.doc,jvtb038r2.doc.
[36] A. Hallapuro, M. Karczewicz, and H. Malvar. Low complexity transform and
quantization - part II: Basic implementation. InJoint Video Team (JVT) of
ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-TSG16
Q.6), 2nd Meeting, Geneva, CH, January 29 – February 1 2002. files:
jvtb039.doc,jvtb039.xls,jvtb039r1.doc,jvtb039r2.doc.
132 Bibliography
[37] Z. He, Y. K. Kim, and Sanjit K. Mitra. Low-delay rate control for DCT video coding via
ρ-domain source modeling.IEEE Trans. Circuits Syst. Video Technol., 11(8):928–940,
August 2001.
[38] Z. He and S. K. Mitra. A Unified Rate-Distortion AnalysisFramework for Transform
coding. IEEE Trans. Circuits Syst. Video Technol., 11(12):1221–1236, December 2001.
[39] Z. He and S. K. Mitra. Optimum bit allocation and accurate rate control for video coding
via ρ-domain source modeling.IEEE Trans. Circuits Syst. Video Technol., 12(10):840–
848, October 2002.
[40] P. Ishwar, V. M. Prabhakaran, and K. Ramchandran. Towards a Theory for Video Coding
Using Distributed Compression Principles. InProc. of the Internation conference on
Image Processing (ICIP), 2003.
[41] ISO/IEC. Coded representation of picture and audio information-MPEG-2 test model 5.
In ISO/IEC AVC-491, April 1993.
[42] ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG 8). Information Technology - Coded Represen-
tation Of Picture And Audio Information - Lossy/Lossless Coding Of Bi-Level Images
(JBIG). Final Committee Draft, 1999.
[43] ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG8). JPEG 2000 Part I Final Committee Draft
Version 1.0. For Review, March16th 2000.
[44] ISO/IEC JTC1. Coding of Audio-Visual Objects - Part 2: Visual. ISO/IEC 14 496-2
(MPEG-4 Visual version 1), Apr. 1999; Amendment 1 (version 2), Feb. 2000; Amend-
ment 4 (streaming profile), Jan. 2001, January 2001.
[45] ITU-T. Video Coding for Low Bitrate Communications, Version 1. ITU-T Recommen-
dation H.263, 1995.
[46] ITU-T. Control Protocol for Multimedia Communication. ITU-T Recommendation
H.245, 1996.
[47] ITU-T and ISO/IEC JTC1. Generic Coding of Moving Pictures and Associated Audio
Information-Part 2: Video. ITU-T Recommendation H.262-ISO/IEC 13 818-2 (MPEG-
2), 1994.
[48] A. Jagmohan, A. Sehgal, and N. Ahuja. Predictive Encoding Using Coset Codes. In
Proceedings of IEEE International Conference on Image Processing, volume 2, pages
29–32, Rochester, New York, USA, September 2002.
[49] N.S. Jayant. Subsampling of a DPCM speech channel to provide two ’self-contained’
half-rate channels.Bell Syst. Tech. J., 60(4):501–509, April 1981.
[50] N.S. Jayant and P. Noll.Digital Coding of Waveforms - Principles and Applications to
speech and Video. Prentice-Hall, 1984.
Bibliography 133
[51] M. I. Jordan and Y. Weiss. Graphical models: probabilistic inference. In M. A. Arbib,
editor,Handbook of Neural Networks and Brain Theory. 2nd edition.MIT Press, 2002.
[52] G. D. Forney Jr., M. D. Trott, and S.-Y. Chung. Sphere-Bound-Achieving Coset Codes
and Multilevel Coset Codes.IEEE Trans. Inform. Theory, 46(3):820–850, May 2000.
[53] N. Kamaci, Yucek Altunbasak, and Russel M. Mersereau. Frame bit allocation for the
H.264/AVC video coder via a Cauchy-density-based rate and distortion models.IEEE
Trans. Circuits Syst. Video Technol., 15(8):994–1006, August 2005.
[54] G. Keesman, I.Shah, and R. Klein-Gunnewiek. Bit rate control for MPEG encoders.
Signal Processing:Image Communication, 6(6):545–560, 1995.
[55] A. Klinger and C. R. Dyer. Experiments on picture representation using regular decom-
position. CGIP, 5:68–105, 1976.
[56] L. P. Kondi and A. K. Katsaggelos. An operational rate-distortion optimal single-pass
SNR scalable video coder.IEEE Trans. Image Processing, 10(11):1613–1620, Novem-
ber 2001.
[57] E. Y. Lam and J. W. Goodman. A mathematical analysis of the DCT coefficient distri-
butions for images.IEEE Trans. Image Processing, 9(10):1661–1666, October 2000.
[58] G. G. Langdon. An Introduction to Arithmetic Coding.IBM J. Res. Develop., 28(2):135–
149, March 1984.
[59] Z. Li, W. Gao, F. Pan, S. Ma, K. Pang Lim, G. Feng, X. Lin, S.Rahardja, H. Lu,
and Y. Lu. Adaptive rate control with HRD consideration. InJoint Video Team
(JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16
Q.6),8th Meeting, Geneva, CH, October 20–26 2003. files: JVT-H014.doc,JVT-H014-
FixedQP_r1.xls,JVT-H014-FixedQP.xls.
[60] Z. Li, F. Pan, G. Feng, K. Lim, X. Lin, and S. Rahardja. Improved rate control al-
gorithm. In Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC
JTC1/SC29/WG11 and ITU-T SG16 Q.6),5th Meeting, Geneva, CH, October 9–17
2002. files: JVT-E069.doc,JVT-E069.zip,JVT-E069_software.zip.
[61] Z. Li, F. Pan, K. P. Lim, G. Feng, X. Lin, and S. Rahardja. Adaptive rate control
for JVT. In Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC
JTC1/SC29/WG11 and ITU-T SG16 Q.6)6th Meeting, Awaji, Japan, December 5–15,
2002.
[62] J.Y. Liao and J.D. Villasenor. Adaptive intra block update for robust transmission of
H.263. IEEE Trans. Circuits Syst. Video Technol., 10(1):30–35, February 2000.
[63] G. Liebl, T.Stockhammer M. Wagner, J. Pandel, G. Baese,M. Nguyen, and F. Burkert.
An RTP Payload Format for Erasure-Resilient Transmission of Progressive Multimedia
Streams, October 2004. Internet Draft.
134 Bibliography
[64] P List, A. Joch, J. Lainema, G. Bjntegaard, and M. Karczewicz. Adaptive deblocking
filter. IEEE Trans. Circuits Syst. Video Technol., 13(7):614–619, July 2003.
[65] J. Liu and P. Moulin. Information-Theoretic Analysis of Interscale and Intrascale
Dependencies Between Image Wavelet Coefficients.IEEE Trans. Image Processing,
10(11):1647–1658, November 2001.
[66] M. Luby. LT codes. www.inference.phy.cam.ac.uk/mackay/dfountain/LT.pdf.
[67] S. Ma, Z. Li, and F. Wu. Proposed draft of adaptive rate control. In Joint Video
Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T
SG16 Q.6),8th Meeting, Geneva, CH, October 20–26 2003. files: JVT-H017.doc,JVT-
H017r1.doc,JVT-H017r2.doc,JVT-H017r3.doc.
[68] D. J. C. MacKay.Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, 2003.
[69] A. Majumdar, J. Chou, and K. Ramchandran. Robust Distributed Video Compression
based on Multilevel Coset Codes. InProc. of the Asilomar Conference on Signals,
Systems, and Computers, Nov. 2003.
[70] A. Majumdar, R. Puri, P. Ishwar, and K. Ramchandran. Complexity/performance trade-
offs for robust distributed video coding. InProc. International Conference on Image
Processing (ICIP), 2005.
[71] H. S. Malvar, A. Hallpuro, and M. Karczewicz. Low-complexity transform and quan-
tization in H.264/AVC.IEEE Trans. Circuits Syst. Video Technol., 13(7):598–603, July
2003.
[72] D. Marpe, H. Schwarz, and T. Wiegand. Context-Base Adaptive Binary Arithmetic
Coding in the H.264/AVC Video Compression Standard.IEEE Trans. Circuits Syst.
Video Technol., 13(7):620–636, July 2003.
[73] D. Marpe, H. Schwarz, and T. Wiegand. Context-based adaptive binary arithmetic cod-
ing in the H.264/AVC video compression standard.IEEE Trans. Circuits Syst. Video
Technol., 13(7):620–636, July 2003.
[74] Michael Luby. LT Codes. InFOCS, 2002.
[75] S. Milani, L. Celetto, and G.A. Mian. A rate control algorithm for the H.264 encoder. In
Proc. of the Sixth Baiona Workshop on Signal Processing in Communications, Baiona,
Spain, September 8–10, 2003.
[76] S. Milani, L. Celetto, and G.A. Mian. An Accurate Low-Complexity Rate Control Algo-
rithm Based on(ρ,Eq)-Domain. IEEE Trans. Circuits Syst. Video Technol., submitted.
[77] S. Milani and G. A. Mian. A Practical Algorithm for Distributed Source Coding Based
on Continuous-Valued Syndromes. InProc. of the 14th European Signal Processing
Conference (EUSIPCO 2006), Firenze, Italy, September4–8 2006.
Bibliography 135
[78] S. Milani and G. A. Mian. An improved context adaptive binary arithmetic coder for
the H.264/AVC standard. InProc. of the 14th European Signal Processing Conference
(EUSIPCO 2006), Firenze, Italy, September4–8 2006.
[79] S. Milani, G. A. Mian, and L. Celetto. Joint optimization of source-channel video coding
using the h.264 encoder and fec codes. InProc. of the 13th European Signal Processing
Conference (EUSIPCO 2005), Antalya, Turkey, September 2005.
[80] S. Milani, G. A. Mian, and L. Celetto. Joint Optimization of Source-Channel Video
Coding Using the H.264 Encoder and FEC Codes. InProc. of the 13th European Signal
Processing Conference (EUSIPCO 2005), Antalya, Turkey, September 2005.
[81] S. Milani, G.A. Mian, D. Alfonso, and L. Celetto. A(ρ,Eq)-Domain Based Low-Cost
Rate-Control Algorithm for the H.264 Video Coder. InProc. of the Seventh International
Symposium onn Wireless Personal Multimedia Communications (WPMC2004), pages
137–142, Abano Terme (PD), Italy, September 2004.
[82] S. Milani, J. Wang, and K. Ramchandran. Achieving H.264-like compression efficiency
with distributed video coding. InProceedings of SPIE Visual Communications and Im-
age Processing Conference, San Jose, California, USA, January 2007. To be published.
[83] Alistair Moffat, Radford M. Neal, and Ian H. Witten. Arithmetic coding revisited.ACM
Trans. Inf. Syst., 16(3):256–294, 1998.
[84] D. Mumford. Empirical statistics and stochastic models for visual signals. In S. Haykin,
J. C. Príncipe, T. J. Sejnowski, and J. McWhirter, editors,New Directions in Statistical
Signal Processing, chapter 1. MIT Press, 2005.
[85] I. Newton. Philosophiae Naturalis Principia Mathematica (mathematical principles of
natural philosophy), July 1687.
[86] M. Niss. History of the Lenz-Ising Model 1920-1950: From Ferromagnetic to Cooper-
ative Phenomena.Archive for History of Exact Sciences, 59(3):267–318, March 2005.
[87] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG. Joint final commit-
tee draft (JFCD) of joint video specification (ITU-T Rec. H.264 | ISO/IEC 14496-
10 AVC). In Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC
JTC1/SC29/WG11 and ITU-T SG16 Q.6),4th Meeting, Klagenfurt, Germany, July 2002.
ftp://ftp.imtc-files.org/jvt-experts/2002_07_Klagenfurt/JVT-D157.zip.
[88] M. T. Orchard, Y. Wang, and A. R. Reibman. Redundancy Rate-Distortion Analysis
of Multiple Description Coding Using Pairwise CorrelatingTransform. InProc. of the
IEEE International Conference on Image Processing, ICIP 1997, Santa Barbara, CA,
USA, October 1997.
[89] M. T. Orchard, Y. Wang, and A. R. Reibman. Optimal Pairwise Correlating Transform
for Multiple Description Coding. InProc. of the IEEE International Conference on
Image Processing, ICIP 1998, Chicago, IL, USA, October 1998.
136 Bibliography
[90] A. Ortega and K. Ramchandran. Rate-Distortion Methodsfor Image and Video Com-
pression.IEEE Signal Processing Mag., pages 23–50, November 1998.
[91] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer,
and T. Wedi. Video coding with H.264/AVC: Tools, Performance, and Complexity.
IEEE Circuits Syst. Mag., 4(1):7–28, First Quarter 2004.
[92] S. S. Pradhan and K. Ramchandran. Distributed Source Coding Using Syndromes (DIS-
CUS): Design and Construction. InProc. of the Data Compression Conference (DCC
1999), Snowbird, UT, USA, March 1999.
[93] R. Puri and K. Ramchandran. Prism: A new robust video coding architecture based on
distributed compression principles. InProc. of the40th Allerton Conference on Com-
munication, Control and Computing, pages 402–408, Allerton, IL, USA, October 2002.
[94] R. Puri and K. Ramchandran. Prism: A “reversed” multimedia coding paradigm.
In Proc. of IEEE International Conference on Image Processing(ICIP), Honk Kong-
Barcelona, Spain, September 2003.
[95] R. Puri and K. Ramchandran. Prism: An uplink-friendly multimedia coding paradigm.
In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Honk Kong, April 2003.
[96] J. Ribas-Corbera and S. Lei. Rate control in DCT video coding for low-delay commu-
nications.IEEE Trans. Circuits Syst. Video Technol., 9(1):172–185, February 1999.
[97] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Overview of H.264.
http://www.rgu.ac.uk/files/h264_overview.pdf, October2002.
[98] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Variable length Coding.
http://www.rgu.ac.uk/files/h264_vlc.pdf, October 2002.
[99] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Context-Based Adaptive Arith-
metic Coding (CABAC). http://www.rgu.ac.uk/files/h264_cabac.pdf, October 2003.
[100] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Prediction of Inter Macroblocks
in P-slices. http://www.rgu.ac.uk/files/h264_interpred.pdf, April 2003.
[101] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Prediction of Intra Macroblocks.
http://www.rgu.ac.uk/files/h264_intrapred.pdf, April 2003.
[102] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Reconstruction filter.
http://www.rgu.ac.uk/files/h264_loopfilter.pdf, April 2003.
[103] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Transform and quantization.
http://www.rgu.ac.uk/files/h264_transform.pdf, March 2003.
[104] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Frame and picture management.
http://www.rgu.ac.uk/files/avc_picmanagement_draft1.pdf, January 2004.
Bibliography 137
[105] I. E. G. Richardson.H.264 and MPEG-4 Video Compression. John Wiley and Sons,
September 2003.
[106] J. Rosenberg and H. Schulzrinne. An RTP Payload Formatfor Generic Forward Error
Correction (RFC2733). InNetwork Working Group, December 1999.
[107] A. Rosenfeld. Quadtrees and pyramids for pattern recognition and image processing. In
Proc. of5th ICIPR, pages 569–572, Miami, FL, USA, 1982.
[108] L. Celetto S. Milani, G.A. Mian. Aρ-domain based joint optimization of source-channel
video coding. InProc. of Wireless Reconfigurable Terminals and Platforms (Wirtep
2006), Rome, Italy, April10–12 2006.
[109] A. Sehgal, A. Jagmohan, and N. Ahuja. Scalable video coding using Wyner-Ziv codes.
In Proc. of the Picture Coding Symposium 2004, San Francisco, CA, USA, December
15–17, 2004.
[110] C.E. Shannon. A Mathematical Theory of Communications. The Bell System Technical
Journal, July 1948.
[111] A. Shokrollahi. Raptor codes, June 2003.
[112] T. Sikora. The MPEG-4 video standard verification model. IEEE Trans. Circuits Syst.
Video Technol., 7(1):19–31, February 1997.
[113] D. Slepian and J. K. Wolf. Noiseless coding of correlated information sources.IEEE
Trans. on Information Theory, 19:471–480, Jul. 1973.
[114] G. J. Sullivan and T. Wiegand. Rate-Distortion Optimization for Video Compression.
IEEE Signal Processing Mag., pages 74–90, November 1998.
[115] Systematic Lossy Forward Error Protection for Error-Resilient Digital Video Broadcast-
ing. A. aaron, s. rane and b. girod. InProceedings of SPIE Visual Communications and
Image Processing Conference, San Jose, California, USA, January 2004.
[116] K. Tanaka, J. Inoue, and D. M. Titterington. Probabilistic image processing by means
of the Bethe approximation for the Q-Ising model.Journal of Physics A: Mathematical
and General, 36(43):11023–11035, 2003.
[117] D. Taubman. private communication, 2004.
[118] D. Taubman and M. W. Marcellin.JPEG2000 Image Compression: Fundamentals,
Standards and Practice. Kluwer, Boston, MA, USA, 2002.
[119] V. A. Vaishampayan. Design of Multiple Description Scalar Quantizer.IEEE Trans.
Inform. Theory, 39(10):821–834, May 1993.
[120] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and varia-
tional methods. In Michael A. Arbib, editor,New Directions in Statistical Signal Pro-
cessing. MIT Press, 2003,2005.
138 Bibliography
[121] M. J. Wainwright and M. I. Jordan. A variational principle for graphical models. In
S. Haykin, J. Principe, T. Sejnowski, and J. McWhirther, editors, New Directions in
Statistical Signal Processing, chapter 11. MIT Press, 2005.
[122] G.K. Wallace. The JPEG Still Picture Compression Standard. Communications of the
ACM, 34(4):30–44, April 1991.
[123] Y. Wang, M. T. Orchard, V. A. Vaishampayan, and A. R. Reibman. Multiple Description
Coding using Pairwise Correlating Transform.IEEE Trans. Inform. Theory, 10(3):351–
366, March 2001.
[124] T. Wedi and H. G. Musmann. Motion- and aliasing-compensated prediction for hybrid
video coding.IEEE Trans. Circuits Syst. Video Technol., 13(7):577–586, July 2003.
[125] T. Wiegand and B. Girod. Parameter Selection in Lagrangian Hybrid Video Coder Con-
trol. In Proc. of International Conference on Image Processing, ICIP 2001, Thessa-
loniki, Greece, October 2001.
[126] T. Wiegand, H. Schwarz, A. Joch, and F. Kossentini. Rate-constrained coder control
and comparison of video coding standards.IEEE Trans. Circuits Syst. Video Technol.,
13(7):688–695, July 2003.
[127] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra. Overview of the H.264/AVC
video coding standard.IEEE Trans. Circuits Syst. Video Technol., 13(7):560–576, July
2003.
[128] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra. Overview of the H.264/AVC
video coding standard.IEEE Trans. Circuits Syst. Video Technol., 13(7):560–576, July
2003.
[129] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data com-
pression.Commun. ACM, 30(6):520–540, 1987.
[130] A. D. Wyner and J. Ziv. The rate distortion function forsource coding with side infor-
mation at the decoder.IEEE Trans. on Information Theory, 22:1–10, Jan. 1976.
[131] Wyner-Ziv Video Coding with Hash-Based Motion Compensation at the Receiver. A.
aaron, s. rane and b. girod. InProceedings of IEEE International Conference on Image
Processing, volume 5, pages 3097–3100, Singapore, October 2004.
[132] Q. Xu and Z. Xiong. Layered Wyner-Ziv video coding. InProc. of VCIP’04, Jan. 2004.
[133] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free energy approximations and
generalized belief propagation algorithms.IEEE Trans. Info. Theory, 51(7):2282–2312,
July 2005.
[134] N. Zandonà, S. Milani, and A. De Giusti. Motion-Compensated Multiple Description
Video Coding for the H.264/AVC Standard. InProc. of IADAT International Conference
Bibliography 139
on Multimedia, Image Processing and Computer Vision, pages 290–294, Madrid, Spain,
March 2005.
[135] X. Zhu, A. Aaron, and B. Girod. Distributed Compression for Large Camera Arrays. In
Proc. of the IEEE Workshop on Statistical Signal Processing, pages 30–33, St. Louis,
Missouri, USA, September 2003.