source and joint source-channel coding for video ... · coding for video transmission over lossy...

UNIVERSITÀ DEGLI STUDI DI PADOVA

Sede Amministrativa: Università degli Studi di Padova

Dipartimento di Ingegneria dell’Informazione

DOTTORATO DI RICERCA IN:

INGEGNERIA ELETTRONICA E DELLE TELECOMUNICAZIONI

CICLO XIX

Source and Joint Source-Channel

Coding for Video Transmission

over Lossy Networks

Coordinatore: Ch.mo Prof. Silvano Pupolin

Supervisore: Ch.mo Prof. Gian Antonio Mian

Dottorando: Simone Milani

31 Dicembre 2006

UNIVERSITÀ DI PADOVA FACOLTÀ DI INGEGNERIA

Source and Joint Source-Channel

Coding for Video Transmission

over Lossy Networks

Ph.D. THESIS

Author: Simone MilaniCoordinator: Ch.mo Prof. Silvano Pupolin

Supervisor: Ch.mo Prof. Gian Antonio Mian

2006

CORSO DI DOTTORATO IN INGEGNERIAELETTRONICA E DELLETELECOMUNICAZIONI – XIX CICLO

Abstract

During the last years, the IT world has showed an increasing interest about the transmission

of video sequences over a heterogeneous set of networks for awide variety of different appli-

cations. This concern has led the development of more and more efficient video coding stan-

dard characterized by an increasing compression efficiency. However, the recent emergence of

media-rich video applications over wireless channels has widened the set of requirements that

a video coding scheme must satisfy. The current tendency is to provide multimedia services to

each terminal without constraining its mobility or autonomy and ensuring a certain Quality of

Service (QoS). Therefore, some of the most important requirements that must be satisfied are

low-power and complexity at the mobile/sensor video encoding unit, high compression effi-

ciency due to the limited available bandwidth, and the robustness to packet/frame drops caused

by wireless channel impairments. More and more solutions were recently proposed in order

to face this problem that is now the main issue about the possibility of providing digital video

services over mobile terminals.

The research presented in this thesis concerns the analysisand the design of coding tech-

niques that allows video architecture to face the three requirements mentioned above. Nowa-

days video coding architectures consist in a synthesis of different coding tools that were defined

during the last 50 years. High compression gains can be obtained whenever all these coding

units are appropriately orchestrated in order to either maximize the visual quality of the re-

constructed sequence at the decoder for a given bit rate or, reciprocally, minimize the coded

bit stream for a given visual quality. The H.264/AVC coder proves to be quite successful in

this task since its coding performance outperforms all the previous video coding standards.

Therefore, it has been chosen as a starting point for the investigation presented in this work.

At first the optimization of the coding gain will be the focus of the investigation. High com-

pression ratio can be achieved through the adoption of efficient entropy coding schemes and

the optimization of internal coding parameters. This thesis investigates an enhanced arithmetic

coder that improves the compression gain of the original scheme defined by H.264/AVC using

a different probability estimate. At macroblock level, coefficient statistics can be estimated

more accurately enhancing the performance of the binary arithmetic coder.

An efficient statistical model proves to be effective also atframe level. In the following,

we will present a rate model that allows a good estimate of coefficient distribution. This model

has been used in a rate control algorithm providing higher visual quality and a tighter control

on the coded bit rate.

Then, the thesis will be focused on allowing a reliable transmission of video contents to

the end user. Many different approaches have been studied inliterature, and we focused on

viii Abstract

two solutions that have been recently on fashion. The first solution is based on the inclusion

of redundant information in the packet stream. Our researchis focused on matrix-based cross-

packet coding techniques, and we introduced some optimization techniques that permit both

controlling the bit rate and maximizing the performance of the FEC channel coder. On the other

hand, a DSC coding architecture has been considered, and we focused our efforts on obtaining

a good compression gain that makes its performance comparable with the one provided by its

hybrid non-robust counterpart.

The analysis and the implementation of all these techniqueswas carried out taking into

consideration the required computational complexity, andpreferring the techniques that engage

a limited amount of operations in order to suit all the three demands presented before.

Sommario

Nell’ultimo decennio, si è venuto a sviluppare un interessecrescente attorno alla trasmissione

di contenuti multimediali per diverse applicazioni attraverso reti di tipo eterogeneo. Questo

ha portato allo sviluppo di standard di codifica video caratterizzati da un capacità via via cres-

cente di ridurre la ridondanza del segnale. Tuttavia, l’introduzione di applicazioni di video

comunicazione su reti wireless ha aumentato l’insieme di requisiti che gli schemi di codifica

devono soddisfare. L’obiettivo è quello di poter fornire all’utente una vasta gamma di servizi

multimediali senza limitarne la sua mobilità e garantendo una certa qualità (Quality of Service

QoS). Tre dei requisiti fondamentali che i moderni sistemi di codifica devono soddisfare sono:

un elevato guadagno di codifica, la capacità di creare un bit stream che sia robusto in presenza

di errori o perdite e la possibilità di implementare queste tecniche su dispositivi con risorse di

calcolo e autonomia energetica limitate. In letteratura sono state proposte soluzioni differenti

a questi problemi, la cui soluzione costituisce un elementofondamentale nella diffusione di

servizi video su terminali mobili.

Il lavoro presentato in questa tesi riguarda l’analisi e la progettazione di tecniche di codi-

fica che permettano di soddisfare questi tre requisiti. Le architetture oggigiorno presenti sin-

tetizzano numerose tecniche di codifica studiate negli ultimi 50 anni. Promettenti guadagni

di codifica possono essere ottenuti qualora i differenti sistemi di codifica vengano ottimizzati

ai fini di massimizzare la qualità percettiva della sequenzaricostruita ad un dato bit rate, o

viceversa minimizzare la dimensione del bit rate fissata unacerta qualità visiva. In questo il

codificatore H.264/AVC permette di ottenere ottime prestazioni, migliorando notevolmente i

rapporti di compressione degli standard di codifica precedenti. Di conseguenza è stato preso

come punto di partenza della ricerca qui presentata.

Il primo problema affrontato è l’efficienza di codifica. Il guadagno di codifica può essere

migliorato attraverso l’implementazione di schemi efficienti di codifica entropica e algoritmi

di ottimizzazione dei parametri interni al codificatore. Questa tesi analizza in primo luogo

un miglioramento del codificatore aritmetico definito nellostandard H.264/AVC. Lo stimatore

delle probabilità dei simboli binari è stato modificato ai fini di migliorare le prestazioni del

codificatore aritmetico adottando un modello probabilistico pi accurato. Allo stesso tempo,

un secondo modello è stato utilizzato a livello di frame per modellare il bit rate prodotto dal

codificatore stesso. Analizzando il numero di bit prodotti in base alla percentuale di zeri e

l’energia del segnale quantizzato, è possibile progettareun algoritmo di controllo del rate che

garantisca una maggiore qualità della sequenza ricostruita e un controllo pi preciso del bit rate

prodotto.

In seguito, la tesi si focalizza sulla necessità di permettere una trasmissione affidabile del

x Sommario

contenuto video. In letteratura sono stati studiati approci differenti, e il lavoro qui presentato

si focalizza su due di questi. La prima soluzione si basa sulla trasmissione di pacchetti di

ridondanza assieme ai pacchetti RTP prodotti dal codificatore di sorgente ai fini di rendere

lo stream video pi robusto. Una soluzione efficiente è data dall’adozione di un codice FEC

cross-pacchetto basato sull’ inserimento dei pacchetti disorgente in una struttura a matrice

che ne ottimizza le dimensioni e effettua un interleaving delle informazioni. La tesi presenta

alcune tecniche di ottimizzazione della dimensione della matrice che possono essere utilizzate

in algoritmo di controllo congiunto del rate prodotto sia dal codificatore di sorgente che da

quello di canale.

Il secondo approcio studiato è basato sui principi della codifica di sorgente distribuita (Dis-

tributed Source Coding DSC). La ricerca è stata principalmente volta alla progettazione di

tecniche efficienti di codifica entropica i fini di ottenere unbuon guadagno di codifica rispetto

ai codificatori tradizionali.

Nell’analisi e nella progettazione di queste tecniche si è tenuto conto della complessità

computazionale richiesta da ogni tipo di applicazione, scegliendo quelle soluzioni che richiedessero

una quantità di calcoli limitata.

”The single biggest problem in communication

is the illusion that it has taken place.“

George Bernard Shaw

“ASBOKQTJEL”

Postcard from J. E. Littlewood to A. S. Besicovich

announcing A. S. B.’s election to fellowship at Trinity.

Acknowledgments

This thesis is the result of three years of work whereby I havebeen accompanied and supported

by many people.

The first person that should be named is my supervisor and master Gian Antonio Mian. It

is very difficult to express how much I have learned from him and how beneficial he was for

my work since he always gave me precious advices about my workand he was always well

disposed to discuss about my research.

Special thanks must be made to my collegues of the Digital Signal and Image Process-

ing Laboratory of the University of Padova, who made my work environment stimulating and

friendly during the last three years. Amongst them, Prof. Giancarlo Calvagno must be thanked

gratefully for the support he gave me in the last period of my PhD course. I also thank the past

and the current Ph.D. students (in random order Andrea , Daniele, Ottavio, Lorenzo, Stefano,

Mino, Matteo), who have shared with me the life of the laboratory and have been precious

mates for discussions and an analysis. I also had the pleasure to supervise and work with

several students who did their graduation work in our projects (Andrea, Nicola, Joe, Raffaele,

Stefano, Simone) and were beneficial for my investigation.

I also include the other Ph.D. students of the department ofvia Gradenigo: Matteo, Mas-

simo, Giamba, Vale, Federico, Filippo, Antonio, Daniele, Tommaso, Pietro, Nicola, Anna,

Elena, Elena A. and the others that I am sure I forgot.

I must remember all the STMicroelectronics people who followed and supported my re-

search work (in random order Luca Celetto, Daniele Bagni, Fabrizio Rovati, Andrea Vitali,

Daniele Alfonso). Among them, I must also mention Gianluca Gennari who was a precious

interlocutor.

A special thank should be made to prof. Kannan Ramchandran, who gave me the opportu-

nity of studying and carrying on my research about Distributed Source Coding at the University

of California - Berkeley. There I had the opportunity of working and studying in a stimulating

environment that improved both my professional expertize and my human growth. I want also

to thank the collegues of the BASICS lab and Wireless Foundation Center, Vinod, Ben, Dan,

Chuohao, Alexandros, Paolo, Animesh, or the always nice talks I had with them. I also want

to thank June for the discussions and the arguments we had. Despite it was not always easy to

understand each other, we always were able to work it out.

I want also to thank all the people that I lived with in the International House of Berkeley,

CA, USA, during the period August 2005-June 2006. Their friendship was both supportive in

moments of sorrow and cheerful in moments of joy and fun. A special thanks must be made

to the “Italian Community” (i.e. in random order David, Devis, Lorenzo, Davide, Luca, Laura,

xiv Acknowledgments

Sara, Bianca, Alberto) who were my companions in many adventures, and to Helena, Cristine,

Pedro, Michelle, Shobi, Tricia, Josephine, Melike, Alessandro, Pietro, Fatma, whom I spent

wonderful moments with. I must also remember Melike, Sergej, Kate, Scott, Mickey, Sanjay,

Albert, Arlene, Kim, Shani, and all the6th floor people. I want also to thank Victoria, George,

Edgar, Elisa, Karin, and all the Capoeira Narahari group together with Basma, Yui, and the

other guys of the dance class. But of course there are lots of other people in Berkeley and all

around the world (since we were a multiethnic community) that made my stayingin the Bay

Area a wonderful experience. There I really appreciated the richness coming from meeting

people of different cultures.

I must remember those people that I met around the world whilecarrying out this work.

Among them, Najat must be named since we kept on talking a lot after EUSIPCO 2004 sharing

professional knowledge and personal interests.

I cannot forget the support received during these years fromthe many friends I have even

outside the University. Above all I must thank my life-long friend Luca, who has supported

me throughout all these years. I must also thank Alberto, Betta, Enrico, Chiara, Alessia, Luisa,

Marta, Nicole, Stefano. I also thank Antonio for our nice discussions about working inside and

outside the university. I must remember all the friends thatI have in Camposampiero (PD),

Italy, who were able to make my life happier beyond work and study.

Finally, I owe a great deal of thanks to my parents, that taught me the discipline and per-

severance necessary to achieve any important result, and tomy brother, who waited patiently

whenever I was using the internet connection to carry on my research.

Contents

1 Introduction 1

1.1 The convergence of multimedia and mobile communications . . . . . . . . . . 1

1.2 Features of next generation coding schemes . . . . . . . . . . .. . . . . . . . 2

1.2.1 Compression gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . .. 3

1.2.3 Robustness to data corruption and losses . . . . . . . . . . .. . . . . . 4

1.3 Main purpose and outline of the thesis . . . . . . . . . . . . . . . .. . . . . . 4

2 Video Source Coding and the H.264/AVC video coding standard 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

2.2 A holistic overview of the building blocks . . . . . . . . . . . .. . . . . . . . 9

2.2.1 Spatial Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

2.2.2 Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Transformation and quantization . . . . . . . . . . . . . . . . .. . . . 16

2.2.4 Entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.5 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Probability-Propagation Based Arithmetic Coding 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

3.2 The Context Adaptive Binary Arithmetic Coder (CABAC) . .. . . . . . . . . 25

3.2.1 Binarization and context modelling for the absolute values of non-zero

coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Modeling the contexts using a graph . . . . . . . . . . . . . . . . . .. . . . . 28

3.4 A Sum-Product based arithmetic coder . . . . . . . . . . . . . . . .. . . . . . 32

3.4.1 Probability modelling through DAGs . . . . . . . . . . . . . . .. . . 33

3.4.2 Estimation of the bit probability . . . . . . . . . . . . . . . . .. . . . 34

3.4.3 Context initialization . . . . . . . . . . . . . . . . . . . . . . . . .. . 36

3.4.4 Statistics update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.5 Reduction of the number of contexts . . . . . . . . . . . . . . . .. . . 37

3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 38

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

xvi Contents

4 Rate control algorithms for H.264 454.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

4.2 Rate distortion modeling based on “zeros” . . . . . . . . . . . .. . . . . . . . 47

4.3 Parametric models for H.264 coefficients estimated through activity . . . . . . 50

4.3.1 Storing the coefficients histograms . . . . . . . . . . . . . . .. . . . . 50

4.3.2 Approximating the coefficients distribution via a parametric model . . . 51

4.4 Signal analysis in the(ρ,Eq)-domain . . . . . . . . . . . . . . . . . . . . . . 53

4.5 A (ρ,Eq)-based rate control algorithm . . . . . . . . . . . . . . . . . . . . . . 55

4.5.1 Bit rate control at GOP level . . . . . . . . . . . . . . . . . . . . . .. 55

4.5.2 Bit rate control at frame level . . . . . . . . . . . . . . . . . . . .. . 56

4.5.3 Bit rate control at macroblock level . . . . . . . . . . . . . . .. . . . 60


4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Joint Source-Channel Video Coding Using H.264/AVC and FECCodes 69

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69

5.2 On dealing with channel errors and losses in video transmission . . . . . . . . 71

5.2.1 Error concealment at the decoder . . . . . . . . . . . . . . . . . .. . 72

5.2.2 Error concealment at the encoder . . . . . . . . . . . . . . . . . .. . 73

5.3 Channel coding techniques based on FEC codes . . . . . . . . . .. . . . . . . 76

5.4 Adapting the matrix size to the input data . . . . . . . . . . . . .. . . . . . . 79

5.4.1 Adapting matrix size according to the packet lengths .. . . . . . . . . 79

5.4.2 Adapting matrix size according to the video content . .. . . . . . . . . 80

5.5 Joint source-channel rate control . . . . . . . . . . . . . . . . . .. . . . . . . 86


5.6.1 Results with a fixed matrix . . . . . . . . . . . . . . . . . . . . . . . .89

5.6.2 Results with an adaptive matrix . . . . . . . . . . . . . . . . . . .. . 93

5.6.3 Results with a joint source-channel rate control algorithm . . . . . . . 93

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Achieving H.264-like compression efficiency with Distributed Video Coding 97

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97

6.2 Distributed Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 99

6.3 A simple example of coding with side information . . . . . . .. . . . . . . . . 101

6.4 A quick glance at the original PRISM architecture . . . . . .. . . . . . . . . . 104

6.5 Structure of the implemented coder . . . . . . . . . . . . . . . . . .. . . . . . 105

6.6 The generation of syndromes . . . . . . . . . . . . . . . . . . . . . . . .. . . 106

6.7 Entropy coding of syndromes . . . . . . . . . . . . . . . . . . . . . . . .. . . 108

6.7.1 Entropy coding of syndromes . . . . . . . . . . . . . . . . . . . . . .108

6.7.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . .113

6.7.3 Evaluation of compression gain with no quality equalization . . . . . . 113

6.7.4 Evaluation of compression gain with Intra refresh . . .. . . . . . . . . 114

6.7.5 Evaluation of compression gain with rate control . . . .. . . . . . . . 116

Contents xvii

6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7 Conclusions 1197.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

A Relation betweenEq and ρ 123A.1 Derivation of probability distribution for syndromes .. . . . . . . . . . . . . . 126

Bibliography 129

List of Figures

2.1 A block-based scheme of the H.264/AVC coder. . . . . . . . . . .. . . . . . . 11

2.2 Relation between Video Coding Layer (VCL), Network Adaptation Layer (NAL),

and transmission networks. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12

2.3 The4× 4 Intra predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Motion Vector computation and its prediction. . . . . . . . .. . . . . . . . . . 14

2.5 Macroblock partitioning for Motion Compensation in theH.264/AVC standard. 15

2.6 Different coding and display order for GOPs. . . . . . . . . . .. . . . . . . . 16

2.7 Comparison between CAVLC, CABAC, and UVLC (a fixed VLC code defined

in the H.26L drafts). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 A simple example of arithmetic coding. . . . . . . . . . . . . . . .. . . . . . 22

3.2 Scheme of the CABAC coding engine. . . . . . . . . . . . . . . . . . . .. . . 24

3.3 Structure of the Finite State Machine related to the CABAC coder. . . . . . . . 26

3.4 Scheme of contexts for the absolute values of coefficients. . . . . . . . . . . . 27

3.5 Directed Acyclic Graph that models the statistical dependencies between the

coefficients in a transform block. . . . . . . . . . . . . . . . . . . . . . .. . . 28

3.6 Dependencies between the coefficients in a macroblock. .. . . . . . . . . . . . 29

3.7 Distinction between bit planes coded using the DAG probability model and bit

planes coded using the traditional CABAC scheme. . . . . . . . . .. . . . . . 34

3.8 Structure of the modified Finite State Machine in the new arithmetic coder. . . 37

3.9 Coding results for different QCIF sequences at 30 frame/s. . . . . . . . . . . . 40

3.10 Results for different QCIF sequence at 30 frame/s. . . . .. . . . . . . . . . . . 42

3.11 Results for different CIF sequence at 30 frame/s. . . . . .. . . . . . . . . . . 43

4.1 Distortion vs. Rate for coded Intra, Inter and B frames. .. . . . . . . . . . . . 48

4.2 Plots of bit rate vs.ρ for the coded sequenceforeman. . . . . . . . . . . . . 49

4.3 Histogram of coefficients frequencies from the coded sequencecarphone. . . 50

4.4 Eq vs. ρ for the sequencecarphone. . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Bits/Frame and PSNR/Frame plot of 240 QCIF frames for thesequencesalesman. 62

4.6 Distortion-Rate plot of 120 CIF frames for the sequencesalesman. . . . . . 63

4.7 Distortion-Rate plot for different QCIF sequences at 30frame/s. . . . . . . . . 64

4.8 PSNR and Rate plots of 180 QCIF frames for the sequenceforeman. . . . . . 66

5.1 A pictorial example of Multiple Description Coding. . . .. . . . . . . . . . . 74

xx List of Figures

5.2 General scheme for the coding matrix in RFC2733 approachwith and without

byte padding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Experimental results for different sequences showing the relative quality loss

δE(PSNR)/E(PSNR) and the parameterN3dB vs. the activityact. . . . . . . . 83

5.4 Experimental results for different sequences showing the relative quality loss

δE(PSNR)/E(PSNR) and the parameterN3dB vs. the percentageρ. . . . . . . . 85

5.5 Results for different sequence with loss probability0.03. . . . . . . . . . . . . 90

5.6 Results of FEC-NoPadding forforeman QCIF with different FEC redun-

dancy and loss probability0.03. . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.7 Results of FEC-NoPadding with different rows and columns (loss probability

0.06). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.8 Comparison between adaptive and fixed methods with loss probability 0.06. . . 94

5.9 Comparison between adaptive and fixed methods with loss probability 0.06. . . 95

6.1 Two different coding scenarios for the example in Section 6.3. . . . . . . . . . 102

6.2 Example of Wyner-Ziv decoding with sources in{0, 1}3. . . . . . . . . . . . . 102

6.3 A pictorial representation of innovation and correlated info for blocks. . . . . . 104

6.4 CRC coding mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.5 Block diagram for the presented DSC-based coder. . . . . . .. . . . . . . . . 106

6.6 Partitioning of the quantized values (the lattice of integersZ) into three sub-

lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.7 Comparison between the actual pmf of syndromes and the presented model. . . 109

6.8 Difference between the entropies of DFD and DSC syndromes. . . . . . . . . . 110

6.9 Comparison between the probabilities of non-null DSC syndromes and non-

null H.264 coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110

6.10 Coding performance of the original CABAC on H.264 coefficients and DSC

syndromes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.11 Example of quad-tree coding using CBP variables. . . . . .. . . . . . . . . . 112

6.12 PSNR vs. Bit rate for the first frame in the GOP. . . . . . . . . .. . . . . . . . 113

6.13 PSNR vs. Bit rate for a whole GOP. . . . . . . . . . . . . . . . . . . . .. . . 115

6.14 PSNR vs. Bit rate with Intra refresh enabled. . . . . . . . . .. . . . . . . . . 116

6.15 PSNR vs. Bit rate with rate control enabled. . . . . . . . . . .. . . . . . . . . 117

List of Tables

2.1 Timeline and coding applications for different video coding standards. . . . . . 8

3.1 Sequence of states for the considered example. . . . . . . . .. . . . . . . . . 23

4.1 Configuration parameter for the H.264 encoder. . . . . . . . .. . . . . . . . . 62

4.2 Results for the sequencesalesman. . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Comparison between the(ρ,Eq)-based algorithm and JM7.6 algorithm. . . . . 65

4.4 PSNR/Rate for VBR tests on different sequences. . . . . . . .. . . . . . . . . 66

5.1 Comparison betweenρ-adaptive and fixed rate control methods for the se-

quencenews. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2 Comparison betweenρ-adaptive and fixed rate control methods for the se-

quenceforeman. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1 Comparing the average bit rate needed to code the position of zeros and ones

in the H.264 coder and CBP blocks for the DSC coder. . . . . . . . . .. . . . 113

Chapter 1

Introduction

1.1 The convergence of multimedia and mobile communications

Over the last decade, the IT world has assisted a joint development and widespreading of both

multimedia technologies and wireless transmission systems.

The commercial success of digital audio/video applications and the attracting business pos-

sibilities that have been created by the gradual penetration of multimedia communications has

rushed both industries and universities towards the designof digital architectures with advanced

multimedia functionality. This tendency has led to the creation of more and more efficient au-

dio/video coding systems, while the availability of fasterprocessors and increased amounts of

memory has made possible the adoption of complex coding algorithms with increased com-

pression capability.

At the same time, we also assisted at an unprecedent widespreading of wireless commu-

nications. The need of connecting distant users at any placein any time has promoted the

investigation of more efficient modulation schemes and transmission protocols allowing con-

sumers to exchange an enhanced and varied set of data.

During the last years, these two research fields have startedto converge since the need of

providing ubiquitous access to multimedia services over anheterogeneous interconnection of

networks has posed new challenges to the existing coding schemes. The goals of second gen-

eration cellular networks, i.e. supporting integrated voice and data, were extended in the third

generation cellular networks to provide the user with a wider set of multimedia services that

span from the video communication to the fruition of video-on-demand contents. As a conse-

quence, wireless terminals with advanced multimedia functionalities have been progressively

gaining importance as the production of multimedia contents for both business and personal

use has become an essential element in everyday communications.

This convergence of mobile and multimedia communication has raised new problems re-

lated to the heterogeneity of the scenarios and time-varying nature of the channels involved.

Since the source coding technology has achieved a sufficientgrade of maturity in these appli-

cation contexts, next-generation architectures have to deal with the interoperable exchange of

multimedia information and the efficient transmission overnetworks that may be affected by

information losses and data corruption.

2 Chapter 1. Introduction

A flexible infrastructure for the exchange of multimedia contents is required since distinct

users in a heterogeneous scenario are willing to communicate and interact with different media,

such as audio, video, and text. Most users have common concerns (efficient management of

contents, protection of contents, and privacy issues), andnew solutions are required to manage

the access and delivery process of these different content types in an integrated and harmonized

way, entirely transparent to the different users. The challenge is made utterly difficult by the

fact that communicating terminals may have different transmission capabilities, and the trans-

mission system must be able to adapt the sent data according to network possibilities. These

needs have led the definition of the emerging standard MPEG-21 Multimedia Framework that

“aims to enable transparent and augmented use of multimediaresources across a wide range

of networks and devices”[10].

On the other hand, the capability of providing reliable video transmission is the most rele-

vant issue in the widespreading and the diffusion of multimedia mobile services. Radio chan-

nels present non-stationary characteristics that result in losses or alterations of the received

data. The missing or corrupted information leads to a mismatch between receiver and trans-

mitter that may reduce the quality perceived by the end user and, in some cases, preclude the

correct decoding of the following information. The receiver can mitigate the drawbacks of

data losses by estimating the lost information up to a given uncertainty. However, in case the

amount of lost information is substantial or the non-stationary characteristics of the signal do

not allow good recovering performance, it is necessary to adopt a more efficient coding strat-

egy or a more robust coding scheme at the encoder. Moreover the intrinsic mobility of wireless

systems leads to a frequently-changing network topology that makes the resource allocation

difficult. For example, a variable number of different usersthat shares a common resource, like

bandwidth, changes the amount of resources assigned to eachone and this variability must be

taken into account by the coding schemes.

In the following section, different issues which characterize the choices and the design of

new coding schemes are identified

1.2 Features of next generation coding schemes

As the emerging scenarios in the multimedia world demand theinteroperability and the re-

liability of communications, the Information Technology professionals have looked for new

solutions that could efficiently cope with the new requirements. In this investigation, their

concern was mainly focused on some features regarding videocoding architectures that had a

straight influence of the resulting coding performance.

1.2.1 Compression gain

In order to obtain an efficient delivery of multimedia information, devices should be endowed

with advanced compression algorithms that allow the receiving terminal to reconstruct the

coded information with the highest possible fidelity.

Usually the size of coded data is constrained by different external factors. One of them can

be the available storage space in case of sequences that are filed into multimedia archives or

1.2. Features of next generation coding schemes 3

memorized on physical supports such as DVDs. Another constraining factor is the available

bandwidth in real-time video communications which limits the amount of data that can be

sent per time unit. In this case the arrival time of sent data concerns the final service quality

experienced by the end user, and therefore, the adopted coding algorithm must make sure that

the produced bit rate does not overwhelm the transmission capacity.

Remember that a relevant compression rate can only be obtained adopting a “lossy” coding,

i.e. a coding scheme that represents the coded visual information tolerating a distortion of the

original video sequence but greatly-reducing the amount ofsent information. Video coding

world has been facing this problem designing new standards characterized by continuously-

improving compression efficiency. The performance of the first MPEG video coders has now

been overcome by the high coding gains obtained by the lateststandards such as H.264/AVC

and the upcoming MPEG-21/SVC [10], which have made video communication possible on

third generation cellular networks.

The main goal of the standardization process is defining a “common language” that enables

different terminals from different vendors to exchange video data. Despite the standardization

process has strictly specified the syntax of coded video stream, there are no limitations on how

the stream can be generated, i.e. about the encoding process. Each designer is free to implement

the encoder according to his/her specific targets and depending on the physical device: the

aimed coding strategy is ensuring the highest visual quality given the imposed constraints on

the size of the coded video stream and the hardware resourcesavailable. This has brought to

the creation of a wide range of different control algorithmsthat tune the coding parameters

according to the number of bits to transmit and to the relevance of the coded information with

respect to the resulting visual quality.

1.2.2 Computational complexity

A second issue is the computational complexity, which is still an important discriminating

element because of its implication on the autonomy of mobiledevices. Despite the availability

of more and more powerful processors, the limited power supply that characterizes most mobile

video devices prevents the adoption of architectures that require a high amount of computation.

Therefore, the video coding literature has been presentingseveral low-complexity solutions

that permit a light implementation of complex video coding architectures on battery-supplied

systems. In addition, during the last years new coding paradigms have emerged inspired by

a “uplink” broadcast model (where a multitude of light encoders sends different coded video

streams to a complex decoder) [93, 95, 94, 26] in place of the traditional “downlink” broadcast

model (where there is a complex encoder and a multitude of light decoders). In this way the

problem of computational complexity on the mobile terminalappears efficiently addressed by

adopting a hybrid system where the coding adopted for the uplink transmission is different

from the coding adopted for the downlink transmission. In this way, the most computationally-

expensive tasks can be performed by the network whose available power supply and bearable

computational load have higher bounds.

4 Chapter 1. Introduction

1.2.3 Robustness to data corruption and losses

A third element that is worth of consideration in the design of a video coder is the robust-

ness of the scheme to errors and losses, i.e. the capability of avoiding error propagation when

the coded bit stream has been corrupted by errors and losses.All the current video coding

paradigms, which are part of popular standards like MPEG [47, 44] and H.26x [45, 87], fail to

address this requirement as most of their compression gain is achieved through the adoption of

an inter-frame motion prediction. Since each frame is codedtaking one of the previous ones as

a reference, in case part or all of the previous reference is missing because of channel errors,

the decoding process is stopped until the state of the decoder is refreshed, i.e. non-predicted

data are sent with a great waste of bandwidth [62]. A possiblealternative is to estimate the lost

information and replace the missing data with its approximation in the decoding process [22]

introducing an additional noise, which results in a qualitydegradation of the image. Other cod-

ing schemes protect the coded stream applying channel codes[33, 34, 106, 79] or code multiple

correlated streams that allow the decoder to estimate the lost stream from the ones that were

correctly received [32, 13, 134, 108]. Most of these techniques prove to be greatly effective

whenever the protection level is matched to the channel conditions and the characteristics of

the coded signal. As a consequence, the recent literature has been characterized by different

proposals of joint source-channel coding algorithms that try to detect the protection strategy

that allows the receiver to recover most of the information lost across the transmission channel.

In this context, R. Puri and K. Ramchandran have recently faced the interesting question of

whether it is possible to design an efficient video coding paradigm that attains simultaneously

both motion-like compression efficiency and robustness with a low encoding complexity. Their

investigation have lead to the design of PRISM [93], a new video coding scheme founded on

the principles of distributed source coding ([113, 130]) that allows the prediction of the current

block of data from a set of different possible references. Incase some of them are missing,

a correct decoding is still possible from those references that were correctly decoded. This

coding solution (called "syndrome encoding") classifies the current input data and codes it ac-

cording to the classification. The receiver is able to decodethe information using an arbitrary

reference that belongs to the same class. In this way, the motion-search is transferred to the

decoder and it is possible to match both the requirement of a light encoder and the need for a

robust bit stream.

The following section will describe how these and other topics are dealt with in this thesis.

1.3 Main purpose and outline of the thesis

The focus of this thesis is the design of efficient coding algorithms for video transmission over

wireless channels. These strategies aim both at increasingthe coding gain and at reducing the

quality degradation in case of information losses. In all these techniques a special attention

was paid to the computational complexity, which was kept as low as possible in order to apply

these solutions to mobile devices with a limited power supply.

Chapter 2 presents a brief overview of the H.264/AVC standard which is the starting point

of our investigation. The purpose of the Chapter is not to provide a detailed description of the

1.3. Main purpose and outline of the thesis 5

standard H.264/AVC, but to define the conventions and the notation that will be used throughout

the whole thesis.

Chapter 3 describes a novel arithmetic coding engine based on statistical graphical models.

It is possible to improve the performance of H.264/AVC arithmetic coder modifying the con-

text structure of its arithmetic coding engine, the ContextAdaptive Binary Arithmetic Coder

(CABAC). In this case, probabilities are modelled through aset of Directed Acyclic Graphs

(DAGs), which allows a more accurate estimate of the probabilities of binary digits. Experi-

mental results show that it is possible to reduce the size of the coded bit stream by approxi-

mately 10 %.

Chapter 4 presents an efficient rate control algorithm that allows to obtain a high objective

quality in the reconstructed sequence while keeping the coded bit stream within the avail-

able bandwidth. The algorithm is based on modelling the number of bits produced by the

H.264/AVC coder in the(ρ,Eq) domain, whereρ is the percentage of null quantized DCT co-

efficients andEq is the energy of the quantized signal. The resulting strategy proves to be very

effective with respect to other solutions since it allows anaccurate estimate of the produced bit

rate. Moreover, this result is improved by the adoption of aneffective skipping technique that

avoids coding frames whenever their lack does not affect significantly the smoothness of the

reconstructed sequence and the transmission buffer is soaked by the previous frames.

Chapter 5 copes with the problem of allowing a robust transmission through a faulty chan-

nel. After an overview of the existing solutions, the chapter describes an implementation of a

cross-packet FEC channel coder. The coding is performed by including RTP packets colum-

nwise into a matrix and computing the redundant informationalong the rows. This approach

presents several issues in terms of optimizing the matrix dimension since the number of rows

and columns lead to different performances according to thechannel characteristics and the

coded sequence. The chapter proposes an optimization strategy that is based on the percentage

of null quantized DCT coefficients. This criterion is used todesign a novel joint source-channel

rate control algorithm that varies the protection level according to the characteristics of the in-

put sequence.

Chapter 6 faces the problem of robust transmission considering a different approach. In-

stead of adding redundancy in the coded bit stream, it is possible to recreate a robust packet

stream though the principle of Distributed Source Coding (DSC). Among all the proposed DSC

solutions, we have considered the scheme proposed by Puri and Ramchandran in [93]. The re-

search presented in Chapter 6 focus on the entropy coding unit since most DSC coders obtain

an inferior compression gain with respect to their non-robust hybrid counterparts. A quad-

tree based arithmetic coder is presented, which make possible to improve the coding results of

previous coders and efficiently compares the original H.264/AVC standard.

Finally, we draw our conclusions in Chapter 7, which gives a brief summary of the results

obtained by this investigation and some guidelines for future research.

The material in Chapter 2 is mainly introduced for review of the H.264/AVC standard. The

remaining chapters represent the original contribution ofthe author and of the supervisor to the

field. Most of the material covered in this thesis has been published in [75, 4, 81, 76, 80, 108,

78, 82].

Chapter 2

Video Source Coding and theH.264/AVC video coding standard

“If you wish to converse with me, define your terms”

Voltaire

The chapter provides a short introduction into the structure of the H.264/AVC coder, underly-ing the features of interest for the algorithms presented inthe following chapters. The aim isto provide a background of conventions related to the syntaxelements and the functional unitsdefined by the standard. The first section gives a general introduction about the purposes andthe guidelines that inspired the standardization process.Then, the H.264/AVC coder is decom-posed into its building blocks, providing more details for those parts that affect more directlythe resulting performance in terms of compression gain. In addition, some conventions aboutthe use of terms related to the H.264/AVC syntax are introduced.

2.1 Introduction

During the last two decades, different video coding standards have been developed to ensure

an efficient handling of visual information along the entirechain that covers the production,

distribution, and reception of video content. Their designwas mainly inspired by the need of

shrinking as much as possible the overwhelming amount of data produced by a video source

since the transmission capacity or the storage space is limited. Each standard defines the syntax

and semantics of the bit stream as well as the processing thatthe decoder needs to perform

when decoding the bit stream back into video. Therefore, manufactures of video decoders

can only compete in areas like post processing, optimization of coding parameters, cost and

hardware requirements, while the implementation of the encoder is completely free as long as

the produced bit stream can be correctly decoded.

This standardization policy has played a crucial role beingthe leading factor in the widespread

of digital video communication and affecting the way we create, communicate, and consume

audio-visual information. In fact, the decoder-oriented standardization allows the interoper-

ability among products developed by different manufacturers ensuring to the content creators

that their content runs everywhere and that they do not have to create and manage multiple

8 Chapter 2. Video Source Coding and the H.264/AVC video coding standard

copies to match the products of different manufacturers. Atthe same time, manufacturers

are free to resort to different implementation schemes in order to find the right performance-

cost trade off matching the requirements of the target applications and the characteristics of

the terminal on which the coder is implemented. Worldwide, two working groups dominate

Name of Year Title ofstandard Organization of release standard

H.261 ITU-T 1990 Video Codec for Audiovisual Ser-vices atp× 64 kbit/s

MPEG-1 ISO/IEC 1991 Coding of moving pictures and as-sociated audio for digital storagemedia at up to about 1.5 Mbit/s

MPEG-2 ISO/IEC 1994 Generic coding of movingH.262 ITU-T pictures and associated audio infor-

mation

H.263 ITU-T 1995 Video coding for low bit rate com-munication

MPEG-4 ISO/IEC 1999 Coding of audio-visual objects

H.264 ITU-T 2003 Advanced Video CodingAVC ISO/IEC

Table 2.1: Timeline and coding applications for different video coding standards.

the video coding standardization processes, namely, the ITU-T Video Coding Experts Group

(VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG).VCEG has traditionally

focused on low bit rate video coding applications, where there is a need for high compression

rates and error resilience tools. MPEG groups a larger community targeting higher bit rates for

entertainment-quality broadcasting applications. Both organizations have produced very suc-

cessful standards in their respective domains, and in 2001,they joined to form the Joint Video

Team (JVT) with the purpose of designing an efficient video coder that was able to satisfy the

novel requirements created by the transmission of video contents over wireless networks [91].

The main goals of this standardization effort were improvedcompression efficiency, network

friendly video representation for interactive (i) and non-interactive (ni) applications such as:

(i) conversational services over ISDN, Ethernet, LAN, DSL,wireless and mobile networks,

modems, etc. or mixtures of these;

(ni) video-on-demand or multimedia streaming services on ISDN, cable modem, DSL, LAN,

wireless networks, etc. ;

(ni) broadcast over cable, satellite, cable modem, DSL, terrestrial, etc. ;

(ni) storage on optical and magnetic devices, DVD, etc. ;

(ni) multimedia messaging services (MMS) over ISDN, DSL, Ethernet, LAN, wireless and

mobile networks, etc. ;

The result of this collaboration is the video coding standard H.264/MPEG-4 AVC, which

reached a first complete definition by the end of 2002 and was completed on February 2005

2.2. A holistic overview of the building blocks 9

[29]. The coder structure reflects the traditional scheme ofhybrid video source coders with

some additional features that improve its coding performance [127]. In fact, it is possible to

consider H.264/AVC as a “collection” of different coding tools that can obtain a high com-

pression gain when orchestrated in an appropriate manner. The following sections will give

an overview of these features providing evidence for their influence on the final coding perfor-

mance.

2.2 A holistic overview of the building blocks

The structure of the H.264/AVC coder can be seen as a comprehensive synergism of coding

solutions designed in the last 50 years. In fact, many features that are included were already

present in some of the previous coders. However, the standardization process that has led

to the definition of this coding scheme has redesigned some ofthese techniques in order to

adequately combine them in a general architecture. In addition to these, some new elements

were introduced providing the final coder with a wide set of tools that can be rearranged in

many different ways.

The input signal is a digitized video sequence, i.e. an ordered sequence of digital pictures

taken at fixed equally-spaced time intervals by a digital video camera. Each picture (called

frame) can be seen as a grid of picture elements (pixelsor pel) that represent the local infor-

mation of the picture similarly to the way a tile is a fractionof a mosaic. The density of pixels

per squared inch can vary but usually it is around hundreds ofpicture elements. Each pic-

ture element carries a color information that can be represented by a set of three integers that

can vary according to the adopted color space representation. Since the Human Visual Sys-

tem (HVS) is much more sensible to the luminance than to the chrominances, theRed, Green

and Blue(RGB) color components acquired by the digital sensors are first transformed into

theLuminance and Chrominance(YUV) color space, with chrominance components spatially

sub-sampled. Since an extensive description about the color representation and chrominance

sampling is beyond the scope of this work, more information can be found in [30, 16]. The

current state of the art for the H.264/AVC coder allows different color space representations

and different types of sub-sampling for the input signal, despite in this work we will always

consider video signals in the YUV format with sampling 4:2:0(i.e. the chrominances are sam-

pled at half the frequency along both the vertical and the spatial direction with respect to the

luminance component). Therefore, each picture can be considered as made of three separate

matrices of integer values. The first contains the luminancecomponent Y (also called Luma),

while the others contain the chrominance components U and V (also called Chromas). Note

that each of Chroma matrices contains a quarter of the pixelsof the Luma matrix because of

the sub-sampling. Most of the conventions about the acquisitions of the frames were inherited

from previous coding standards ([47, 45, 44])

Like many of the previous video coding architectures, the basic processing unit in the

H.264/AVC standard is theMacroblock(MB), i.e. a square of16 by 16 pixels from the Luma

component associated with two squares of8 by 8 pixels from the two Chroma components.

Each macroblock is processed as shown in the scheme reportedin Fig. 2.1, which reports the


following set of processing elements:

• Motion Estimation. Each Macroblock is at first partitioned into smaller blocksthat can

have heterogeneous sizes, so that Block Matching Motion Estimation (BMME) is applied

to each sub-block, i.e. the coder searches for an equally-sized block among previously-

coded frames that are available in areference frame bufferin order to accurately predict

the current one. The prediction is fully described by a Motion Vector (MV), which is

differentially coded with respect to neighboring MVs in thetransmitted bit stream.

• Intra Prediction. One of the main innovations brought by the standard is the adoption of a

block-based spatial prediction that estimates the currentblock according to the neighbor-

ing reconstructed pixels. The Intra prediction compensates the adoption of a sub-optimal

transform size and allows the H.264/AVC coder to obtain a better coding efficiency with

respect to previous standards.

• Transform. This module reversibly transforms the pixels from thespatial domaininto

the frequency domain, where information appears to be less correlated so that a more

compact representation of the current block can be achieved. The transform adopted by

the H.264/AVC standard is an approximation of the4× 4 DCT [35, 36] which proves to

be an efficient coding solution when it is matched with an efficient prediction mechanism

like the Intra spatial block-based prediction [14].

• Quantization. The quantization phase is intended to shrink the set of possible reconstruc-

tion levels for the transform coefficients in order to reducethe number of representation

symbols so that the size of the coded bit stream results smaller. In fact, small variations

in coefficient values are not perceptible by the human visualsystem, and therefore, it

is possible to slightly distort the transform coefficients without affecting the perceived

visual quality. However, whenever high compression gains are needed (i.e. when the

bit rate is constrained by external factors such as available bandwidth or storage space),

an objectionable visual degradation of the reconstructed images becomes evident so that

some additional measures must be taken into account (such asincreasing the strength of

deblocking filter).

• Deblocking Filter. The quantization of the transform coefficients and the block-based

transform performed on the residual signal cause the appearing of unpleasant visual ar-

tifacts, especially at low bit rates. These artifacts usually result in an additional high

frequency noise that makes the reconstructed image appearslike composed by different

tiled-up blocks (blocketization). This high frequency noise can be significantly reduced

by an adaptive low-pass filter which is able to tune its strength according to the values

of various coding parameters and syntax elements [64]. In addition, it allows a better

motion compensated prediction since the deblocking filter is included in the prediction

loop improving the compression efficiency.

• Entropy Coder. This block converts the syntax elements produced by the coder into

variable length binary strings that can be formatted and packetized into different ways


according to the coding parameters. The H.264/AVC defines two different entropy cod-

ing algorithms: the Context-Adaptive Variable Length Coder (CAVLC) [127] and the

Context-Adaptive Binary Arithmetic Coder (CABAC) [73].

The structure can be roughly decomposed in a DPCM coder, which can perform either

a temporal or a spatial prediction, followed by a transform coder and an entropy coder (see

Fig. 2.1).

Decoder

Quant.

∆I−Quant.

EstimationMotion

CompensationMotion

PredictionIntra−frame

FilterDe−blocking

CodingEntropy

ControlCoder

Transf.

Output Frame

Motion Data

Bit−stream

Input Block Quant. Trans. Coeffs

Control Data

I−Transf.

Figure 2.1: A block-based scheme of the H.264/AVC coder.

As for temporally-predictedmacroblocks, the encoder takes advantage of the temporal

correlation existing among subsequent frames and estimates pixel blocks of the current frame

according to pixel blocks from the previous ones (see section 2.2.2). This estimate is then

refined by the transform coding unit which processes the residual signal.

On the other hand, thespatially-predictedmacroblocks are characterized by blocks which

are predicted according to the previously coded pixels in the same frame (see 2.2.1). Note that

in this case the coded MB does not depend on the previous frames and the decoding can be

done independently (i.e. we can refer to them asindependently-codedor intra macroblocks).

As a consequence, the decoding of a randomly-chosen temporally-predicted frame implies the

decoding of all the previous pictures until a spatially-predicted frame is found.1 In addition, the

loss of one frame precludes the correct decoding of all the following temporally-predicted pic-

tures. Therefore, spatially-predicted frames must be coded at regular intervals in order to both

allow pseudo-random access to each frame and avoid the errorpropagation in case of frame

losses. The periodicity of independently-coded frames depends on the application and affects

1We define aspatially-predictedframes orIntra frames, the coded pictures that are made only of Intra mac-roblocks. At the same time, we call temporally-predicted orInter frames those pictures that can be made of bothtemporally-predicted and spatially-predicted macroblocks.


both the characteristics of the coded bit stream and the quality of the reconstructed sequence

whenever the transmission channel is corrupted by errors and losses. More information on this

subject will be given in section 2.2.2.

Each reconstructed image is then processed using a deblocking filter in order to remove

visual artifacts and improve the performance of the temporal prediction. Further details will be

given in the next sections, focusing the attention on the parts of the standard that concern the in-

vestigation. We refer the reader to [87, 105] for a complete description.

Figure 2.2: Relation between Video Cod-ing Layer (VCL), Network AdaptationLayer (NAL), and transmission networks.

The output of the scheme reported in Fig. 2.2

is a binary stream that needs to be packetized

and organized into an appropriate way ac-

cording to the coding order of macroblocks

and the coded information. This operation is

performed by theNetwork Adaptation Layer

(NAL), which defines a flexible interface in-

tended to adapt the coded bit stream to the

transmission network (as depicted in the fig-

ure on the left).

According to the H.264/AVC specifica-

tion, each video packet carries the informa-

tion related to oneslice, a set of macroblocks

belonging to the same frame. In the previous

standards, slices were made of sequences of

macroblocks processed in raster scan order (see [47, 41, 44]). For example, many coding set-

tings included a row of macroblocks into one slice. More recently many issues concerning

error resilience and packetization have suggested new strategies to design the pattern of mac-

roblocks forming a slice like choosing randomly the MBs across the frame, selecting all the

macroblocks inside a specific area, interlacing rows of macroblocks. The investigation of the

possibilities offered by a Flexible Macroblock Ordering (FMO) is beyond the scope of this

work, and all the experiments were carried out considering slices with adjacent macroblocks

in raster scan order. Slices were made considering a fixed number of MBs per slice or a fixed

number of bytes (i.e. including macroblocks until the number of bits reached a fixed threshold).

For further details about FMO policies, it is possible to refer to [87, 5, 6, 7].

The information related to each slice can be packetized intoone or more RTP packets ac-

cording to whether the Data Partitioning (DP) option is enabled or not. This distinction makes

possible the inclusion of different syntax elements into different packets that can be transmit-

ted or protected according to different criteria. Wheneverthe video sequence is transmitted

over a network affected by losses, the Data Partitioning option allows the decoder to improve

the performance of the error concealment algorithm since part of the coded information can

be correctly received. However, in our work we created only one packet per slice without en-

abling the Data Partitioning mode since the investigation of the possible benefits produced by

its adoption is beyond the scope of the present thesis (for further information see [87]).

The following sections will provide a more detailed insightof some basic building blocks.


2.2.1 Spatial Prediction

One of the main innovations introduced by H.264/AVC within the scenario of hybrid video

coders is the adoption of a block-based spatial prediction.In fact, the independently-coded

macroblocks (calledIntra) defined by previous video coders were created applying transform

coding to the input signal directly without performing any kind of prediction. However, the

adoption of a DCT transform with a lower dimension (see subsection 2.2.3) required some

additional processing in order to supply the lower coding gain associated to unpredicted sig-

nals (see [14]). The prediction of the original signal before transform coding was a feasible

solution, provided that the frame could be independently decoded. This requirements led to

the adoption of the spatial prediction unit reported in Fig.2.1 (Intra-frame Prediction), which

creates an estimate of each transform block without resorting to the previous frames but consid-

ering the values of the neighboring pixels. Unlike most of the previous coding standards which

adopted spatial prediction on a pixel basis [42, 122], H.264/AVC defines a block-oriented spa-

tial prediction stage, where each block can be estimated in two different ways. One possible

prediction can be performed on the whole macroblock considering the pixels of the upper and

the left MBs. The estimate is computed choosing one among a set of 4 predictors, and it is

performed whenever the current MB is coded in theIntra16x16mode. In order to obtain a

A B C D E F G H

LKJIM

0 (vertical)

LKJIM A B C D E F G H

1 (horizontal)


2 (DC)


3 (diagonal down−left)


4 (diagonal down−right)

LKJIM A B C D FE G H

6 (horizontal−down)


7 (vertical−left)


8 (horizontal−up)

LKJIM A B C D F G H

5 (vertical−right)

E

Figure 2.3: The4× 4 Intra predictors.

finer prediction the standard defines a set of 9 possible spatial predictors (depicted in Fig. 2.3)

that approximate the pixel of the current4 × 4 block using the values of the upper and left

pixels. This coding mode (calledIntra4x4) implies the coding of the adopted predictor for

each4× 4 block.

Since an extensive description of the Intra prediction process in H.264/AVC is not the main

topic of this work, [128, 87, 14] can be consulted for furtherdetails.


2.2.2 Motion Compensation

Despite the fact that the spatial prediction permits improving the coding gain of the4× 4 DCT

alone, it is possible to obtain a further compression through temporal prediction. The efficiency

of this technique was known to the previous coding standardstoo (see [47, 44, 45]), and differ-

ent techniques have been applied to take advantage of it. Themost widely used methods im-

plies the estimate of Motion Vectors (MV). Motion Vector coding is based on a vector-oriented

model of motion derived from classical mechanics ([18, 85]), where the movement of objects

is simplified as a sequence of small local translation on a plane, which are the projection on the

image plane of the real three-dimensional movements. Despite the model proves to be ineffi-

cient in modelling rotations, deformations or movements along the optical axis of the camera,

it is broadly adopted for its intrinsic simplicity. The original image is divided into blocks of

pre-determined size which are predicted using an equally-dimensioned block of pixels taken

from the previous images. The identification of the prediction block is typically performed

through a Block Matching (BM) algorithm, which finds the block that minimizes a given dis-

tortion function among a set of possible candidates. The candidate set could include all the

possible blocks from the previously coded frames, but practical approaches confine the search

to those blocks that lie within a limited window (see Fig. 2.4(a)). Each block can be identi-

fied by specifying aMotion Vector, i.e. the difference between the Cartesian coordinates of the

prediction block and the current one. Experimental resultsand the physical characteristics of

(a) Translational model for BMME (b) MV median predictor applied to heterogeneousneighboring blocks

Figure 2.4: Motion Vector computation and its prediction.

real objects in a scene indicate that neighboring motion vectors are correlated. Therefore, it

is possible to take advantage of this correlation both in theMotion Estimation (ME) process

and in the coding of MV values. The H.264/AVC coder identifiesa predictor for each motion

vector, which corresponds to the median of the neighboring ones as depicted in Fig. 2.4(b). The

median predictor identifies the center of the search window and its value is used in a DPCM

coding of the current MV.

In H.264/AVC, the MV-based prediction scheme is enhanced byallowing a flexible parti-

tioning of the MB to be predicted. Fig. 2.5 reports the possible partitioning structures that can

be applied to a macroblock. This flexibility makes possible amore accurate temporal predic-


16x16 16x8 8x16 8x8

4x44x88x48x8

(a) Possible block-partitioning for a macroblock (b) Partitioning of a single frame from the sequenceforeman

Figure 2.5: Macroblock partitioning for Motion Compensation in the H.264/AVC standard.

tion since the macroblock partitioning can be fitted to the shape of moving objects in the scene,

minimizing the energy of the residual signal that is to be coded. In order to improve the coding

gain, motion compensation is performed from interpolated frames with quarter-pel resolution,

where the expanded frames were obtained using a 6 taps followed by a 4 taps FIR filter as

reported in [124].

There are two types of temporally-predicted frames/slices: theP-typeframes/slices andB-

typeframes/slices. P-type frames/slices are predicted considering only the previous frames in

the display order, and for each predicted block only a singlemotion vector is specified (classi-

fied asforward MV). B-frames are characterized by a bi-directional temporalprediction (spec-

ified by a forward and abackward MV), which allows a better estimate of the current block

since the prediction process takes into account both the previous and the following frames in the

display order. As a consequence, the display order does not correspond to the coding order as

Fig. 2.6(b) shows. Usually, bi-directional temporal prediction requires a higher computational

cost, and therefore, the lowest complexity profiles for the H.264/AVC coder do not include

this coding option. Moreover, the standard imposes additional limitations on the coding types

of macroblocks in each frame. Intra slices can include Intramacroblocks only, while P slices

can be made of both Intra and motion-compensated macroblocks where prediction is charac-

terized by a single motion vector (forward). As for B-slices, no Intra macroblocks are present

since all the macroblocks are temporally-predicted, and the coder can specify one or two MVs

for each motion-compensated block permitting the estimation of the prediction block either

from the previous or from both the previous and the followingframes in displaying order (see

Figure 2.6 and [87]). As it was said before, the coding type ofeach frame is pre-determined

by the encoder according to certain structure. In fact, the sequence of pictures is divided in

Group Of Pictures (GOP), which generally includes all the frames between two subsequent

Intra pictures. The structure and the length of each GOP is determined by the availability of

computational resources, the characteristics of the application, and the state of the channel.

In fact, B-directional prediction requires a double ME thatresults quite expensive for a de-

vice with reduced hardware resources or limited power supply. At the same time, bidirectional


BB B BI P P

displaying order

coding order 0 2 1 5 6 43

0 1 3 4 5 62

(a) Coding and display order for GOPs with structureIBBP

BB B BP P

BB B BP P

IPPP

I

IBBP

I P PPP P P P PPP P P

(b) GOP structures IBBP and IPPP

Figure 2.6: Different coding and display order for GOPs.

prediction requires a computational time that could be prohibitive for interactive applications,

where the coded pictures must be ready for transmission at fixed equally-spaced time instants.

Finally, in presence of losses the reconstructed sequence results distorted since the loss of one

or more frames is compensated by estimating the lost information with an error concealment

algorithm. As a drawback, the decoder has a different frame buffer with respect to the en-

coder precluding a correct reconstruction of the coded sequence until a refresh is performed

with some Intra data. Therefore, the ordering of coding types and length of GOP can be quite

varying, and in the present work two GOP structures are considered: the IBBP structure (two

B-frames either between an I- and a P-frame or between two P-frames), and the IPPP structure

(an I-frame followed by only P frames). These two structuresare depicted in Fig. 2.6(a). The

baseline profile of the H.264/AVC coder, which is the standard configuration with the lowest

computational complexity, only includes the IPPP sequence.

It is worth mentioning as a specific coding type theDirect mode, where no residual

information is sent and the motion vectors for the current block are estimated from the MVs of

the blocks at the corresponding positions in the previous and the following frames. In this way,

it is possible to minimize the amount of bits that have to be coded for temporally-predicted

macroblocks. Further details are available in [105, 87].

2.2.3 Transformation and quantization

One of the basic features that distinguishes the H.264/AVC coder from the previous ones is the

adopted transform. Since the development of earliest standards for image compression, such

as JPEG in the end of the eighties [122], transform coding throughDiscrete Cosine Transform

(DCT) has represented the usual approach for both image and video coding [47, 44, 87]. DCT

is able to compact theenergyof transform block in the frame to be coded into a smaller num-

ber of frequency coefficients, and most of the previous transform-based video coders adopt

a Discrete Cosine Transform performed on8 × 8 blocks as it provides a good trade-off be-

tween computational complexity and compression efficiency. Recently, new paradigms have


been proposed as good substitutes for DCT, like the waveletsused in the JPEG2000 standard

[43, 118].

At the beginning of the standardization process of the H.264/AVC, technical literature had

presented several implementation of the8× 8 DCT in fixed-point arithmetic designed for low-

complexity devices. However, the designer of H.264/AVC looked for a reduced complexity

transform which was implementable with a few additions and shift-registers. The solution was

found by simplifying the structure of the4×4 DCT, which can be easily approximated through

a multiplierless transform followed by a rescaling. In fact, the4 × 4 DCT can be rewritten as

follows

Y = AXAT =

a a a a

b c −c −ba −a −a a

c −b b −c

X

a b a c

a c −a −ba −c −a b

a −b a −c

, (2.1)

whereX is the input block,Y is its transformed version, and the transform matrixA is fully

described by the factorsa = 12 , b =

√

12 cos

(

π8

)

andc =√

12 cos

(

3π8

)

= bd . The transform

matrixA can be written as

A =

a 0 0 0

0 b 0 0

0 0 a 0

0 0 0 b

1 1 1 1

1 d −d −1

1 −1 −1 1

d −1 1 −d

, (2.2)

which makes possible to rewrite (2.1) as follows

Y =

1 1 1 1

1 d −d −1

1 −1 −1 1

d −1 1 −d

X

1 1 1 1

1 d −1 −1

1 −d −1 1

1 −1 1 −d

⊗

a2 ab a2 ab

ab b2 ab b2

a2 ab a2 ab

ab b2 ab b2

(2.3)

where⊗ denotes a coefficient-by-coefficient multiplication (see [35, 36]). In the H.264/AVC

standard, the parameterd, whose value is√

2− 1 ≃ 0.414, is approximated to1/2 allowing an

implementation of the transform with additions and shift registers only.

In addition to the presented4 × 4 transform, the standard includes an additional4 × 4

Hadamard transform, which is applied to the DC coefficients of the 4 × 4 blocks for the

Intra16x16 coding mode. The Hadamard transform is also applied on the prediction error

blocks before computing the cost function, since it allows the Rate-Distortion (RD) optimiza-

tion algorithm to discriminate between low-pass and high-pass residual errors2. An additional

2 × 2 transform is also applied on the DC coefficients of Chroma blocks as it is done in the

Intra16x16 case.

Recent developments of the standard have led to the adoptionof a higher dimension trans-

form (sized8 × 8 pixels), which was introduced in order to efficiently code high-definition

2The Hadamard transform permits distinguishing blocks withdifferent features that have the same distortionvalue (see [87]).


video formats (i.e. HDTV). The adopted8 × 8 transform is derived from a DCT, and it is

implemented without any multiplication. Since most of the research work presented in this

thesis concerns the processing of4 × 4 blocks, a detailed description of the8 × 8 DCT of the

H.264/AVC standard is let to [87].

Since the transform block is directly followed by a quantization, the rescaling matrix can

be included in the quantization step by specifying different quantizations steps according to

the position of the coefficient to be quantized. In the first definitions of H.264/AVC, the quan-

tization steps depended on the spatial frequency of the coefficients and on the Quantization

Parameter (QP), a coding parameter that can be specified at macroblock level and can be re-

lated to the quantization step through the equation

∆ = K(i,j) · 2QP/6, i, j = 0, . . . , 3 (2.4)

whereK(i,j) is a scaling factor (see [87])that depends on the position(i, j) of the coefficient

in the block and includes the factors of the rescaling matrix.3 In a more recent definition of the

standard, the factorsK(i,j) can be arbitrarily specified at the encoder by aquantization matrix,

which allows a coarser or finer representation of the coded coefficients according to the adopted

coding strategy. The investigation of the optimal quantization matrix is beyond the scope of

this work and will not be treated. The quantized coefficientsare scanned according to a zig-

zag order and run-length coded, i.e. each non-null quantized coefficient will be mapped into a

couple(r, l), wherel equals the value of the quantized coefficient itself whiler specifies the

number of null coefficients that occur before the current non-zero coefficient and the previous

one in the scanning order. In this work, non-null coefficients will be calledlevelswhile the

null coefficients will be calledzeros. The couples(r, l) are then sent to the entropy coder

together with other syntax elements such as the motion vectors, the partitioning structure for

the current macroblock, the macroblock coding mode, and theadopted prediction modes in

case of Intra macroblocks. For a more detailed description of the run-length coding procedure,

see [127, 105, 87].

2.2.4 Entropy coding

The final step of the encoding process for a macroblock implies the conversion of all the syntax

elements into a set of binary strings that can be sent to the Network Adaptation Unit in order to

be transmitted. This conversion results to be efficient whenever the length of the binary strings

assigned to the syntax elements is matched with their probabilities, i.e. the most probable values

are represented with short strings while the least probablevalues are coded using long strings.

This task is demanded to two different coding algorithm: theContext-Adaptive Variable Length

Coder (CAVLC) [127] and the Context-Adaptive Binary Arithmetic Coder (CABAC) [73].

Since the CABAC algorithm will be widely described in chapter 3, the following paragraph

aims at giving a short insight of the CAVLC algorithm.

In most of the previous video coders, the entropy coding algorithm maps the produced

syntax elements into binary variable-length binary strings (calledcodewords) according to a

3The Equation 2.4 is derived from the Tables in [103] considering that the quantization step doubles every6 QPvalues.


fixed table (calledcoding table). This approach proves to be efficient in terms of computational

cost, but most of the times, the compression gain is limited since the algorithm is not able

to adapt the length of the codewords to the changing statistics of the input data. Therefore,

since the beginning of source coding it was evident that the efforts of researchers should have

focused on adaptive approaches. First attempts were made considering adaptive Huffman codes

where the probabilities associated with the nodes of the coding tree were updated according to

the input statistics (see [19]). Then, different adaptive codes were proposed according to the

characteristics of the probability distribution of the coded source. Among these we can mention

the CAVLC algorithm, which represents an efficient way of coding the quantized coefficients

choosing adaptively the coding table among a set of possibleones in order to match the signal

statistics.

The CAVLC algorithm adopts a fixed Variable Length Code (VLC)in order to specify

all the syntax elements which are not related to residual information. As for each block of

quantized DCT coefficients, the coder first specifies the number of coefficients different from

zero and whether there are coefficients equal to±1 at the end of the scan. Then according to the

number of coefficients, couples(r, l) are coded by writing all thel values first and ther values

in the following. The adopted coding table is chosen from a set of possible ones, and it depends

on the number of non-zero coefficients and on the value of the previously coded information

(i.e. the previousl or r values, the number of non-zero coefficients in the neighboring blocks).

Since the statistics of the source can be roughly approximated with a geometric variable, the

coding tables defines an Exp-Golomb code (see [87]), which proves to be optimal for that kind

of source. The performance of the CAVLC algorithm compares very well with the performance

of the arithmetic coder since the difference is only a10% increment of the coded bit stream.

Fig. 2.7 reports the coding results of CAVLC algorithm compared with CABAC algorithm and

UVLC, the variable length coding algorithm based on a fixed coding table that was adopted in

the early definitions of the standard.

Figure 2.7: Comparison between CAVLC, CABAC, and UVLC (a fixed VLC code defined inthe H.26L drafts).


2.2.5 Deblocking Filter

All the images that have been coded by a block-based transform coding algorithm present

some visual artifacts related to the fact that each block is processed independently from the

neighboring ones. In fact, the loss of part of the information (caused by the quantization) may

lead to reconstructing the coded blocks in different ways despite in the original signal they are

similar. These artifacts usually appear as a high frequencynoise that is added to the image

and makes it appear as if made by separate tiles in a sort of mosaic-like effect. This distortion

(calledblocketization) can be mitigated by filtering the reconstructed image alongthe edges of

each block with a low-pass filter, which is adaptively tuned in order to attenuate more or less

strongly high frequencies whenever they carry a significantamount of distortion.

The H.264/AVC standard defines a really efficientdeblocking filter, which is applied along

the vertical and horizontal edges of every4 × 4 block. For each edge, the strength of the

filter (i.e. the number of involved pixels) depends on the coding type of the two neighbor-

ing blocks, the quantization parameters, the presence of non-zero coefficients, and in case of

motion-compensated blocks, the values of their corresponding motion vectors. Since a full

description of the deblocking routine is not the main subject of this thesis, further details can

be found in [87, 105, 64].

2.3 Summary

This chapter has presented a short overview of the video coding standard H.264/AVC, starting

from its general scheme and describing some of its building blocks. This coding scheme can

be summarized, for the sake of brevity, as a general DPCM coder followed by a transform

coding block, and an entropy coder. The DPCM prediction can be performed either spatially or

temporally according to the coding type that characterizesthe current frame. The residual sig-

nal is then transformed through a multiplierless Discrete Cosine Transform, which is followed

by a rescaling/quantization phase. The quantized coefficients, also known as levels, are coded

according to a run-length coding algorithm and sent to the entropy coder, which converts them

into binary variable length strings together with the macroblock coding type, the macroblock

partitioning information, the prediction modes, and the predicted motion vectors. A specific

attention has been payed for those blocks that will be involved in the coding algorithms that

will be presented in the following chapters, i.e. the transform block, the quantization and the

CAVLC entropy coder. Further details about the Arithmetic Coding engine will be given in

chapter 3, where an improvement of the original algorithm isdescribed.

The basic aim of the chapter is not to provide an accurate overview of the H.264/AVC

coding standard, since it is not the main topic of this work, but to introduce the set of tools

which will be tuned by the algorithms presented in the following chapters. Further details on

the H.264/AVC standard can be found in [105, 87, 71, 64, 127, 72].

Chapter 3

Probability-Propagation BasedArithmetic Coding

“In nature we never see anything isolated,but everything in connection with something else

which is before it, beside it, under it and over it.”

Johann Wolfgang von Goethe

The previous chapters have described the main issues of thiswork and the starting point ofour research, the video coding standard H.264/AVC. This chapter concerns the first of therequirements that an efficient video coding architecture for mobile applications must satisfy,i.e. a good compression efficiency. The whole chapter is focused on improving the compressiongain of the arithmetic coder specified by the standard by modelling the probability of bitsthrough a graph. The proposed model takes advantage of the statistical dependence amongneighboring coefficients in order to improve the probability estimate. Enhancing the numberand the structure of contexts, the proposed solution permits improving the compression gainwithout increasing the number of required operations.

3.1 Introduction

Arithmetic coding is known in the form we use it today from thelate 70’s. The first investiga-

tions about this topic appeared in the 1960 thanks to Abramson and Elias, despite the proposed

solution was far from what was the first “arithmetic coder”. Only in 1976 the work of Pasco

and Rissanen led to the design of the first arithmetic coding engine (1979-1980) that is similar

to the one used nowadays (see [9]). Despite the idea of arithmetic coding results quite simple,

we had to wait the 80’s in order to witness the first practical implementations (Q-coder and

MQ-coder) [83, 129]. Nowadays most of the arithmetic codingengines are a re-elaboration of

the MQ-coder and are used in a varied set of applications.

The key idea is mapping strings of values into separate intervals of “real” numbers allowing

the identification of the string by specifying one value in the final interval. As an example, let

us consider the string of binary symbolsb = [01101] and the corresponding probabilities for

the symbol “0”p0 = [p0,i]i=1,...,5 = [1/3,1/4,3/8,1/8,1/4] at every instant. At thei-th

22 Chapter 3. Probability-Propagation Based Arithmetic Coding

iteration the coding intervalIi = [li, li + Wi), whereli is its lower bound andWi its width,

can be partitioned in two sub-intervalsI0i andI1

i associated respectively to the probability of a

“0” symbol and a “1” symbol. In the adopted notation the subscript index refers to the index of

the coded bit and the exponent index refers to the associatedbinary value. The width of each

sub-interval can be computed as follows

W 1i = Wi · (1− p0,i) W 0

i = Wi · p0,i.

while the lower bounds are

l1i = W 0i l0i = 0

where we are assuming that the lower interval is always associated with the symbol “0”.

The intervals can be univocally identified by their widthsW si and their lower boundslsi ,

s = 0, 1 and i = 0, . . . , 4. According to the coded binary symbol, the coding interval is

shrinked each time into one of its partitions

Ii+1 ←{

I0i if bi = 0

I1i if bi = 1

i.e. li+1 ←{

l0i if bi = 0

l1i if bi = 1andWi+1 ←

{

W 0i if bi = 0

W 1i if bi = 1

In the following step, the intervalIi+1 = [li, li+Wi] is further partitioned into two sub-intervals

according to the probability distribution of the followingsymbol to code. In the considered

b = 00

b = 11

02I 2

1I

0I 1 11I

I = [0,1/3)1

03I 3

1I

b = 03

b = 14 I = [11/96,109/768)4

I 010

0I

04I 4

1I

0I = [0,1)

2b = 12

3I = [11/96,1/3)

I = [1/12,1/3)

Figure 3.1: Example of arithmetic coding: the string[01101] is coded considering the vectorof probabilitiesp0 = [1/3,1/4,3/8,1/8,1/4] for the symbol “0”.

example, at each iteration the state of the arithmetic codercan be identified by the couple

(li,Wi) while the coded string can be specified by a real numberr internal to the final interval

[l4, l4 +W4). The sequence of states for the arithmetic coder looks like shown in Fig. 3.1 and

reported in Table 3.1.

As the number of coded bits increases, the width of the codinginterval shrinks at each iter-

3.1. Introduction 23

Iteration symbolbi p0,i [li, li + Wi) state(li, Wi)

0 “0” 1/3 [0, 1) (0, 1)

1 “1” 1/4 [0, 1/3) (0, 1/3)

2 “1” 3/8 [1/(4 · 3), 1/3) (1/12, 1/4)

3 “0” 1/8 [11/(8 · 4 · 3), 1/3) (11/96, 7/32)

4 “1” 1/4ˆ

11/(8 · 4 · 3), 109/(82· 4 · 3)

´

(11/96, 7/256)

Table 3.1: Sequence of states for the considered example.

ation requiring a finer resolution to specify the internal real number. In actual implementations,

the resolution is chosen in the design of the coding architecture and the length of the string that

can be coded is derived from it. Whenever the width of the coding interval has reached the

smallest size allowed, a rescaling is performed [58]. At thedecoder, given the numberr it is

possible to reconstruct the coded bit stream repeating the partitioning procedure carried out at

the encoder and verifying in which interval the numberr lies in.1 Moreover, although during

encoding the algorithm generates one codeword for a whole sequence of data, it is possible to

implement it with a sequential algorithm that outputs bits whenever possible.

With respect to the Huffman code, Arithmetic Coding (AC) codes each symbol with a

fractional number of bits leading to higher efficiency. Indeed it can be proven to almost reach

the best compression ratio possible, i.e. the entropy rate of the source being coded.

Despite AC is still very young with respect to other fields of Information Theory, it is al-

ready a mature and widely-used coding solution. The attractive coding gains that could be

achieved has led to its introduction in most coding standards for video and image compression

(see the specifications of standards JBIG, JBIG2 [42], JPEG [122], JPEG2000 [118, 43], H.263

[45]). However, its complexity still remains an important issue since it still can be considered

computationally-prohibitive for devices with a very limited power supply. As a consequence,

the most recent video coding standards define two separate entropy coding algorithms: one

is based on arithmetic coding while the other is a simpler coder that requires a limited com-

putational load. Among these, we can include the H.264/AVC [127] video coder, which has

standardized the non-arithmetic algorithm CAVLC (ContextAdaptive Variable Length Coder)

[87] and the arithmetic engine CABAC (Context Adaptive Binary Arithmetic Coder) [72]. The

efficiency of the latter is one of the main improvements that allow the H.264/AVC coder to out-

perform all existing standards reducing the size of the coded bit stream up to 50%, especially

in comparison to MPEG-2 [47]. Its performance is mainly due to a precise context modelling

and an efficient binarization strategy, as it will be shown inthe following sections.

In the CABAC architecture, each syntax element is convertedinto a variable-length binary

string, and each string is coded via a binary arithmetic coder according to the probability of

the bit value. The probability is given by probability distribution function (pdf) associated to a

context, which depends on the coded syntax elements and on the position of the binary digit in

the string.

1The procedure reminds the approximation of a real number through successive partitions of the real interval[0, 1).


Figure 3.2: Scheme of the CABAC coding engine.

Since adaptive algorithms have already proved to be extremely effective for other classes

of codes (see [19]), the same strategy was adopted for the arithmetic architecture. After coding

each binary digit, the pdf is updated in order to adapt the coder statistics to the input signal.

Adopting the same conventions that were adopted in [73], in the rest of the chapter we

will also refer to the binary elements of strings with the name “bin” in order to avoid any

misunderstanding with the actual bits that are written in the coded bit stream. This choice was

motivated by many previous works, which have shown that it ispossible to implement a mul-

tilevel arithmetic coder by applying a binary arithmetic coder to the bins obtained binarizing

the input symbols. This allows a great simplification in designing the coding architecture, and

varying the contexts, it is possible to process heterogeneous syntax elements without changing

the coder structure.

At the same time, the binarization improves the coding performance while contexts refines

their probability estimate (statistics can change). At thebeginning of each slice, contexts are

reset in order to make the decoding operation independent from other slices, and therefore,

the probability values are not matched to the input data yet.This mismatch may reduce the

compression performance, mostly whenever the slice size issmall. The binarization block (see

Fig. 3.2) performs an initial “entropy coding” of the input symbols during the transient period

of probability estimate improving the final performance.

Finally, the binarization allows an accurate and efficient design of contexts structure. An

excessive number of contexts can potentially model the probability mass function (pmf) of

each syntax element very accurately, provided that the estimated binary probabilities have con-

verged. This process could result difficult whenever the input data are diluted in an excessive

number of contexts since this precludes a fast convergence of the estimates. At the same time,

a great number of contexts increases the hardware requirements. A prior binarization of coded

symbols helps in detecting the optimal contexts structure since those contexts that are related

to the least probable bins can be collapsed into a smaller set(see [73]).

Even if the context structure of the CABAC coding engine is well-designed, the probability

model for the transform DCT coefficients can be improved. In fact, the context set does not

take into account the fact that the amplitudes of coefficients at neighboring frequencies are

statistically dependent, as well as for coefficients of neighboring blocks at the same frequency.

3.2. The Context Adaptive Binary Arithmetic Coder (CABAC) 25

This statistical dependence can be represented through a proper probability mass function,

which can be schematized through a graphical model [133]. Through this model, it is possible

to modify the structure of the CABAC coder estimating the probability of each bin through a

Sum-Product approach [51, 68]. Then, the estimated probability value is used to select the state

of the binary coder implemented in the CABAC coding engine.

The chapter is structured as follows. Section 3.2 gives a brief overview of CABAC coder.

Section 3.3 presents the adopted graphical model and how it was implemented. Section 3.4

reports the details of the modified arithmetic coder. Section 3.5 reports the experimental results

obtained on a set of different video sequences.

3.2 The Context Adaptive Binary Arithmetic Coder (CABAC)

The H.264/AVC standard includes two different entropy coding algorithms. The first one is

a Context-Adaptive Variable Length Code (CAVLC) that uses afixed VLC table for all the

syntax elements except for the transform coefficients, which are coded choosing adaptively a

VLC table among a set of different possible coding tables. The second entropy coder defined

within the H.264/AVC standard specification [87] is a Context-Adaptive Binary Arithmetic

Coder (CABAC) [73], schematized in Fig. 3.2, which allows a bit stream size typically10%

smaller with respect to CAVLC (see Section 2.2.4).

The encoding process can be specified in three different stages:

1. binarization;

2. context modeling;

3. binary arithmetic coding.

In the first step, a given non-binary valued syntax element isuniquely mapped to a variable-

length sequence of bins (called bin-string). The only exception is the coding of a binary value:

in this case no conversion is needed, and the binarization step is bypassed (see in Fig. 3.2). In

this way, the input symbols for the arithmetic coder are always binary values, independently

of the characteristic of the syntax elements. For each binary element, one or two subsequent

steps may follow depending on the coding mode. In the so-called regular coding mode, prior

to the actual arithmetic coding process, the given binary digit (bin) enters the context modeling

stage. According to the syntax element it belongs to, a context and its related probability model

are selected and the bin probability is computed. Then the bin value, along with its associated

probability, is sent to the binary arithmetic coding engine, which will map them into an interval.

After the coding operation, the encoder updates the probability model for the current context.

In the following paragraph we will focus on the coding operations related to the coding of the

transform coefficients.

The coding of DCT data is characterized by the following distinct features:

• a one-bit symbol notifies the occurrence of nonzero transform coefficients in the current

block following the reverse scanning order;


• in case the coefficient is different from zero, an additionalcouple of bits code the sign of

the coefficient and the flag that indicates whether it is the last coefficients different from

zero or not;

• then non-zero levels are coded, assigning a context to each one of them according to the

number of previously transmitted nonzero levels within thereverse scanning path.

The binary coding engine can be modeled via a Finite State Machine (FSM) with 64 states,

where each state identifies the probability of the least probable symbol (LPS), i.e. the less

probable binary value. A memory table maps each state into a width value for the LPS interval

according to the width of the whole interval. In the same way,the transition from one state to

another is driven by the correspondence between the coded bit value and the most probable one

(MPS). Fig. 3.3 reports a section of the whole FSM with the corresponding state transitions.

Each state transition also depends on the state value of the current context, and the admitted

Figure 3.3: Structure of the Finite State Machine related tothe CABAC coder. The dashedline reports the state transitions for the standard CABAC coding engine. The solid lines referto the DAG-based version of CABAC.

transitions are specified through a matrix. After a varying number of coding steps, the length

of the coding interval is rescaled in order to keep it greaterthan a quarter of the full resolution

(that is equal to210 = 1024). Each rescaling operation increases the number of bits to be

written in the bit stream, and therefore, the number of rescalings must be limited in order to

keep the bit stream as small as possible. To this purpose, thebinary strings that represent the

non-null coefficients are codedvertically instead ofhorizontally(bit plane by bit plane). In this

way the interval shrinking is mitigated since the binary strings associated to the absolute values

of levels present a high occurrence of “1”s and most of the times the coding interval is mapped

into the MPS sub-interval, i.e. the larger one.

Note that the convergence speed of the probability estimateis limited by the allowed state

transitions and the coder must process a certain amount of data before obtaining a reliable pmf

for each context. This problem can be solved by initializingthe coding contexts in different

ways according to the characteristics of the data that have to be processed. For example, since

quantized coefficients are partially binarized using a unary code, the initial probability mass

functions associated with the contexts for the coefficientshave a mean greater than 0.5. The

following subsection will give a more detailed insight on how the absolute values of DCT

coefficients are converted into binary strings and how contexts are assigned to each bin.

3.2. The Context Adaptive Binary Arithmetic Coder (CABAC) 27

3.2.1 Binarization and context modelling for the absolute values of non-zero co-efficients

Binarizer

ctx

11111110

11110

11

11

1

ctx

ctx

ctx

ctx

ctx

ctx

0 00

0 0 0

Figure 3.4: Scheme of contexts for the absolute values of coefficients.

Analyzing the CABAC routines that code the residual data, itis possible to infer two differ-

ent phases: the coding of the positions of non-zero coefficients (calledsignificance map) and

the coding of their values. Our investigation was focused onthe latter phase of the process and

since the signs of non-null coefficients are coded using a non-adaptive uniform pmf, the next

section will be focused on the binarization and the context modelling for the absolute values of

coefficients as it proves to be the most interesting aspect.

The occurrence of non-null coefficients with an absolute value equal to1 (calledones) is

frequent,2 and therefore, the CABAC coder specifies a separate significance map for coeffi-

cients that are equal to 1. A binary value signals whether thecurrent non-zero coefficient is

a one or not, and4 possible contexts can be associated. For the first three coefficients in the

reverse scanning order that occur before the first absolute value greater than one, the context

modeller associates three separate contexts. The remaining bins are coded using a fourth binary

context which eventually models the occurrence probability of ones at the lowest frequencies.

Each of the remaining absolute values is then binarized using a unary-Exp-Golomb code,

which corresponds to a unary code eventually followed by an Exp-Golomb code, and the coder

assigns to each of them a single separate context up to a fixed limit which depends on the

block type (e.g.5 for inter blocks) in order to keep the context number as low aspossible.

The remaining values are assigned to the final binary contexts and each absolute value is then

coded vertically using the binary context that has been assigned. Considering that the adopted

binarization is a unary code most of the time, each context models the average length of the

DCT coefficients that it is assigned to. In this way, the needsfor both a good modelization

and low complexity are met, but this choice is paid with a lower compression gain since the

modelling of the probability may result inadequate. Note that the context assignment does not

take into account the spatial frequency of the coefficients but only their ordering in the scanning

process, and sometimes this makes statistically-heterogeneous data to be mapped into the same

probability model.

Despite its limitations, the CABAC architecture proves to be quite efficient with respect

to the previous coding engines since it significantly contributes to improve the coding perfor-

mance of H.264/AVC coder with respect to the previous standards. However, the performance

can be significantly improved considering that each coefficient is statistically dependent on the

2Transform coefficients have a symmetric pdf monotonically decreasing for positive values (see Section 4.3).


neighboring ones. This dependence can be used to refine the statistical model of the original

CABAC architecture.

3.3 Modeling the contexts using a graph

The basic idea that lies beneath transform coding is to reduce the correlation among different

samples of the signal to be coded. Adopting Karhunen-Loève (KL) transform, for a Gaus-

sian source each transform coefficient is independent from the other ones since it is computed

projecting the original signal on a different vector from anad-hoc built orthogonal basis. In

addition it is possible to reproduce the original signal once the transform coefficients (and the

relative basis) are known. In this way the signal can be efficiently coded by specifying its

KL coefficients since the intrinsic redundancy of the original data has been removed. Un-

fortunately, the KL transform must be adapted to the input signal and sent to the decoder in

order to convert the decoded coefficient in the original signal domain. In addition, the estimate

of the optimal KL transform is a computationally-expensivetask and need to be computed

frequently in order to match the varying statistics of the input signal. In most practical ap-

proaches, transform coders resort to the Discrete Cosine Transform (DCT) since it has good

decorrelating properties for correlated signals, and it can be efficiently implemented with an

integer arithmetic. Previous works have shown that the compression performance of DCT de-

pends on the transform size, and most video coding standardsadopt a8 × 8-sized DCT since

it provides a good trade-off between compression gain and low computational complexity. As

it was mentioned in chapter 2, H.264/AVC resorts to a sub-optimal 4 × 4 transform obtained

from a DCT (see [35, 36]) that does not require multiplications and whose lower performance

can be compensated by an efficient prediction mechanism (see[14]).

The choice of a smaller transform size has some drawbacks on the corresponding signal

basis that is used to decompose each4 × 4 block since the coding efficiency of4 × 4 DCT

is lower with respect to its8 × 8 and 16 × 16 versions. In this way, the statistics of the

(a) Dependencies among coef-ficients

X8 X9 X10 X11

X1 X2 X3X0

X7X6X5X

X12 X13 14 X15

4

X

(b) DAG model

Figure 3.5: Directed Acyclic Graph that models the statistical dependencies between the co-efficients in a transform block.

current coefficients partly depends on the statistics of theneighboring coefficients whenever

the Manhattan distance between them is lower than two. The statistical dependence between

3.3. Modeling the contexts using a graph 29

coefficients at greater Manhattan distances is much lower, and therefore its influence on the

resulting pdf is less significant. A similar relation was found for other types of transform (see

[8, 65]), and it can be related to the intrinsic statistical dependence between the energy levels

of the signal at neighboring bands. Therefore, we can model this relation with the Directed

Acyclic Graph (DAG)G that is reported in Fig. 3.5(b). The DAGG can be specified by a

couple of setsG = (V,E), whereV denotes the set of the nodes (corresponding in this case to

the coefficients) andE is the set of edges (corresponding to the conditioned probabilities).

The model connects each coefficient with the coefficients lying on the left and above since

we can assume that the correlation is horizontally and vertically oriented. In addition to the

Figure 3.6: Dependencies between the coefficients in a macroblock.

dependence among coefficients of the same block, there is also a statistical dependence among

coefficients belonging to neighboring transform blocks. This relation is partially used by the

CAVLC coder when coding the number of coefficients differentfrom zero in each block. The

number of non-zero levels in the current block is predicted averaging the number of non-zero

levels in the upper one and the left one [98]. Once again this can be justified by the transform

size: since the blocks are small, some features of the image are correlated for neighboring

blocks (e.g. the value of the DC coefficient or the coefficients at certain frequencies). Note

that for predicted frames this relation also depends on the different performance of prediction

on distinct block. For example, in case the motion estimation finds a good estimate for a block

and bad ones for its neighbors, the resulting coefficients are very poorly correlated since in the

first block the residual signal is nearly AWGN while residualsignal of the neighboring block

is highly correlated with the original image. As a consequence, statistical dependence arises

whenever the temporal predictions performed on adjacent blocks are correlated. In this case,

the same model of Figure 3.5(b) can be applied considering group of 164× 4 blocks, and we

can associate a separate DAG to each frequency in the transformed signal (see Fig. 3.6). The

choice of considering a4×4 group of4×4 blocks was made in order to consider the statistical

dependencies within a macroblock.

According to the statistical dependences modelled by the graph in Figure 3.5(b), the joint

probability mass function (pmf) for a block of coefficient (or for a grid of coefficients at the


same position in different transform blocks) can be factorized into conditional pmfs as follows

p(x) = p(x0) · p(x1, x4/x0) · p(x2, x5, x8/x1, x4)·p(x3, x6, x9, x12/x2, x5, x8)·p(x7, x10, x13/x3, x6, x9, x12)·p(x11, x14/x7, x10, x13) · p(x15/x11, x14)

= p(x0) ·∏

s∈V,s 6=0

p(xs/xπs)

(3.1)

wherex is the vector that reports the value of each coefficientxi, i ∈ V . The setπs contains

the coefficients adjacent tos

πs = {t ∈ V : (t, s) ∈ E} = {xs,A, xs,B} (3.2)

wherexs,A,xs,B are respectively the upper and the left levels for the current coefficients.

In case one or both of the adjacent coefficients are not available, we assumexs,A andxs,B

undefined.

The factorization is possible since each pair of variables that have a common parent is

conditionally independent with respect to the parent itself. As a consequence, it is trivial to

verify that all the nodes lying on each diagonal are conditionally independent given the nodes

on the previous diagonal. The probabilistic relations expressed by the DAGG can also be

applied to the bit planes that are found slicing horizontally the binary representation of the

block of coefficients.

It can be seen that the bitsbks of thek-th bit plane, where

xs =15∑

k=0

bks 2k (3.3)

are related according to equation

p(bk) = p(bk0) ·∏

s∈V,s 6=0 p(bks/b

kπs

), (3.4)

with

p(bks/bks,u) =

215∑

xs = 0 :

thek-th bit

of xs is bks

215∑

xs,u = 0 :

thek-th bit

of xs,u is bks,u

p(xks/x

ks,u) p(xk

s,u)

u ∈ πs andk = 0 . . . 15.

(3.5)

(see [68]).3 In this way, we obtain more than one DAG that can be modelled using an Ising

3Note that in the previous equations we assume that the maximum number of bit planes is16. The transformreported in Section 2.2.3 is applied on4×4 residual blocks of 9-bits samples (8 bits for the original sample plus onebit because of prediction), and the maximum amplification performed by the4 × 4 transform is36 (correspondingto 5.17 bits). Therefore,14.17 bits suffice for representing a transform coefficient, whichare rounded up to16 inthe arithmetic of the H.264/AVC coder. For further details,see [35, 36].

3.3. Modeling the contexts using a graph 31

model. This probability structure was first introduced by Lenz and Ising in the early 1920s in

the field of ferromagnetism [86]. The model has been widely applied to describe cooperative

phenomena, and more recently, it has been intensively adopted in statistical image processing

for different applications (see [116, 84]). Omitting the index of the bit planek, it is possible to

rewrite the pmf reported in eq. (3.4) as

p(b) = p(b0) ·15∏

s=1

p(bs/πs)

= exp log p(b0) · exp∑15

s=1 log p(bs/πs)

= exp{

θ01 · b0 + θ0

0 · (1− b0)}

·

exp

15∑

s=1

1∑

i,j,z=0

θsABijz · ψsAB

ijz (bs, bs,A, bs,B)

(3.6)

whereθ0i = log p(b0 = i)

θsABijz = log p(bs = i/bs,A = j, bs,B = z)

(3.7)

and the sufficient statistics is

ψa1(b0) = b0

ψa0(b0) = (1− b0)ψsAB

000 (bs, bs,A, bs,B) = (1− bs)(1− bs,A)(1 − bs,B)

ψsAB001 (bs, bs,A, bs,B) = (1− bs)(1− bs,A)bs,B

ψsAB010 (bs, bs,A, bs,B) = (1− bs) bs,A(1− bs,B)

ψsAB011 (bs, bs,A, bs,B) = (1− bs) bs,A bs,BψsAB

100 (bs, bs,A, bs,B) = bs(1− bs,A)(1− bs,B)

ψsAB101 (bs, bs,A, bs,B) = bs(1− bs,A)bs,B

ψsAB110 (bs, bs,A, bs,B) = bs bs,A(1− bs,B)

ψsAB111 (bs, bs,A, bs,B) = bs bs,A bs,B.

(3.8)

(see [120, 121]).

Given a set of observations{

x1,x2, . . . ,xM}

wherexj = [xj

s]s∈V ∈ Zn

∀j = 0, . . . ,M − 1 with n = |V |,

it is possible to extract the sets

{

b1,k,b2,k, . . . ,bM,k}

that includes the vectorsbj,k =

[

bj,ks

]

s∈V(3.9)

wherebj,ks is thek-th bit of xjs.


Omitting the bit plane indexk, it is easy to check that the log-ML estimate of moments

µsABi,j,z andµa

i

µsABijz = arg max

µsABijz

1

M

M−1∑

k=0

log p(

bj/µsABijz

)

µai = arg max

µai

1

M

M−1∑

k=0

log p(

bj/µai

)

i, j, z = 0, 1.

(3.10)

areµsAB

ijz = E[

ψsABijz

]

= p(bs = i/bs,A = j, bs,B = z)

=1

M

M−1∑

k=0

ψsABijz (bks , b

ks,A, b

ks,B)

µai = E [ψa

i ] = p(bs = i) =1

M

M−1∑

k=0

ψai (bka)

i, j, z = 0, 1.

(3.11)

(see [51]).

Note that in this case the normalizing conditions are

1∑

i=0

µsABijz = 1

1∑

i=0

µ0i = 1.

(3.12)

The application of the Ising model for each bit plane of coefficient blocks proves to be an effi-

cient solution since it simplifies the equations for the log-ML estimate. However, the CABAC

coding engine codes each coefficient vertically since the high occurrence of “1”s improves the

coding performance. Therefore, it is possible to apply the DAG model in a different way. In

fact, as the CABAC coder associates only one binary context per coefficient, it is possible to do

the same with the DAGs using only one graph to model the statistical relation between the co-

efficients. Remember that since the coded binary strings belongs to a unary code, each context

models the average value of each coefficient which corresponds to the average value for each

non-zero level. Therefore, the Ising model represents the relation between the average values

of coefficients placed in different spatial positions.

The following section will show how the binary model can be used in the arithmetic coder.

3.4 A Sum-Product based arithmetic coder

The previous section has proposed a probability model that can be used to characterize the

probability of the different bit planes for a block of transformed coefficients. Therefore, an

interesting application to investigate is its inclusion into a binary arithmetic coder.

In the CABAC coder, the probability of each binary value is associated with the state of

a Finite State Machine, and the transition from one state to another is fixed by a transition

3.4. A Sum-Product based arithmetic coder 33

matrix. One of the disadvantages of this model is that the probability is correctly estimated

after coding a certain amount of data since the convergence speed of the probability is limited

by the transitions allowed from each state. In addition, thestatistics estimate does not take into

account either the position of the DCT coefficient in the transform block or the values of the

neighboring pixels, but it performs a simple estimation of the probability for each bit.

Graphical models allow a better probability estimation using a Sum-Product algorithm

along the edges of the DAG structure. In the following subsections, the whole encoding process

is presented.

3.4.1 Probability modelling through DAGs

At first, the encoder creates a4× 4 matrix of coefficients either belonging to the current block

or positioned at the same frequencies in different neighboring blocks. In this work, the first

approach will be denoted as DAGB (DAG on a Block), while the second will be called DAGMB

(DAG on a macroblock). In both cases, each coefficient is considered as a node in a graph

structure that model the statistical relations with its neighbors and allow the estimate of a

probability model.

In our first approach (see [78]), we associate a distinct binary DAG for each bit plane in

order to model the statistical dependence among the bits. Nospecific binarization is applied in

this case, and the coded binary strings consist in the binaryrepresentations of coefficients. The

most significant bit planes are not coded using the DAG model.In fact, the bins to code are

few and sparse since the number of high-energy coefficients is low. Therefore, there is no need

to apply the DAG model to these bits because it would provide only a small improvement. In

the implemented algorithm, only the 5 least significant bit planes were coded using the DAG

scheme while the remaining bits were coded using only one probability model per bit level,

since increasing the number of DAG-modelled bit planes doesnot improve the performance.

This distinction can be schematized as in Fig. 3.7. This model is able to estimate the statistics

for each bit plane but proves to be computationally demanding since the estimate of bit prob-

ability must be repeated for each bit plane. At the same time,the storage of the conditional

probabilities requires a great amount of memory since theirvalues may change according to

the significance of bits. Therefore, a better implementation can be obtained by considering the

binarization performed on transform coefficients.

The CABAC implementation defined by the H.264/AVC standard converts the absolute

valuesxk of transform coefficients into VLC strings using a unary codefollowed by an Exp-

Golomb code in casexk − 2 > 12. At first the encoder signals whetherxk is greater than one

or not. In case it is, the binarization unit specifies a binarystring ofxk − 2 digits equal to1

followed by an ending digit equal to0. In case the value ofxk − 2 is greater than12, the first

13 "1" digits are followed by an Exp-Golomb code as it is specified in [73]. Therefore, for

a given binarized coefficient the probability that a binary digit is equal to one results deeply-

related to the expected value for that coefficients (at leastin case of levels lower than13). In

this way it is possible to reduce the multiple DAGs in Figure 3.7 into only one graph. Note

that in this case the graph models the relation between the probabilitiesP (bs = 1) at different

spatial frequencies. In this way, the moments of the Ising model characterize the statistical


Figure 3.7: Distinction between bit planes coded using the DAG probability model and bitplanes coded using the traditional CABAC scheme. In the depicted example, the 3 leastsignificant bit plane were coded using the DAG model, while the upper bits were coded usingone context per bit plane.

dependence between the average absolute values of coefficients placed at different positions in

the graph.

Coding operations can be divided into two steps: the estimation of the probability for the

current bit and its arithmetic coding.

3.4.2 Estimation of the bit probability

Given the bit planeb made of the bitsbi, with i = 0, 1, . . . , 15, we associate tob the proba-

bility mass functionp(b), which can be factorized as reported in eq. (3.1). The sum-product

algorithm is run from position(0, 0), and scans the bits according to the zig-zag path defined

by the H.264 standard. The zig-zag ordering was chosen sinceit is the scanning order of the

DCT coefficients, and examines the low-frequencies coefficients first.4

After zig-zag scanning, the sequence of bits can be represented by a vectorb = [b0,b1, . . . ,b15],

and the sum-product algorithm (see [68, 121]) is run following this ordering. For each nodebi,

i = 0, . . . , 15, the algorithm stores the probability valuep(bi) which is computed as

p(bs = i) =1

∑

j,z=0

exp(

θsABijz

)

· p(bs,A = j)p(bs,B = z)

p(b0 = i) = exp(

θ0i

)

(3.13)

wherei = 1, . . . , 15. In case the bitbs is lying on the borders of the graph, only one predecessor

will affect its value.

In our implementation we first used a floating point estimate for the probabilityp(bt) related

4The algorithm could be run using an arbitrary causal path, e.g. raster scan.


to t-th bit of the bit plane, which was used to initialize the context before coding the current bit.

In the on-line estimation of the probabilities, eq. (3.13) is modified using a recursive equation

(as it is reported in the following section).

A computationally-efficient implementation requires the adoption of a fixed point arith-

metic in the DAG modelization. Following the same approximations of the CABAC algorithm,

it was possible to implement the whole architecture in a fixedpoint arithmetic associating the

conditional probabilities to a set of binary contexts. At the beginning of each slice, contexts

are initialized according to the corresponding conditional probabilities, which have been es-

timated from a training sequence. Throughout the coding process, the probability estimation

and propagation is performed using the same FSM structure adopted by the CABAC coder (and

reported in [73]). Experimental results prove that the approximation is feasible also in this case

since the coding performance is not significantly affected.

Probability estimation significantly affects the final performance of the arithmetic coder. As

it was reported in the first sections of this chapter, the estimated probability value is associated

to the width of the coding interval. In the CABAC algorithm, arescaling is needed whenever

the width of the coding interval is lower than one fourth of the full resolution, and it can be

described as follows

while (range < QUARTER)

{

if (low >= HALF)

{

write_one_bit(1);

low -= HALF;

}

else if (low < QUARTER)

{

write_one_bit(0);

}

else

{

Ebits_to_follow++; /*bits to be written later*/

low -= QUARTER;

}

low <<= 1;

range <<= 1;

}

wherelow is the lower bound of the coding interval andrange its width.Ebits_to_follow

reports the number of bits that have to be written in additionto the current one whenever

write_one_bit() is called. It is possible to notice that the number of bits written in the

bit stream is strictly dependent on the number of rescaling operations, i.e. on the number of

processed binary symbols before an expansion of the coding interval is done. Whenever the

bit source presents a highly biased probability distribution (like the output of a unary code), it


is possible to code a high number of binary symbols before a rescaling operation in case the

estimated pmf is close enough to the real one. Hence, the shrinking of the coding interval is

limited since most of the time the coded binary digit equals the MSB and its corresponding

sub-intervals is large. Therefore, a good probability estimate “delays” the rescaling (i.e. adding

bits to the stream) improving the compression efficiency.

The following section will show how this probability estimate makes possible to improve

the performance with respect to the simple context update and initialization of CABAC. This

estimation consists in two crucial phases: the context initialization and the probability update

during the coding process.

3.4.3 Context initialization

In the binary arithmetic coder, the FSM machine is set to a state where the dimension of the

interval is proportional to the estimated probability. In the original version of CABAC, the

state depends on the previous state and the binary symbol that has just been coded. In fact,

in case the coded bit equals the most probable bit for the current context, the state index is

increased by one. In case the bit equals the least probable one, the state index is decreased of

an integer that may vary in the range[1, 3] according to its current value (Figure 3.3 reports a

graphical representation of the algorithm). In the modifiedversion, the probability is computed

independently from the state probability of the previous iteration. In order to reuse the quan-

tized intervals of CABAC, we need a rule to map a probability value to the inner state of FSM.

In the standardization process of H.264/AVC, FSM states [73] were computed considering the

probability values

p(state) = p0 αstate (3.14)

wherep0 = 0.5 andstate = 0, . . . , 63 andα =(

0.018750.5

)163 in order to include thep(state)

values in the interval[0.01875, 0.5]. According to eq. (3.14), it is possible to map the probabil-

ity valuep(bt) to the statestate_init

state_init =

⌊(

log2p(bt)

0.5 · 0.025

)⌋

(3.15)

as Fig. 3.8 depicts. This initialization of the contexts proves to be really effective in terms of

coding performance.

A wrong initialization requires a certain number of adapting iterations before the state of the

arithmetic coder fits the statistics of the coded data. During this transient period, probabilities

are mapped into intervals of inappropriate widths, which may result too wide or too narrow.

As a drawback, the number of rescalings may increase in the encoding process leading to an

excessive number of bits written in the final bit stream.

This fact is utterly evident in the statistics updating of binary contexts, which must suit

tightly the statistics of input data. The following sectionprovides a detailed description of the

statistics update process.


Figure 3.8: Structure of the modified Finite State Machine inthe new arithmetic coder.

3.4.4 Statistics update

In the floating point approach, the update operation was performed via the moment estimate

µ0i ← α · µ0

i + (1− α) · ψ0i (x0)

µsABijz ← α · µsAB

ijz + (1− α) · ψsABijz (xs, xs,A, xs,B)

giveni, j, z = 0, 1 ands = 1, . . . , 15

(3.16)

wherebs is the actual value of the coded bitbs. Equation (3.16) reports the same MAP estimate

of eq. (3.11) implemented using a recursive average. After updating the moment, the context

state is reinitialized using eq. (3.15) in order to translate a floating point probability value into

the arithmetic of the binary coder.

In the fixed point implementation, the context update corresponds with the one performed

by the standard CABAC encoder. Each conditional probability is updated using the FSM ap-

proximation defined in the CABAC. In this case, DAG-based probability estimate changes the

context initial state, but it does not affect its evolution while coding the current coefficient.

Computational complexity is significantly reduced and equals the one required by the origi-

nal CABAC algorithm in terms of coding and context updating operations. However, in our

case the number of adopted contexts is bigger requiring a wider memory to store their relatives

binary pmf.

3.4.5 Reduction of the number of contexts

The approach described so far involves a significant number of conditional probability (i.e. cod-

ing contexts) to be modelled. As a drawback, a high amount of coding contexts dilutes the

number of statistical samples that are used to estimate the binary pdf functions and the re-

quired memory area increases. Therefore, in order to make feasible this approach, the number

of coding contexts is reduced adopting some approximations.

For each coded coefficient, the DAG-based model chooses among four different contexts


depending on the values of the upper and left pixels. Therefore, in case we are modelling

the probabilities taking advantage of the dependencies among coefficients in the same block

(block-based DAG), the number of required contexts is4 · 16 = 64, where we are using

different contexts for different spatial frequencies. In the other case, the number increases up

to 4 ·16 ·16 = 1024 since we must take into consideration which4×4 block in the macroblock

we are considering. These numbers can be easily reduced using some approximations.

A first approximation regards the position in the transform block for the current coeffi-

cients. An accurate model must vary the conditional probability according to the frequency of

the coefficients. It is possible to collapse different contexts into one for coefficients at differ-

ent frequencies that present the same behavior. This fact isalready exploited in the original

CABAC algorithm since the context assignment is performed according to the order of non-

zero coefficients in the zig-zag scan. Considering a reversescanning order (from non-zero

levels at high frequencies to non-zero levels at low frequencies), the context modelling unit of

CABAC assigns a different context for the first consecutive levels equal to one and the first 4

consecutive levels greater than one. The following levels are related to the last context. In the

same way, it is possible to adopt this simplification for the DAG-based model assigning four

contexts to each non-zero coefficient in the reverse scanning order up to a last context. The

total number is reduced to4 · 7 = 21 for the block-based DAG and to4 · 7 · 16 = 336 for the

macroblock-based DAG.

Moreover, in the MB-based DAG it is possible to assume the statistical dependence among

neighboring blocks independent from the position of the current block in the MB. Therefore,

the MB-based DAG can reduce its number of context down to4 · 7 = 21, like in the case of the

block-based DAG.

The coding performance can be increased differentiating contexts according the energy of

the residual signal. In this case, its is possible to adopt different sets of contexts according to

the maximum number of bits that need to be coded for a coefficient in the current block.

3.5 Experimental results

The evaluation of this probability estimate has been performed coding different sequences us-

ing the CABAC scheme that was previously described. In our experimental tests, we con-

sidered the following statistical dependences: the ones that links the coefficients within the

same block (DAGB), and the ones that links the coefficients ofneighboring blocks (DAGMB).

Both schemes are used to code a set of heterogeneous sequences after a training phase that

is intended to initialize the conditioned probability values. In the training phase, we used

the sequencemobile since it presents many features that can be found in other sequences.

Tests were done on sequences with different resolution in order to evaluate the effectiveness of

these methods at different spatial resolutions. In our approach we avoided any rate-distortion

optimization in order to prevent the optimization algorithm from affecting the resulting perfor-

mance. At the same time, no partitioning of the motion-compensated macroblocks is applied

since motion estimation can produce different residual signals with different energies within

the same macroblock. The coding performance of the DAGMB-based CABAC can be affected

3.5. Experimental results 39

by this choice since blocks result less correlated as the prediction efficiency can be different

(i.e. a block that represents a portion of background is moreeasily predictable than one that re-

ports a new element in the scene or an object moving with non-translational movement). In our

first implementation the algorithm is applied to P frames only without adopting any binariza-

tion (i.e. using the simple binary representation of the absolute values of coefficients) [78]. In

the second architecture presented here, we adopted the samebinarization used in the CABAC

algorithm since it allowed the assignment of only one context per coefficient without resorting

to different DAGs on different planes. Both algorithms wereimplemented on the reference

software JM 9.5 and are compared with the standard CABAC algorithm as it is defined in the

H.264/AVC specification [87].

At first, we evaluated the capability of estimating the binary pdf for each coded bits in the

CABAC engine. The performance of each algorithm was evaluated considering the average

symmetric Kullback-Leibler distortion between binary distribution p andq

D (p ‖ q) =

1∑

b=0

p(b) log

(

p(b)

q(b)

)

+

1∑

b=0

q(b) log

(

q(b)

p(b)

)

. (3.17)

In this work, the symmetrized divergenceD (p ‖ q) is taken as a measure of the effectiveness

for the estimating algorithm. In our simulations, we compared the binary pdf assigned to each

context with the binary distribution estimated for the current frame. The average Kullback-

Leibler divergence was computed for different sequences with different quantization parame-

ters. Figure 3.9 shows how the DAGMB and the DAGB approach areable to provide a better

estimate of the binary statistical distribution for each context. It is possible to notice that the

DAG-based estimators reduce the divergence to one third with respect to the divergence ob-

tained with the original CABAC estimator. Note also that forsequences with a lot of details

and non-translational movement (likemobile andtable) the original CABAC estimator

presents a higher divergence with strong quantization. This fact is mainly due to the fact that

coefficients statistics is highly varying and the CABAC estimator can not adapt quickly to

changes. On the contrary, DAG-based model is able to tune thecontext structure properly,

providing a precise estimate of binary pmfs. The DAGB approach results less effective than

the Macroblock-based approach since the statistical dependence of coefficients is lower within

the same block, depending on the coded sequence. This phenomenon is more evident for se-

quences with a lot of small details enhancing the differencein the performances of the two

algorithm (see results formobile in Fig. 3.9(b) and in Fig.3.10(e)). It is also possible to no-

tice that the mismatch between the performances of different algorithms depends also on the

quantization parameter QP. Strong quantization increasesthe introduced distortion, which al-

ters the statistical distribution of coefficients since theperformance of motion compensation is

more varying on different blocks. The performance of the DAG-based estimator is mostly af-

fected in those sequences that present a high compression gain thanks to motion compensation,

like foreman. Fig. 3.9(a) shows that the performances of estimators result more affected by

the quantization parameter with respect to the other sequences.

Finally, we compared the performance of different algorithms in terms of PSNR vs. rate.

We considered fixed point implementations of the DAGB and DAGMB algorithms that take


15 17 19 21 23 25 27 29 310

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

QP

Ave

rage

Kul

lbac

k−Le

ible

r di

verg

ence

CABAC−DAGMB CABAC CABAC−DAGB

(a) foreman

15 17 19 21 23 25 27 29 310

0.05

0.1

0.15

0.2

0.25

0.3

0.35

QP

Ave

rage

Kul

lbac

k−Le

ible

r di

verg

ence


(b) mobile

15 17 19 21 23 25 27 29 310

0.05

0.1

0.15

0.2

0.25

QP

Ave

rage

Kul

lbac

k−Le

ible

r di

verg

ence


(c) container

15 17 19 21 23 25 27 29 310

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

QP

Ave

rage

Kul

lbac

k−Le

ible

r di

verg

ence


(d) table

Figure 3.9: Coding results for different QCIF sequences at 30 frame/s.

3.6. Summary 41

advantage of multiple FSMs to update the contexts. This implementation adopts the same

arithmetic and approximations that the original CABAC adopts for the probability, but the

context structure is changed. The computational complexity in terms of arithmetic operations

is the same of CABAC algorithm despite the modified approach requires an increased number

of contexts (i.e. a wider memory to store the binary pmf).

Despite the two DAG-oriented algorithms performs differently in terms of final average

divergence, the compression gain obtained by their application in the CABAC structure do

not vary significantly. Figures 3.10 and 3.11 reports the results obtained for different video

sequences. The adopted GOP structure is IPPP, and the DAG models were adopted for Inter

macroblocks which are coded imposing one motion vector. It is possible to notice that the best

performance provides a bit stream reduction of about10% (for container sequence), while

the compression gain can be lower or slightly lower according to the specific sequence. It is

possible to notice that the performance of both DAG-based algorithms is quite similar since the

reduction of the number of coding contexts does not allow a finer estimation of probability.

However, the adoption of the DAG model allows to increase thePSNR by 0.5 dB for

low bit rates and up to 1 dB for high bit rates. For sequences atdifferent resolutions (see

Fig. 3.11), the compression gain results lower since the higher amount of data available for the

estimation of context probabilities allows the original CABAC coder to improve the accuracy

of the statistical modelling operated by the context structure. In this case, the ratio between the

average Kullback-Leibler divergence for the original CABAC and the DAG based algorithms

is lower, and therefore, the difference in the performance is less evident.

3.6 Summary

This chapter describes the arithmetic coding architectureCABAC and how it is possible to im-

prove its performance changing the probability estimate algorithm. The improvement comes

from modelling the probability of the absolute value for each coefficient using a directed graph.

The statistics of each coefficient is thus estimated considering the conditioning of its neighbors

in the graph. Two dependence structures are considered. Thefirst one includes all the coef-

ficients in a4 × 4 transform block while the other considers coefficients at the same spatial

frequency belonging to neighboring4 × 4 blocks. However, the number of possible values

that each coefficient may assume makes prohibitive the adoption of a model based on integer

values, and therefore the transform coefficients are converted into binary strings. In this way,

the statistical dependence to be estimated involves binaryvariables and it is possible to model

it using a Ising model. Whenever coding a binary symbol, a Probability-Propagation algo-

rithm is run on the corresponding graphical model estimating a probability value that is used

to initialize the CABAC contexts. Experimental results show that the adoption of the graphical

models allows to improve the coding performance of the CABACalgorithm by estimating the

binary probabilities in a more accurate way and avoiding thetransient periods that are required

by context updating in the original coder. In fact, this allows to limit the interval shrinking at

each coding step reducing the number of bits written in the coded stream. The graph structure

results more effective when it is used to model the relation between coefficients of different


0 100 200 300 400 500 600 700 800 90030

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)

CABAC−DAGMBCABAC−DAGBCABAC

(a) foreman

0 50 100 150 200 250 300 35032

34

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)CABAC−DAGMBCABAC−DAGBCABAC

(b) news

0 50 100 150 200 250 300 350 40032

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(c) container

0 100 200 300 400 500 600 70032

34

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)


(d) mother

0 200 400 600 800 1000 1200 1400 1600 1800 200028

30

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(e) mobile

0 200 400 600 800 1000 1200 140032

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(f) table

Figure 3.10: Results for different QCIF sequence at 30 frame/s.

3.6. Summary 43

0 500 1000 1500 2000 2500 3000 350032

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(a) foreman

0 1000 2000 3000 4000 5000 6000 7000 800028

30

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(b) mobile

0 500 1000 1500 2000 2500 3000 3500 4000 4500 500032

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(c) table

1000 2000 3000 4000 5000 6000 7000 800030

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(d) football

Figure 3.11: Results for different CIF sequence at 30 frame/s.


blocks due to the higher correlation. The obtained bit stream reduction is approximately equal

to 10% or equivalently, the obtained quality increment for a givenbit rate varies between0.5

and1 dB. Future work will include the modelling of the probability as a mixture of DAGs

depending on a set of parameters.

Chapter 4

Rate control algorithms for H.264

“We must cut our coat according to our cloth,and adapt ourselves to changing circumstances”

W. R. Inge

“Change is inevitable - except from a vending machine.”

Robert C. Gallagher

This chapter deals with the problem of controlling the bit rate produced by the H.264/AVCcoder. Given a certain available bandwidth, the rate control unit has to fit the coded bit ratewithin the transmission constraints maximizing the quality of the sequence reconstructed atthe decoder. This goal can be achieved modifying appropriately the coding parameters (likethe quantization step, the coding mode, etc. ), which must betuned according to the inputstatistics. In case of time-varying channels, the control algorithm must be flexible enough toadapt the parameters of the video coder to the modified channel condition. The chapter presenta control approach which is based on an accurate modeling of the bit rate. This characterizationis possible analyzing the produced bit rate as a function of the percentage of quantized nullcoefficient and the energy of the quantized residual signal.The propose approach provides aneffective control at a low computational cost.

4.1 Introduction

During the last decades the communication world has showed an increasing interest about the

transmission of video sequences over a heterogeneous set ofnetworks for a wide variety of

different applications. The main aim is to provide multimedia services to each terminal with-

out constraining its mobility or autonomy and granting a certain Quality of Service (QoS).

According to these requirements, wireless communicationsprove to be the most suitable way

to distribute multimedia content and allow videocommunication in all the environments. How-

ever, the characteristics of radio channels has also brought the need for rate control algorithms

that allow controlling the encoding parameters in both a flexible and efficient way. In fact, we

can identify some basic features that a rate control must satisfy to be suitable for wireless video

transmission [3]. A first requirement is low computational complexity, since some of video ter-

minals might have limited hardware resources or a finite power supply. The time-varying nature

of wireless channels also implies that rate control algorithms must be flexible and must quickly

46 Chapter 4. Rate control algorithms for H.264

adapt the coding parameters to changing transmission capacity. In the end, the algorithm must

show good compression efficiency allowing good visual quality in the reconstructed sequence

at the decoder despite a limited available bandwidth. Focusing on these essential demands, we

investigated a flexible low-complexity rate control algorithm that maximizes the video quality

of the coded sequence respecting the bandwidth constraints.

In literature it is possible to find different solutions to this problem, and for all of them

the key issue is to model the statistics for the coded data. The papers [37] and [38] present

an efficient model for the bit rate produced by a transform video coder, which is based on the

percentageρ of null quantized transform coefficients (calledzeros). These algorithms prove to

be very efficient since they present a simple structure and a sufficient accuracy in controlling

the produced bit rate. However, their implementations require either a lot of memory accesses

in order to store the statistics or an accurate probabilistic model for transform coefficients.

Since the latter solution allows for a reduction of the memory requirements, different solutions

were proposed to provide a coefficient model that is both simple and sufficiently accurate.

Most of the solutions which were presented in literature arebased on Laplacian and gen-

eralized Gaussian models (see [11, 57]). Since the generalized Gaussian model presents some

issues in terms of computational complexity, many applications prefer Laplacian probability

density function (pdf), which proves to be both simple and sufficiently accurate for some cod-

ing standards. The first application to the pdf modeling based on the percentage of zeros was

presented in [39].

In order to obtain high compression performance, we have focused our investigation on one

of the most efficient video coding architectures that have been introduced in the last years, the

video coding standard H.264/AVC [105]. Thanks to an improved motion estimation technique

[124], an efficient entropy coder [73], the adoption of spatial prediction [128, 14], and an

improved deblocking filter [71], H.264/AVC provides a higher compression gain with respect

to the previous coding architectures and places itself among the top candidate video coders

for video communications over mobile channels. However, experimental results show that the

Laplacian model of [39] is inaccurate in modeling the statistics of the H.264/coefficients. In

[53], Kamaciet al. propose a better solution using a Cauchy probability density function to

estimate the rate and distortion in a rate control algorithm. The Cauchy distribution proves to

be effective in estimating the coefficients probabilities but its application in finding the optimal

quantization parameter still requires a high computational complexity that makes it not suitable

for low level devices. A simpler model based on aLaplacian+impulsivepdf can be found in

[75]. This solution proves to be effective at low bit rates, and its implementation requires a

minimum amount of computational complexity. In our investigation we tried to find a solution

that works well for different target bit rates and adapts quickly to frequent variations of the

available bandwidth without requiring great computational complexity. The solution was found

introducing in the modelization of the bit rate an additional parameterEq that approximates

the energy of the quantized signal. The modelling of the bit rate in the joint domain(ρ,Eq)

proves to be very efficient in controlling the size of the produced bit stream with respect to the

algorithm adopted by the Joint Video Team [59].

In Section 4.2 the “zeros”-parameterization is presented.The size of the coded image can

be linearly related to the percentage of null quantized transform coefficients (calledρ as in

4.2. Rate distortion modeling based on “zeros” 47

[37]). Both temporally and spatially predicted pictures provide strong experimental evidence

for the accuracy of this model. Section 4.3 describes how a parametric model can replace

the storage of the transform coefficients histogram and to take advantage of H.264 internal

parameters in order to estimate the parameters of the coefficients probability density function.

Section 4.5 describes a rate control algorithm based onρ-modeling. The algorithm relates the

quantization step with the target percentage of “zeros” through a parametric function estimated

from previously encoded data. The quantization step is carefully modified while coding the

different macroblocks in order to fit the bit allocation constraints and to provide maximal and

uniform video quality.

Finally, Section 4.6 reports experimental results that compare the “zeros”-based rate control

with the rate control algorithm implemented with the JM7.6 coder. The experimental data show

that ρ-modeling provides better performances both in terms of video quality and in terms of

required computation.

4.2 Rate distortion modeling based on “zeros”

In every rate control algorithm, the key issue is to map the produced bit rate and the distortion

in the reconstructed sequence to the encoding parameters inan optimal way. Given a constraint

on the available bandwidth, the rate control algorithm mustfind that set of parameter values

that maximize the visual quality of the reconstructed sequence. The provided results for this

constrained maximization problem are mainly due to the adopted optimization algorithm and

the adopted Rate-Distortion model [90]. In most of the applications, the choice of the optimiza-

tion algorithm is mainly influenced by the amount of calculations that is required. Whenever

the computational complexity or the encoding delay are not issues, it is possible to adopt very

efficient optimization routines that allow for the estimateof the optimal set of encoding param-

eters [15, 54]. Unfortunately, the computational resources of many devices or the constraints

imposed on the encoding time by some applications bind the choice of the optimization algo-

rithm among those solutions that need a limited complexity and a small amount of memory.

In these cases, the main difference is given by the capability of the Rate-Distortion model in

characterizing the statistics of the encoded signal.

Most of rate control algorithms are based on hyperbolic R-D models, where bit rate and

distortion are functions of the quantization step (e.g. [112, 41, 60, 59, 61, 56, 96]). Fig. 4.1

shows that for different coding types (I,P, and B) and imagesthe rate produced by the H.264

coder is a non-linear function of the quantization step. This model has been adopted in many

control techniques providing a simple approximation of theR-D function and a practical tool

to control the quantization parameters. However, this approach can be inefficient in some

cases. Whenever there is a low spatial correlation (e.g. theencoder is processing a picture with

varying characteristics) or the motion compensation is notequally efficient all over the frame,

the estimated R-D model is not suitable for all the regions.

In [37, 38, 39] Z. Heet al. present a better solution to parameterize the number of bits

produced by a video encoder. In these papers, the size (in bits) and the distortion of a coded

image are functions of the percentage of null quantized transform coefficients (called “zeros”).


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

10

20

30

40

50

60

70

80

90

Rate (bits)

Dis

tort

ion

Hyperbolic model

0(I)3(P)2(B)

Figure 4.1: Distortion vs. Rate for coded Intra, Inter and B frames. The plot was obtainedcoding the frames0, 3 and2 of foreman sequence and varying QP from13 to 35.

Experimental results (see [37, 38, 39]) show that the “zeros” parameterization suits a great

number of images better than previous models since the influence of input image characteristic

is lower. Moreover, this model can be successfully implemented on different transform-based

coding standards. As a matter of fact, its application to theemerging video coding standard

H.264 is an interesting topic of investigation.

As it is reported in Chapter 2, the H.264 encoder (sketched inFig. 2.1) implements a

hybrid transform video coder with motion compensation or spatial prediction. Each4 × 4

block of the current frame is predicted, and the residual error is transformed and quantized (see

Section 2.2.3).

After the transform operation, we can store the frequency ofeach coefficient in a histogram

px(a). The percentage of “zeros” in the current frame,ρ, can be computed through the equation

ρ(∆) =∑

|a|<∆

px(a), (4.1)

where∆ is the quantization step chosen for the picture andpx(a) represents the percentage of

DCT coefficientsx equal toa for the current frame. Note that∆ in H.264/AVC depends on the

position of the coefficient, the quantization parameter QP,the macroblock coding type, and the

quantization matrix1 (see Section 2.2.3). However, here we omitted the indexes inthe notation

for the sake of simplicity. Despite∆ may vary in the same block, the coefficients can be stored

in a common histogram rescaling the coefficients before quantization.

Figure 4.2 shows plots of the bit rate vs.ρ as QP varies between15 and45 for I,P, and B

frames offoreman sequence. From the graph, it is apparent that the picture rateR(ρ) is well

1In our approach, we do not consider the adoption of a quantization matrix for Rate-Distortion optimization.The variations of∆ within the same transform block are related only to the fact that the matrix is not orthonormaland the coefficients need to be rescaled (see Section 2.2.3).

4.2. Rate distortion modeling based on “zeros” 49

represented by a linear function ofρ expressed through the equation

R(ρ) = µρ+ q, (4.2)

whereq is the number of overhead bits that codes all the informationthat is not related to

DCT coefficients, whileµ is the ratio between the percentage of bits that code the transform

coefficient andρ. Similar results are obtained for different kinds of pictures, independently of

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

10

20

30

40

50

60

70

80

ρ

Rat

e (k

bit)

frame 0(I)frame 3(P)frame 2(B)

Figure 4.2: Plots of bit rate vs.ρ for the coded sequenceforeman (GOP IP. . . P 15 frames),as QP varies between 15 and 45; the plots refer to frame 0 (Intra coded), frame 3 (Inter coded),and frame 2 (B-predicted Inter).

the nature of the prediction (spatial, temporal or temporalbi-directional). As a matter of fact, it

is possible to use the “zeros” parameterization to carefully model the number of bits produced

by the H.264 encoder. In order to find the “zeros” percentage for a coded picture, we can avoid

using eq. (4.1) asρ is directly available from the H.264 encoder syntax. Since CAVLC context

modeling is based on the percentage of quantized DCT coefficients different from zero, the

number of “zeros” is computed at macroblock level by the coding routine itself.

For rate control purposes, the coder needs to relate the target percentage of null quantized

coefficientsρT to the quantization step∆. One possible solution is computingρ through equa-

tion (4.1) for every quantization step and choosing the quantization parameter QP that produces

a ρ value closer toρT (a bisection-like technique would help to reduce the computations). A

second solution approximates the coefficients histogram through a parametric model that makes

the estimation of the target value∆T faster than the iterative approach.

Since coefficient statistics can differ for pictures of different type, three separated coeffi-

cients statistics for I, P, and B pictures respectively are to be kept.


4.3 Parametric models for H.264 coefficients estimated throughactivity

−500 −400 −300 −200 −100 0 100 200 300 400 5000

0.002

0.004

0.006

0.008

0.01

0.012

max=0.7607

coeff

freq

uenc

y

laplacian componentcoefficient histogram

(a) Frame no. 0 QP=30 type I

−500 −400 −300 −200 −100 0 100 200 300 400 5000

0.005

0.01

0.015

0.02

0.025

max=0.7601

coeff

freq

uenc

y


(b) Frame no. 18 QP=30 type P

−500 −400 −300 −200 −100 0 100 200 300 400 5000

0.005

0.01

0.015

0.02

0.025

max=0.7610

coeff

freq

uenc

y


(c) Frame no. 6 QP=30 type B

−500 −400 −300 −200 −100 0 100 200 300 400 5000

0.005

0.01

0.015

0.02

0.025

max=0.7607

coeff

freq

uenc

y


(d) Frame no. 20 QP=30 type B

Figure 4.3: Histogram of coefficients frequencies from the coded sequencecarphone (360frames coded with GOP IBBP 60 frames, at 30 frame/s coded withQP=30=cost).

In order to correlate the “zeros” percentage with the quantization step, equation (4.1) needs

the knowledge of the coefficient distribution, which can be provided either by the storage of

coefficient histograms or by a parametric model.

4.3.1 Storing the coefficients histograms

At first, the frequencies of the coefficients were stored in three different histograms, as the

quantization step depends on the position of transform coefficients in the4 × 4 block. Since

the transform described in eq. (2.3) is simply performed with additions and register shifts, the

resulting values are integer numbers. The storage of their frequencies requires a great memory

area, since each coefficient can be represented with approximately 14.7 bits (see [35, 36]), and

in our first approach we kept three different memory vectors.Moreover, given a coefficient

4.3. Parametric models for H.264 coefficients estimated through activity 51

distribution and a QP, the computation of “zeros” percentage from the data of each histogram

requires a great amount of additions.

In a second solution, all the information was stored in one single histogram. Rescaling each

coefficient before counting it in the histogram allows keeping a single histogram and reducing

the memory area and the computational effort by three times.However, the requirements are

still demanding since the possible coefficient values are214.17/4 = 212.17 ≃ 4698, where4 is

the smallest rescaling factor.2

4.3.2 Approximating the coefficients distribution via a parametric model

A parametric model allows for a further reduction of memory area (see [11]). In a first imple-

mentation, the coefficients histogram was approximated with a generalized Gaussian function

px(a) = γe−β|a|α (4.3)

where

β =1

σx

[

Γ (3/α)

1/α

] 12

γ =αβ

2Γ (1/α).

andΓ(·) denotes the gamma function.

The equation (4.1) turns to the integral

ρ(∆) =

∫ +∆

−∆px(a)da. (4.4)

This model provides a good estimate of coefficients statistics for I and P frames and can

be fully defined by parametersα andσx. The value ofα is computed fromm|x| = E[|x|] and

σ2x = E[(x−mx)2] according to the equation

α = F−1

(

m|x|

σx

)

(4.5)

whereF (·) is defined as

F (α) =Γ

(

2α

)

√

Γ(

1α

)

Γ(

3α

)

. (4.6)

SinceF (·) is a monotone increasing function, it is possible to store its values in a table for a

given set ofα values, and use them in order to approximate the inverseF−1(·). Equation (4.5)

can be implemented through a non-uniform quantizerQF [·] that outputs analpha = QF [α]

value.

Thanks to this parametric model, the memory requirements are reduced. On the other hand,

a lot of calculations are needed to compute the statistical parametersm|x|, σx, andρ(∆) via

eq. (4.4). In order to find a faster and less demanding solution, we focused on estimatingm|x|

2Further information about the range of coefficient values can be found in the footnote 3 at page 30 and in[35, 36].


andσx directly from some parameters of the H.264 syntax. One of these is the activityact(m)

of m-th macroblock that can be expressed as

act(m) =15∑

x,y=0

|errm(x, y)| =15∑

x,y=0

|Im(x, y)− Im(x, y)|, (4.7)

whereIm(x, y) is the original pixel ofm-th macroblock at position(x, y), Im(x, y) is its

prediction, anderrm(x, y) is the residual prediction error. In many video coders and rate

control algorithms (e.g. [60], [59]), the activity is used as a measure of coefficient standard

deviationσx (see Appendix A)since its computation does not imply any multiplication and it

can be directly extracted from the encoding process. This replacement is supported by the great

correlation betweenact(m) andσx. In addition it is possible to estimatem|x| andσx through

a second order polynomial expressed as

σx(act) = s0(ρ) + s1(ρ)act+ s2(ρ)(

act)2

m|x|(act) = m0(ρ) +m1(ρ)act+m2(ρ)(

act)2

(4.8)

with

act =1

NMB

NMB−1∑

m=0

act(m) (4.9)

whereNMB is the number of macroblocks of the picture. The coefficientssi(ρ) andmi(ρ) are

tabulated for different values ofρ.

However, this approximation does not fit the number of coded bits for B frames especially

at low bit rates, and it is necessary to find a new model both simple and sufficiently accu-

rate in matching coefficient statistics. To this purpose, weadopted a“Laplacian+impulsive”

distribution, described by the equation

px(a) = α′δ(a) + (1− α′)1

γ′e− 2

γ′|a|, (4.10)

whereδ(a) is the Dirac impulse function. This solution is best-suitedfor B frames, but Fig. 4.3

shows that it can be used for I and P frames as well. The whole pdf is identified by the two

parametersα′ andγ′ that can be expressed as functions of the average activityact through the

equationsγ′(act) = γ′0(ρ) + γ′1(ρ)act+ γ′2(ρ)

(

act)2

α′(act) = α′0(ρ) + α′

1(ρ) log(

act)

+ α′2(ρ) log log

(

act)

,

(4.11)

where coefficientsα′i(ρ), γ

′i(ρ), i = 0, 1, 2, are stored for a set ofρ values. These values were

computed forρ varying in the range[0.79, 0.99] with decimal resolution0.01. As [75] reports,

this model results to be the most efficient since it both matches the statistical data and allows

an easy estimate of the quantization step∆ associated to a given targetρ value. According to

equations (4.10) and (4.4),ρ can be expressed as function ofα′, γ′ and∆

4.4. Signal analysis in the(ρ,Eq)-domain 53

ρ =

∫ +∆

−∆px(a)da = 1−

(

1− α′)

e− 2

γ′∆ (4.12)

whereα′ andγ′ are estimated via eq. (4.8) and (4.11). The inverse functionthat relates the

quantization step∆ to ρ is

∆ = −γ′

2ln

(

1− ρ1− α′

)

. (4.13)

In this way it is possible to estimate a target average quantization step∆ for a targetρ value

in a simple way. This solution proves to be quite efficient at low bit rates (see [75]), but the

required memory area increases in case the algorithm has to support a wide range of target

bit rate values. In fact, at different bit rates the parameter ρ changes, and as a consequence,

we have to store many additional coefficients tables. In order to reduce the memory area, we

designed the QP-estimation algorithm described in the nextSection.

4.4 Signal analysis in the(ρ, Eq)-domain

According to the experimental results, it is possible to relate the percentage of null quantized

DCT coefficients with the activity of the signal. In fact, theprediction residual of an image with

high average activity presents a great number of coefficients different from zero. Therefore, for

a given quantization step value the percentage of null quantized coefficients is lower than in an

image that is efficiently predicted and therefore presents alow activity value. In our work, we

tried to characterize the relation between the three parametersρ, activity, and QP.

The parametric models of the previous subsection show that there is a nearly inverse-

logarithmic relation between the percentage of “zeros” andthe variance of the signal for a

given quantization step. Since a rate control algorithm hasto estimate a quantization parameter

value for a given target number of bits, we investigated which relation occurs betweenρ and

QP once the activity level is known. To this purpose, we analyzed the relation betweenρ and

the parameter

Eq =act

∆(4.14)

that gives the average activity level normalized to the quantization step∆. In the Appendix

it is shown thatEq is an approximation of the average energyEq of the quantized signal,

i.e. the quantized DCT coefficients. Moreover, in the Appendix it is shown that the parametric

model of eq. (4.10) suggests that there is a quadratic relation betweenEq andρ. This fact

is well confirmed by the experimental results reported in Fig. 4.4, and as a consequence, the

parameterEq can be expressed via the second order polynomial

Eq = ci,0 + ci,1(1− ρ) + ci,2(1− ρ)2 (4.15)

with i = I, P,B.

The equation (4.15) provides a computationally-simple butaccurate relation that is used

in the rate control algorithm described in the following section. In fact, the quadratic model


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−20

0

20

40

60

80

100

120

140

act /

∆

1 − ρ

(a) Frame 0(I)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16−10

0

10

20

30

40

50

60

act /

∆

1 − ρ

(b) Frame 3(P)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18−10

0

10

20

30

40

50

60

70

act /

∆

1 − ρ

(c) Frame 1(B)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35−20

0

20

40

60

80

100

120

140

act /

∆

1 − ρ

(d) Frame 30(I)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160

10

20

30

40

50

60

act /

∆

1 − ρ

(e) Frame 33(P)

0 0.02 0.04 0.06 0.08 0.1 0.12−10

0

10

20

30

40

50

60

70

act /

∆

1 − ρ

(f) Frame 31(P)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35−20

0

20

40

60

80

100

120

140

act /

∆

1 − ρ

(g) Frame 120(I)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

10

20

30

40

50

60

act /

∆

1 − ρ

(h) Frame 123(P)

Figure 4.4:Eq vs. ρ for the sequencecarphone coded with constantQP ∈ [15, 51] (GOPIBBP 15 frames @ 30 frame/s QCIF resolution). Hereact denotes the parameteract of eq.4.14.

4.5. A (ρ,Eq)-based rate control algorithm 55

allows the coder to relate a target percentage of “zeros” with a target QP value with a low

computational effort since the activity is computed while estimating the best predictor (either

spatial or temporal) and the target percentage of zeros is given by the rate constraints (see

eq. (4.2)).

4.5 A (ρ, Eq)-based rate control algorithm

In the previous section it has been described how the number of “zeros” can model the bit rate

produced by an H.264 encoder. In this section we show how thisparameterization can be used

to control the amount of coded bits in order to transmit a video sequence over a channel of a

given capacity.

The algorithm adopts a feedback scheme where the control is performed in different steps

operating at different levels. The first control is performed at the beginning of each GOP and

allocatesGk,0 bits for thek-th group of pictures. The second level deals with the bit rate and

the coding parameters of a single picture. Finally, the quantization parameter is corrected at

macroblock level in order to fit the global constraints. In the following subsections each control

level is described in detail.

4.5.1 Bit rate control at GOP level

Given the target bit rateRb and the frame rate of input video sequenceFr, the video encoder

sets the overall number of bits

G =RbN

Fr(4.16)

to code the whole GOP, whereN is the number of frames for each group of pictures. This value

has to be corrected during the coding process because of bit allocation errors and variations of

the available bandwidth.

Bit rate allocation errors are corrected through the equation

Gk,0 = δGk−1 +G−(

Bc −Bs

8

)

(4.17)

whereGk,n represents the available bits before coding then-th frame of thek-th GOP andG

is defined in (4.16).δGk−1 is the difference between target and effective bit usage after coding

the(k− 1)-th GOP, and the parametersBc andBs respectively refer to the buffer level and the

buffer dimension. The GOP level rate control tries to keepBc as close as possible toBs/8 in

order to avoid underflows.

The second type of correction is carried out whenever the transmission bit rate changes,

and it is given by the equation

Gk,n ← Gk,n +(

R′b −Rb

) N − nFr

(4.18)

whereR′b is the new available bit rate. These two operations make it possible to adapt the coded

bit stream to channel variations avoiding transmission delays. However, whenever the available


bandwidth is reduced too much, the algorithm starts skipping some B-coded pictures in order

to allocate more bits for those frames that are used as references for motion compensation.

In order to deal with fast time-varying channels, the algorithm also considers a“micro-

GOP”, i.e. a group made of an I or P-type picture and the following B-type pictures. The

number of available bits for a micro-GOP is

Gmicroj = Gmicro

j−1 +R′

b (1 + number_of_B)

Fr(4.19)

wherenumber_of_B is the number of B-type consecutive pictures before the following I or

P frame, andj is the number of the micro-GOP in the current GOP. After the coding of each

frame,Gmicroj is updated according to the rules presented in the followingparagraphs.

4.5.2 Bit rate control at frame level

After computing the available bits for the current group of pictures, the rate control algorithm

has to distribute them between all the frames in the GOP. For this purpose, the algorithm has

to estimate the target bit rateTn and the target average quantization parameterQP T,n for the

current frame.

The target number of bits for each picture is computed according to the frame type and

bit rate allocation errors. The bit rate allocation errors may have happened while coding the

previous pictures and they are related to an unforeseen behavior of the H.264 encoder. As a

consequence, the control routine has to correct the target bit rate in order to keep the produced

bit rate within the bandwidth constraints.

Moreover, the assignment of the target rate has to take into account the coding type of the

whole picture and its parameters. For example, an I-type picture requires a greater number of

bits since its video quality affects the coding performancein the following frames. In addition,

the spatial prediction is less performing than the temporalone, and as a consequence, the

residual information to code for an Intra image requires more bits with respect to other types

of frames even if the quantization parameter is the same.

As for P-type frames, they need to be coded with a lower distortion than the one affecting

B-type frames since they are used as references for temporalprediction.

In the end, we must stress that the coding performance of a video encoder is deeply related

to the image characteristics and their variations in time. The produced bit rate for a given quan-

tization parameterQP depends on the statistics of transform coefficients. As a consequence,

the rate control algorithm requires a complexity parameterXt (t = I, P,B),which is defined

in the following paragraphs, to characterize the complexity of the current frame and adapt the

choice of coding parameters to the actual picture statistics.

The bit rate control for then-th frame in thek-th GOP is divided into four steps. First,

the target bit rate is computed according to the available number of bits in the GOP, the coding

type and the characteristics of the previous images. Second, the algorithm estimates whether

it is worth coding the current picture or skipping it. In the first case, the algorithm goes on the

third step and computes the average QP value for the current frame. Then, the current picture is

coded, and the parameters of image statistics are updated according to the coding results. If the


current frame is skipped, the algorithm starts processing the following picture. In the following

paragraphs each step is presented in detail.

B.1 Computation of the target bit rate

Before coding then-th frame in the current GOP, the algorithm estimates the target bit rate

Tn as a convex combination of the target bit rateTn at GOP level and the target bit rateTn at

micro-GOP level

Tn = βTn + (1− β)Tn n = 0, . . . ,N − 1 (4.20)

where

Tn = KiGk,n

KI · nI +KP · nP +KB · nBi = I, P,B (4.21)

with

KI = KI,P ·KP,B KP = KP,B KI = 1 (4.22)

and

Tn =Rb

Fr− γTn. (4.23)

In (4.21)ni is the number of remainingi-type frames in the GOP, in (4.22)Ki,j is the com-

plexity ratio between ani-type coded frame and aj-type one (i, j = I, P,B), andγ, β are

constants (γ = 0.25 andβ = 0.9 in our experiments). A more efficient implementation should

change adaptively the value ofβ in order to modify the influence ofTn and Tn according to

the channel behavior.

Equation (4.21) shares the available bits among the different frames of the GOP while

(4.23) distributes the bits in the current micro-GOP. In fact, a pure GOP-based bit allocation

results to be not effective whenever the channel bandwidth frequently varies. In this case, the

rate control algorithm performs a wrong estimation of the target bit rate because of an obsolete

value ofRb, and the following frames can be affected by bit starvation.

The quantityTn in (4.23) is computed through the equation

Tn = Tn−1 + δB1 + δB2 −Rb

Fr(4.24)

where

δB1 = K{I,P} ·Gmicro

g

K{I,P} · nmicro{I,P} + nmicro

B

(4.25)

and

δB2 =

(

Tn −Bs

8

)

Ki

K{I,P} · nmicro{I,P} +KB · nmicro

B

. (4.26)

The parameternmicroi (i = I, P,B) is the number of remainingi-type frames in the micro-

GOP.


All these parameters are updated after the coding of the current frame as it is described in

the following paragraphs.

B.2 Frame skipping control

After computing the target number of bitsTn for the current frame, the rate control algorithm

estimates whether skipping the current frame or not. In fact, whenever a picture is skipped,

Tn bits are saved for the following frames. Therefore, frame skipping permits dealing with

bit rate allocation errors and scene changes in an efficient way since the algorithm skips those

frames that are not used as references for temporal prediction whenever the remaining frames

in the GOP suffer from bit starvation. In this way, we avoid anexcessive distortion of reference

pictures that decreases the motion estimation efficiency.

In the proposed algorithm, a frame is skipped whenever the inequality

Tn ≤Rb

8Fr(4.27)

holds.

In addition, whenever the current picture is a B-type frame,the following condition is tested

Gn,k ≤(NP +NB)Rb

8Fr. (4.28)

This test allows the rate control to check whether the coder has already used a greater number

of bits than expected. In this case, the current B-type frameis skipped.

B.3 Computation ofQPn,T

As the H.264 encoder is driven by the quantization parameterQP, the algorithm has to

compute the average QP value forn-th frame fromTn. According to the R-D model presented

in Section 4.3, the bit rateTn has to be referred to a target average percentage of “zeros”ρT,n

through the equation

ρT,n =Tn − qµ

(4.29)

whereµ andq are estimated from previously coded pictures (e.g. the(n − 1)-th frame). The

parametersµ andq are the slope and the intercept of equation (4.2).

From equation (4.15), the parameterEq can be computed fromρT,n for a given set of

coefficientsci,t, i = 0, 1, 2 andt = 0, 1, 2. Therefore, the target percentage of “zeros”ρT,n is

related to a target quantized signal energy valueEq,T via (4.15), where the set of coefficients

varies according to the coding type of the current frame. Then, according to eq. (4.14) the

algorithm estimates the target average quantization step∆n,T as

∆n,T =Eq,T

actn,pred(4.30)

whereactn,pred is the predicted average activity for the current frame. In our approachactn,pred

is equal to the average activity of the previous frame of the same coding type. Nevertheless,

more efficient prediction schemes can be implemented.


Finally, the target average quantization step∆n,T is converted into an average target quan-

tization parameterQPn,T as described in [75].

B.3 Parameters update

The rate control algorithm requires estimating the quantization parameter QP correspond-

ing to a given target percentage of “zeros”. In order to provide an accurate control over a

wide range of bit rates, an adaptive approach, which requires a reduced computational load

and avoids the storage of many coefficients tables as in [75],is adopted. For this purpose, an

LMS-based technique proved to be satisfactory.

After coding then-th frame, the coefficientsci,t of eq. (4.15) are updated in the following

way. First, the estimation error ofEq is found through the equation

eEq = Eq − Eq,T (4.31)

with

Eq,T =

2∑

t=0

ci,t (1− ρn)t (4.32)

whereρn is the actual percentage of “zeros” of the current frame.

Then, the appropriate set of coefficients is updated

ci,t ← ci,t + κ eEq ρtn (4.33)

whereκ is the adaptation gain of the estimator. We kept a lowκ value (κ = 0.01) resettingct,ivalues whenever the relative bit allocation errors are greater than a threshold. The initial values

of ct,i are computed from a training set of sequences coded with constant QP.

In addition, the algorithm updates the slopeµ and the intersectq of eq. (4.2) setting

µ← hn − Sn

1− ρn(4.34)

q ← 0.9 q + 0.1 (hn − µ) (4.35)

whereSn is the total number of bits produced andhn is the number of header bits for then-th

frame in the current GOP.

As for the bit rate related parameters, the available numberof bits is updated according to

Gk,n+1 = Gk,n − Sn (4.36)

Gmicroj+1 = Gmicro

j − Sn. (4.37)

In this way, the target bit rate for the following picture is modified compensating previous bit

rate allocation errors.


The ratiosKI,P andKP,B are set to

KI,P =XI

XPKP,B =

XP

XB(4.38)

and characterize the relations between the complexitiesXi, i = I, P,B, for frames of different

type.

In order to avoid sudden changes in the complexity ratios,Xi is found through the averag-

ing filter

Xi ← ωXi + (1− ω)Xi i = I, P,B (4.39)

where the inputXi is thecomplexity

Xi = 2QP n/6 · Sn. (4.40)

The parameterQPn is the average QP value of the whole picture, andSn is defined in (4.34).

The variablesXi, Xi and their corresponding complexity ratiosKI,P , KP,B allow the coding

process of the H.264 encoder to adapt to the input video data.

In previous coding standards, several rate control algorithms defined a complexity propor-

tional to the quantization parameter and coded bits, as reported in the following expression

Xi = QPn · Sn. (4.41)

As a matter of fact, since in H.264 the relation between QP and∆ is not linear but expo-

nential, equation (4.41) has to be changed into eq. (4.40).

4.5.3 Bit rate control at macroblock level

At macroblock level the quantization parameter is corrected according to the number of remain-

ing bits and the percentage of “zeros”. This grants a good control both over picture quality and

coded bits, while keeping bit rate within the given constraints and smoothing visual distortion

across different macroblocks. The proposed algorithm usesthe same macroblock level control

reported in [75].

After coding them-th MB of then-th frame, the percentage of null quantized coefficients

in the previous codedm macroblocks isρPm and the number of bits used to code the picture

is BPm. According to the given target,BR

m = Tn − BPm bits are left to code the remaining

macroblocks: the percentage of “zeros” required to fit the constraints is equal to

ρRm = 1− BR

m

µ

NMB

NMB −m(4.42)

whereNMB is the total number of macroblocks in each frame.

This leads to estimate the ratiok = ρRm/ρ

Pm, which affects the quantization parameter


QPm+1 of the following macroblock according to the equation

QPm+1 =

QP T,n + 3 if 1 + 3δκ ≤ k < +∞

QP T,n + 2 if 1 + 2δκ ≤ k < 1 + 3δκ

QP T,n + 1 if 1 + δκ ≤ k < 1 + 2δκ

QP T,n if 1− δκ ≤ k < 1 + δκ

QP T,n − 1 if 1− 2δκ ≤ k < 1− δκ

QP T,n − 2 if 1− 3δκ ≤ k < 1− 2δκ

QP T,n − 3 if −∞ ≤ k < 1− 3δκ.

(4.43)

with δκ specified in the following paragraph.

In [37] a linear law was used in order to correct theQPn value since in the H.263 encoder

the corresponding relation betweenQP and quantization step∆ can be expressed as∆H.263 =

2 QP . In the H.264 encoder this relation is given by the exponential relation (2.4). Therefore,

δκ can be estimated by

δκ =0.67

C· 2QPT,n/6. (4.44)

In order to fit the targeted bit rates, the constantC has been set to500. A reduced value ofδκ

allows the encoder to react more quickly to the changes inρPm.

Note thatδκ is monotonically increasing with the quantization parameter since, according

to equation (2.4), the variation of∆ is more relevant for higher values of the quantization

parameter. This can bring to strong variations in bit rate and coding quality across different

macroblocks, and it is necessary to avoid frequent changes using a greaterδκ.

In order to achieve a sufficient statistic forρPm, QPm remains equal toQP T,n until BP

m ≥0.1 · Tn.

We adopted the RD-optimization performed by the JVT encoderin order to compare our

results with the rate control algorithm included in the version JM 7.6 of the encoder. There-

fore, in our approach the rate control chooses the quantization parameter while the macroblock

coding mode is selected minimizing the cost function

J(mode,QPm) = D(mode,QPm) + λ R(mode,QPm) (4.45)

wheremode = 0, . . . , 10 is the macroblock mode,QPm is the quantization parameter chosen

for the current macroblock,D(m,QPm) is the coding distortion, andR(m,QPm) is the bit

rate. The Lagrange multiplierλ is set to

λ = λ0 2QP/6 (4.46)


with λ0 = 0.85 for I or P-slices andλ0 = 3.4 for B-slices (see [114, 125, 126]).


In order to evaluate the “zeros” algorithm performance, we coded different sequences at various

bit rates using two different rate controls. The first one is the proposed “zeros”-based rate

control, while the second one is the algorithm implemented in the Joint Model 7.6 of H.264 by

the Joint Video Team (here, denoted with the labelJVT[59, 67]).

The configuration parameters of the H.264 video coder are reported in Table 4.1. For each

Parameter Value

GOP structure IBBPGOP length 15 and60

Coding algorithm CABACSearch window width 16

MV resolution 1/4 pixelHadamard enabledNum. of reference frames 1

RD optimization enabledSP pictures not usedSlice mode not used

Table 4.1: Configuration parameter for the H.264 encoder.

coded sequence we computed the bit rate and the PSNR. In addition, we calculated the stan-

dard deviation of this parameter (σPSNR) in order to evaluate how strongly the distortion varied

among different frames. In fact, strong PSNR variations affect the resulting video quality since

the displayed sequence looks unnatural and visually unpleasant. A video sequence with great

PSNR variations may be worse than a sequence which has a loweraverage video quality but

limited quality variations. At first, we coded different sequences at different bit rates. The

0 50 100 150 200 2500

1

2

3

4

5

6x 10

4 solid=Zeros / dotted=JVT

Frame number

Bits

(a) Bits vs. Frame

0 50 100 150 200 25030

32

34

36

38

40

42solid=Zeros / dotted=JVT

Frame number

PS

NR

(Y)

(dB

)

(b) PSNR (dB) vs. FrameFigure 4.5: Bits/Frame and PSNR/Frame plot of 240 QCIF frames for the sequencesalesman (GOP IBBP 60 frames) at 30 frame/s.

results are reported in Fig. 4.5, 4.7 and in Table 4.2 for QCIFsequences, while Fig. 4.6 reports

the results for a CIF sequence. The reported data show that the “zeros”-based approach pro-


“zeros” JVT “zeros” JVT

Target Rate err. Rate err. PSNR ± σPSNR PSNR ± σPSNR

64.00 63.41 -0.92 63.92 -0.12 34.01 ± 1.99 33.83 ± 2.26

80.00 79.65 -0.44 79.90 -0.12 35.46 ± 1.51 34.81 ± 3.02

96.00 95.79 -0.22 95.81 -0.19 36.52 ± 2.18 35.73 ± 4.23

112.00 111.53 -0.42 111.66 -0.30 37.38 ± 3.58 36.44 ± 5.13

128.00 127.82 -0.14 127.54 -0.36 38.51 ± 1.82 37.08 ± 6.17

144.00 143.60 -0.28 143.59 -0.29 39.29 ± 2.49 37.68 ± 6.91

160.00 159.09 -0.57 159.48 -0.32 40.22 ± 3.14 38.27 ± 8.25

176.00 175.41 -0.34 175.37 -0.36 40.76 ± 3.05 38.72 ± 8.81

192.00 191.63 -0.20 191.34 -0.34 41.84 ± 2.76 39.14 ± 10.55

208.00 207.09 -0.44 207.28 -0.35 42.54 ± 2.88 39.73 ± 12.19

224.00 223.08 -0.41 223.20 -0.36 43.10 ± 3.36 40.22 ± 13.56

240.00 239.10 -0.38 239.18 -0.34 43.74 ± 3.77 40.61 ± 15.69

256.00 254.59 -0.55 255.01 -0.39 44.34 ± 4.43 41.00 ± 16.97

Table 4.2: Results for the sequencesalesman.[PSNR]=[σPSNR]=dB,[Rate]=[Target]=kbit/s,[err]=(%).

vides a better quality (as measured by the PSNR) with respectto the JVT algorithm. Fig. 4.5(a)

shows the number of bits allocated for each frame of the sequencesalesman coded at128

kbit/s, and Fig. 4.5(b) shows the corresponding PSNR value of the luma component. The plots

of Fig. 4.5(b) underline that the video quality of the “zeros” algorithm is less varying even if

the allocated number of bits is approximately the same.

100 150 200 250 300 35031

32

33

34

35

36

37

38

Bit rate (kbit/s)

PS

NR

(Y)

(dB

)

solid=Zeros / dotted=JVT

Figure 4.6: Distortion-Rate plot of 120 CIF frames for the sequencesalesman (GOP IBBP60 frames) at 30 frame/s; the superimposed vertical bars denote±σPSNR.

In addition, the data in Table 4.2 and in Fig. 4.7 show that theperceptual quality vari-

ation (measured byσPSNR) is smaller in the proposed algorithm. In fact, the graphs of

Fig. 4.7 show the experimental distortion-rate curve with superimposed vertical bars that denote

±σPSNR. The results were obtained coding the sequencessalesman,foreman,news, and

container. The figures confirm that the proposed algorithm produces both a greater PSNR

value and lowerσPSNR at all bit rates, i.e. both a higher and smoother quality. This fact proved


60 80 100 120 140 160 180 200 220 240 26030

32

34

36

38

40

42

Bit rate (kbit/s)

PS

NR

(Y)

(dB

)


(a) 360 frames from sequenceforeman (QCIF GOPIBBP 15 frames).

60 80 100 120 140 160 180 200 220 240 26032

34

36

38

40

42

44

46

48

Bit rate (kbit/s)

PS

NR

(Y)

(dB

)


(b) 240 frames from sequencesalesman (GOP IBBP60 frames).

60 80 100 120 140 160 180 200 220 240 26032

34

36

38

40

42

44

46

48

Bit rate (kbit/s)

PS

NR

(Y)

(dB

)


(c) 240 frames from sequencenews (GOP IBBP 60frames).

60 80 100 120 140 160 180 200 220 240 26034

36

38

40

42

44

46

48

Bit rate (kbit/s)

PS

NR

(Y)

(dB

)


(d) 240 frames from sequencecontainer (GOPIBBP 60 frames).

Figure 4.7: Distortion-Rate plot for different QCIF sequences at 30 frame/s; the superimposedvertical bars denote±σPSNR .


Target (kbit/s)/GOP length /

Format

Seq.JVT algorithm (ρ,Eq) algorithm

Bit rate(kbit/s)

PSNR (dB)±σPSNR

Bit rate(kbit/s)

PSNR (dB)±σPSNR

64/60/QCIF

foreman 65.78 33.28 ± 2.24 63.93 33.56 ± 1.26

news 63.72 35.50 ± 2.39 63.89 36.44 ± 2.03

container 63.68 38.17 ± 0.68 63.55 38.39 ± 0.63

silent 66.02 34.93 ± 0.63 63.97 34.87 ± 0.87

table 66.95 32.43 ± 2.36 63.85 32.62 ± 2.85

salesman 65.52 33.93 ± 2.26 63.92 34.01 ± 1.99

96/60/QCIF

foreman 95.99 34.79 ± 1.79 95.66 35.35 ± 1.27

news 96.19 37.49 ± 1.97 95.39 38.96 ± 2.46

container 95.51 39.64 ± 1.73 95.55 39.87 ± 1.03

silent 98.08 36.57 ± 0.70 95.93 37.66 ± 0.34

table 99.75 34.42 ± 2.29 95.70 35.12 ± 2.37

salesman 98.08 36.01 ± 2.36 95.81 36.52 ± 2.18

128/60/QCIF

foreman 130.81 36.05 ± 1.81 127.67 36.58 ± 1.21

news 128.43 39.19 ± 2.92 128.20 40.50 ± 1.63

container 127.59 40.79 ± 3.08 127.58 41.24 ± 2.03

silent 130.49 38.05 ± 1.68 127.94 39.25 ± 0.80

table 132.19 35.75 ± 2.22 127.40 36.84 ± 2.19

salesman 132.04 37.49 ± 3.17 128.08 38.51 ± 1.82

96/15/QCIFforeman 96.37 34.94 ± 2.84 96.70 35.38 ± 1.38

mobile 96.20 27.69 ± 1.77 96.67 28.52 ± 0.67

salesman 95.79 36.63 ± 2.30 96.64 37.08 ± 1.83

silent 96.87 36.57 ± 1.60 97.00 37.07 ± 0.90

128/15/QCIFforeman 128.25 36.05 ± 2.81 129.54 36.84 ± 1.51

mobile 127.99 28.88 ± 1.62 129.00 30.01 ± 1.40

salesman 127.66 37.91 ± 2.41 128.93 38.94 ± 2.02

silent 128.93 38.00 ± 1.97 129.22 39.34 ± 1.01

192/60/CIFforeman 202.15 34.39 ± 1.14 192.128 34.62 ± 1.40

news 205.56 37.24 ± 1.22 191.46 37.96 ± 1.88

salesman 209.29 34.52 ± 0.83 192.26 34.79 ± 0.92

table 201.99 30.83 ± 1.39 191.53 31.06 ± 1.22

256/60/CIFforeman 267.65 35.53 ± 1.11 255.74 35.86 ± 1.08

news 270.92 38.54 ± 1.30 260.73 39.45 ± 0.90

salesman 277.08 35.43 ± 0.87 256.38 35.79 ± 0.73

table 266.47 31.93 ± 1.24 255.71 32.44 ± 1.13

Table 4.3: Comparison between the(ρ,Eq)-based algorithm and JM7.6 algorithm.


to be independent of the complexity of the sequence. We performed the same analysis on CIF

sequences in order to evaluate the performance of the algorithm with wider-sized pictures.

Experimental results for the sequencesalesman are reported in Figure 4.6 and confirm the

previous results. More results are reported in Table 4.3.

0 20 40 60 80 100 120 140 160 1800

0.5

1

1.5

2

2.5

3

3.5x 10

4

128 kbit/s 102 kbit/s 154 kbit/s


Frame number

Bits

per

fram

e

(a) Rate vs. Frame number

0 20 40 60 80 100 120 140 160 18024

26

28

30

32

34

36

38

40

42

44

128 kbit/s 102 kbit/s 154 kbit/s


Frame number

PS

NR

(Y)

(dB

)

(b) Rate vs. Frame number

Figure 4.8: PSNR and Rate plots of 180 QCIF frames for the sequenceforeman (GOP IBBP15 frames) at 30 frame/s. The bit rate decreases to102 kbit/s at the90th frame and increasesto 154 kbit/s at the130th frame.

The “zeros”-domain algorithm exhibits good performance also in case of varying band-

width. In fact, Figure 4.8 shows the results obtained when transmitting theforeman sequence

on a channel that varies its capacity. The algorithms has to adapt to changes in channel bit rate,

which varies from128 kbit/s to102 kbit/s in the first transition and increases to154 kbit/s in

the second one. The performances of both algorithms are reported showing both the PSNR

and the number of coded bits for each frame. The performance of “zeros”-algorithm does not

appear to be seriously affected by changes in channel capacity. In fact, the algorithm provides

a smoother quality between consecutive frames than the JVT algorithm, avoiding peaks in the

number of bits per frame. In this way, the transmission jitters are limited, and as a consequence,

it is possible to avoid frequent freezing of the displayed frames whenever the decoder has not

enough buffered data and has to wait the complete reception of the next frame to be decoded.

Moreover, we tested the algorithm using the same conditionsof VBR tests3 in [61] with RD

Sequence foreman QCIF carphone QCIF news CIF

ρ, Eq 39.68/152.40 41.51/152.40 43.67/305.23

JVT 39.36/156.79 41.31/157.68 42.62/314.37

Table 4.4: PSNR/Rate for VBR tests on different sequences.

3Different sequences are coded at 10 frame/s (100 frames withGOP IPPP). The bit rate is128 kbit/s until the60 frame and it is incremented to192 kbit/s for QCIF frames, while the initial target rate is256 kbit/s and it isincremented to384 kbit/s for CIF sequences.

4.7. Summary 67

optimization enabled. The proposed algorithm has proved tobe more effective in terms of both

visual quality and rate control accuracy. Results are reported in Table 4.4

4.7 Summary

In this chapter we analyzed the application of a rate distortion model based on the percentage

ρ of null quantized transform coefficients ("zeros") to the video coding standard H.264/AVC.

In fact, the bit rate proves to be a linear function ofρ for frames of different coding types, and

this relation can efficiently be implemented to control the produced bit rate. However, in this

modelization the probability density function of transform coefficients plays an essential role.

Experiments show that it is possible to reduce the computational requirements of storing the

coefficient statistics by parameterizing the percentage ofzeros via the energy of the quantized

signalEq. In fact, it is possible to find out a quadratic relation betweenEq andρ that makes

the analysis of the produced bit rate very easy. Modeling thesignal in the joint domain(ρ,Eq)

permits the design of a low-cost rate control algorithm which provides good performance both

at high and low bit rates. The results were also obtained adopting an enhanced skipping strat-

egy that avoids sending a useless amount of information and prevents an unnecessary waste of

bandwidth. This choice increases the coding performance with respect to the algorithm imple-

mented in the JM7.6 software, both in terms of average PSNR and in terms of its variance. The

resulting visual quality proves to be higher and smoother for the proposed algorithm while the

JM7.6 algorithm proves to be unable to modify quickly its coding parameters to the statistics

of the input signal. In addition, the proposed algorithm also proves to be flexible in presence of

bandwidth variation showing a fast capability of adapting its parameter setting. Experimental

results show that bit rate oscillations around25% do not significantly affect the performance of

the proposed algorithm while the technique used in the reference software is not able to quickly

react to changes and its buffer suffers from overflows or underflows according to the variations

in the available bandwidth. In the end, the computational complexity is significantly reduced

since no pre-analysis is required and the coding of a single macroblock is performed only once.

Chapter 5

Joint Source-Channel Video CodingUsing H.264/AVC and FEC Codes

“Nothing hurts a new truth more than an old error”Johann Wolfgang von Goethe

One of the most challenging drawbacks of video transmissionover mobile channel is theperceptual degradation of the reconstructed video sequence at the decoder. In fact, the highpercentage of lost packets, as well as the intensive use of prediction to obtain a high compres-sion ratio, affects the visual quality of the reconstructedsequence. As a matter of fact, it isnecessary to introduce some redundant data in order to increase the robustness of the codedbit stream. A possible solution can be found filling a matrix structure with RTP packets andapplying a FEC code on its rows. However, the matrix size and the chosen FEC type affectthe performance of the coding system. The chapter discussesa rate allocation algorithm thatdistributes the available number of bits between the H.264/AVC coder and the channel coderin order to maximize the perceptual quality of the decoded sequence.

5.1 Introduction

As it was anticipated in Chapter 1, one of the technical challenges that are posed by video

transmission over wireless networks is granting a certain QoS to the end user. In fact, the ever-

changing nature of the radio channels and the varying topology of wireless networks modify the

transmission conditions more often with respect to the wired communication. As a drawback,

the transmission of video contents becomes a challenging task since varying channels require

flexible algorithms that adapt the coding parameters to the different transmission conditions.

At the same time, the greatest difficulty is due to the fact that mobile networks can not grant

a reliable transmission because of errors and losses, whichis the very Achilles’ heel of video

transmission.

Losses and errors may be produced by different causes. A firstcause is the time-varying

characteristics of the transmission environment, where the transmitted information is often

corrupted by bursty bit error patterns. One of the classicaltechniques that are used to make

the bit stream less vulnerable to errors is to increase the redundancy of the sent information

70 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codes

providing the decoder with some “extra” data. The additional amount of information allows

the recovering of the lost data in case their amount is limited beyond a given threshold. These

techniques are called Forward Error Correction schemes (FEC) since no interaction is needed

between the encoder and the decoder in the recovering process, and their effectiveness is limited

by the capability of designing a protection scheme that suits the channel conditions all the

time. Closed-loop error control techniques like AutomaticRepeat reQuest (ARQ) provide a

more efficient protection against errors since the decoder interacts with the receiver sharing its

knowledge of the channel conditions and allowing the encoder to tune the allocated redundancy

in an appropriate way. Unfortunately, many applications can not resort to ARQ schemes since

they need a reliable feedback channel and introduce an excessive delay in the transmission

which results prohibitive for interactive communications.

In addition to the problems related to the radio link, we musttake into consideration the

amount of coded data and the network condition too. Video sources produce a huge amount of

information per time unit with respect to other kind of sources. Hence, video communications

are crucial in affecting the network conditions since one ormore uncontrolled users sending

video packets across the network may seriously limit the transmission capacity available to the

others. Usually, network management adopts a set of different policies in order to prevent a

single user from jeopardizing network resources. These solutions monitor the entering traf-

fic, and whenever the network is overloaded, they take appropriate measures, such as packet

dropping1 or queuing. At the receiver, the dropping of a packet is perceived as a loss.

Finally, transmission delays must be mentioned as well. In fact, each data packet requires

a certain amount of time to reach its destination depending on the average transmission ca-

pacity, the number of crossed links and the overall waiting times in the queues. Whenever

the statistics of delays presents a limited variance (jitters), the delay is compensated buffering

the decoded information at the receiver and displaying the reconstructed sequence after an ini-

tial appropriate time interval (playout delay). Unfortunately, a highly-varying delay statistics

makes difficult the estimate of the initial waiting time. Moreover, interactive applications re-

quire limited delays since the decoded frames must be displayed at fixed instants which are

pre-determined according to the negotiated QoS. Whenever aframe arrives too late, it can be

discarded and regarded as lost.

Despite this chapter is focused on dealing with losses in thepacket video stream, errors

and corruption can be efficiently addressed too. In the data stream produced by a video coder,

syntax elements are coded into a sequence of variable-length binary strings. The corruption of

one bit is crucial in the whole decoding process since all theremaining symbols are wrongly

decoded until a synchronization bit marker is found. In these cases, the resulting bit stream

can be correctly decoded until an error occurs, and therefore, the impact of the loss in the

decoding process may vary according to where the error takesplace and where the decoder

detects it. Hence, the decoder may carry on decoding erroneous data as far as it finds a feasi-

ble bit stream, introducing a distorted frame in the processwhich affects the decoding of the

remaining sequence.

1A packet that results non-conforming with respect to the allowed amount of traffic is discarded. Dropping ofpackets can take place whenever the user does not respect theconditions specified in the traffic contract indepen-dently from the real presence of a network congestion.

5.2. On dealing with channel errors and losses in video transmission 71

According to these premises, errors and losses are a peculiar characteristic of wireless

communication and must be appropriately addressed since they can seriously affect the quality

perceived by the end-user. The following section will present on overview of different tech-

niques that are adopted to cope with error and losses in the video packet stream. Then, an

efficient FEC approach that permits reducing the quality degradation in the coded bit stream

whenever packets are lost will be presented. However, its performance is greatly improved

whenever the strength of the protection is varied accordingto the characteristics of the video

content. Hence, we will present an optimization strategy, which improves the quality of the

reconstructed sequence for a given coding rate. This technique is included in a joint source-

channel rate control which adaptively partitions the available bandwidth between the channel

coder and the video source coder. The reported experimentalresults shows that the solution

can significantly improve the performance of a non-adaptiveapproach.

5.2 On dealing with channel errors and losses in video transmis-sion

The transmission rates offered by the current communications providers are inadequate to trans-

mit the uncompressed multimedia contents produced by each user. This limits requires the

adoption of efficient coding algorithms with a good compression ratio that significantly reduce

the amount of transmitted data. Nowadays, all the approaches that have been proposed in-

clude a DPCM loop along the temporal dimension. At every instant, video information can be

predicted according to the previously decoded data that constitute the state of the decoder (in-

terframe coding). Nevertheless, the price to be paid for the high coding gainof Inter prediction

is an extreme vulnerability against transmission errors.

The loss of part of the information prevents a correct reconstruction of the encoder state

at the decoder and has a considerable impact on the quality ofthe following frames. In most

of video communications, the state of the encoder/decoder is given by those frames that have

been encoded/decoded and are included in the frame buffer asavailable references for motion

compensation. In case one of them is either missing because the connection was temporarily

lost during the transmission or corrupted because the codedstream was altered by channel

noise, the decoder must take appropriate measures to replace the loss.

A possible solution is to estimate an approximated frame that replaces the original one at the

decoder. The mismatch between the original frame and its approximated version introduces an

additional distortion in the reconstructed sequence that can be reduced as much as the estimated

frame is close to the missing one. To this purpose, during thelast years technical literature has

proposed manyerror concealmentalgorithms [22, 7], which have adopted more and more

sophisticated estimate techniques.

On the other side of the connection, the encoder can optimizethe packet stream in order

to maximize the estimated quality of the reconstructed sequence given the loss statistics and

the available bandwidth. The optimization can be performedby coding the video source in

appropriate way including a certain amount of non-predicted information in the coded stream

in order to stop the error propagating from a frame loss. Moreover, it is possible to add some


redundant information in the bit stream that allows the estimate of the lost data in case of losses.

In the end, technical literature has also presented some ARQprotocols, which allow the

decoder to ask for the retransmission of part of the lost information [28]. However, these tech-

niques imply the existence of a control channel that connects the receiver with the transmitter

and allows the decoder to provide the encoder with an error report. In a wireless network sce-

nario, the reliability and the timeliness required by the feedback channel can not be granted

(it could be granted at cell level, but it is not available forlong paths), and therefore, ARQ

techniques will not be considered.

The following sections will present an overview of error concealment techniques performed

both at the decoder and at the encoder, paying great attention for the latter ones since our

investigation is focused on them.

5.2.1 Error concealment at the decoder

The previous section has given a short overview of the possible different errors that may affect

the received bit stream. According to the nature of the corruption, different results can be

obtained. Transmission glitches range from single bit errors to bursts, or even the temporary

loss of connection, causing a wide range of different conditions.

In case of bit errors, the corruption of one bit in the stream causes an incorrect decoding

of the corrupted symbol, which propagates the error in the following values until a resynchro-

nization point is reached [22, 24]. The first step that the decoder must take in order to decode

a corrupted bit stream is to detect syntax errors, discarding the rest of the corrupted data unit

and recovering the synchronization with the encoder. The location of the first corrupted bit in

the stream is made possible by checking the values of the decoded elements and detecting the

violations of coder syntax. The difficulty of this task is strictly dependent on the entropy cod-

ing algorithm that is adopted. In relation to the standard H.264/AVC, the error concealment of

a CAVLC stream is much easier than the error concealment of a CABAC stream since syntax

errors are detected far behind the point where they actuallyoccurred (see [22]). At every syntax

exception, the decoding process is interrupted, given the impossibility to recover the remaining

information in the current slice. Due to the data structure introduced by the H.264 standard,

every slice is independent from the others, and therefore, the resynchronization with the data

flow can happen in correspondence of every new slice.

In case of losses, no error location is needed because a wholepacket is lost and the decoder

do not need to be resynchronized. Since in most of the cases the error detection and correction

is demanded to the lower levels of communication stack, in this work we will consider only

packet losses even if the presented algorithms can be efficiently applied to a corrupted bit

stream.

Usually, the process of video decoding takes place at the highest levels of the protocol

stack, and in most of the protocols the lowest levels block the corrupted packets and signal to

the highest levels that they are lost. However, the widespread of multimedia communication

over packet networks has underlined the possibility of decoding a multimedia packet at the

highest levels despite it contains bit error. In many cases corrupted information can still pro-

vide significant information to the video decoder despite itcontains errors. Therefore, some


transmission protocols have been proposed that make corrupted packets available to the high-

est levels whenever errors do not affect some crucial parts of the packets, like headers. One of

these is UDPLite, which is an extension of UDP, and allows thesender to specify whether to

compute the checksum on the whole packet or only on the header. In this way the packet is kept

even if it is corrupted in the payload as far as the destination and the length are correct. Error

concealment algorithms can detect the errors, and in some cases, correct them checking the

compatibility of the decoded information with the syntax ofthe coding standard [22, 24, 27].

After detecting a loss, concealment methods are necessary to reconstruct the missing parts

of a damaged image. Note that in these approaches a feedback transmission to the encoder is

avoided since it implies longer delays in the displaying of the pictures. On the contrary, most

of the adopted algorithms perform an image post-processingat the decoder taking advantage

of the intrinsic correlation that can be found in a video sequence ([21, 20]). Some techniques

are based on the interpolation of lost pixels according to the neighboring information, while

others recover the lost syntax elements according to the neighboring ones. For example, it is

possible to estimate a lost motion vector predicting its value from the neighboring ones thanks

to the correlation existing among spatially-adjacent motion vectors [23]. The efficiency of

each solution varies according to the characteristics of the video sequence, and it is strictly

dependent on how the video stream is coded. Hence, the encoder can optimize the coding

variables in such a way to enhance the error concealment performance at the decoder.

5.2.2 Error concealment at the encoder

The previous section has quickly glanced at the techniques engaged by the decoder to mitigate

the effects of transmission errors on the decoding process.The performance of these techniques

is deeply affected by the coding choices adopted at the encoder, i.e. the performance of the

error concealment is greatly improved whenever the coder control takes into consideration the

channel condition while tuning its parameters.

Intra refresh of coded video information

In literature, one of the first algorithms that addresses theproblem of producing a robust video

stream is based on including periodically non-predicted information in the bit stream in order

to block the propagation error (Intra refresh). This mechanism was inherited from DPCM cod-

ing [50], and it can be properly tuned according to the video content and the channel statistics

(see [62]). The video encoder may force the RD-Optimizationalgorithm so that the crucial

parts of an image are coded with Intra coding, introducing inthis way non-predicted informa-

tion in the coded video stream. As a drawback, the increment of the Intra-coded macroblock

produce an increment in the number of coded bits or, in case the bit rate is constrained by the

available bandwidth, a decrease in the quality of the reconstructed sequence with respect to its

counterpart in a error-free environment. The identification of the parts to be refreshed depends

on the error model for the channel and on the required robustness. One possible strategy is to

randomly intra-code the macroblocks of the sequence so thatafter a certain number of frames

the whole image has been refreshed (see [24]). Another strategy, which proves to be extremely

sensible and efficient for bit errors, is to identify which macroblocks are either crucial in the


decoding process or more likely to be lost, and refresh them (see [62]). In this way, while

the former approach refresh all the macroblocks the same number of times, the latter selects

more often those macroblocks whose loss has a stronger effect on the visual quality of the

reconstructed sequence provided to the end user. With the adoption of a flexible macroblock

ordering structure (FMO) in the H.264/AVC standard, Intra refresh can be performed in a very

efficient way by intra-coding a whole slice of non-neighboring macroblocks. The increment

of Intra macroblocks percentage in the coded sequence increases the overall bit rate since the

coding gain for Intra coding is much lower than the one for temporal prediction.

Multiple Description coding algorithms

Recently, other techniques have been studied in order to allow a reliable transmission even

in presence of a high loss rate (10 − 20%) and with bursts of consecutive losses. Among

the possible solution, Multiple Description Coding (MD Coding or MDC) have been broadly

investigated during the last years. According to a cunning definition of MDC given by V.

Goyal [32], Multiple Description Coding aims at“representing a single information source

with several chunks of data (‘descriptions’) so that the source can be approximated from any

subset of chunks”. Note that chunks are perfectly equivalent in the reconstruction process, viz.

the performance of the error concealment algorithm at the encoder does not depend on which

piece of data correctly arrives but on their number. In fact,the more chunks the decoder gets,

the higher quality is obtained from the decoding process independently from which pieces

of information arrived, and this is the main distinctive element that differentiate MDC from

scalable coding. In scalable coding, different chunks (or packets) of data are used to represent

the source, but packets are also hierarchically ordered, and the loss of the most important one

precludes the decoding of all the others. In a multiple channels environment, the transmission

system must adaptively select the transmission channel according to the importance of the

transmitted data sending the packets with the most relevantinformation across the most reliable

transmission path. MDC schemes permits avoiding this step since all the chunks of data are

equally relevant. The key idea beyond multiple descriptiondates back to the late 70’s, when the

Figure 5.1: A pictorial example of Multiple Description Coding. Three descriptions of thesame video source are transmitted across three independentchannels (red, green, and blue).

problem was to allow DPCM speech transmission over faulty channels, i.e. channels that were

not working in certain periods. The aim was to find a more efficient solution that replicating or

splitting the transmitted information on more than one channel (see [25]). A solution proposed

by Jayant was based on the separation of odd and even samples in a speech coding method

([49]). The original sequence of samples was split into two sub-sequences, the odd ones and


the even ones, which were coded by two separate DPCM coders. Then, the output signals

are merged together and transmitted over separate channels. Assuming that loss patterns on

different channels are uncorrelated, whenever the sample of one description is lost (e.g. an odd

sample), it is possible to reconstruct the original sequence at half the sample rate. In addition,

the lost information can be estimated by interpolating the previous and the following samples

thanks to the high correlation of adjacent speech samples. In this way, the state of the DPCM

decoder is recovered and the decoding of the subsequence cango on with some additional

distortion included. This solution has been recently applied to video coding, where speech

samples are replaced by frames at different instants.

From these initial techniques, different MDC scheme have been proposed. For example,

other approaches are based on quincunx sampling ([13]), other ones adopt multiple states for

motion compensation in order to replace the lost frame with acoarser version in case it is

lost [134], others includes a correlating transform in the coding process that increases the re-

dundant information in the stream [88, 123, 89, 31], and others use different quantizer with

a translated characteristics [119, 12]. All these techniques take advantage of the correlation

existing or created between syntax elements that are eithertemporally or spatially close. Some

schemes allows to tune the allocated redundancy while for other approaches the redundancy is

intrinsically determined by the MDC scheme and it can not be controlled.

Despite some promising results have been obtained, multiple description still proves to be

young with respect to the current technological resources [117], since the required bandwidth

is high in most of the cases and for many efficient algorithms the allocated redundancy can not

be tuned.

Distributed Source Coding

A possible alternative to the previous techniques is provided by Distributed Source Coding

(DSC), who dates back to the pioneering works of Slepian and Wolf (1973), and Wyner and

Ziv (1976), but has lied dormant for more than a quarter century, perhaps due a lack of ap-

plications focus. Recent applications, like video-over-wireless, multimedia cellular telephony

and wireless video surveillance camera systems, have aroused a new interest in distributed

coding thanks to its built-in robustness to drift caused by prediction-error mismatch between

encoder and decoder following a channel loss. As a consequence, during the last years several

DSC-based video coding architectures have appeared in literature, aimed at providing robust

alternatives to the traditional coding standards, like MPEG-x and H.26x. More details about

this topic are to be found in Chapter 6.

Automatic Repeat reQuest techniques

All the previous techniques are characterized by the fact that video encoder is completely un-

aware of which parts of information have been correctly received and which have been lost or

corrupted. Whenever a feedback channel is available, the quality of the reconstructed sequence

at the decoder can be greatly improved by designing a coding scheme that allows the decoder

to communicate with the encoder [27]. The feedback channel is used to signal to the encoder


which parts of the video sequence has been correctly received and which parts have been lost

[28, 27].

Usually, the information sent across the feedback channel can be made of positive acknowl-

edgements (ACKs) or negative acknowledgements (NACKs), which state whether a loss has

occurred or not. The feedback message is not part of the videocoder syntax, but it is handled

at a lower layer in the protocol stack, where control information is exchanged. However, some

standards were defined in order to provide video encoders with a more detailed knowledge of

missing parts. In relation to H.263 video coder, ITU-T Recommendation H.245 [46] allows re-

porting the spatial and temporal location of macroblocks that could not be decoded successfully

and had to be concealed. According to the decoder reports, the sender keeps on retransmitting

the lost part until it is correctly received. In case of errors, the delay introduced by this method

can be significant, especially for real-time and interactive applications [27]. Video encoder can

take more appropriate measures to compensate the loss, suchas retransmitting the whole or part

of the lost information (maybe at a lower quality), selecting for MC references those frames in

the buffer that have been correctly decoded, and coding withIntra mode those parts that could

be corrupted by the loss in order to block the propagation of the error. These different measures

allow a reduction in the overall amount of transmitted data and in the average delay that elapses

between the first transmission and the instant when a correctdecoding is possible.

The following section will present another alternative to all the previous coding techniques,

based on the adoption of FEC codes to generate some additional redundancy packets in the bit

stream that allows to recover the lost information in case oflosses.

5.3 Channel coding techniques based on FEC codes

All the previous techniques aim at reducing the probabilityof receiving a corrupted bit stream

by increasing the redundant information sent across the channel. This is obtained either ex-

ploiting the intrinsic redundancy, which is both present inthe video signal and between the

syntax elements of video coder, or retransmitting the lost information at different quality lev-

els. In this section we will present another approach that creates some additional redundant

information using Forward Error Correction (FEC) codes.

FEC codes constitute the first class of codes that have been efficiently applied in com-

munications thanks to their error correcting performance that does not imply signaling to the

transmitter the correct reception of the transmitted data.Given a source of binary symbols with

bit rateRb, the FEC coder converts this stream into a new one with bit rateRb · (1 + r), where

r ≥ 0 is the additional redundancy. For example, a simple inefficient code is therepetition

codethat replicates each input binary symbols in the output stream1+r times [68]. It is possi-

ble to obtain good correcting performance without wasting away the available bandwidth (like

repetition codes do) by processing blocks of symbols from the input source. In case of ablock

code,s-length strings of bits are mapped inton-length strings of bits (n = s + k) addingk

redundant bits. In this case the redundancy is equal tor = k/s. The cardinality of the domain

is 2s, but the cardinality of the codomain is2n, where2n−2s corrupted codeword are included

(here a corrupted codeword is intended as a codeword that is not included in the image set of

5.3. Channel coding techniques based on FEC codes 77

the map). One of the first examples of block code is the parity-check code [68], but in literature

several others have been proposed. The length of the input and the output strings of a block

code characterize the error correcting and detecting performance of the channel code itself.

Whenever the codewords are equally distributed in the codomain set (i.e. they are equally dis-

tant), it is possible to correct⌊(k − 1)/2⌋ errors and detectk corrupted bit. The arithmetic of

binary block codes is based on binary Galois fields (GF (2)), but they can be extended to wider

Galois fields (GF (2q)) and all the previous property still holds for non-binary symbols.

Despite this techniques were introduced at the lowest layers in protocols stack to cope with

errors on communication channels, they have been recently reused at the higher layers in order

to recover the lost information. As for multimedia signals,different proposal have aimed at

applying block codes on distinct video packets. One of thesecross-packet strategies implies the

inclusion of video packets in a matrix according to a column-wise order and to apply the chosen

block codes along the rows [33, 34, 106, 63]. For a given channel, the performance of the code

strictly depends on the filling strategy and the size of the matrix. Assuming thatn is the total

number of columns andL is the number of rows, the firsts columns (source code columns)

are filled with video source packets while the remainingk = n − s columns (channel code

columns) are computed according to the adopted channel code. The symbols in the channel

code columns are included in the redundant packets that allow the decoder to reconstruct the

lost information in case of errors. In the following paragraphs, a brief overview of different

coding methods will be given. The simplest cross-packet methods that was adopted is described

��

��

��

��

��

��

��

��

��

��

��

��

Mat

rix h

eigh

t

Codeword length

Packet4

FEC packets

Filled

Packet1

Channel coding bytesSource coding bytes

(a) with zero-padding

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Mat

rix h

eigh

t

Codeword length

Packet9

FEC packets

Filled

Packet1

Channel coding bytesSource coding bytes

(b) without zero padding

Figure 5.2: General scheme for the coding matrix in RFC2733 approach with and withoutbyte padding.

in the RFC2733 [106] and introduces one parity-check packetof L bytes for each matrix of

source packets (i.e.k = 1). The source packets are included in the matrix one-per-column,

permitting a correct decoding of video information only whenever up tok packet overn = s+k

are lost or corrupted. The scheme was extended later adopting more complex FEC coding

solutions, like the Reed-Solomon codes (RS), which allow the receiver to recover more packets

per matrix and are widely used in many transmission systems.2 This scheme generalizes the

2Another important class of codes is nowadays used in this type of protection, the Digital Fountain Raptor codes[66, 74, 111]. However, in this work they will not be considered since we focused on tuning the protection leveland the matrix size.


RFC2733 scheme and provides an improved recovering efficiency since aRS(n, s) code allows

the channel decoder to recover up ton − s lost columns of the same matrix. The scheme is

depicted in Fig. 5.2(a). The characteristics of the code canbe varied according to the desired

performance and complexity, and there are a lot of works thatcompares the performance of

different coding solutions. Here only RS codes are considered since the investigation of the

optimal code is beyond the scope of the work.

The size of the matrix depends on the longest video packet since its length determines the

number of rows. The columns of the shortest packets are padded with dummy symbols equal

to zero. As it will be shown later, this filling strategy causes a waste of bandwidth whenever

the variance of the packet lengths is high and implies a different channel coding rate depending

on the row.

After computing the values of the channel code columns, the information in the matrix

is sent across the network, and the redundancy bytes are packetized in a column-wise order

according to the size of the PDU. An extra payload (12 bytes) is added to these packets in order

to make them compliant with the RTP format (see [106] for further details). At the receiver,

the same matrix is recreated filling the cells with the incoming data. Whenever a packet is

missing, all the cells that contained its byte are labeled asmissing, and the decoder scans each

row checking the number of missing bytes for each row. In casethe missing byte in thej-th

row x[j,1...n] are lower thank, it is possible to reconstruct the lost information by solving the

linear system in the Galois fieldGF (M)

2

6

6

6

6

4

α11 · · · αnerr

1

α12 · · · αnerr

2

.... . .

...

α1nerr

· · · αnerr

nerr

3

7

7

7

7

5

2

6

6

6

6

4

x[j,l1]

x[j,l2]

...

x[j,lnerr]

3

7

7

7

7

5

=

2

6

6

6

6

4

−S(α)

−S(α2)...

−S(αnerr )

3

7

7

7

7

5

(5.1)

whereαi ∈ GF (M), i = 1, . . . , nerr, andli are the indexes of the lost bytes in the row. The

functionS(a) is the syndrome computed ona from the received rowx[j,1...n] where the missing

bytes are replaced by zeros.

In this approach, matrix size plays a crucial role both in terms of allocated redundancy and

playout delay. Therefore, its dimensions must be properly optimized. To reduce the percentage

of FEC bytes sent across the network, it is possible to include more than one packet per column,

wrapping the exceeding bytes on the following columns (see [33, 34] and Fig. 5.2(b)). In this

way the matrix height does not depend on the length of the longest packet, but can be tuned up

in different manners.

The matrix has a double function. From one point of view it packs the video source infor-

mation in appropriate manner in order to provide all the datawith the same level of protection

and limiting the number of cells padded width dummy bytes. Onthe other point of view,

the matrix can be properly dimensioned in order to work like an interleaver, which scrambles

the packets before computing the FEC bytes allowing a betterrecovering from cancellations

whenever the network is affected by bursts of losses. As a drawback, we need to include more

packets increasing the delay of the playout. In case of real-time application, we may reduce the

number of matrix columns. This adaptive approach allows thecoding scheme to obtain better

performances as Figure 5.7 shows.

5.4. Adapting the matrix size to the input data 79

Note that these schemes introduce an additional delay in thedecoding process since in case

of losses, the channel decoder has to wait the matrix to be filled before recovering the lost

information. The current frame is displayed after a time delay that depends on the size of the

matrix. Hence, the need of finding efficient algorithm that shapes the matrix dimensions in

order to keep a limited delay and control the included redundancy.

5.4 Adapting the matrix size to the input data

The previous section has presented an efficient approach that enables the decoder to reconstruct

the transmitted information in case of losses including some redundant information in the trans-

mitted bit stream. However, the proposed approach performsquite differently according to the

size of the coding matrix with respect to the input data sincewrong dimensioning may lead to

overprotecting some bytes while underprotecting others. As a consequence, the performance

of the channel coding scheme decreases because the correcting capability is weakened and the

allocated redundancy wastes the available bandwidth. Therefore, matrix dimensions must be

appropriately tuned in order to maximize the recovering performance.

In this work we considered two adaptations. The first one is based on the length of the

packets. The second one is based on the characteristics of video information included in the

corresponding packets.

5.4.1 Adapting matrix size according to the packet lengths

Packet lengths significantly affects the recovering performance in relation with the size of the

matrix. In case the matrix height is too small, the longest packets may be wrapped and in-

serted in more than one column. Whenever they are lost, two cells in the same row may be

marked asmissingdecreasing the number of packets that can be recovered. On the other hand,

increasing the height of the matrix with a given number of source code columns avoids the

problems related to packet wrapping but increases the number of included video packets. As a

consequence, the recovering time after the loss of a packet is delayed since the decoder needs

to wait a complete filling of the matrix. Finally, the number of channel code columns must be

properly varied in order to match the channel characteristics.

In this work we adopt the codeRS(255, C), with C varying according to the number of

desired channel code columns. This choice was suggested by the consideration that Reed-

Solomon codes have been widely studied and implemented in the transmission of digital video

signals, and the market presents a wide offer of chipsets that efficiently perform the computa-

tion in real time. The number of source code columnss varies independently ofC in order to

match the input data. On the other hand, for a given coding rate r, the value ofC is chosen

according to the equation

k = 255 − C = ⌊s · r + 0.5⌋ that impliesC = 255 + ⌊s · r + 0.5⌋. (5.2)

A first criterion we have followed in order to dimension the coding matrix is based on the

length of the input video packets. Under the assumption of including more than one packet,


the matrix height is tuned on the average length of the packets in the coded stream. This

choice proves to be efficient for large matrices that includenearly one GOP, like the approaches

[33, 34] designed for mobile messaging applications (MBMS)over third generation mobile

channels.3 In this case, the target application does not impose tight constraints on the time

delay. Focusing on real-time applications, a smaller matrix is needed since a playout delay of

one GOP is too much. This imply an accurate dimensioning since a reduction of the matrix

may lead to a dramatic decrement of the recovering performance. The following paragraph

will provide details for the adopted matrix dimensioning algorithm.

In order to constrain the decoding delay, the matrix shapingalgorithm limits the number

of packets that can be included in the matrix. Then, the number of rows and columns is varied

in order to suit the characteristics of the packets that enters the matrix. Since the correcting

capability of the matrix is significantly affected by long packets, the number of rowsL must

be greater or equal to the length of the longest packetLmax = maxi Li. However, in case

the lengths of packets show a high variance a lot of dummy bytes could be inserted in the

last source code column. Therefore, the algorithm at first set L equal toLmax, and checks

the number of non-dummy bytes which are present in the last column. In case the number of

non-dummy bytes is lower than half of the matrix height, the matrix height is increased of one

byte until there are no more bytes in the last column. In this way, the constrain imposed by the

lengthLmax is respected, and the number of dummy bytes is minimized as the performance

of the algorithm shows with respect to its non-adapted version (see Fig.??). Since our target

application includes low delay multimedia communications, like videophoning and streaming,

the number of packets included in a matrix is equal to the number of packets that codes a single

frame (i.e. the number of slices).

5.4.2 Adapting matrix size according to the video content

So far the discussion about matrix dimension has never takeninto account the video content

that is carried in the video packets. Following Shannon’s separation principle [110], the opti-

mization of matrix size has been performed only according tothe statistics of packet length and

ignoring the characteristics of coded video signal. However, varying the protection according

to the video content avoids allocating unnecessary redundancy and improves the overall per-

formance. This approach makes possible to increase the channel coding rate for those parts of

the video stream that are crucial in the decoding process andreduce the additional redundancy

for those parts that are uninfluent. The following paragraphs will show how it is possible to

classify the packets produced by a source coder and choose anappropriate protection level for

each one of them.

As it has been shown before, the significance of frame losses for a hybrid video coding

architecture is tightly connected with the significance of the frame in the motion compensation

process, e.g. the number of frames that can be correlated to it through the motion compensation.

A corruption or a loss of its visual information has a major impact on the overall quality of the

reconstructed sequence until a refresh of the frame buffer is performed (by coding an Intra

frame). Several works in literature have studied distortion propagation and have shown its

3The work was carried on within 3GPP.


effects whenever different types of frames are lost. In the studied approach, B frames are not

used as references for motion estimation since they usuallydisplay a lower visual quality and

motion compensation may be negatively affected. Therefore, the redundancy is varied only on

I and P frames.

On the other hand, error concealment at the decoder must be considered. Error concealment

estimates the lost information according to the neighboring one [24, 22]. More specifically, the

correlation that exists among adjacent syntax elements partially allows the estimate of the lost

data, like in the case of motion vector. However, the correlation may vary according to the

input sequence, and error concealment performs quite badlywhenever the correlation among

neighboring syntax elements is low. An efficient tuning of the channel code must reduce the

protection level in case the lost information can be accurately reconstructed and increase it

whenever the coded information is hardly predictable. Hence the need of finding a parameter

that is able to characterize the importance of each packet inthe overall decoding process.

The first parameter to be considered is the activity of the residual signal. Computing the

activity of the current frame, it is possible to understand whether the displayed picture can

be easily predicted or not with respect to the other frames. Alow activity value states that

motion estimation has performed quite well, and the currentpicture can be well represented

by partitions of the previous frames. In case the activity ishigh, the level of“innovation” ,

i.e. the amount of unpredictable visual information, risesup, and it is possible to deduce that in

the current frame there are some elements that can not be efficiently motion compensated. The

occurrence of high activity values is usually related to thepresence of complex motion (i.e. non-

translational), the shooting of new objects, and scene changes. All these elements could be

crucial for motion estimation since the intrinsic correlation that exists in a video sequence

makes them highly-probable candidate references for the following MCs and their loss may

significantly affect the following frames. Hence, frame activity results deeply correlated with

the relevance of the picture in the motion estimation, and itcan be used to adapt the FEC code

for each frame in order to increase the correcting performance.

The propagation of the error deriving from the loss of thek-th packet, can be simply mod-

elled as follows

σ2corr(n, k) = σ2

corr(0, k) fptf (n) (5.3)

whereσ2corr(n, k) is the distortion resulting on the(k+n)-th frame at the decoder andfptf (n)

is thepower transfer functionthat models the propagation of the distortion through the GOP

(see [28]). According to [17], the power transfer function (p.t.f.) can be well approximated by

the following equation

fptf (n) =1

1 + ηn(5.4)

where the parameterη characterizes how fast the prediction loop compensates thedistortion

introduced by the loss. The p.t.f. describes the distortionleakage in the prediction loop (see

[17]) that is produced by spatial filtering during the encoding. Spatial filtering can be either in-

troduced by an explicit loop filter, like in H.261, or implicitly as a side-effect of fractional-pel

motion compensation4 and deblocking filtering, like in H.264/AVC. Other prediction tech-

4Blocks are interpolated using a 6-taps and a 4-taps low-passFIR filter.


niques like overlapped block motion compensation (OBMC) may also contribute to the overall

quality increment, but they will not be considered since they are not included in the H.264/AVC

standard. Since an accurate derivation of the effect produced by each individual technique is

a hard task, the overall effect can be described by a separable average loop filter(see [17]).

In this approach, the main concern regards the propagation of the distortion due to the lost in-

formation, and therefore, we will focus on the overall effects of this generalized filter at frame

level.

Simulations were run in order to evaluate the relation between the activity value of video

contents included in each packet and the impact of its loss onthe quality of the reconstructed

sequence at the decoder. The first analysis concerns the average decrement of quality through

the whole GOP, i.e. the amount of distortion introduced in the sequence, as a function of the

activity value associated to the lost packet. Results are reported in Figure 5.3(a), 5.3(b), and

5.3(c), and show that there is a linear relation between the average relative quality loss and the

activity value itself. In a second step, we evaluated the propagation of the error through the

whole GOP. Figure 5.3(d), 5.3(e), and 5.3(f) reports the dependence between activity and the

parameterN3dB which represents the number of frames after which the distortion is lower than

3 dB and is computed through the equation

N3dB =σ2

corr(0, k)/2 − 1

γ(5.5)

derived from eq. (5.4).

The reported results show that there is a linear relation between the activity value of a

packet and the overall quality decrement produced by its loss. This linear relation is also found

for the parameterN3dB , proving that a high activity value identifies those frames whose loss

produces a significant distortion on the whole sequence. More precisely, Figures 5.3(d) and

5.3(f) show that the slope of the linear relation depends on the characteristics of the coded

signal since the recovering time increases more quickly with the activity for sequences that

contains a lot of motion.

Therefore, it is possible to design an optimization algorithm that adapts the amount of

redundancy introduced in the video stream according to the significance of the packets.

While coding the video signal, the algorithm estimates the average value of the activity

avg_acti and its variancevar_acti, i = I, P,B, for each frame typei . After H.264 has

coded then-th picture, the matrix-based channel coder adopts aRS(K +Cn,K) code withK

depending on the lengths of H.264 RTP packets5 andCn equal to

Cn = C +

2 if act ∈ avg_acti +√var_acti · [1,+∞)

1 if act ∈ avg_acti +√var_acti · [0.5, 1)

0 if act ∈ avg_acti +√var_acti · [−0.5, 0.5)

−1 if act ∈ avg_acti +√var_acti · [−1,−0.5)

−2 otherwise

(5.6)

5The number of rows and the number of source code columns are tailored according to the adaptive algorithmpreviously described in Section 5.4.1.


10 12 14 16 18 20 22 24 26 28 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35

act

‘δ E

(PS

NR

)/E

(PS

NR

)

(a) δE(PSNR)/E(PSNR) forforeman

12 14 16 18 20 22 24 26 280.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

act

‘δ E

(PS

NR

)/E

(PS

NR

)

(b) δE(PSNR)/E(PSNR) formobile

5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

act

‘δ E

(PS

NR

)/E

(PS

NR

)

(c) δE(PSNR)/E(PSNR) fortable

12 14 16 18 20 22 24 26 28

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

act

γ

(d) N3dB for foreman

12 14 16 18 20 22 24 26 280

500

1000

1500

2000

2500

3000

3500

4000

4500

act

Neq

(e) N3dB for mobile

5 10 15 20 25−2.5

−2

−1.5

−1

−0.5

0

0.5x 10

4

act

Neq

(f) N3dB for table

Figure 5.3: Experimental results for different sequences showing the relative quality lossδE(PSNR)/E(PSNR) and the parameterN3dB vs. the activityact. Results were obtainedcoding the first 15 frames of each sequence (GOP IPPP andQP = 15+2kwith k = 0, . . . , 10)and enabling error concealment at the decoder.


provided that the final value ofCn is non-negative.C is the average number of redundancy

bytes for the coder whileact is the activity value for the current frame. Since the loss ofa

B frame does not affect the quality of the reconstructed sequence as I and P pictures do, we

decreased the final valueCn of one unit whenever the current frame is coded as B. In the

end, if Cn−1 > C, Cn is set toC − 1 in order to reduce the overall final redundancy of

the bit stream. Fig.?? compares this algorithm with the previous ones. The reported graphs

shows an increment of the visual quality of the reconstructed sequence since each frame is

protected depending on how important it is in the decoding process. In addition, the activity-

based optimization is able to increase the coding performance of the scheme. In fact, the

optimized RFC2733-like scheme is able to overcome the performance of the scheme optimized

only on frame size (see Fig.??). However, the best result was obtained combining both the

optimization of the matrix size and the optimization of the adopted FEC code.

The previous paragraph has shown how the activity value proves to be an efficient param-

eter to characterize the relevance of the frame in the decoding process. However, a parameter

that characterizes both the source and the channel coder results useful since it allows an external

controller to allocate the available bandwidth between these two in an optimal way. Chapter 4

showed how it is possible to use the percentage of “zeros” to model the bit rate produced by

the H.264/AVC coder. On the other hand, it is possible to notice that the percentageρ proves

to be a good substitute for the activity in the FEC scheme too.A low percentage of zeros is

related to a complex texture information (and, therefore, ahigh activity level) since it implies

the presence of many high frequency coefficients. Moreover,quantized DCT coefficients cod-

ify the residual information of the frame, i.e. the innovation that can not be approximated from

the previous pictures. Therefore, a high occurrence of zeros is deeply connected with a rather

simple residual signals that can be more easily estimated bya concealment algorithm. This fact

can be highlighted relating the percentageρ of zero for a packet with the distortion produced

by its loss. Figures 5.4(a), 5.4(b) and 5.4(c) report the relative distortion produced in a GOP by

the loss of a packet as a function of its percentage of null quantized coefficients. It is possible

to notice that the higher is the percentage of zeros, the easier is the task of concealment. Re-

sults were obtained erasing one packet from the coded streamand evaluating the average PSNR

obtained using the error concealment algorithm described in [22]. In Figure 5.4(d), 5.4(e) and

5.4(f), the parameterN3dB is reported as a function ofρ. It is possible to notice that a low

percentage of zeros is associated with a lower capacity of recovering the quality after the loss

of a packet. Therefore, its is possible to adopt the percentageρ of zeros in place of the activity.

Since an increase in the complexity of the residual signal ischaracterized by a reduction of

the percentage of zeros for a given QP, an increment of the recovering capability whenever the

percentage of zeros decreases enhances the probability of restoring some of the most important

information in the sequence. In this investigation, we adapt the code according to the linear

equation

Cn = C +

−2 if ρ ∈ avg_ρi + [0.04,+∞)

−1 if ρ ∈ avg_ρi + [0.02, 0.04)

0 if ρ ∈ avg_ρi + [−0.02, 0.02)

1 if ρ ∈ avg_ρi + [−0.04,−0.02)

2 otherwise.

(5.7)


0.975 0.98 0.985 0.99 0.995 1 1.0050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

act

‘δ E

(PS

NR

)/E

(PS

NR

)

(a) δE(PSNR)/E(PSNR) forforeman

0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 10

0.05

0.1

0.15

0.2

0.25

act

‘δ E

(PS

NR

)/E

(PS

NR

) (b) δE(PSNR)/E(PSNR) formobile

0.975 0.98 0.985 0.99 0.995 1 1.0050

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

act

‘δ E

(PS

NR

)/E

(PS

NR

)

(c) δE(PSNR)/E(PSNR) fortable

0.975 0.98 0.985 0.99 0.995 1 1.005−500

0

500

1000

1500

2000

2500

3000

3500

act

Neq

(d) N3dB for foreman

0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 10

200

400

600

800

1000

1200

act

Neq

(e) N3dB for mobile

0.975 0.98 0.985 0.99 0.995 1 1.005−500

0

500

1000

1500

2000

2500

3000

act

Neq

(f) N3dB for table

Figure 5.4: Experimental results for different sequences showing the relative quality lossδE(PSNR)/E(PSNR) and the parameterN3dB vs. the percentageρ. Results were obtainedcoding the first 15 frames of each sequence (GOP IPPP andQP = 15+2kwith k = 0, . . . , 10)and enabling error concealment at the decoder.


whereavg_ρi is the average percentage ofzeros for i-type frames (i = I, P,B). Fig. ??reports the results for the sequenceforeman. The graphs show that theρ-based algorithm has

approximately the same performance as the activity-based algorithm.

It is possible to notice that the optimization algorithm based onρ performs better than

activity sinceρ is able to better characterize the characteristics of the frame. Considering the

performance of theρ parameterization, we included this optimization in a jointsource-channel

rate control strategy that allows to maximize the quality ofthe reconstructed video sequence at

the decoder.

5.5 Joint source-channel rate control

Experimental results reported in Fig.?? shows that the adoption of a matrix-based FEC coder

is an efficient solution for protecting the data transmittedover an unreliable channel. However,

the reported plot provides evidence for the need of optimizing the matrix dimension and the

adopted code according to the input video signal, the channel characteristics, and the available

bandwidth. In fact, Fig.5.7 shows that a blind application of the matrix-based FEC scheme

may reduce the performance of the scheme in case the matrix size and the protection level are

not adequate.

The previous results have showed that the quality of the reconstructed sequence at the

decoder is strictly dependent on the characteristics of thelost frames, i.e. the loss of a frame

with a high activity value or lowρ value affects more deeply the decoding process since the

error concealment results more difficult. The adaptive algorithms of the previous section copes

with this problem by tuning the additional redundancy according to either the activity or the

percentage of “zeros”. However, in a real transmission the number of redundant bits, as well

as the bit stream produced by the source coder, is constrained by the transmission capacity.

Hence, a joint source-channel rate control algorithm, which partitions the available bandwidth

between the source coder and the channel coder, is needed. Incase the rate assigned to the

channel code is reduced in order to provide the source coder with more bits, the small amount

of redundant packets does not allow the recovery of the lost information, and the decoder can

rely on the error concealment only. On the other hand, an excessive number of redundant

packets lead both to a waste of the transmission capacity andto a decrement of the quality in

the reconstructed sequence. In this case the assigned protection is overestimated and many FEC

packets are not used, while the H.264/AVC coder has to code the input data introducing a higher

distortion since the available number of bits is small. Thischicken-egg problem can be solved

only by a joint rate allocation strategy to accurately tune both coders. Previous results have

shown that the percentage of zerosρ can efficiently model the bit rate produced by a generic

transform-based coder. However, the previous section has also shown that the percentageρ

is also correlated with the significance of video information in the decoding process. As a

consequence, it is possible to merge both techniques to design an effective joint strategy.

Given the target overall bit rateRb, the frame rateFr, and the numberN of frames in a

GOP, the algorithm assignsTi bits to i-th frame that is computed according to the following

5.5. Joint source-channel rate control 87

equation

Ti =Gi,j

KI,P ·KP,B ·NI +KP,B ·NP +NB(5.8)

whereGi,j is the number of bits remaining in thej-th GOP after coding thei-th frame and

Nt, t = I, P,B, is the number of not-yet-codedt-type frames that still remains in the current

GOP. Note that equation (5.8) is similar to eq. (4.21) in Section 4.5.2.Kt1,t2 is the complex-

ity ratio that is computed as described in eq. (4.38) of Section 4.5.2. However, in this case

the parametersXi, i = I, P,B, are the average of frame complexitiesXi (see eq. (4.39) in

Section 4.5.2), which are now modified in order to include thestatistics information from the

channel coder. Hence, the equation (4.40) remains valid, but the parameterSi now is the sum

of the numberSSi of bits coded by the H.264/AVC coder and the numberSC

i of bits added by

the matrix-based channel coder.

The available numberTi is partitioned into two target amounts of bits such that

Ti = T Si + TC

i with T Si =

Ti

1 + r(5.9)

wherer is the coding rate. The targetT Si is the number of bits available for the H.264/AVC

to code the current frame whileTCi is the number of bits that can be use to add redundancy

information to protect the stream. Given the5 possible channel coding rate of equation (5.7),

the joint control algorithm select the channel code rate that results best in order to keepT Si +TC

i

as close as possible toTi.

Given a certain rate valuer, the corresponding number of channel code columnsCn is

Cn = ⌊s · r + 0.5⌋. (5.10)

According to eq. (5.7), it is possible to relate the difference δC = Cn − C to the difference

δρ = ρ − avg_ρi, which identifies an interval of possibleρ values[ρmin, ρmax] through the

conditions

[ρmin, ρmax] =

[avg_ρi + 0.04,+∞] if δC ≤ −2

[avg_ρi + 0.02, avg_ρi + 0.04] if δC = −1

[avg_ρi − 0.02, avg_ρi + 0.02] if δC = 0

[avg_ρi − 0.04, avg_ρi − 0.02] if δC = 1

[−∞, avg_ρi − 0.04] if δC ≥ 2

(5.11)

In case the target value forρT,i which is obtained from the equation

ρT,i =Ti − q

(1 + r)µ(5.12)

whereµ andq were first presented in eq. (4.2) in Section 4.2. In caseρT,i ∈ [ρmin, ρmax], the

target value forρ is found, and the H.264/AVC coder has to tune its coding parameters in order

to match the percentageρT,i. The procedure is the same described in Section 4.5.2, and assigns

an average QP value to the current frame according to the target percentage of zerosρ. In case

the estimated target percentage of zeros does not lie in the interval [ρmin, ρmax], the joint rate


control algorithm takes into consideration another redundancy ratio.

After coding the first frame, the number of channel code bitsTCi can be remodelled accord-

ing to the actual percentageρ that was obtained from the frame. In this way it is possible to

increase the protection of those part that present some important changes in the video sequence.

GivenSSi the actual number of bits produced by the H.264 coder andSC

i the number of

bits coded by the channel coder, the available number of bitsfor the current GOP is updated


Gi+1,j = Gi,j − SSi − SA

i . (5.13)

The parameters of the H.264 coder for a given targetT Si are chosen according to the algo-

rithm reported in [75].


In order to evaluate the efficiency of the different coding solutions, we simulated the transmis-

sion of the RTP packets over the mobile IP network selecting the 3GPP framework, described

in [33], and using different models of channel error. In our simulations, we varied the length

of channel code and the error generation process in order to test the solutions under different

conditions. In case some RTP packet is still missing, the adopted decoder performs an error

concealment interpolating the lost part of the image from the neighboring pixels [22].

In our first investigation, we simulate a random loss of the transmitted RTP packets using

different types of parameter settings. For QCIF sequences,we coded different streams using

the GOP structure IPPP. In order to avoid a significant propagation of the distortion deriving

from the loss of a packet, we used GOPs of 15 frames, where the information of each frame

is carried by a fixed number of packets.6 The H.264/AVC standard defines different slice

partitioning modes (see Section 2.2 and the papers [87, 5, 6,7]). The most popular modes that

are chosen in order to ease the error concealment task include in a slice an equal number of

macroblocks and an equal number of bytes. The adoption of FMOalgorithms allows a wide

variety of different configurations, but still their adoption is quite a novelty since this slice

partitioning mode was not present in the previous coding standard. On the contrary, using

a fixed number either of macroblocks or of bytes has been already adopted in some of the

previous architectures. In our approach, each slice is madeof a fixed number of MBs so that

each packet loss corresponds to the loss of a fixed amount of visual information in a frame. In

fact, using a fixed number of bytes implies a variable number of macroblocks in a slice, which

affects the resulting distortion in case the current packetis lost. In this way, for each packet

lost the amount of corrupted visual information is the same.For QCIF sequences each slice

contains11 macroblocks (i.e. an entire row) while for CIF sequences slices are made of22

macroblocks.

At first, loss patterns were generated adopting an equal lossprobability for each RTP pack-

ets. In a second step, we compared the performance of different FEC coding systems on an

actual radio channel. In our simulation we considered a packet-switched transmission on an

AWGN radio channel withEb/Nr = 4 dB. The length of the frame was200 bytes and the

6Note that in H.264/AVC each slice is contained in an RTP packet.


adopted transmission scheme was a QPSK modulator with a convolutional code (rate = 1/2

andmemory = 5). The measured BER is0.2 · 10−3.

In the following subsections, results for different algorithms and configurations are re-

ported.

5.6.1 Results with a fixed matrix

At first, we evaluated the correcting performance coming from the adoption of a fixed matrix

structure. The performance is significantly affected by theinsertion criterion and the dimen-

sions of the matrix with respect to the average length of included packets. In a first approach,

we evaluated the different performances obtained by fillingone packet per columns (padding

the remaining matrix cells with dummy 0 bytes) and filling completely each columns with

packets (made exception for the last column). In the second approach, which can be schema-

tized by Fig. 5.2(a), the number of rowsL depends on the lengthLmax of the longest packet.

In the second case (see Fig. 5.2(b)) for a graphical example)the number of rows is computed

according to the average lengthL of the included RTP packets. In the reported results, the pa-

rameterL is set to3L since in this way the probability of packet wrapping, i.e. the probability

that a single packet is included in more than one column is lowenough. In fact, packet wrap-

ping could affect badly the recovering performance since the loss of one packet could result in

the cancellation of more than one byte in a row.

Figure 5.5 reports the average PSNR obtained for different sequences corrupted with 10

independent loss patterns. The255 columns of the matrix include239 columns for the source

information and16 for the redundant packets which are generated using a RS(255,239) code

in both cases. The bit stream was corrupted generating different independent loss patterns with

cancellation probability0.03. Note that the FEC-padding solution implies a significant waste

of bandwidth since most of the allocated redundancy is not actually used to correct errors.

In fact the FEC-NoPadding solution is able to obtain the samerecovering performance with

a significantly lower redundancy. This fact is utterly evident in Fig 5.5(b) which reports the

experimental results for the sequencenews, where many parts of the displayed image are static

or slowly moving. Therefore, the lengths of RTP packets are highly varying, and therefore, a

lot of cells in FEC-Padding matrix are filled with dummy zeros. Another drawback concerns

the relation between the PSNR vs. rate plots of FEC-padding solution and the solution that

relies only on error concealment at the decoder (without FECpackets). This allows the error

concealment to reconstruct most of the lost images with a small amount of distortion, and as a

result, the performance of the transmission without any additional FEC packets results better

than the performance of FEC-padding solution in terms of Rate-Distortion. This furtherly

justifies the need for adapting the amount of FEC informationincluded in the stream according

to the characteristics of the coded sequence.

In a second set of simulations, we tested the sensitivity of the FEC-NoPadding approach

in case the loss probability is underestimated and the amount of FEC packets included in the

bit stream may result insufficient to recover the lost information. Results in Fig 5.6 report the

average PSNR values obtained for different sequences varying the number of FEC columns in

the matrix. Note that the ratio between channel columns and source columns do not correspond


0 50 100 150 200 250 30032

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)

Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses

(a) foreman QCIF

20 40 60 80 100 120 140 160 18034

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)Losses with FEC−PaddingLosses with FEC−No PaddingLosses without FECNo Losses

(b) news QCIF

100 150 200 250 300 350 400 450 500 55028

30

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(c) mobile QCIF

0 50 100 150 200 250 30030

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(d) table QCIF

100 200 300 400 500 600 700 800 90034

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)


(e) foreman CIF

200 400 600 800 1000 1200 1400 1600 180030

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(f) mobile CIF

Figure 5.5: Results for different sequences (coded at30 frame/s with GOP IPPP and fixedQP) where the bit stream is affected by a loss probability of0.03.


0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.0830

35

40

45

FEC rate (channel col. / source col.)

PS

NR

(dB

)QP=15QP=19QP=23QP=27QP=31

(a) Results with rate given by the ratiok/n

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.230

35

40

45

FEC rate (percentage of bytes)

PS

NR

(dB

)

QP=15QP=19QP=23QP=27QP=31

(b) Results with rate given by the percentage of FECbytes

Figure 5.6: Results for the sequenceforeman QCIF (coded at30 frame/s with GOP IPPPand fixed QP) where the bitstream is affected by a loss probability of 0.03 and the numberof redundant columns is varied. The graphs reports the performance of the FEC-NoPaddingalgorithm in terms of average PSNR vs. the channel code rate,which is measured both as theratiok/n and as the percentage of FEC bytes transmitted in the stream.

to the actual ratio that results from the transmitted RTP packet stream. This is mainly due to the

additional overhead information derived from the extra header that must be added to the FEC

bytes to create new RTP packets, and the final padding bytes that can be relevant for the last

matrix (the amount of redundancy can be reduced coding longer sequences). It is possible to

notice that the full-recovery point7 is obtained when the ratiok/n equals the loss probability.

Instead the FEC matrix requires a higher percentage of FEC bytes to provide the transmitted

bit stream with enough FEC packets to make it robust to losses.

In the end, the influence of the number of rows with respect to the average length of

RTP packets is considered. Figure 5.7 reports the average PSNR and the average relative

lossδE(PSNR)/E(PSNR) for different configurations of the matrix. The reported experimental

results shows that whenever the number of rows is too small the performance of the matrix

significantly decreases despite the code rate (i.e. the number of channel code columns with

respect to the number of source code columns) could allow a perfect reconstruction of the lost

information. On the other hand, whenever the height of the matrix is long enough, it is possible

to adopt a code with a lower correcting capacity with respectto the channel loss, e.g. Fig.5.7(a)

shows that withL > 4L the coder ratek/n = 0.4 is enough to recover the whole sequence

from losses. The variance of the packet lengths plays a significant role too. Considering Fig-

ures 5.7(c) and 5.7(d) it is possible to notice that, for low QPs,L > 2L suffices for a perfect

recovering since the lengths of the RTP packets are less varying. With strong quantization,

the number of skipped macroblock increases and the lengths of packets start varying signifi-

cantly. Therefore, it is necessary to adopt a higher scalingfactor in order to reduce the effects

of wrapping and to increase the influence of interleaving.

The configurations that have been considered so far implies filling the matrix with a great

7Here, the full-recovery point is the point where all the information is recovered by the FEC scheme and thesequence reconstructed at the decoder equals the one reconstructed at the coder.

92 Chapter 5. Joint Source-Channel Video Coding Using H.264/AVC and FEC Codesnu

m. r

ows

/ avg

. len

. pac

kets

code rate

QP=20

No losses

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

(a) δE(PSNR)/E(PSNR) forforeman QP=20nu

m. r

ows

/ avg

. len

. pac

kets

code rate

QP=30

No losses

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

(b) δE(PSNR)/E(PSNR) forforeman QP=30

num

. row

s / a

vg. l

en. p

acke

ts

code rate

QP=20

No losses

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

(c) δE(PSNR)/E(PSNR) fornews QP=20

num

. row

s / a

vg. l

en. p

acke

ts

code rate

QP=30

No losses

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.11

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

(d) δE(PSNR)/E(PSNR) fornews QP=20

Figure 5.7: Results of FEC-NoPadding with different rows and columns (loss probability0.03) for the sequencesmobile andforeman QCIF (coded at30 frame/s with GOP IPPPand fixed QP). The performance is evaluated reporting a contour plot of the relative qualitylossδE(PSNR)/E(PSNR). The number of rows is characterized with respect to the averagelength of the RTP packets by a scaling factor, and the code rate is given by the ratiok/n.

number of packets. Although this setting proves to be quite efficient for non-interactive video

transmission, there are significant drawbacks regarding videophoning applications. As it was

mentioned in Section 5.3, the recovering of a lost packet is possible only after the complete fill-

ing of the matrix. This introduces variable jitters in displaying the reconstructed images, which

can be compensated delaying appropriately the playout of the sequence. Since such a delay

can not be tolerated in the interaction of two remote users, we need to modify the parameter

setting reducing the matrix dimensions. However, smaller matrices imply a decrement in the

efficiency of the coding scheme, and therefore, it is necessary to set the number of rows, source

columns and channel columns in appropriate way. In the following section, adaptive methods

will be tested.

5.7. Summary 93

5.6.2 Results with an adaptive matrix

Previous section has shown how the matrix size is deeply correlated with the performance of

the FEC scheme. In this section, we present experimental results related to the adaptive al-

gorithms reported in Section 5.4. Figure 5.8 reports the average PSNR vs. the produced bit

rate for tree adaptive solutions and their non-adaptive counterpart. The first adaptive solution

(referenced with the label FEC-Adaptive) tailors the matrix size according to the packet length

(see Section 5.4.1). In this way, no critical packet wrapping is allowed, i.e. the loss of a single

packet correspond to the cancellation of one byte for some matrix rows. The second adaptive

solution improve the performance of the previous one increasing the number of redundancy

columns according to the average activity value of the visual information included in the ma-

trix as Section 5.4.2 reports. In this way, the algorithm identifies the significant frames in the

decoding process and protects them with additional FEC packets while it reduces the redundant

information for those frames that can be easily interpolated from the neighboring ones. The

third adaptive algorithm uses the percentage of zeros to increase the amount of additional re-

dundancy instead of activity. This behavior allows an accurate control over the bit rate as it will

be shown later. Experimental results (see Fig. 5.9) shows how theρ-adaptive approach proves

to be significantly better in terms of Rate-Distortion sinceit saves FEC bytes from frames that

can be easily estimated by error concealment to improve the protection level of critical frames

that can not easily estimated. Note that the allocated redundant bytes results higher for those

sequences that present rapid movements and a high activity level.

In the following, we tested the joint source-channel rate control algorithm that is reported

in Section 5.5.

5.6.3 Results with a joint source-channel rate control algorithm

The final set of simulations concerns the results obtained with the rate control algorithm de-

scribed in the Section 5.5. Different sequence

In the end, we tested the joint source-channel rate control algorithm described in Section 5.

The simulation benchmark was the same described and the losspattern are generated from the

simulation of a AWGN radio channel.

It can be appreciated that theρ-adaptive algorithm is able to change the coding rate ac-

cording to the signal characteristics providing a higher quality with respect to the fixed rate

(see Tables 5.1 and 5.2). In fact, the algorithm is able to partition the available bandwidth in

appropriate manner increasing the code rate whenever the input signal is not correlated with

the previous data.

5.7 Summary

The chapter presented an effective joint source-channel coding scheme for video transmission

over RTP channels, which is based on cross-packet matrix-based FEC coding. The RTP packets

produced by the H.264/AVC video coder are included into a matrix columnwise, and redun-

dant data are generated applying a Reed-Solomon code along matrix rows. The additional


0 50 100 150 200 250 30030

31

32

33

34

35

36

37

38

39

Rate (kbit/s)

PS

NR

(dB

)

Losses with FEC−FixedLosses with FEC−Adaptive

(a) foreman QCIF

40 60 80 100 120 140 16031

32

33

34

35

36

37

38

39

40

41

Rate (kbit/s)

PS

NR

(dB

)Losses with FEC−FixedLosses with FEC−Adaptive

(b) news QCIF

100 150 200 250 300 350 400 450 500 55028

29

30

31

32

33

34

35

36

Rate (kbit/s)

PS

NR

(dB

)


(c) mobile QCIF

50 100 150 200 250 30029

30

31

32

33

34

35

36

37

Rate (kbit/s)

PS

NR

(dB

)


(d) table QCIF

0 200 400 600 800 1000 120032

32.5

33

33.5

34

34.5

35

Rate (kbit/s)

PS

NR

(dB

)


(e) foreman CIF

400 600 800 1000 1200 1400 1600 1800 2000 220027

27.5

28

28.5

29

29.5

30

30.5

31

Rate (kbit/s)

PS

NR

(dB

)


(f) mobile CIF

Figure 5.8: Comparison between length-adaptive method presented in Section 5.4.1 and fixedmethod for different sequences (coded at30 frame/s with GOP IPPP and fixed QP) where thebit stream is affected by a loss probability of0.06. The code isRS(10, 9) for QCIF sequencesandRS(20, 18) for CIF sequences.

5.7. Summary 95

50 100 150 200 250 30032

33

34

35

36

37

38

39

40

41

42

Rate (kbit/s)

PS

NR

(dB

)

Losses with FEC−ρLosses with FEC−ActLosses with FEC−Adaptive

(a) foreman QCIF

40 60 80 100 120 140 160 18034

35

36

37

38

39

40

41

42

43

44

Rate (kbit/s)

PS

NR

(dB

)


(b) news QCIF

100 150 200 250 300 350 400 450 500 550 60029

30

31

32

33

34

35

36

37

38

Rate (kbit/s)

PS

NR

(dB

)


(c) mobile QCIF

50 100 150 200 250 300 35031

32

33

34

35

36

37

38

39

40

41

Rate (kbit/s)

PS

NR

(dB

)


(d) table QCIF

100 200 300 400 500 600 70032

32.5

33

33.5

34

34.5

35

35.5

36

36.5

Rate (kbit/s)

PS

NR

(dB

)


(e) foreman CIF

400 600 800 1000 1200 1400 160028

28.5

29

29.5

30

30.5

31

31.5

32

32.5

33

Rate (kbit/s)

PS

NR

(dB

)


(f) mobile CIF

Figure 5.9: Comparison between adaptive methods presentedin Section 5.4 for different se-quences (coded at30 frame/s with GOP IPPP and fixed QP) where the bit stream is affectedby a loss probability of0.06. The code isRS(10, 9) for QCIF sequences andRS(20, 18) forCIF sequences.


Target BitRate (kbit/s)

Actual BitRate (kbit/s)

Effective ChannelCode Rate (r/s)

LostRTP packets (%)

Final LostRTP packets (%)

AveragePSNR (dB)

356 355.67/357.29 0.40/0.57 13.52/13.68 5.93/2.62 29.19/32.00

400 400.44/400.24 0.60/0.57 13.51/13.57 3.07/2.48 28.87/32.64

450 471.35/450.19 0.41/0.58 13.44/13.95 6.88/3.36 29.04/33.12

500 511.05/507.89 0.54/0.56 14.34/14.19 3.48/3.73 29.60/33.41

550 552.26/562.14 0.62/0.57 14.29/14.76 3.33/4.85 28.98/32.22

Table 5.1: Comparison betweenρ-adaptive and fixed rate control methods for the sequencenews. The values on the left report the results for the fixed channel rate method. The valueson the right report the results for theρ-adaptive joint source-channel rate control. The defaultchannel code rate isr/s = 0.22.

Target BitRate (kbit/s)

Actual BitRate (kbit/s)

Effective ChannelCode Rate (r/s)

LostRTP packets (%)

Final LostRTP packets (%)

AveragePSNR (dB)

356 359.03/360.94 0.40/0.57 12.91/13.10 7.04/0.34 22.29/28.44

400 404.24/404.54 0.60/0.77 13.59/13.09 5.11/0.45 25.56/39.22

450 453.07/453.92 0.61/0.77 13.51/13.19 3.07/0.29 27.05/30.51

500 502.13/503.57 0.62/0.77 13.59/13.19 2.48/0.42 22.24/32.04

550 551.93/553.35 0.62/0.78 13.77/13.43 2.70/0.50 27.20/30.78

Table 5.2: Comparison betweenρ-adaptive and fixed rate control methods for the sequenceforeman. The values on the left report the results for the fixed channel rate method. Thevalues on the right report the results for theρ-adaptive joint source-channel rate control. Thedefault channel code rate isr/s = 0.44.

information is then packed and transmitted across the channel together with the video source

packets. In case some video RTP packets are lost and the decoder receives enough redundant

packets, it is possible to recover the missing information.However, the proposed scheme ob-

tains different performances according to the size of the matrix and the protection level applied

to each frame. Experimental results show that the matrix dimension must suit both the lengths

of packets and the video content they carries. The chapter propose an optimization algorithm

that either increases or reduces the number of channel code columns in the matrix according

to the percentage of null quantized DCT coefficients in codedinformation. At the same time,

the height of the matrix is adjusted according to the length of the longest packet and the overall

number of coded bytes. These optimizations can be included in a joint source-channel coding

rate control that partition the available bandwidth between the source coder and the channel

coder in order to maximize the quality of the reconstructed sequence at the decoder. Exper-

imental results show a significant improvement in terms of visual quality for a given bit rate

with respect to its non-adaptive counterpart.

Chapter 6

Achieving H.264-like compressionefficiency with Distributed VideoCoding

“I shall try to correct errors when shown to be errors,

and I shall adopt new views so fast as they shall appear to be true views”

Abraham Lincoln

Previous chapters have discussed different source and channel coding methods focused on thetraditional hybrid video coders. In this chapter, a new typeof video coding architecture, whichallows a robust transmission of coded images, is presented.This scheme can be included in theemerging class of Distributed Source Coding (DSC) based video coders. Despite these codersenable low-complexity encoding, they have been unable to reach a compression efficiencycomparable with that of motion-compensated predictive coding based video codecs, such asH.264/AVC, due to insufficient accuracy in video data modeling. The DSC-based approachdescribed in this chapter is intended to achieve H.264-likecompression efficiency. The successof H.264/AVC highlights the importance of accurately modeling highly non-stationary videodata through fine-granularity motion estimation. This motivates us to deviate from the popularmethod of approaching the Wyner-Ziv bound with sophisticated capacity-achieving channelcodes, which require long block lengths and high decoding complexity, and instead focus onthe investigation of efficient models for video data. Such a DSC-based, compression-centricencoder is an important step towards building a robust DSC-based video coding framework.

6.1 Introduction

A recent innovation in the communication world is the massive introduction of multimedia

services over wireless networks, which was mainly inspiredby the aim of providing video and

audio applications almost anywhere and anytime [10]. More and more Internet and mobile

communication providers offer a wide variety of multimedia-related services that span from

the video communication to the fruition of video-on-demandcontents on mobile devices. This

accomplishment was possible thanks to the recent development of mobile communication and

the technological advances in digital coding of multimediadata. However, the appearing of

98 Chapter 6. Achieving H.264-like compression efficiency with Distributed Video Coding

heterogeneous network scenarios, characterized by the interconnection of different types of

networks and devices, and the massive wide-spreading of mobile communications, affected by

a higher percentage of losses and errors with respect to the traditional wired communications,

has modified the needs and the guidelines followed in the design of compression algorithms. As

a matter of fact, the capability of providing reliable videocommunication in a heterogeneous

scenario is the most relevant issue in the widespread and thediffusion of multimedia mobile

services, and the recent literature reports a wide number ofdifferent proposal that try to cope

with the problems of transmitting a video sequence across a network affected by losses (see

Chapter 5).

As it was anticipated in Chapter 1, the requirements of coding algorithms for wireless video

communications can be summarized into three main topics:

• low-power and complexity at the mobile/sensor video encoding unit;

• high compression efficiency due to both bandwidth and transmission power constraints;

• robustness to packet/frame drops caused by wireless channel impairments.

Current video codecs fail to deliver on all these demands since most of them are based on

temporal prediction. Despite this obtains high coding gains, it results to be inefficient whenever

some of the information is lost. In this case, the state of theencoder can not be recovered until

it is refreshed (i.e. the encoder codes a frame without any temporal prediction, called Intra

frame). Unfortunately, the presence of frequent Intra refresh leads to a waste of the available

bandwidth since the amount of bits produced by Intra coding is much higher than that produced

by temporal prediction coding.

Moreover, we must mention that Motion Estimation (see Section 2.2.2) is a computation-

ally demanding task that has to be run at the encoder. Since inmobile communications the

hardware resources of communicating devices are quite heterogeneous and quite often the

transmitting device has a low computational capacity, it isconvenient to choose coding schemes

that require a limited computational power and complexity to the terminal devices. A possible

solution is to adopt two different coding architectures forthe uplink transmission and for the

downlink transmission. In the uplink communication (from the transmitter to the network),

the encoding paradigm must require a low complexity at the encoder shifting the computa-

tional load to the decoder. In the downlink transmission, the coding scheme must keep the

computationally-demanding tasks at the encoder side demanding to the terminals the imple-

mentation of a light decoder (such as decoders for traditional hybrid video coding standards).

In this way, the encoding/decoding load is mainly sustainedby network hardware, which has

to transcode the uplink bit stream coming from the transmitter into a new bit stream which is

compliant with the video coding standard adopted in the downlink transmission.

During the last years, novel coding solutions that cope efficiently with these problems have

been found, and most of them are based on the Distributed Source Coding (DSC) theory.

One of them is the PRISM coder, which aims at satisfying all ofthe previous requirements

by implementing“a modified side-information paradigm where there is inherent uncertainty

in the state of nature characterizing the side information”[93]. This coding architecture char-

acterizes the side information as a class of possible predictor values, and in the reconstruction

6.2. Distributed Video Coding 99

of the transmitted information the decoder can use any of them provided that it is available,

i.e. it is received correctly. However, it is also possible to adopt the PRISM coding paradigm

even when the are no information losses or corruption (such as in video storage application).

In this case, the side information can be identified by a Motion Vector (see Section 2.2.2).

One weak point of these solutions is the compression ratio, since DSC coding solutions

show a lower coding gain with respect to their hybrid counterparts. Therefore, the investigation

of effective entropy coding algorithms is a clue element in improving the efficiency of this

coding solution.

6.2 Distributed Video Coding

The theory of Distributed Source Coding dates back to two major theoretical results: the

Slepian-Wolf (1973) and Wyner-Ziv Theorems (1676) [113, 130]. However, despite the the-

oretical basis was defined in the 70’s, only the last years have assisted to the appearing of

DSC-based applications for video transmission. This novelcoding paradigm relies on the cod-

ing of two or more dependent random sequences in an independent way, i.e. associating a

separated independent encoder to each of them. In this context, the term “distributed” refers

to the encoding operation mode and not to its location. An independent bit stream is sent

from each encoder to a single decoder which performs a joint decoding of all the received bit

streams exploiting the statistical dependencies between them. Being aware of this, the different

encoders can take advantage of the mutual correlation between source sequences to reduce the

overall bit stream size. Assuming that two sourcesX andY have to be transmitted with rates

RX (RX ≥ H(X)) andRY (RY ≥ H(Y )) respectively, the statistical relation betweenX

andY allows a sensible reduction of the coded bit stream since thelower bounds of coded bit

rates decrease (RX ≥ H(X|Y ) with RY ≥ H(Y ) or RY ≥ H(Y |X) with RX ≥ H(X))

[113]. Despite Distributed Source Coding can still achievethe compression gains allowed by

joint source coding (R = RX + RY ≥ H(X,Y )), Wyner-Ziv coding [130] focus on the

rate point (RY = H(Y ), RX = H(X|Y )) assuming that the sourceY is fully encoded and

transmitted to the decoder while the sourceX is coded taking into consideration the existing

correlation. Although the encoder does not know the other source, the decoder can useY to

decodeX.

Since a detailed description of Distributed Source Coding is beyond the scope of this work,

further information can be found in [113, 130, 40].

Based on this independent-encoding/joint-decoding configuration, a new video coding pa-

radigm, called Distributed Video Coding (DVC), has emerged. In this case, the statistical de-

pendence that is exploited is the correlation among temporally-close frames, which many video

coding standards have already used in Motion Compensation (see Section 2.2.2). However, this

encoding technique allows the decoding of the current framewithout using a specific reference

as Motion Compensation requires. Any frame that suits the correlation characteristics that was

used to code the current frame is good enough to allow an error-free decoding of transmitted

video data, and therefore, the research of a suitable reference has to be performed at the de-

coder through a Motion Estimation algorithm. Note that in this case both the requirements


of robustness to errors and low-complexity at the encoder are met. Since the decoder can use

any sufficiently-correlated reference, in presence of errors the loss of part of the information

does not preclude a correct decoding as long as a suitable reference can be found in the frame

buffer. In addition, Motion Estimation, which is one of the most computationally-expensive

task, is performed at the decoder (i.e. at network side) reducing the hardware requirements of

the encoder.

Different DSC-oriented coding schemes have been presentedduring the last years.

In 2002, Jagmohan, Sehgal and Ahuja [48] used coset codes forpredictive encoding in or-

der to reduce the consequences of the predictive mismatch without a large increment in terms of

bit rate. In the same year, Aaron, Zhang and Girod [2] have shown results on video coding using

an Intra-encoding/Inter-decoding scheme through a Turbo decoding scheme. In 2002, an ap-

proach well-known as PRISM (Power-efficient, Robust, hIgh-compression, Syndrome-based

Multimedia coding) was proposed by Puri and Ramchandran [93] for multimedia transmissions

on wireless networks using syndromes. The major goal of thissolution is to join the traditional

intraframe coding error robustness with the traditional interframe compression efficiency.

In 2003, Zhu, Aaron and Girod have proposed an approach to Wyner-Ziv based low-

complexity coding that aims at compressing video signals for large camera arrays [135]. In

this solution, multiple correlated views of a scene are independently encoded with a pixel

domain Wyner-Ziv coder but are jointly decoded at a central node. The same article shows

a comparison between pixel domain Wyner-Ziv coder and an independent encoding and de-

coding of each view employing the JPEG-2000 wavelet image coding standard. The results

demonstrate that at lower bit rates the solution presented by Zhu et al. achieves higher PSNR

than JPEG-2000 with a lower encoder complexity. For more details, the reader should consult

[135]. In 2004, Aaron, Rane, Setton and Girod [1] proposed anarchitecture similar to the one

in [2]; the key difference with respect to [2] is the additional use of transform coding (DCT

transform) at the encoder. The results obtained show that the new coding solution leads to a

better coding efficiency when compared with the solution in [2] (at the cost of a high encoder

complexity associated with the DCT transform).

In the same year, the most recent Wyner-Ziv low-complexity video coding solution by

Aaron, Rane and Girod was proposed in [131]. This solution isbased on an Intra-encoding/Inter-

decoding system, and in addition to the bit stream resultingfrom the current frame encoding

process the encoder also transmits supplementary information about the current frame to help

the decoder in the motion estimation task. In 2004, Rane, Aaron and Girod have presented

another approach [115] aimed at making a traditionally encoded bit stream more error-resilient

when it is transmitted over an error-prone channel with no protection against channel transmis-

sion errors, for example by means of channel coding.

In this scenario, a common element links all these strategies [109, 26, 132]. All these works

utilize capacity-achieving channel codes to approach the Wyner-Ziv bound. This solutions

require both a high decoding complexity and a long block length which can be applied to a

very large area of the video frame1. This contradicts with the highly non-stationary nature of

video data.

1Typically bit plane encoding is over an entire frame.

6.3. A simple example of coding with side information 101

In [70], it was shown that a distributed video coding approach has the potential of achieving

high compression efficiency by modeling video data with motion search and without sophisti-

cated channel codes. In light of the success of H.264, in thiswork we design and implement a

DSC-based video coder that adopts some key primitives underlying the H.264 standard, such as

more sophisticated motion search (and thus more accurate correlation estimation) and in-loop

deblocking filter [87]. Due to the fact that a quantized encoding unit itself instead of DFD is

used to generate the encoded bit stream, a new arithmetic coder is designed and implemented

to suit the new statistics of encoding coefficients. This approach allows us to achieve H.264-

like compression efficiency without having to use sophisticated channel codes that entail high

decoding complexity.

The benefits of adopting such a baseline compression-centric distributed video coder are

two-fold. Firstly, the architecture can be efficiently extended to a system that is robust to

channel losses. The DSC-based encoder sends information about the source, the amount of

which depends on thestatisticalcorrelation between the source and side-information (predic-

tor). When channel loss alters this statistical correlation, only the amount of source information

needed to successfully decode changes. Therefore an incremental amount of source informa-

tion can easily be sent to ensure successful decoding when channel noise weakens the corre-

lation between the source and the predictor. In an MC-based2 system, on the other hand, the

compressed data (residual signal) depends on both the source and the predictordeterministi-

cally. Therefore, a channel loss that alters the reconstructed predictor will require coding for

unpredicted residual signal. Secondly, for both the MC-based and DSC-based systems, there

is a complexity-performance tradeoff. When encoding complexity becomes a constraint, the

lower the encoder complexity has to be, the lower the compression efficiency is. However, for a

DSC-based system with lowered compression efficiency, we obtain a bit stream with increased

robustness [93, 70].

In this work, we focus on a compression-centric DSC-based video coder that is an impor-

tant building block for a video coding system robust to channel losses.

6.3 A simple example of coding with side information

Previous sections have presented a bird’s eye view of different DVC-based video coders that

take advantage of Wyner-Ziv theorem to independently code different sources (i.e. frames)

assuming that a certain correlation structure exists amongthem. Each frame is coded under the

assumption that the decoder has in its buffer another frame that is correlated enough with the

current one. Therefore, the coder has to specify only the non-correlated information in order

to permit a correct decoding. To see how we can achieve this, it is instructive to examine the

following example that was first presented in [92]. LetX andY be two correlated pieces of

information that are generated by two separate sources (or related to two different frames in a

DVC setting) and are to be transmitted to a common receiver. Assuming thatY has already

been sent, the informationX can be efficiently transmitted considering its correlationwith Y .

In this way, the redundancy existing between different partof the overall information is reduced

2It indicates Motion Compensation based coders.


XXDecoderEncoder

Y

D

(a) Predictive Coding (Y available both at theencoder and at the decoder)

XXDecoderEncoder

Y

S

.

(b) Wyner-Ziv Coding (Y only available at thedecoder)

Figure 6.1: The two different scenarios considered in the example. Side information can beavailable to the encoder or not.

leading to a smaller bit stream size. From these premises, two different coding settings can be

derived. The first setting (see Fig. 6.1(a)) assumes thatY is known both at the encoder and

at the decoder. Therefore, it is possible to transmitX by coding its difference withY . In

the second setting (see Fig. 6.1(b)), the informationY is known only at the decoder, but the

encoder knows the characteristics of the correlation existing betweenX andY . This can be

used to reduce the coded bit stream provided that the information is coded in such a way that

the decoder can recover it using the available informationY , which is calledside information.

In the following paragraph, a simple example is provided.

Let X andY be two binary vectors belonging to the space{0, 1}3, which includes8 dif-

ferent values. However,X andY are correlated in such a way that the Hamming distance

dH between them is at most 1. For example, givenY = [1 1 0], X can assume the val-

ues{[1 1 0], [0 0 0], [0 1 1], [0 1 0]}. In the first scenario, both encoder and decoder knows the

value ofY , and therefore, the encoder only needs to transmit the difference3 D = X ⊕Y . The

differenceD can assume4 different values, and therefore, it can be coded with2 bits, assuming

that its4 values are equally probable. The decoder can combine the transmitted differenceD

with the symbolY to reconstructX = Y ⊕D. We can relate this example to traditional hy-

brid video coders considering the valuesX andY like pixels or blocks belonging to different

frames which are temporally correlated. In this way, the considered example is analogous to

the predictive coding paradigm reported in Section 2.2.2.

(a) The cosets adopted in the example. (b) Example of decoding.

Figure 6.2: Example of Wyner-Ziv decoding using sources in{0, 1}3 with Hamming distancedH ≤ 1.

In the second scenario, it is possible to partition the space{0, 1}3 into 4 separate sets where

the Hamming distance between every couple of values is greater than two. Despite the encoder

is completely unaware of the value assumed by the side informationY , it can transmit to the

3The symbol⊕ denotes a bitwise XOR operation.

6.3. A simple example of coding with side information 103

decoder in which set the informationX lies. Then the decoder can choose among the possible

values included in the signalled set the one that is the closest to the side informationY . Since

the distance among values in the same set is greater than1, there is only one value at minimum

distance among the others. Note that the encoder does not need to know the value ofY , but

only the maximum Hamming distance that could exist between the two sources. It is also worth

mentioning that the decoding ofX can be done correctly even if the value ofY is different. As

an example, letX be the string[0 1 0] with Y = [1 1 0] and the space{0, 1}3 partitioned as

Fig. 6.2(a) depicts. In this case, the encoder signals to thedecoder that the value ofX belongs

to the3rd set, and givenY the decoder can reconstructX as Fig. 6.2(b) depicts. However, a

correct reconstruction is possible even ifY assumes the values{[0 1 0], [0 1 1], [0 0 0]}, since

the decoding is not strictly dependent on a specific reference like in the case of predictive

coding. Since the number of separate partitions is4, we need only two bits to code the sets

whereX lies in assuming that all the partitions are equally probable. In this case, the amount

of transmitted information is exactly the same of the previous case.

For the second scenario just described, different analogies can be derived. One of the

most popular ones concerns channel coding theory and associates each partition to acosetCi

of codewords obtained perturbating the binary words of a maximum-distance channel code

C with a specific error vectore such that4 wH(e) ≤ 1 (see [68]). The error is associated

to the difference existing between separate sources while the coset indexi can be associated

to a set of possible correct codewords in the space{0, 1}3. Following the same conventions

used for channel codes, the coset index will be calledsyndromethroughout the rest of this

chapter. Assuming that the channel decoding process can be seen like a vector quantization in

the domain{0, 1}3, in Section 6.6 the termsyndromewill be extended to identify a specific sub-

lattice structure related to a quantization characteristics that allows a correct reconstruction of

the transmitted value by quantizing the side information. In the rest of the chapter, this analogy

will be used many times, using the word syndrome to characterize the transmitted information

and the word coset in relation to quantizers with shifted characteristics.

Previous paragraphs have shown that the coding efficiency ofthe second scenario is com-

parable with the coding efficiency of predictive coding. This fact is possible whenever the

variance of the prediction error (the Hamming distance in the example) is comparable with

the distance of codewords within the same coset. The amount of information that needs to be

transmitted (i.e. the syndrome) strictly depends on the volume of the quotient group inferred

from the codeC on the space{0, 1}3, and therefore, the scheme proves to be efficient under

the assumption that the codeC accurately suits the characteristics of the correlation betweenX

andY . To this purpose, different approaches were proposed to tailor the shape of the quotient

group in order to match the correlation between different sources.5

The following paragraph will show how these principles can be applied to the robust trans-

mission of video contents.

4The symbolwH(e) denotes the Hamming weight of vectore.5Some of them resort to trellis coded quantization allowing the adoption of quotient groups with spherical

geometry instead of square geometry (see [77]). Others adopt simple probabilistic models for the sake of simplicity(see [52]).


6.4 A quick glance at the original PRISM architecture

The starting point of this investigation is the PRISM coder [93, 95, 94], a DVC coding archi-

tecture that tries to“marry the high compression efficiency of the predictive-coding mode with

the robustness and the low encoding complexity features of the intra coding mode”[93]. It

is possible to notice that the coding paradigm described in the previous section satisfy these

requirements. Indeed it proves to be robust to channel losses since the transmitted information

can be correctly decoded once there is some side informationat the decoder that is correlated

enough with the transmitted data. In the following, it will be shown how it is possible to apply

these principles to the video.

As it was mentioned in Section 2.2.2, video sequences present a strong correlation between

pixels belonging to temporally-close frames. This fact is widely used in traditional hybrid video

coding architecture to achieve high compression gains, andit can be used in a DSC approach

too. In the case of video signals, it is possible to associatethe informationX of the previous

example with the current block to be coded and its side information Y to another block that

belongs to a different temporally-close frame which has already been correctly decoded. The

correlated blockY can be estimated performing a block-matching motion estimation (BMME)

algorithm, and the result of this search permits decomposing the original block into the su-

perposition of some correlated data and some innovation (see Fig. 6.3). Depending on the two

different coding paradigms presented in Section 6.3, this search can be done in different places.

Since the predictive solution (Fig. 6.1(a)) implies that the reference block is available to both

coder and decoder, BMME is performed at the encoder and the coordinates of the estimated

blockY have to be transmitted together with the residual difference in order to allow a precise

reconstruction of the signal. A correct decoding is possible only in the case that coder and

decoder are perfectly aligned, i.e. the reference frames are the same and the coded informa-

tion arrives without losses. In the Wyner-Ziv solution, theencoder does not need to perform a

BMME in order to find a correlated block in the previous pixelsbut requires the knowledge of

the correlation existing between the current blockX and its predictionY .

Figure 6.3: A pictorial representation of innovationand correlated info for blocks.

1010

01

11

1

01

01

01

11

1 1

1001

011

01

11

1

mask

coefficient array

CRC bits Intra bits

Figure 6.4: CRC coding mask.

Starting from these premises, in 2002 R. Puri and K. Ramchandran designed PRISM [93],

a robust DSC-based video coder that relies on this second solution. The PRISM architecture

6.5. Structure of the implemented coder 105

processes the video signal in the transform domain. For eachblock of quantized transform

coefficients (appropriately shifted in order to have only positive values), the coder estimates

which part of the information can be correlated with blocks in the previous frames and which

are not. Typically, the correlated information are the mostsignificant bits of the binary repre-

sentation for each coefficient, and their estimation is performed using a classifier [93] which,

according to the mean square error between the current transformed block and its counterpart in

the previous frame (i.e. the block placed at the same coordinates), identifies a bit mask. The bit

mask selects those bits that could be correlated with a possible prediction block, and the coder

computes a 16-bits CRC on them (see Fig. 6.4). The remaining bits are intra coded, i.e. they are

coded independently from other blocks, and they will be called syndromes. The CRC and the

bit mask (together with the intra-coded bits) are sent to thedecoder which looks for a blockY

with the same CRC given the received mask. This block will be the side information that is to

be used in the decoding process. From this point, the decoding process is the same adopted by

the coder described in Section 6.5. The least-significant intra-coded bits (syndromes) are used

to identify the correct coefficients given the estimated predictor blockY . Since this chapter is

not intended to give an exhaustive description of the PRISM architecture, further details can be

found in the papers [93, 94, 95].

Section 6.3 has presented a simple example where the Wyner-Ziv coder obtains the same

compression efficiency of its predictive counterpart. However, in most of its practical imple-

mentations DVC is not able to match the compression performance offered by its predictive

counterparts. Therefore, the investigation of innovativeand performing entropy coding mech-

anisms is a stimulating research topic.

6.5 Structure of the implemented coder

One of the big issues that concern DVC is its compression performance. Distributed Video

Coding is nowadays seen as an efficient solution to transmit video content over unreliable

channels, but its possibilities in terms of compression performance are still at their beginning.

Different approaches are focused on achieving high coding gains with DVC in order to design

a coding architecture that embraces both robustness and compression efficiency. Following this

tendency, we have investigated an implementation of an efficient DSC architecture focusing on

the entropy coding of quantizer-related syndromes.

In order to obtain good compression results, we have designed our implementation using

the building blocks of the H.264/AVC coding architecture. The structure of the H.264/AVC

coder can be seen as a comprehensive synergism of coding solutions designed in the last 50

years. Many features that are included were already presentin some of the previous hybrid

coders, but were redefined in order to suit them to the generalarchitecture. In addition, some

new elements were introduced providing the final coder with awide set of tools that can be

rearranged in many different ways. Experimental results prove that this orchestration of many

different coding strategies is a winning solution as the H.264/AVC architecture outperforms

all of the previous coding standards, MPEG-4[44] and H.263[45] included. Therefore, the

implementation of a DSC coding scheme on its basic structure(depicted in Fig. 6.5) is an inter-


esting investigation field, considering that some new features of the coder pose new challenges.

Instead of obtaining DFD through motion compensation, motion search is used to determine

Figure 6.5: Encoder block diagram. The key differences between the presented DSC-basedencoder and a H.264 encoder are the syndrome generator and re-designed entropy coder.

how much of the current quantized information needs to be encoded, i.e. the correlation struc-

ture between the source and side-information. From this estimate, the encoder generates a

piece of information, calledsyndrome, which allows the identification of the subspace where

both the quantized information and its prediction lie.

The key differences between the presented DSC-based encoder and a H.264 encoder are:

(1) an additional syndrome generator and (2) a modified entropy coding algorithm to better

suit the probability distribution of syndrome values in place of the original H.264 entropy

coders, i.e. Context-Adaptive Variable-Length Coder (CAVLC) and Context-Adaptive Binary

Arithmetic Coder (CABAC), which were designed according tothe statistics of the quantized

transformed DFD. We now describe these two modules in more detail.

6.6 The generation of syndromes

One of the issues raised by the implementation concerns the syndrome generation, which is

strictly linked with the transform and quantization block.Traditional hybrid coders process the

Displaced Frame Difference (DFD) between the current blockand the reference provided by

the Motion Estimation unit (see Section 2.2.2), transforming the residual error and quantizing

the resulting coefficients. The H.264/AVC standard adopts a4 × 4 multiplication-free trans-

form, mapping the block of the residual signalx into a blockX of transform coefficient (see

Section 2.2.3).

The coefficients dynamic range is then reduced using a dead-zone quantizer, which can be

characterized by the equation

Xq(i, j) =

⌊

X(i, j) +O(i, j,QP,mb_type)∆(i, j,QP,mb_type)

⌋

, (6.1)

where the quantization step∆(i, j,QP,mb_type) and the offsetO(i, j,QP,mb_type) de-

pend on the coefficient position(i, j) in the block, the Quantization Parameter QP, and the

macroblock coding typemb_type. The standard allows specifying a quantization matrix that

may vary the relation between different quantization parameters according to Rate-Distortion

optimization criteria. In the current JVT implementation,the offsetO(i, j,QP,mb_type)

is 13∆(i, j,QP,mb_type) for Intra blocks and1

6∆(i, j,QP,mb_type) for Inter blocks. For

6.6. The generation of syndromes 107

the sake of simplicity, in the following paragraphs we will refer to∆(i, j,QP,mb_type) and

O(i, j,QP,mb_type) as∆ andO respectively.

The adopted DSC scheme, in its counterpart, transforms and quantizes the original signal

x into the coefficientsXq using the ME unit to find how much of the quantized information

needs to be encoded, i.e. the correlation structure betweenthe source and side-information. In

our implementation, the quantization rule was changed in order to match the characteristics of

the input signal and avoid an excessive mismatch between thequality obtained by H.264/AVC

and DSC coder for the same QP. Indeed, the adopted quantization offsetO is set to

O =

{

∆3 QP < 12∆6 · (2− 2

QP−1140 ) QP ≥ 12.

(6.2)

allowing a coarser quantization for high values of the QP parameter. In these cases the quanti-

zation rule glitches towards a truncation rule which reduces the occurrence of small coefficient

avoiding the coding of unnecessary information that does not significantly affect the resulting

distortion. Side-information is found in the previous frames through ME and computing, in the

transform domain, the numbern(i, j) of least significant bits that cannot be inferred from the

predicted block, according to the equation

n =

2 +

⌊

log

( |(Xq ·∆)−Xp|∆

)⌋

, if d > ∆

0 otherwise(6.3)

with d = min {|(Xq∆)−Xp| , |X −Xp|}. The parameter∆ is the quantization step for that

coefficient,Xq is the quantized coefficient from the current block,6 andXp is the corresponding

unquantized transform coefficient from the predicted block. From the value ofn, a syndrome

Z is generated corresponding to then least significant bits ofXq according to

Z = Xq & (2n − 1) (6.4)

where& denotes a bitwise AND operation (note that in equations (6.3) and (6.4) we omitted

the indexes). Considering the latticeΛ that includes all of the quantized real values, the symbol

Z identifies the sub-latticeΛZ , where the binary representations of all values have the same

least significant bits (see Fig. 6.6). Therefore the symbol can also be thought of as a sub-

lattice index, also called syndrome. This coding strategy corresponds to the multilevel coset

framework reported in [69]. In the following, we will represent syndromes with the notation

S = 2n + Z in order to signal both the number of bits and the syndrome value.

Given S and the referenceXp from motion compensation, the decoder can reconstruct

the original quantized valueXq selecting the point in the sub-latticeΛZ which is closer to

the referenceXp. Each syndrome conveys both the number of coded bitsn and their values.

Since these syndromes are not equally likely, they can be entropy coded to achieve higher

compression efficiency. Here, we present a quad-tree based arithmetic coder that is tailored to

6In our DSC implementation,M = 214/∆ is added to each coefficient value in order to make it positive, wherethe214 factor depends on the amplification of the4 × 4 transform (6 bits) on the input signal (8 bits). For furtherdetails, see [35].


∆

Λ01 Λ10

Λ11

Λ00 X q

1B = 0 1B = 1 1B = 0 1B = 1

X p

X q

X p

X q

X p

B = 00 B = 10

000 010 011 100 101 110 111001

2∆

4∆

X

001 011 101 111110100010000

100000 010 110 101 011 111001

Figure 6.6: Partitioning of the integer lattice into 3 levels. The parameter∆ identifiesthe quantization step,X is the source,Xq is the quantized codeword andXp is the side-information (omitting the spatial coordinates(i, j)). The number of levels in the partition treedepends on the correlation betweenXq · δ andXp givenX .

the distribution of the syndromes.

6.7 Entropy coding of syndromes

6.7.1 Entropy coding of syndromes

The compression gain of H.264/AVC is partially due to an effective entropy coding algorithm,

the Context-Adaptive Binary Arithmetic Coder (CABAC [73]). Its structure relies on an ef-

ficient symbol binarization and on accurate context modeling that well suits the statistics of

input data. At first, the syntax elements produced by the video coder are converted into vari-

able length binary strings, and for each binary digit, the modeling block assigns a context that is

associated with a binary probability mass function (p.m.f.). Then, both the binary digit and the

associated p.m.f. are sent to a binary arithmetic coder thatmaps them into an interval through

a Finite States Machine (FSM) and updates the binary context. Unfortunately, the CABAC

coder was designed and optimized for compressing quantizedand transformed DFD. Some

modifications were necessary in order to make it suitable forcompressing syndromes.

Modeling syndrome distribution

Despite each syndrome is actually represented by the least significant bits of a transform co-

efficient, its probability distribution may result quite different from the one of a transform

coefficient. In literature several works have proposed different probabilistic models for trans-

form coefficients according to the characteristics of the adopted transform and its dimension.

Most of the solutions that were adopted for video coding standard precedent to H.264/AVC are

based on Laplacian and generalized-Gaussian models (see [11, 57]). In [53], Kamaciet al. pro-

pose a better solution using a Cauchy probability distribution function to estimate the rate and

distortion in a rate control algorithm, while [75] resorts to aLaplacian+impulsivedistribution

which proves to be a sufficiently-accurate low-cost approximation of the generalized-Gaussian

distribution. After quantization this model can be easily approximated by a symmetric geo-

metric pmf or a symmetric piecewise geometric pmf, which we use to simplify the analysis of

syndrome distribution.

We divide the coded symbols (S) into two categories: (1) null (zero) coefficients (second

6.7. Entropy coding of syndromes 109

case in Equation 6.3), i.e.S = 0 and (2) non-null coefficient, i.e.S = 2n + Z (first case in

Equation 6.3). We first analyze the distribution of these coded coefficients.

From Equation 6.3, the probability distribution of symbolS can be approximated as

p(S) ≃ KS p2n−2

e

(

1− p2n−2

e

) cosh(

(2n−1 − Z) log(pr))

cosh (2n−1 log(pr)), (6.5)

wherepe andpr are constants characterizing two different geometric distributions, andKS is

a normalizing constant (see Appendix A.1). Note that forpr → 1, i.e. log(pr) → 0, the termcosh((2n−1−Z) log(pr))

cosh(2n−1 log(pr))is close to1, and Equation (6.5) can be simplified as

p(S) ≃ KS p2n−2

e

(

1− p2n−2

e

)

. (6.6)

Experimental results prove that the model fits the syndrome statistics quite well (see Fig. 6.7).

The fitting was made considering a differentpr for n = 2 syndromes, since the pdf of trans-

form coefficients for the4 × 4 transform is fitted better by using two different values forpr.

Generalized-Gaussian distribution with exponent lower than1 can be simplified using a Lapla-

cian model with an additive peak component which can be well represented by an impulsive

term [75] or, more precisely, by another Laplacian component with a lower variance. The re-

0 5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Pro

b.

syndrome value

Position 0

syndrome frequenciesmodel

(a) position0

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Pro

b.

syndrome value

Position 1


(b) position1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

Pro

b.

syndrome value

Position 2


(c) position2

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Pro

b.

syndrome value

Position 3


(d) position3

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Pro

b.

syndrome value

Position 4


(e) position4

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Pro

b.

syndrome value

Position 5


(f) position5

Figure 6.7: Comparison between the probability mass functions of syndromes (solid line) andthe model in eq. (6.5) (dashed line). The results were computed from the sequenceforemanwith QP = 28. The x-axis reports the syndrome value while the y-axis reports its probability.Each different plot is referred to a different position in the scanning order of4 × 4 transformblock.

ported graphs show that the statistics of syndromes is much more irregular than the statistics

of H.264/AVC coefficients, and since the whole distributionis less biased towards zero despite


the probability of having a null syndrome is higher, an efficient coding of syndrome informa-

tion becomes harder. Fig. 6.8 reports the difference between the entropy of syndromes and the

entropy of DFD coefficients for typical values ofpe andpr. In addition, the occurrence of null

0.05

0.1

0.15

0.2

0.60.65

0.70.75

0.80.85

0.90.95

1.6

1.9048

2.2096

pe

pr

diffe

renc

e be

twee

n en

trop

ies

Figure 6.8: Difference between the en-tropy of syndromes and the entropy ofDFD for differentpe andpr values.

0 2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Pro

babi

lity

of n

on−

null

sym

bols

Position in the scanning order

DSC coder H(s)=5.89H.264/AVC H(s)=5.10

Figure 6.9: Probability of non-nullsyndrome/coefficient for DSC coderand H.264/AVC (from the sequenceforeman (frame 1 QP=30)).

syndromes must be considered. In H.264/AVC, the high percentage of null quantized coeffi-

cients (calledzerosas in [39, 75]) in a transform block is efficiently exploited by a run-length

coding algorithm. The quantized transform coefficients arescanned according to a zig-zag or-

der, and then, the number of “zeros” that interlie between two non-null coefficients is coded

(called run). In the structure of CABAC coder, run-length coding is replaced by coding the

position of non-zero coefficients. A binary context associated with its position in the scanning

order, which models the probabilities of having a null coefficient at that position. Experimental

results show that, in DFD-based video coders, transform blocks have a low-pass characteristics

since the probability of non-null quantized coefficients ishigher at low frequencies. On the

contrary, the DSC syndromes show a more irregular distribution of null values. According to

the result reported in Fig. 6.9, the probability of a null syndrome is more equally distributed at

all the frequencies, and a low-pass characteristic is less evident. As a consequence, the adop-

tion of a zig-zag scan of syndromes followed by a run-length coding strategy results to be quite

efficient, as well as coding the position of each single non-null syndrome, since both distribu-

tions are less biased towards zero. Fig. 6.10 reports the results of the two coders (H.264-based

PRISM and the original H.264) on different sequences. The graphs show a 2 dB loss that is

due to an excessive waste of bit rate to code the DSC syndromes.

Quad-tree based entropy coding of syndromes

Experimental results show that, in a transform block, null syndromes occur in neighboring

positions, while non-null syndromes appear to be more sparse. From this result, the entropy

coding block can take advantage of null syndromes positionsadopting a quad-tree [55, 107]

based solution. Adopting a hierarchical quad-tree partitioning of the4×4 syndromes intosub-

blocksallows an efficient coding of the syndromes. The top level variableCBP-bit indicates

if there is any non-zero syndrome value in the4× 4 block. CBP-block then indicates which


0 1 2 3 4 5 6 7 8 9

x 105

28

30

32

34

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)

DSC SsCABAC

(a) news

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 106

20

25

30

35

40

45

50

Rate (kbit/s)

PS

NR

(dB

)

DSC SsCABAC

(b) mobile

Figure 6.10: Coding performances of the original CABAC algorithm on the H.264 coefficientsand the DSC syndromes. The CABAC was used to code the DSC syndromesas it is. The inputsignal has QCIF format at 30 frame/s GOP IPPP 15 frames.

of the 4 sub-blocks contain non-zero syndrome values. Finally, each of these indicated sub-

blocks has a variableCBP-subblock that indicates where and what the non-zero values

are (see Figure 6.11 for an example). At this level, the quad-tree coder characterizes which

syndromes are different from zero, which ones are coded using two bits (called d1s or d1-

syndromes), and which ones are coded with a higher number of bits. These variables are

then sent to the binary arithmetic coder. Note that their names recall the CBP structure that

is present in the H.264/AVC coder and specifies which8 × 8 block has non-zero syndromes.

However, the CBP-like variables that were introduced push the things further and pack more

information with respect with the original CBP. At first, thecoder signals whether there are

non-zero syndromes in the block. In case some syndromes are not null, the4 × 4 block is

divided into four2× 2 sub-blocks (see Fig. 6.11), and the encoder generates the first quad-tree

parameterCBP_block equal to

CBP_block = c0 + c1 · 2 + c2 · 4 + c3 · 8,

whereci =

{

1 if there are non-zero syndromes in thei-th subblock

0 if all the syndromes of thei-th subblock are nulli = 0, . . . , 3.

(6.7)


4 4

7 0

0

0 0

0 0 4

0

10

12

5

7

7

CBP_subblock=0

CBP_subblock=1256

CBP_subblock=151

CBP_block=14

0 1

11

CBP bit =1

CBP_subblock=870

Figure 6.11: Example of quad-tree coding using CBP variables.

In the next step of hierarchical quad-tree coding, for each sub-block that contains some syn-

dromes different from zero, the encoder computes the parameter

CBP_subblock= c0 + c1 · 6 + c2 · 36 + c3 · 216,

whereci =

0 if zi = 0

1 if zi is a d1 syndrome andzi&3 = 0




5 otherwise

with i = 0, . . . , 3.

(6.8)

It is possible to compare the average bit rate needed to code the position of zeros and d1-

syndromes using the traditional CABAC scheme and the one needed using the quad-tree scheme.

The comparison is reported in Table 6.1 for different sequences without applying the arithmetic

coding; the syndromes were obtained varying the quantization parameter in the range[15, 39].

It is possible to notice that the algorithm used by H.264 works well for low motion sequences,

but it results highly inefficient whenever there are a lot of details and the number of coefficients

increases.

Then, all of the CBP parameters are coded into a variable-length binary string using a

Huffman coding table, and each bit is successively sent to the binary arithmetic coder. The

remaining syndromes are coded separately, specifying the number of coded bit planes and their

values for each syndrome. However, it is possible to notice from the experimental results that

the number of non-d1 syndromes in a sub-block is very rarely bigger than one, and whenever

there are more non-d1 syndromes the number of coded bit planes is the same in most of the

cases. Therefore, it is possible to specify the same number of coded bit planes for all the non-d1

syndromes in the sub-block, which is equal to the biggest oneamong all the non-d1 syndromes

in a sub-block. In this way, we waste some bit planes wheneverthere are non-d1 syndromes

with different bit planes number, but we are able to reduce the amount of information sent


through the network since the number of bit planes need to be specified only once per sub-

block.

Sequences quad-treeRun-Lengthof CABAC

‘foreman’ 9.92 10.18‘mobile’ 14.34 15.33‘news’ 5.53 5.25

Table 6.1: Comparison of average bit rate (from binarization unit) needed to code the positionof zeros and ones in the H.264 coder and CBP blocks for the DSC coder (frame 1 QP=28).

6.7.2 Experimental results

6.7.3 Evaluation of compression gain with no quality equalization

The effectiveness of the designed entropy coder was evaluated comparing the performance of

DSC coding with that provided by the H.264/AVC coder using the same set of R-D optimization

parameters7. Different video sequences were coded using different quantization parameters

centering0 5 10 15 20 25

28

30

32

34

36

38

40

42

44

46

Rate (kbit)

PS

NR

(dB

)

Proposed schemeH.264

(a) ‘foreman’ (training sequence)

0 0.5 1 1.5 2 2.5 3 3.528

30

32

34

36

38

40

42

44

46

48

Rate (kbit)

PS

NR

(dB

)


(b) ‘news’ (test sequence)

Figure 6.12: PSNR vs. Bit rate for the first frame in the GOP (QP∈ [15 , 39]).

with GOP IPPP of 15 frames at30 frame/s under different test conditions. At first, we compared

the two entropy coders using a common reference for motion estimation, i.e. we coded the first

P frame of each sequence with the same temporal reference. Inthis case the reference block

used by H.264/AVC to compute the DFD and the reference block used by the DSC coder

are the same, and the different performance depend on the coding of residual information.

The implemented DSC coder proves to be very effective as it compares well with H.264/AVC

providing even higher quality for some sequences at medium bit rates (see Fig. 6.12). This

coding performance is mainly due to the adoption of a quad-tree entropy coder, which proves to

be more efficient than the traditional schemes based on run-length coding that are adopted in the

7Here both systems use only4×4 transform with Lagrangian R-D optimization disabled and without cancellingunnecessary coefficients at high frequencies.


previous DFD-based coding architectures. For the sequenceforeman we were able to obtain

the same bit rate of H.264/AVC up to40 dB of quality, while the sequencenews was more

efficiently coded at medium bit rates thanks to the little motion that characterizes this sequence

and increases the percentage of null syndromes in the DSC blocks. As it was explained in

the previous section, the number of null syndromes per frameis higher with respect to the

H.264/AVC, and therefore, the DSC coder avoids coding a lot of blocks using the hierarchical

CBP structures.

Unfortunately, this efficiency decreases whenever we are coding a whole GOP of frames.

In this case, the efficiency of motion compensation for the following frames is reduced since

the distortion introduced by the DSC coder in the sequence ishigher with respect to the one

introduced by H.264. The DSC coder quantizes the transform coefficients of the original signal,

while H.264/AVC quantizes the transform coefficients of prediction error. Therefore, the DSC

coded sequences are more affected by the distortion drift that is related to any prediction-

based coding scheme since the references used by the DSC coder in the motion compensation

have a lower quality that precludes an efficient prediction and increases the number of bits for

each syndrome. Despite this decrement of performances, Figure 6.13 shows that the proposed

coder is still able to closely match the compression efficiency of H.264/AVC also for the rest

of the GOP with a slight decrement in terms of performance with respect to the common-

reference case, especially at lower rates (below40 dB). In addition, we must remind that every

Rate-Distortion optimization strategy and coefficient cancellation was disabled in the proposed

scheme in order to evaluate the performance of the entropy coding itself. It is possible to

improve the compression gain optimizing the adopted quantization step, the coding mode, and

the erasure of unnecessary syndromes. Experimental results show that enabling a random Intra

refresh for the macroblocks in the sequence, the performance of DSC coder gets closer to the

one of H.264/AVC.

6.7.4 Evaluation of compression gain with Intra refresh

The quality degradation that affects the DSC-coded sequences can be significantly mitigated

by forcing a certain number of macroblocks to be coded in Intra mode. Figure 6.14 shows

the coding results for the sequencesforeman andnews enabling a random intra refresh of

macroblocks in the sequence. Hybrid coders frequently resort to it when transmitting over an

error-prone channel since the partial refresh of the decoder state makes possible to stop the

propagation of distortion in case some information gets lost. In this case, random intra refresh

performs a sort of “quality equalization” on the reference frame buffer of H.264/AVC decoder

and the one used by the DSC decoder. The distortion of the reference frames used by DSC

decoder is closer to the distortion of frames in the buffer ofH.264/AVC decoder mitigating the

effects of error propagation. As a result, it is possible to notice that the performance of DSC

coder gets closer to that of H.264 recovering part of the performance gain that was shown in

Figure 6.12.


0 100 200 300 400 500 600 700 800 90028

30

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)

Proposed schemeH264/AVC

(a) ‘foreman’ QCIF (training sequence)

0 50 100 150 200 250 300 35028

30

32

34

36

38

40

42

44

46

48

Rate (kbit/s)P

SN

R (

dB)


(b) ‘news’ QCIF (training sequence)

0 50 100 150 200 250 300 350 40025

30

35

40

45

50

Rate (kbit/s)

PS

NR

(dB

)


(c) ‘salesman’ QCIF (test sequence)

0 50 100 150 200 250 300 350 40028

30

32

34

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)


(d) ‘sean’ QCIF (test sequence)

0 100 200 300 400 500 60028

30

32

34

36

38

40

42

44

Rate (kbit/s)

PS

NR

(dB

)


(e) ‘foreman’ CIF (test sequence)

0 50 100 150 200 25028

30

32

34

36

38

40

42

44

Rate (kbit/s)

PS

NR

(dB

)


(f) ‘news’ CIF (test sequence)

Figure 6.13: PSNR vs. Bit rate for a whole GOP (IPPP QP∈ [15 , 39]).


0 200 400 600 800 1000 120028

30

32

34

36

38

40

42

44

46

Rate (kbit/s)

PS

NR

(dB

)


(a) ‘foreman’ QCIF

50 100 150 200 250 300 350 400 450 500 55028

30

32

34

36

38

40

42

44

46

48

Rate (kbit/s)

PS

NR

(dB

)


(b) ‘news’ QCIF

Figure 6.14: PSNR vs. Bit rate with Intra refresh enabled (11Macroblocks) (QP∈ [15 , 39]).

6.7.5 Evaluation of compression gain with rate control

Figure 6.12 shows that, for a given QP, the DSC-based coder achieves a lower quality at a

reduced bit rate. This mismatch can be equalized by implementing a rate control algorithm

that tunes the quantization parameter QP both at the macroblock and at the frame level in

order to keep the coded bit rate close to a target value. We adopted a modified version of the

algorithm proposed in Chapter 4, where the use of the percentage of “zeros” is replaced with

the percentage of null syndromes.

For then-th frame, the algorithm allocatesTn bits, whereTn is computed as

Tn =G

KI,DSC · nI + nDSC. (6.9)

The parameterG represents the number of bits that are left for the current GOP andnt, where

t =I or DSC, is the number oft-type frames in the GOP that still remain to be coded. As it is

explained in Section 4.5.2, the ratioKI,DSC characterizes the complexity relation between Intra

frames (I) and DSC-coded frames (DSC), and it is equal to

KI,DSC =XI

XDSC, where Xt = 2QPt/6Rt, t = I,DSC. (6.10)

These parameters are updated in the same way of their counterparts for the H.264/AVC coder

(see Equation 4.39). The quantization parameterQPt is used to quantize the lastt-type frame,

while Rt is the related number of bits. Experimental results show that also for syndrome

coding there is a linear relation between the number of codedbitsR and the percentageρ of

null syndromes, and therefore, the target bit rateTn can be related to a target percentageρn


ρn =Tn − qm

. (6.11)

The target percentageρn of null syndromes makes possible the identification of an average

quantization stepQPn, which has to be corrected at macroblock level in order to match the

bandwidth constraints (see Section 4.5.3 for more details). The parametersm andq are es-

6.8. Summary 117

timated from previously-coded frames (see Equations 4.34 and 4.35). The same rate control

80 100 120 140 160 180 20032

33

34

35

36

37

38

39

Rate (kbit/s)

PS

NR

(dB

)


(a) ‘sean’ QCIF

80 100 120 140 160 180 20033

34

35

36

37

38

39

40

41

Rate (kbit/s)

PS

NR

(dB

)


(b) ‘news’ QCIF

Figure 6.15: PSNR vs. Bit rate with rate control enabled (target bit rates{80, 96, 112, 128, 144, 160, 176, 192} kbit/s).

algorithm was adopted for both the proposed DSC coder and theH.264/AVC architecture (in

this case DSC frames are replaced by P frames) changing only the complexity ratioKI,P . In

the DSC coder the complexityKI,P is divided by a constantc = 1.4 in order to equalize the

quality mismatch between DSC frames and P frames in H.264/AVC. The scaling of the com-

plexity ratio reduces the number of bits allocated for Intraframes, and increases the bit rate

for DSC frames improving their quality. The adopted choicesproved to be effective, as the

compression performance showed in the plots of Fig. 6.15 report that the designed DSC-based

architecture is able to improve the performance of H.264/AVC (using the same rate control

algorithm).

6.8 Summary

This chapter has presented the implementation of a DSC coderthat reuses the building blocks

of the H.264/AVC coder. The enhanced features of H.264/AVC allow improving the coding

performance of the DSC coder itself, but pose, at the same time, new challenges in terms of

entropy coding. The adoption of a DCT transform on smaller blocks increases the variance

of coefficients and makes the coding of syndromes harder since the usual low-pass structure

of each block to be coded is altered. A hierarchical quad-tree approach copes with this prob-

lem efficiently since it allows obtaining good compression results with respect to H.264/AVC

without the use of sophisticated channel codes. The coding performance is also improved

by the adoption of rate control strategies and random intra refresh of macroblocks. Such a

compression-centric DSC-based encoder is an important building block for a robust version of

DSC-based video coders.

Chapter 7

Conclusions

Previous chapters have presented the basic needs that are required by multimedia communi-cations on wireless networks. They concern mainly the compression gain, the computationalrequirements, and the robustness to errors and losses. In this thesis, we presented some possi-ble improvements that lead to a more efficient implementation of videocommunication appli-cations in terms of both compression gain and robustness. Inthis chapter, we summarize thekey results of this work and discuss some future research directions.

7.1 Summary

A recent innovation in the communication world is the massive introduction of multimedia

services over wireless networks, which was mainly inspiredby the aim of providing video

and audio applications almost anywhere and anytime. More and more Internet and mobile

communication providers offer a wide variety of multimedia-related services that span from

the video communication to the fruition of video-on-demandcontents on mobile devices. This

accomplishment was possible thanks to the recent development of mobile communication and

the technological advances in digital coding of multimediadata. However, the appearance of

heterogeneous network scenarios, characterized by the interconnection of different types of

networks and devices, and the massive wide-spreading of mobile communications, affected by

a higher percentage of losses and errors with respect to the traditional wired communications,

has modified the needs and the guidelines followed in the design of compression algorithms.

As a matter of fact, the capability of providing reliable video is the most relevant issue in the

widespread and the diffusion of multimedia mobile services, and the recent literature reports a

wide number of different proposals that try to cope with the problems of transmitting a video

sequence across a network affected by losses.

In this thesis, three main issues, which characterize the choices and the design of new

coding schemes, are addressed.

The first issue is the compression gain, which has to grant both the respect of bandwidth

constraints and a high visual quality in the reconstructed sequence at the decoder. This problem

can be addressed in two ways: designing efficient entropy coding schemes and implementing

efficient rate optimization algorithms. The standard H.264/AVC has proven to achieve the

120 Chapter 7. Conclusions

highest compression gain of the last ten years among hybrid video coders, and since its defini-

tion was oriented towards wireless application, it has beenadopted in this research as the basic

coding architecture.

The coding performance of the H.264/AVC standards is the result of an efficient orches-

tration of different coding techniques that span from predictive coding to arithmetic entropy

coding. However, the compression performance of the standard can be mainly ascribed to

some of them, such as an enhanced macroblock partitioning inthe Motion Compensation,

the adoption of an efficient spatial prediction, the introduction of an adaptive deblocking fil-

ter in the prediction loop, and finally, the implementation of an efficient adaptive arithmetic

coding engine, called Context Adaptive Binary Arithmetic Coder (CABAC). Syntax elements

are converted into binary strings, and a context is assignedto each binary digit. The couples

(symbol,context) are then processed by a binary arithmeticcoder which codes the binary digit

according to the probability model identified by the contextand updates the statistics. Our

results have showed that it is possible to improve this estimate by modifying the original prob-

ability model. In the original CABAC coder, the residual information is coded mapping at first

the positions of non null quantized DCT coefficients, and coding their values in a second step.

The context modeller assigns to each non-zero coefficient a context according to its order in the

zig-zag scanning. This context modelization does not take into consideration the position of

the coefficient in the block and its neighbors. Experimentalresults show that there is a statisti-

cal dependence between DCT coefficients both at neighboringpositions within the same block

and at the same positions in neighboring blocks. Therefore,it is possible to take advantage

of this dependence to improve the estimate by associating contexts to conditional probabilities

in place of absolute probabilities. The absolute values of DCT coefficients are sliced into bit

planes and for each bit plane a Directed Acyclic Graph is adopted to represent the statistical

relations among neighboring bits. Each edge in the model is associated to a conditional prob-

ability among adjacent bits, and it is used to propagate binary probabilities through the graph.

The CABAC, endowed with this new probability estimate, produces a smaller bit stream with

respect to its original definition (about 10 % smaller).

Compression gain also concerns the design of efficient rate optimization and control algo-

rithms, and in literature different rate control strategies have been published, which are able

to keep the produced bit rate within the bandwidth constraints and allowing a good perceptual

quality in the reconstructed sequence at the same time. Computational complexity introduces

a further differentiating criterion among different rate control algorithms. A good control of

bit stream size and a high perceptual quality at the decoder can be obtained with a highly

computationally-expensive rate control algorithm. On theother hand, the adoption of compu-

tationally lighter solutions is payed with a reduced codingperformance in terms of both low

visual quality and coarse accuracy in the respect of bandwidth constraints. The investigation

of Chapter 4 is mainly focused on finding an efficient trade-off between the two opposite so-

lutions. Since the efficiency of each rate control algorithmis deeply affected by the adopted

rate model, an efficient solution was found by Heet al. in [38]. The proposed model states the

linear relation that exists between the produced bit rate and the percentageρ of null quantized

DCT coefficients. However, it is necessary to map the estimated ρ values to a quantization

step. This thesis introduces a rate modelling in the joint domain(ρ,Eq), which permits an ac-

7.1. Summary 121

curate low-cost estimate of the target quantization step relating the percentageρ to the energy

of the quantized signalEq. This model is then adopted in a low-cost rate control algorithm

implementing a proportional control together with an efficient frame skipping scheme. Less

significant frames, like B frames, are skipped whenever the transmission buffer is close to an

overflow saving their bits for the improvement of the visual quality in the following frames.

Experimental results show that the proposed technique compares well with respect to other

techniques proposed by the same JVT committee.

Despite the adoption of efficient entropy coding algorithmsand rate control strategies al-

lows the receiver to experience a better visual quality at a given bandwidth, these coding efforts

can result completely useless in case the channel is affected by errors and losses. Different so-

lutions have been proposed to cope with this problem, and their efficiency often depends on

the target application for what they are conceived. An efficient solution consists in introducing

some redundant information in the coded packet stream. A recent approach includes RTP pack-

ets produced by the video source coder into a matrix columnwise and applies a cross-packet

FEC code along the rows. Redundant data are then packetized and sent to the transmitter,

which can reconstruct the lost information in case the received redundant packets are enough.

This approach proves to be very effective whenever the matrix size is well tuned to the packet

lengths and the information they carry. Experimental results shows that the performance of

this scheme can be significantly improved whenever matrix size is chosen according to packet

lengths and the percentage of null quantized DCT coefficients. Chapter 5 proposes a novel

joint source-channel rate control algorithm based on the percentage of zeros, which adapts the

protection level to the characteristics of each frames. Theperformance of the algorithm proves

to be significantly better that its non-adaptive counterpart.

A possible alternative to the introduction of redundant data is adopting a robust source

coding scheme based on Distributed Source Coding (DSC) principles. In literature, several

approaches have been proposed during the last years, mainlyfocused on complex channel

coding/decoding schemes. A simpler solution was proposed in 2002 by Puri and Ramchan-

dran [93], who designed a Distributed Video Coding scheme that compared with the efficiency

of traditional hybrid techniques but was able to produce a robust bit stream. PRISM coder

processes the signal in the transform domain and assigns to each transform block a sort of “sig-

nature” that identifies the most significant bits of transform coefficients. The remaining least

significant bits are Intra coded, and both signature and Intra coded bits are sent to the receiver.

The decoder searches for a block in its frame buffer with the same signature, and through the

Intra-coded bits, reconstructs the coded transform block.Note that both the requirements of

robustness and reduced computational complexity at the encoder are satisfied. The motion

estimation is performed at the decoder shifting the computational complexity at the network

side. Moreover, decoding is possible independently from the adopted reference block allowing

a correct reconstruction of the coded sequence even when at the decoder the reference buffer

is different from the one at the encoder. Therefore, the investigation of effective algorithms

that allow a high compression gain is an interesting investigation topic. Chapter 6 presents new

results obtained investigating efficient entropy coding algorithm for the PRISM syndromes.

The investigation has shown that PRISM syndromes present a rather peculiar statistics that

makes the coding solutions adopted for DSC coefficients ineffective and requires ad-hoc cod-

122 Chapter 7. Conclusions

ing schemes. In our work we proposed an original model for syndromes statistics that proved to

be quite accurate when matched with experimental results. Moreover, the investigation has led

to the design of a novel arithmetic coder scheme based on quad-tree coding that improves the

performance of the PRISM coder. Adopting the same prediction mechanism of H.264/AVC

(based on Motion Vectors), it is possible to obtain coding results comparable to the ones of

H.264/AVC with the same computational effort.

7.2 Future Research

Several important areas of research stem from the developments discussed in this thesis. As

for the DAG-based arithmetic coder, it is possible to improve the proposed scheme working

towards two directions, i.e. the reduction of the computational complexity and the extension

of the DAG model to other syntax elements. The computationalcomplexity can be reduced

clustering the DAGs into different clusters. On the other hand, the spatial correlation allows

the DAG modeller to characterize the probability of other syntax elements, like motion vectors.

Another important investigation field regards the improvement of the matrix-based FEC

scheme of Chapter 5. So far only Reed-Solomon codes have beenconsidered while the litera-

ture have proposed more efficient schemes. An interesting issue is raised from the adoption of

Turbo Codes. Redundant packets can be computed both along the columns and along the rows

creating two different sets of redundancy bytes. It is possible to shape the additional redun-

dancy including packets from both codes implementing a sortof Turbo codes at RTP levels.

Since the performance of Turbo codes can significantly increment the recovering performance

the investigation of this possibility is an interesting research field.

Finally, the novelty of DSC schemes opens a wide variety of research topics. Among the

most important ones, two issues result to be determinant in the performance of these schemes.

The first is achieving a high compression gain, which can be obtained through effective en-

tropy coding solutions and rate distortion optimization algorithms. On the other hand, the final

performance is deeply affected by the classification algorithm. Therefore, the design of new

strategies that are able to characterize the coded information in a transmission environment af-

fected by errors is a stimulating research field that offers many possible solutions to investigate.

Appendix A

Relation betweenEq and ρ

In the rate control algorithm, the energy of the quantized signal is approximated using the

parameter

Eq =act

∆(A.1)

which can be expressed as follows

act

∆=

∑NMB−1m=0

∑15x,y=0 |errm(x, y)|

NMB ·∆. (A.2)

For a largeNMB, we can relate the average activity to the average energy

Eq =1

NMB

NMB−1∑

m=0

15∑

x,y=0

|errm(x, y)|2 (A.3)

via the relation

1

N

N−1∑

n=0

|xn| = ξx

√

√

√

√

1

N

N−1∑

n=0

|xn|2 (A.4)

whereξx = E[|x|]/E[

|x|2]

is a shape factor depending on the p.d.f. of the zero-mean random

variablex (e.g.ξx =√

2/π for a Gaussian variable,ξx = 1 for a Laplacian variable).

This leads to the approximation

124 Appendix A. Relation betweenEq andρ

Eq =act

∆=

∑NMB−1m=0

∑15x,y=0 |errm(x, y)|

NMB ·∆

=ξx∆

√

∑NMB−1m=0

∑15x,y=0 |errm(x, y)|2NMB

=ξx∆

√

∑NMB−1m=0

∑15x,y=0 |Errm(x, y)|2NMB

= ξx

√

∑NMB−1m=0

∑15x,y=0 (Errm(x, y)/∆)2

NMB

≃ ξx

√

E[

(Errm(x, y)/∆)2]

= ξx

√

Eq.

(A.5)

where the activityactm is computed as expressed in equation (4.7) andErr(x, y), x, y =

0, . . . , 15, is the signalerr(x, y) after the transformation. In fact, the average activity valueact

computed on the original residual signal (eq. A.1) is linearly proportional to the square root of

the energy of its transformed version.

According to the probability density function reported in eq. (4.10), the percentage of

quantized DCT coefficients different from zero is equal to

θ = 1− ρ = 2 ·∫ +∞

∆px(a)da =

e− 2

γ′∆

1 + α′. (A.6)

The energy of the quantized signal is

Eq = 2

∫ +∞

∆[Q(a)]2 px(a)da, (A.7)

whereQ(a) is the quantized index of the coefficienta, and it depends on1 − ρ as in the

following equation

Eq =

∫ +∞

−∞[Q(a)]2 px(a)da

= 2+∞∑

i=1

i2∫ ∆·(i+1)

∆·ipx(a)da

= 2

+∞∑

i=1

i2∫ ∆(i+1)

∆i

2 · e−2γ′

a

(1 + α′) · γ′da

= 2

+∞∑

i=1

i2e− 2

γ′∆i ·

(

1− e−2γ′

∆)

1 + α′

(A.8)

Let

125

ς = 1− e−2γ′

∆, (A.9)

then the series in equation (A.8) converges to the value

+∞∑

i=1

i2e− 2

γ′∆i ·

(

1− e−2γ′

∆)

=+∞∑

i=1

i2 (1− ς)i ς

=ς2 − 3ς + 2

ς2.

(A.10)

The energy of the quantized signal can be expressed as

Eq = 2ς2 − 3ς + 2

(1 + α′) · ς2 =2 (1− ς) · (2− ς)

(1 + α′) · ς2 . (A.11)

and therefore, the approximation of equation (A.1) can be written as

Eq(ς) = ξx

√

Eq(ς) = ξx

√

2ς2 − 3ς + 2

(1 + α′) · ς2 . (A.12)

The first derivative of equation (A.12) shows that√

Eq(ς) has only one stationary point

ς = 2/3 in the range[0, 1]. In addition, the second derivative shows that the functionis convex

for the whole interval[0, 1]. Therefore, the Taylor expansion of the functionEq can be well

approximated in[0, 1] by a second degree polynomial. The experimental results reported in

Fig. 4.4 show that this approximation is sufficiently accurate. Therefore, taking into account

that eq. (A.6) implies

ς = 1− e−2γ′

∆= 1− (1 + α′)θ, (A.13)

we can write

Eq = ξx

√

Eq(ς) = a0 + a1 · ς + a2 · ς2, (A.14)

which is equivalent to

Eq = a0 + a1 · (1− (1 + α′)θ)) + a2 · (1− (1 + α′)θ)2

= c0 + c1 · θ + c2 · θ2,(A.15)

with

c0 = a0 + a1 + a2

c1 = (−a1 − 2a2) (1 + α′)

c1 = a2 (1 + α′)2 .

(A.16)

Sinceθ is the complementary value of the percentageρ, the equation (A.15) can be written

as in (A.2).

126 Appendix A. Relation betweenEq andρ

A.1 Derivation of probability distribution for syndromes

The probability distribution for non-zero syndromes can beapproximated as follows. Accord-

ing to the first case of equation (6.3), the number of bits thatmust be included in the syndrome

isn = 2 +

⌊

log2

(

|Xq·∆−Xp|∆

)⌋

= 2 +⌊

log2

(

|Xq − Xp

∆ |)⌋

≃ 2 + ⌊log2 (|Xq −Xp,q|)⌋ = 2 + ⌊log2 (|E|)⌋ ,(A.17)

whereXp is the side-information (reference block),Xp,q is the quantized version ofXp and

E = Xq − Xp,q. Assuming that bothXq and the differenceE can be approximated by an

independent symmetrical geometric variable, the probability mass function ofXq andE can

be respectively expressed as

pr(Xq) =1− pr

1 + prp|Xq−M |r pe(E) =

1− pe

1 + pep|E|

e , (A.18)

where we assume that the coefficientsXq are shifted in such a way that, omitting the tails of

the distribution, they can be included in the set[0, 2M ]. The parameterspr andpe completely

characterize the two probability distributions. In our implementation, they are estimated from

the experimental data using log-linear fitting.

Pay attention to the fact that the coefficient are shifted in order to be always positive, that

is to say that, after the transform operation, we addM = 215 to each coefficient. This value

was computed considering that the amplification of the4 × 4 transform, which is equal to36

(worst case), can be represented with6 bits and the residual signal can be represented with

1 + 8 bits. In the following analysis, we will omit considering the tails of the p.m.f. since they

have a small influence on the final probability value and the effective transform coefficients are

included in the range[−M,M ]. Therefore, the previous pmfs in Equation (A.18) are centered

aroundM .

Let the syndromeZ be coded withn bits (i.e. 2n−2 ≤ |E| = |Xq − Xp,q| < 2n−1). In

the following the couple(E,n) will be also referenced with the symbolS = Z + 2n. We can

write the joint p.d.f. of the syndromeZ and the number of bitsn as

p(Z, n) =2M−1∑

Xq=0

2M−1∑

Xp,q=0

pr(Xq) · 1(2n−2 < |Xq −Xp,q| < 2n−1) · 1(Z = Xq&(2n − 1)).

(A.19)

where1(·) is the indicator function.

Let kT = M/2n (k is an integer sinceM is some power of 2), then the sum can be written

A.1. Derivation of probability distribution for syndromes 127

as

p(S) =

kT−1∑

k=0

2n−1−1∑

E=2n−2

pr(k · 2n + Z)pe(E) · [1(k · 2n + Z ≥ E) + 1]+

+

2·kT−1∑

k=kT

2n−1−1∑

E=2n−2

pr(k · 2n + Z)pe(E) · [1(k · 2n + Z < 2M − E) + 1]

≃kT−1∑

k=0

2n−1−1∑

E=2n−2

2pr(k · 2n + Z)pe(E) +

2·kT−1∑

k=kT

2n−1−1∑

E=2n−2

2pr(k · 2n + Z)pe(E)

(A.20)

since typicallylog2M ≫ n, thusp(1(k · 2n + Z ≥ E) = 1) ≃ 1 andp(1(k · 2n + Z <

2M − E) = 1) ≃ 1.

This can then be rewritten as

p(S) =1− pr

1 + pr

1− pe

1 + pe·

kT −1∑

k=0

pkT ·2n

r p−k·2n

r pe−Z

2n−1−1∑

E=2n−2

2peE+

+

2kT −1∑

k=kT

pk·2n

r p−kT ·2n

r peZ

2n−1−1∑

E=2n−2

2peE

,

(A.21)

which can be further simplified into

p(S) =1− pr

1 + pr

p2n−2

e

(

1− p2n−2

e

)

1 + pe

{

pMr − 1

1− p−2n

rp−Z

r +1− pM

r

1− p2n

r

pZr

}

≃ KS · p2n−2

e ·(

1− p2n−2

e

) cosh(

(2n−1 − Z) · log(pr))

cosh (2n−1 · log(pr))(A.22)

whereKS is a normalizing constant. Note that forpr → 1, i.e. log(pr) → 0, the termcosh((2n−1−Z) log(pr))

cosh(2n−1 log(pr))is close to1, and Equation (A.22) can be simplified as

p(S) ≃ KS p2n−2

e

(

1− p2n−2

e

)

. (A.23)

Bibliography

[1] A. Aaron, S. Rane, E. Setton, and B. Girod. Transform-Domain Wyner-Ziv Codec for

Video. In Proceedings of SPIE Visual Communications and Image Processing Confer-

ence, San Jose, California, USA, January 2004.

[2] A. Aaron, R. Zhang, and B. Girod. Wyner-ziv coding for motion video. InProceed-

ings of Asilomar Conference on Signals, Systems and Computers 2002, Pacific Grove,

California, USA, November 2002.

[3] M. E. Al-Mualla, C. N. Canagarajah, and D. R. Bull.Video Coding for Mobile Commu-

nications. Academic Press, An Imprint of Elsevier Science, 2002.

[4] D. Alfonso, D. Bagni, L. Celetto, and S. Milani. Constantbit-rate control efficiency

with fast motion estimation in H.264/AVC video coding standard. InProc. of the 12th

European Signal Processing Conference (EUSIPCO 2004), pages 1271–1274, Wien,

Austria, September 6–10 2004.

[5] P. Baccichet.Solutions for the protection and reconstruction of H.264/AVC video signals

for real time transmission over lossy networks. PhD thesis, Istituto di Elettronica e di

Ingegneria dell’Informazione e delle Telecomunicazioni (I.E.I.I.T.), University of Milan,

Milan, Italy, 2006.

[6] P. Baccichet and A. Chimienti. Forward Selective Protection Exploiting Redundant

Slices and FMO In H.264/AVC. InProc. of the IEEE International Conference on

Image Processing - ICIP, submitted, Atlanta, GA,USA, October 2006.

[7] P. Baccichet, S. Rane, and B. Girod. Systematic Lossy Error Protection based on

H.264/AVC Redundant Slices and Flexible Macroblock Ordering. InProc. of the IEEE

Packet Video Workshop, Hangzou, China, April 2006.

[8] I. Bauermann and E. Steinbach. Further Lossless compression of JPEG-images. In

Picture Coding Symposium, PCS 2004, San Francisco, California, USA, December15–

17 2004.

[9] Eric Bodden, Malte Clasen, and Joachim Kneis. Arithmetic Coding revealed. InProsem-

inar Datenkompression 2001. RWTH Aachen University, 2002. German version avail-

able: Proseminar Datenkompression, Arithmetische Kodierung.

130 Bibliography

[10] J. Bormans, J. Gelissen, and A. Perkis. MPEG-21: The21st century multimedia frame-

work. IEEE Signal Processing Mag., 20(2):53–62, March 2003.

[11] G. Calvagno, C. Ghirardi, G.A. Mian, and R. Rinaldo. Modeling of subband data for

buffer control.IEEE Trans. Circuits Syst. Video Technol., 7(2):402–408, April 1997.

[12] O. Campana and R. Contiero. An H.264/AVC video coder based on Multiple Descrip-

tion Scalar Quantizer. InProc. of40th Asilomar Conference on Signals, Systems, and

Computers, Pacific Grove, CA, USA, October 29 – November 1 2006.

[13] O. Campana and S. Milani. A Multiple Description CodingScheme For The H.264/AVC

Coder. InProc. of the International Conference on Telecommunication and Computer

Networks IADAT-tcn2004, pages 191–195, San Sebastian, Spain, December 2004.

[14] L. Cappellari and G.A. Mian. Analysis of joint predictive-transform coding. InProc.

of the Sixth Baiona Workshop on Signal Processing in Communications, Baiona, Spain,

September 8–10, 2003.

[15] J.-J. Chen and D. W. Lin. Optimal bit allocation for coding of video signal over ATM

networks.IEEE J. Select. Areas Commun., 15(6):1002–1015, August 1997.

[16] M. D. Fairchild. Color Appearence Model. Wiley, 2005.

[17] N. Färber, K. Stuhlmuller, and B. Girod. Analysis of error propagation in hybrid video

coding with application to error resilience. InProc. of International Conference on

Image Processing, ICIP 1999, pages 550–554, Thessaloniki, Greece, October 1999.

[18] G. Galilei. Dialogo sopra i due massimi sistemi Tolemaico e Copernicano (Dialog on

the Two Chief Systems of the World), 1632.

[19] R. G. Gallager. Variations on a theme by Huffman.IEEE Trans. Inform. Theory,

24(6):668–664, December 1978.

[20] G. Gennari, D. Bagni, A. Borneo, and L. Pezzoni. Slice header reconstruction for

H.264/AVC robust decoders. InInternational Workshop on MultiMedia Signal Pro-

cessing (MMSP 2005), Shanghai, Cina, November 2005.

[21] G. Gennari and L. Celetto. A H.264 robust decoder for wireless environments. In

STMicroelectronics Journal, Wien, Austria, September 6–10 2004.

[22] G. Gennari, G. A. Mian, and L. Celetto. A robust H.264 decoder with error concealment

capabilities. InProc. of the 12th European Signal Processing Conference (EUSIPCO

2004), pages 649–652, Wien, Austria, September 6–10 2004.

[23] G. Gennari, G.A. Mian, D. Bagni, and L. Celetto. A robustH.264/AVC decoder capable

of error concealment and slice header reconstruction. InProc. of Wireless Reconfig-

urable Terminals and Platforms (Wirtep 2006), Rome, Italy, April10–12 2006.

Bibliography 131

[24] Gianluca Gennari. Decodificatore H.264 robusto nei confronti degli errori di trasmis-

sione. Master’s thesis, Department of Information Engineering, University of Padova,

Padova, Italy, July 2003.

[25] A. Gersho. The channel splitting problem and modulo-PCM coding. Technical report,

Bell Labs Memo for Record (not archived), October 1979.

[26] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero. Distributed Video Coding.

Proc. of the IEEE, Special Issue on Video Coding and Delivery, 93(1):71–83, January

2005. Invited Paper.

[27] B. Girod and N. Fäber. Wireless Video. In M.-T. Sun and A.R. Reibman, editors,

Compressed Video Over Networks, chapter 12. Marcel Dekker Inc., September 2001.

[28] B. Girod and N. Färber. Feedback-based error control for mobile video transmission.

Proc. of the IEEE, 87(10):1707–1723, October 1999.

[29] C. Gomila. The H.264/MPEG-4 AVC Video Coding Standard.EURASIP Newsletter,

15(2):19–34, June 2004.

[30] R. C. Gonzalez and R. E. Woods.Digital Image Processing. Pearson Education, 2002.

[31] V. K. Goyal and J. Kovacevic. Generalized Multiple Description Coding with Pair-

wise Correlating Transform.IEEE Trans. Inform. Theory, 47(6):2199–2224, September

2001.

[32] Vivek K. Goyal. Multiple Description Coding: Compression Meets The Network.IEEE

Signal Processing Mag., 8(5):74–93, September 2001.

[33] 3GPP TSG-SA4 Siemens Group. Matrix approach vs. packetapproach for MBMS appli-

cation layer FEC. In3GPP TSG-SA4 Meeting TSG-SA4 # 30, Malaga, Spain, February

23–27 2004.

[34] 3GPP TSG-SA4 Siemens Group. Simulation results of MBMSapplication layer FEC

with RS-codes. In3GPP TSG-SA4 Meeting TSG-SA4 # 30, Malaga, Spain, February

23–27 2004.

[35] A. Hallapuro, M. Karczewicz, and H. Malvar. Low complexity transform and

quantization - part I: Basic implementation. InJoint Video Team (JVT) of

ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-TSG16

Q.6), 2nd Meeting, Geneva, CH, January 29 – February 1 2002. files:

jvtb038.doc,jvtb038.xls,jvtb038r1.doc,jvtb038r2.doc.

[36] A. Hallapuro, M. Karczewicz, and H. Malvar. Low complexity transform and

quantization - part II: Basic implementation. InJoint Video Team (JVT) of

ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-TSG16

Q.6), 2nd Meeting, Geneva, CH, January 29 – February 1 2002. files:

jvtb039.doc,jvtb039.xls,jvtb039r1.doc,jvtb039r2.doc.

132 Bibliography

[37] Z. He, Y. K. Kim, and Sanjit K. Mitra. Low-delay rate control for DCT video coding via

ρ-domain source modeling.IEEE Trans. Circuits Syst. Video Technol., 11(8):928–940,

August 2001.

[38] Z. He and S. K. Mitra. A Unified Rate-Distortion AnalysisFramework for Transform

coding. IEEE Trans. Circuits Syst. Video Technol., 11(12):1221–1236, December 2001.

[39] Z. He and S. K. Mitra. Optimum bit allocation and accurate rate control for video coding

via ρ-domain source modeling.IEEE Trans. Circuits Syst. Video Technol., 12(10):840–

848, October 2002.

[40] P. Ishwar, V. M. Prabhakaran, and K. Ramchandran. Towards a Theory for Video Coding

Using Distributed Compression Principles. InProc. of the Internation conference on

Image Processing (ICIP), 2003.

[41] ISO/IEC. Coded representation of picture and audio information-MPEG-2 test model 5.

In ISO/IEC AVC-491, April 1993.

[42] ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG 8). Information Technology - Coded Represen-

tation Of Picture And Audio Information - Lossy/Lossless Coding Of Bi-Level Images

(JBIG). Final Committee Draft, 1999.

[43] ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG8). JPEG 2000 Part I Final Committee Draft

Version 1.0. For Review, March16th 2000.

[44] ISO/IEC JTC1. Coding of Audio-Visual Objects - Part 2: Visual. ISO/IEC 14 496-2

(MPEG-4 Visual version 1), Apr. 1999; Amendment 1 (version 2), Feb. 2000; Amend-

ment 4 (streaming profile), Jan. 2001, January 2001.

[45] ITU-T. Video Coding for Low Bitrate Communications, Version 1. ITU-T Recommen-

dation H.263, 1995.

[46] ITU-T. Control Protocol for Multimedia Communication. ITU-T Recommendation

H.245, 1996.

[47] ITU-T and ISO/IEC JTC1. Generic Coding of Moving Pictures and Associated Audio

Information-Part 2: Video. ITU-T Recommendation H.262-ISO/IEC 13 818-2 (MPEG-

2), 1994.

[48] A. Jagmohan, A. Sehgal, and N. Ahuja. Predictive Encoding Using Coset Codes. In

Proceedings of IEEE International Conference on Image Processing, volume 2, pages

29–32, Rochester, New York, USA, September 2002.

[49] N.S. Jayant. Subsampling of a DPCM speech channel to provide two ’self-contained’

half-rate channels.Bell Syst. Tech. J., 60(4):501–509, April 1981.

[50] N.S. Jayant and P. Noll.Digital Coding of Waveforms - Principles and Applications to

speech and Video. Prentice-Hall, 1984.

Bibliography 133

[51] M. I. Jordan and Y. Weiss. Graphical models: probabilistic inference. In M. A. Arbib,

editor,Handbook of Neural Networks and Brain Theory. 2nd edition.MIT Press, 2002.

[52] G. D. Forney Jr., M. D. Trott, and S.-Y. Chung. Sphere-Bound-Achieving Coset Codes

and Multilevel Coset Codes.IEEE Trans. Inform. Theory, 46(3):820–850, May 2000.

[53] N. Kamaci, Yucek Altunbasak, and Russel M. Mersereau. Frame bit allocation for the

H.264/AVC video coder via a Cauchy-density-based rate and distortion models.IEEE

Trans. Circuits Syst. Video Technol., 15(8):994–1006, August 2005.

[54] G. Keesman, I.Shah, and R. Klein-Gunnewiek. Bit rate control for MPEG encoders.

Signal Processing:Image Communication, 6(6):545–560, 1995.

[55] A. Klinger and C. R. Dyer. Experiments on picture representation using regular decom-

position. CGIP, 5:68–105, 1976.

[56] L. P. Kondi and A. K. Katsaggelos. An operational rate-distortion optimal single-pass

SNR scalable video coder.IEEE Trans. Image Processing, 10(11):1613–1620, Novem-

ber 2001.

[57] E. Y. Lam and J. W. Goodman. A mathematical analysis of the DCT coefficient distri-

butions for images.IEEE Trans. Image Processing, 9(10):1661–1666, October 2000.

[58] G. G. Langdon. An Introduction to Arithmetic Coding.IBM J. Res. Develop., 28(2):135–

149, March 1984.

[59] Z. Li, W. Gao, F. Pan, S. Ma, K. Pang Lim, G. Feng, X. Lin, S.Rahardja, H. Lu,

and Y. Lu. Adaptive rate control with HRD consideration. InJoint Video Team

(JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16

Q.6),8th Meeting, Geneva, CH, October 20–26 2003. files: JVT-H014.doc,JVT-H014-

FixedQP_r1.xls,JVT-H014-FixedQP.xls.

[60] Z. Li, F. Pan, G. Feng, K. Lim, X. Lin, and S. Rahardja. Improved rate control al-

gorithm. In Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC

JTC1/SC29/WG11 and ITU-T SG16 Q.6),5th Meeting, Geneva, CH, October 9–17

2002. files: JVT-E069.doc,JVT-E069.zip,JVT-E069_software.zip.

[61] Z. Li, F. Pan, K. P. Lim, G. Feng, X. Lin, and S. Rahardja. Adaptive rate control

for JVT. In Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC

JTC1/SC29/WG11 and ITU-T SG16 Q.6)6th Meeting, Awaji, Japan, December 5–15,

2002.

[62] J.Y. Liao and J.D. Villasenor. Adaptive intra block update for robust transmission of

H.263. IEEE Trans. Circuits Syst. Video Technol., 10(1):30–35, February 2000.

[63] G. Liebl, T.Stockhammer M. Wagner, J. Pandel, G. Baese,M. Nguyen, and F. Burkert.

An RTP Payload Format for Erasure-Resilient Transmission of Progressive Multimedia

Streams, October 2004. Internet Draft.

134 Bibliography

[64] P List, A. Joch, J. Lainema, G. Bjntegaard, and M. Karczewicz. Adaptive deblocking

filter. IEEE Trans. Circuits Syst. Video Technol., 13(7):614–619, July 2003.

[65] J. Liu and P. Moulin. Information-Theoretic Analysis of Interscale and Intrascale

Dependencies Between Image Wavelet Coefficients.IEEE Trans. Image Processing,

10(11):1647–1658, November 2001.

[66] M. Luby. LT codes. www.inference.phy.cam.ac.uk/mackay/dfountain/LT.pdf.

[67] S. Ma, Z. Li, and F. Wu. Proposed draft of adaptive rate control. In Joint Video

Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T

SG16 Q.6),8th Meeting, Geneva, CH, October 20–26 2003. files: JVT-H017.doc,JVT-

H017r1.doc,JVT-H017r2.doc,JVT-H017r3.doc.

[68] D. J. C. MacKay.Information Theory, Inference, and Learning Algorithms. Cambridge

University Press, 2003.

[69] A. Majumdar, J. Chou, and K. Ramchandran. Robust Distributed Video Compression

based on Multilevel Coset Codes. InProc. of the Asilomar Conference on Signals,

Systems, and Computers, Nov. 2003.

[70] A. Majumdar, R. Puri, P. Ishwar, and K. Ramchandran. Complexity/performance trade-

offs for robust distributed video coding. InProc. International Conference on Image

Processing (ICIP), 2005.

[71] H. S. Malvar, A. Hallpuro, and M. Karczewicz. Low-complexity transform and quan-

tization in H.264/AVC.IEEE Trans. Circuits Syst. Video Technol., 13(7):598–603, July

2003.

[72] D. Marpe, H. Schwarz, and T. Wiegand. Context-Base Adaptive Binary Arithmetic

Coding in the H.264/AVC Video Compression Standard.IEEE Trans. Circuits Syst.

Video Technol., 13(7):620–636, July 2003.

[73] D. Marpe, H. Schwarz, and T. Wiegand. Context-based adaptive binary arithmetic cod-

ing in the H.264/AVC video compression standard.IEEE Trans. Circuits Syst. Video

Technol., 13(7):620–636, July 2003.

[74] Michael Luby. LT Codes. InFOCS, 2002.

[75] S. Milani, L. Celetto, and G.A. Mian. A rate control algorithm for the H.264 encoder. In

Proc. of the Sixth Baiona Workshop on Signal Processing in Communications, Baiona,

Spain, September 8–10, 2003.

[76] S. Milani, L. Celetto, and G.A. Mian. An Accurate Low-Complexity Rate Control Algo-

rithm Based on(ρ,Eq)-Domain. IEEE Trans. Circuits Syst. Video Technol., submitted.

[77] S. Milani and G. A. Mian. A Practical Algorithm for Distributed Source Coding Based

on Continuous-Valued Syndromes. InProc. of the 14th European Signal Processing

Conference (EUSIPCO 2006), Firenze, Italy, September4–8 2006.

Bibliography 135

[78] S. Milani and G. A. Mian. An improved context adaptive binary arithmetic coder for

the H.264/AVC standard. InProc. of the 14th European Signal Processing Conference

(EUSIPCO 2006), Firenze, Italy, September4–8 2006.

[79] S. Milani, G. A. Mian, and L. Celetto. Joint optimization of source-channel video coding

using the h.264 encoder and fec codes. InProc. of the 13th European Signal Processing

Conference (EUSIPCO 2005), Antalya, Turkey, September 2005.

[80] S. Milani, G. A. Mian, and L. Celetto. Joint Optimization of Source-Channel Video

Coding Using the H.264 Encoder and FEC Codes. InProc. of the 13th European Signal

Processing Conference (EUSIPCO 2005), Antalya, Turkey, September 2005.

[81] S. Milani, G.A. Mian, D. Alfonso, and L. Celetto. A(ρ,Eq)-Domain Based Low-Cost

Rate-Control Algorithm for the H.264 Video Coder. InProc. of the Seventh International

Symposium onn Wireless Personal Multimedia Communications (WPMC2004), pages

137–142, Abano Terme (PD), Italy, September 2004.

[82] S. Milani, J. Wang, and K. Ramchandran. Achieving H.264-like compression efficiency

with distributed video coding. InProceedings of SPIE Visual Communications and Im-

age Processing Conference, San Jose, California, USA, January 2007. To be published.

[83] Alistair Moffat, Radford M. Neal, and Ian H. Witten. Arithmetic coding revisited.ACM

Trans. Inf. Syst., 16(3):256–294, 1998.

[84] D. Mumford. Empirical statistics and stochastic models for visual signals. In S. Haykin,

J. C. Príncipe, T. J. Sejnowski, and J. McWhirter, editors,New Directions in Statistical

Signal Processing, chapter 1. MIT Press, 2005.

[85] I. Newton. Philosophiae Naturalis Principia Mathematica (mathematical principles of

natural philosophy), July 1687.

[86] M. Niss. History of the Lenz-Ising Model 1920-1950: From Ferromagnetic to Cooper-

ative Phenomena.Archive for History of Exact Sciences, 59(3):267–318, March 2005.

[87] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG. Joint final commit-

tee draft (JFCD) of joint video specification (ITU-T Rec. H.264 | ISO/IEC 14496-

10 AVC). In Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC

JTC1/SC29/WG11 and ITU-T SG16 Q.6),4th Meeting, Klagenfurt, Germany, July 2002.

ftp://ftp.imtc-files.org/jvt-experts/2002_07_Klagenfurt/JVT-D157.zip.

[88] M. T. Orchard, Y. Wang, and A. R. Reibman. Redundancy Rate-Distortion Analysis

of Multiple Description Coding Using Pairwise CorrelatingTransform. InProc. of the

IEEE International Conference on Image Processing, ICIP 1997, Santa Barbara, CA,

USA, October 1997.

[89] M. T. Orchard, Y. Wang, and A. R. Reibman. Optimal Pairwise Correlating Transform

for Multiple Description Coding. InProc. of the IEEE International Conference on

Image Processing, ICIP 1998, Chicago, IL, USA, October 1998.

136 Bibliography

[90] A. Ortega and K. Ramchandran. Rate-Distortion Methodsfor Image and Video Com-

pression.IEEE Signal Processing Mag., pages 23–50, November 1998.

[91] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer,

and T. Wedi. Video coding with H.264/AVC: Tools, Performance, and Complexity.

IEEE Circuits Syst. Mag., 4(1):7–28, First Quarter 2004.

[92] S. S. Pradhan and K. Ramchandran. Distributed Source Coding Using Syndromes (DIS-

CUS): Design and Construction. InProc. of the Data Compression Conference (DCC

1999), Snowbird, UT, USA, March 1999.

[93] R. Puri and K. Ramchandran. Prism: A new robust video coding architecture based on

distributed compression principles. InProc. of the40th Allerton Conference on Com-

munication, Control and Computing, pages 402–408, Allerton, IL, USA, October 2002.

[94] R. Puri and K. Ramchandran. Prism: A “reversed” multimedia coding paradigm.

In Proc. of IEEE International Conference on Image Processing(ICIP), Honk Kong-

Barcelona, Spain, September 2003.

[95] R. Puri and K. Ramchandran. Prism: An uplink-friendly multimedia coding paradigm.

In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP), Honk Kong, April 2003.

[96] J. Ribas-Corbera and S. Lei. Rate control in DCT video coding for low-delay commu-

nications.IEEE Trans. Circuits Syst. Video Technol., 9(1):172–185, February 1999.

[97] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Overview of H.264.

http://www.rgu.ac.uk/files/h264_overview.pdf, October2002.

[98] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Variable length Coding.

http://www.rgu.ac.uk/files/h264_vlc.pdf, October 2002.

[99] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Context-Based Adaptive Arith-

metic Coding (CABAC). http://www.rgu.ac.uk/files/h264_cabac.pdf, October 2003.

[100] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Prediction of Inter Macroblocks

in P-slices. http://www.rgu.ac.uk/files/h264_interpred.pdf, April 2003.

[101] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Prediction of Intra Macroblocks.

http://www.rgu.ac.uk/files/h264_intrapred.pdf, April 2003.

[102] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Reconstruction filter.

http://www.rgu.ac.uk/files/h264_loopfilter.pdf, April 2003.

[103] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Transform and quantization.

http://www.rgu.ac.uk/files/h264_transform.pdf, March 2003.

[104] I. E. Richardson. H.264/MPEG-4 Part 10 White Paper: Frame and picture management.

http://www.rgu.ac.uk/files/avc_picmanagement_draft1.pdf, January 2004.

Bibliography 137

[105] I. E. G. Richardson.H.264 and MPEG-4 Video Compression. John Wiley and Sons,

September 2003.

[106] J. Rosenberg and H. Schulzrinne. An RTP Payload Formatfor Generic Forward Error

Correction (RFC2733). InNetwork Working Group, December 1999.

[107] A. Rosenfeld. Quadtrees and pyramids for pattern recognition and image processing. In

Proc. of5th ICIPR, pages 569–572, Miami, FL, USA, 1982.

[108] L. Celetto S. Milani, G.A. Mian. Aρ-domain based joint optimization of source-channel

video coding. InProc. of Wireless Reconfigurable Terminals and Platforms (Wirtep

2006), Rome, Italy, April10–12 2006.

[109] A. Sehgal, A. Jagmohan, and N. Ahuja. Scalable video coding using Wyner-Ziv codes.

In Proc. of the Picture Coding Symposium 2004, San Francisco, CA, USA, December

15–17, 2004.

[110] C.E. Shannon. A Mathematical Theory of Communications. The Bell System Technical

Journal, July 1948.

[111] A. Shokrollahi. Raptor codes, June 2003.

[112] T. Sikora. The MPEG-4 video standard verification model. IEEE Trans. Circuits Syst.

Video Technol., 7(1):19–31, February 1997.

[113] D. Slepian and J. K. Wolf. Noiseless coding of correlated information sources.IEEE

Trans. on Information Theory, 19:471–480, Jul. 1973.

[114] G. J. Sullivan and T. Wiegand. Rate-Distortion Optimization for Video Compression.

IEEE Signal Processing Mag., pages 74–90, November 1998.

[115] Systematic Lossy Forward Error Protection for Error-Resilient Digital Video Broadcast-

ing. A. aaron, s. rane and b. girod. InProceedings of SPIE Visual Communications and

Image Processing Conference, San Jose, California, USA, January 2004.

[116] K. Tanaka, J. Inoue, and D. M. Titterington. Probabilistic image processing by means

of the Bethe approximation for the Q-Ising model.Journal of Physics A: Mathematical

and General, 36(43):11023–11035, 2003.

[117] D. Taubman. private communication, 2004.

[118] D. Taubman and M. W. Marcellin.JPEG2000 Image Compression: Fundamentals,

Standards and Practice. Kluwer, Boston, MA, USA, 2002.

[119] V. A. Vaishampayan. Design of Multiple Description Scalar Quantizer.IEEE Trans.

Inform. Theory, 39(10):821–834, May 1993.

[120] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and varia-

tional methods. In Michael A. Arbib, editor,New Directions in Statistical Signal Pro-

cessing. MIT Press, 2003,2005.

138 Bibliography

[121] M. J. Wainwright and M. I. Jordan. A variational principle for graphical models. In

S. Haykin, J. Principe, T. Sejnowski, and J. McWhirther, editors, New Directions in

Statistical Signal Processing, chapter 11. MIT Press, 2005.

[122] G.K. Wallace. The JPEG Still Picture Compression Standard. Communications of the

ACM, 34(4):30–44, April 1991.

[123] Y. Wang, M. T. Orchard, V. A. Vaishampayan, and A. R. Reibman. Multiple Description

Coding using Pairwise Correlating Transform.IEEE Trans. Inform. Theory, 10(3):351–

366, March 2001.

[124] T. Wedi and H. G. Musmann. Motion- and aliasing-compensated prediction for hybrid

video coding.IEEE Trans. Circuits Syst. Video Technol., 13(7):577–586, July 2003.

[125] T. Wiegand and B. Girod. Parameter Selection in Lagrangian Hybrid Video Coder Con-

trol. In Proc. of International Conference on Image Processing, ICIP 2001, Thessa-

loniki, Greece, October 2001.

[126] T. Wiegand, H. Schwarz, A. Joch, and F. Kossentini. Rate-constrained coder control

and comparison of video coding standards.IEEE Trans. Circuits Syst. Video Technol.,

13(7):688–695, July 2003.

[127] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra. Overview of the H.264/AVC

video coding standard.IEEE Trans. Circuits Syst. Video Technol., 13(7):560–576, July

2003.

[128] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra. Overview of the H.264/AVC

video coding standard.IEEE Trans. Circuits Syst. Video Technol., 13(7):560–576, July

2003.

[129] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data com-

pression.Commun. ACM, 30(6):520–540, 1987.

[130] A. D. Wyner and J. Ziv. The rate distortion function forsource coding with side infor-

mation at the decoder.IEEE Trans. on Information Theory, 22:1–10, Jan. 1976.

[131] Wyner-Ziv Video Coding with Hash-Based Motion Compensation at the Receiver. A.

aaron, s. rane and b. girod. InProceedings of IEEE International Conference on Image

Processing, volume 5, pages 3097–3100, Singapore, October 2004.

[132] Q. Xu and Z. Xiong. Layered Wyner-Ziv video coding. InProc. of VCIP’04, Jan. 2004.

[133] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free energy approximations and

generalized belief propagation algorithms.IEEE Trans. Info. Theory, 51(7):2282–2312,

July 2005.

[134] N. Zandonà, S. Milani, and A. De Giusti. Motion-Compensated Multiple Description

Video Coding for the H.264/AVC Standard. InProc. of IADAT International Conference

Bibliography 139

on Multimedia, Image Processing and Computer Vision, pages 290–294, Madrid, Spain,

March 2005.

[135] X. Zhu, A. Aaron, and B. Girod. Distributed Compression for Large Camera Arrays. In

Proc. of the IEEE Workshop on Statistical Signal Processing, pages 30–33, St. Louis,

Missouri, USA, September 2003.

source and joint source-channel coding for video ... · coding for video transmission over lossy...

Documents