source and channel coding for audiovisual communication...

Thesis for the degree of Doctor of Philosophy

Source and Channel Coding

for Audiovisual Communication Systems

Moo Young Kim

Sound and Image Processing LaboratoryDepartment of Signals, Sensors, and Systems

Kungliga Tekniska Hogskolan

Stockholm 2004

Kim, Moo YoungSource and Channel Coding for Audiovisual Communication Systems

Copyright c©2004 Moo Young Kim except whereotherwise stated. All rights reserved.

ISBN 91-628-6198-0TRITA-S3-SIP 2004:3ISSN 1652-4500ISRN KTH/S3/SIP-04:03-SE

Sound and Image Processing LaboratoryDepartment of Signals, Sensors, and SystemsKungliga Tekniska HogskolanSE-100 44 Stockholm, SwedenTelephone + 46 (0)8-790 7790

To my family

AbstractTopics in source and channel coding for audiovisual communication systemsare studied. The goal of source coding is to represent a source with thelowest possible rate to achieve a particular distortion, or with the lowestpossible distortion at a given rate. Channel coding adds redundancy toquantized source information to recover channel errors. This thesis consistsof four topics.

Firstly, based on high-rate theory, we propose Karhunen-Loeve trans-form (KLT)-based classified vector quantization (VQ) to efficiently utilizeoptimal VQ advantages over scalar quantization (SQ). Compared with code-excited linear predictive (CELP) speech coding, KLT-based classified VQprovides not only a higher SNR and perceptual quality, but also lower com-putational complexity. Further improvement is obtained by companding.

Secondly, we compare various transmitter-based packet-loss recoverytechniques from a rate-distortion viewpoint for real-time audiovisual com-munication systems over the Internet. We conclude that, in most circum-stances, multiple description coding (MDC) is the best packet-loss recov-ery technique. If channel conditions are informed, channel-optimized MDCyields better performance.

Compared with resolution-constrained quantization (RCQ), entropy-constrained quantization (ECQ) produces a smaller number of distortionoutliers but is more sensitive to channel errors. We apply a generalizedr-th power distortion measure to design a new RCQ algorithm that hasless distortion outliers and is more robust against source mismatch thanconventional RCQ methods.

Finally, designing quantizers to effectively remove irrelevancy as well asredundancy is considered. Taking into account the just noticeable differ-ence (JND) of human perception, we design a new RCQ method that hasimproved performance in terms of mean distortion and distortion outliers.Based on high-rate theory, optimal centroid density and its correspondingmean distortion are also accurately predicted.

The latter two quantization methods can be combined with practicalsource coding systems such as KLT-based classified VQ and with jointsource-channel coding paradigms such as MDC.Keywords: Information theory, high-rate theory, source coding, channelcoding, quantization, multimedia communication, error correction, multipledescription, just noticeable difference, perceptual quality, source mismatch.

iii

List of Papers

The thesis is based on the following papers:

[A] Moo Young Kim and W. Bastiaan Kleijn, “KLT-based ClassifiedVQ for the Speech Signal,” in Proc. IEEE Int. Conf. Acoustics,in Speech, Signal Processing (ICASSP), 2002, pp. 645-648.

[B] Moo Young Kim and W. Bastiaan Kleijn, “KLT-Based ClassifiedVQ with Companding,” in Proc. IEEE Workshop on SpeechCoding, 2002, pp. 8-10.

[C] Moo Young Kim and W. Bastiaan Kleijn, “Rate-DistortionComparisons between FEC and MDC based on Gilbert ChannelModel,” in Proc. IEEE Int. Conf. on Networks (ICON), 2003,pp. 495 - 500.

[D] Moo Young Kim and W. Bastiaan Kleijn, “Comparison ofTransmitter-Based Packet-Loss Recovery Techniques for VoiceTransmission,” in Proc. INTERSPEECH2004-ICSLP, 2004, pp.641-644.

[E] Moo Young Kim and W. Bastiaan Kleijn, “KLT-based Adap-tive Classified VQ of the Speech Signal,” IEEE Transactions onSpeech and Audio Processing, vol. 12, no. 5, pp. 277-289, May2004.

[F] Moo Young Kim and W. Bastiaan Kleijn, “Comparative Rate-Distortion Performance of Multiple Description Coding for Real-Time Audiovisual Communication over the Internet,” submittedto IEEE Transactions on Communications, 2004.

[G] Moo Young Kim and W. Bastiaan Kleijn, “Effect of Cell-SizeVariation of Resolution-Constrained Quantization,” to be sub-mitted to IEEE Transactions on Speech and Audio Processing.

[H] Moo Young Kim and W. Bastiaan Kleijn, “Resolution-Constrained Quantization with Perceptual Distortion Measure,”to be submitted to IEEE Signal Processing Letters.

v

Contents

Abstract iii

List of Papers v

Contents vii

Acknowledgements ix

Acronyms xi

I Introduction 11 Source Coding and Distortion Measures . . . . . . . . . . . 32 High-Rate Theory . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Quantization with Various Distortion Measures . . . 82.2 Advantages of Vector Quantization over Scalar Quan-

tization . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Distortion Distribution . . . . . . . . . . . . . . . . . 112.4 Robust Quantization against Source Mismatch . . . 112.5 Comparison between Resolution-Constrained and

Entropy-Constrained Quantization . . . . . . . . . . 132.6 Comparison between R-D Theory and High-Rate

Theory . . . . . . . . . . . . . . . . . . . . . . . . . 143 Quantizer Design . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Generalized Lloyd Algorithm . . . . . . . . . . . . . 153.2 Entropy-Constrained Vector Quantization . . . . . . 173.3 Lattice Quantization . . . . . . . . . . . . . . . . . . 173.4 Random Coding based on High-Rate Theory . . . . 18

4 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . 194.1 Packet-Loss Recovery Techniques . . . . . . . . . . . 204.2 Forward Error Correction (FEC) . . . . . . . . . . . 224.3 Multiple Description Coding (MDC) . . . . . . . . . 25

vii

4.4 Rate-Distortion Optimized Packet Loss RecoveryTechniques . . . . . . . . . . . . . . . . . . . . . . . 30

5 Applications: Audiovisual Coding . . . . . . . . . . . . . . . 315.1 Speech and Audio Coding Standards . . . . . . . . . 315.2 Image and Video Coding Standards . . . . . . . . . 32

6 Summary of Contributions . . . . . . . . . . . . . . . . . . . 33References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

viii

Acknowledgements

The work presented in this thesis has been performed at the Sound andImage Processing (SIP) Lab. of the Department of Signals, Sensors andSystems (S3) of KTH (Royal Institute of Technology) from January 2001 toNovember 2004. During this period, I have been indebted to many peoplefor their help. It is my great pleasure to thank all the people who havesupported me during my study.

First of all, I am very grateful to my supervisor, Professor W. BastiaanKleijn, for his excellent academic guidance and endless encouragement tomy research. I am impressed by his broad and deep knowledge and hisenthusiasm to share his creativity. I also appreciate his patient proof readingof my works and for his trust in my independent research.

I would like to thank my good friends at SIP for sharing a nice at-mosphere that has always filled me with energy. I have enjoyed workingtogether with Volodya Grancharov, Barbara Resch, Anders Ekman, andKahye Song. Jan Plasberg deserves my gratitude especially for his sense ofhumor and kindness. I would like to thank Harald Pobloth and David Zhaowho have helped everyone in SIP from computer problems to importantsoftware tips. I would like to express my gratitude to Dr. Jonas Lindblom,Professor Arne Leijon, Elisabet Molin, and Martin Dahlquist for valuablediscussions and interaction. Dr. Jonas Samuelsson is a rare man who cando works quickly with high quality. Renat Vafin and Yusuke Hiwasaki arespecially thanked for staying in the office together until midnight and forsharing valuable ideas for both professional and private life. I am very muchgrateful to my previous office mate Mattias Nilsson especially for his hospi-tality during my first year of study and life in Stockholm. Particular thanksgo to my office mate Sriram Srinivasan for his bright comments on my workand his tireless proof reading. I wish to thank our secretary Dora Soderbergfor her kindness whenever I was in trouble. I cannot forget our energeticlunch-time activity - fooseball which gave us many laughs.

This work was financially supported by KTH, Vinnova (Swedish Agencyfor Innovation Systems), Global IP Sound AB, and Samsung Advanced In-stitute of Technology (SAIT). I would like to express my gratitude to Pro-fessor Mikael Skoglund, Professor Gunnar Karlsson, Professor Gerald Q.

ix

Maguire Jr., Ian Marsh, Tomas Skollermo, Gyorgy Dan, and other col-leagues for their positive and friendly discussions during our projects on‘Audiovisual Mobile Internetworking’ and ‘Affordable Wireless Services andInfrastructures’. I am grateful to my previous supervisor Professor JaihieKim of Yonsei University for his mental support and inspiring advice. Dr.Sang Ryong Kim, Professor Dae Hee Youn, Professor Hong Goo Kang, Pro-fessor Nam Soo Kim, Professor Hong Kook Kim, Dr. Yong Duk Cho, Mr.Jeongsu Kim, and Dr. Doh Suk Kim are acknowledged for encouraging meto continue on this academic path.

Thanks to all the members of the Korean student and scholar associationin Sweden (KOSAS) and the Korean church in Sweden. I am also luckyenough to have the support of many good friends in Korea. It was great tomeet my best friend, Young Jay Paik, in Stockholm.

Most of all, I would like to thank my dear wife, Hyun Soo Kwon, andmy lovely daughters, So Yee Kim and So Yon Kim for their endless love.Their support and understanding has been the most essential part for meto complete this thesis. I also would like to share my pleasure with mywife’s parents, my sister’s family, and my uncles’ and aunts’ family. Finally,I sincerely dedicate this thesis to my parents, Eung Tae Kim and ByoungJoo Youn, who I respect most.

Moo Young KimStockholm, November 2004

x

Acronyms

3GPP : 3rd Generation Partnership Project

3GPP2 : 3GPP Project 2

AbS : analysis-by-synthesis

ADPCM : adaptive differential PCM

AFS : adaptive full-rate speech

AHS : adaptive half-rate speech

AMR : adaptive multi-rate

AMR-WB : adaptive multi-rate wideband

ARQ : automatic repeat request

AVC : advanced video coding

BCH : Bose, Chaudhuri, and Hocquenghem

B-pictures : bidirectionally-predicted pictures

BQ-RCQ : RCQ with the biquadratic distortion measure

BS : base station

C/I : carrier-to-interference

CCITT : International Telegraph and Telephone ConsultativeCommittee

CD : compact disc

cdf : cumulative distribution function

CD-MDC : central-distortion optimized MDC

CELP : code-excited linear predictive coding

xi

CS-ACELP : conjugated structured algebraic CELP

DCT : discrete-cosine transform

DFT : discrete-Fourier transform

DVD : digital video disc

DVQ : direct vector quantization

ECQ : entropy-constrained quantization

ECSQ : entropy-constrained scalar quantization

ECVQ : entropy-constrained vector quantization

EGC : El Gamal and Cover

EM : expectation maximization

ETSI : European Telecommunications Standards Institute

EVRC : enhanced variable-rate codec

FEC : forward error correction

FLOPS : floating-point operations

GLA : generalized Lloyd algorithm

GMM : Gaussian mixture model

GOP : group of pictures

GSM : global system for mobile communication

HDTV : high definition television

IEC : International Engineering Consortium

IETF : Internet Engineer Task Force

iid : independent and identically distributed

iLBC : Internet low-bit-rate codec

I-pictures : intra-pictures

ISL : inherent SNR limit

ISO : International Organization for Standardization

ITU : International Telecommunication Union

ITU-T : ITU Telecommunication Standards Sector

xii

JND : just noticeable difference

JPEG : Joint Photographic Experts Group

JVT : joint video team

KLT : Karhunen-Loeve transform

KLT-CVQ : KLT-based adaptive classified VQ

KLT-CVQC : KLT-based adaptive classified VQ with companding

LBG : Linde, Buzo, and Gray

LC : layered coding

LD-CELP : low-delay CELP

LDPC : low-density parity check

LPC : linear predictive coding

LQ : lattice quantization

LSD : log-spectral distortion

LSF : line spectral frequencies

MDC : multiple description coding

MDC-SC : MDC for a single-channel

MDC-TIC : MDC with two independent channels

MDLQ : multiple description lattice quantizer

MDSQ : multiple description scalar quantizer

MDTC : multiple description transform coding

MI-FEC : media independent FEC

MLT : modulated lapped transform

MS : mobile station

MS-FEC : media specific FEC

MTU : maximum transfer unit

optMDC : channel optimized MDC

PCM : pulse code modulation

PCT : pairwise correlation transform

xiii

pdf : probability density function

PER-Q : quantizer based on the perceptual-error criterion

pmf : probability mass function

P-pictures : predicted pictures

PSNR : peak signal-to-noise ratio

PSTN : public switched telephone networks

Q-RCQ : RCQ with the squared-error criterion

RCQ : resolution-constrained quantization

RMS : root mean square

RRD : redundancy-rate distortion

RS : Reed Solomon

RS-FEC : Reed-Solomon based FEC

RTCP : real-time control protocol

SD : spectral distortion

SDC : single description coding

SD-MDC : side-distortion optimized MDC

SMV : selectable mode vocoder

SNR : signal-to-noise ratio

SQ : scalar quantization

SQUARE-Q : quantizer based on the squared-error criterion

SVD : singular-value decomposition

SVS : Servetto, Vaishampayan, and Sloane

TCP : transmission control protocol

UDP : user datagram protocol

ULP : uneven level protection

UQ : uniform quantization

VoIP : voice over IP

VQ : vector quantization

xiv

XOR : eXclusive-OR

ZIR : zero-input response

ZSR : zero-state response

xv

Part I

Introduction

Introduction

Source and channel coding is an essential part of modern life. Audiovisualcommunication and storage systems such as the public switched telephonenetworks (PSTN) based communication systems, mobile phones, voice overIP (VoIP), digital television, compact disc (CD), digital video disc (DVD),and MP3 players, require source and channel coding as fundamental blocks.Fig. 1 shows general building blocks for source and channel coding systems.

sourceencodingsource

sourcedecodingplayout

Channel

channelencoding

channeldecoding

Figure 1: Building blocks of source and channel coding.

The goal of source coding is to represent a source with a good com-promise between rate and quality. For example, for an average bit-rateconstraint, source coding attempts to optimize quality. Decreasing the rateresults in decreased quality, but increasing the rate requires higher cost ofsource coding systems. The rate-distortion tradeoff was first defined as rate-distortion theory by Shannon in 1948 [1,59,60,62]. The minimum achievablerate R at a distortion D is the rate-distortion function R(D). The Shannonlower bound is a lower bound for the rate-distortion function. If we con-sider a k-dimensional vector Xk and a reconstruction error W k, then therate-distortion function R(D) and the Shannon lower bound RSLB(D) are

2 Introduction

given by

R(D) ≥ RSLB(D)

=1

kh(Xk) − 1

ksup

{fW k (wk):

∫

fW k (wk)d(wk)dwk≤D}

h(W k) (1)

where h(Xk) and h(W k) are the differential entropies of Xk and W k, respec-tively, and d(wk) and fW k(wk) are the reconstruction error and its prob-ability, respectively. For a Gaussian source with a squared error criterion,the bound is tight and identical to the rate-distortion function [59,60,62]:

R(D) = RSLB(D) =1

2log(

σ2

D) (2)

where σ2 is a variance of the source.Bennett [2] analyzed the rate-distortion tradeoff using high-rate the-

ory. While rate-distortion theory characterizes rates achievable for infinitedimensionality, high-rate theory assumes a high number of reconstructionlevels, which corresponds to low distortion. The former is considered tobe source coding theory or information theory, while the latter is consid-ered to be quantization theory or high-rate theory [7]. General optimalityconditions and design issues for source coding were developed by Lloyd [3],Max [4], Linde et al. [5], and Chou et al. [6].

The original source descriptions often contain redundancy and irrele-vancy. Redundancy is information that is superfluous and inessential forrepresenting the source. It can be eliminated without loss of source repre-sentation. Irrelevancy is related to the imperceptible part of informationthat can be removed without perceptual quality degradation. Perceptualquality for both human vision and hearing is studied in [131–138]. In papersA, C, and D, we use information theory and high-rate theory to proposenew methods to efficiently remove redundancy and irrelevancy.

In channel coding, k information bits are mapped into n coded bits.Since n ≥ k, channel coding introduces n − k redundant bits. The proba-bility of a channel error, λi, is defined for each information bit i, and themaximal probability of error is

λ(n) = maxi∈{1,2,...,k}

λi. (3)

The rate of an (n, k) code is R = k/n, and a rate R is said to be achievableif the maximum probability of channel error λ(n) approaches zero as n →∞. The channel capacity C is the supremum of all achievable rates, andShannon proved that information bits can be reliably sent over a channelat all rates R below the channel capacity C [1]. Conversely, any (n, k) codewith λ(n) → 0 must have R ≤ C.

1 Source Coding and Distortion Measures 3

From the source and channel separation theorem, it follows that if op-timal source codes for given sources and optimal channel codes for givenchannel conditions are designed separately and independently, the com-bination results in optimal performance. However, if delay is of criticalimportance such as in real-time communication systems, source and chan-nel coding should be optimized jointly. In paper B, we compare differenttransmitter-based packet-loss recovery techniques for real-time audiovisualcommunications over the Internet.

This introduction is organized as follows. In section 1, various distor-tion measures for source coding are introduced. In section 2, based onhigh-rate theory, we first address the optimal centroid density and its corre-sponding mean distortion with various distortion measures. We also discussresolution versus entropy constrained quantization, high-rate theory versusrate-distortion theory, scalar versus vector quantization, distortion distri-bution, and robust quantization against source mismatch. In section 3,practical quantizer design issues are discussed such as the generalized Lloydalgorithm (GLA), entropy-constrained vector quantization (ECVQ), latticequantization (LQ), and random coding theory. In section 4, various packet-loss recovery techniques are introduced, especially forward error correction(FEC) and multiple description coding (MDC). Audiovisual coding stan-dards for speech, audio, image and video are introduced in section 5. Fi-nally, in section 6, we present a summary of the contributions of this thesis.

1 Source Coding and Distortion Measures

Source coding attempts to encode a source description into an index withrate and distortion tradeoff. The choice of an appropriate distortion cri-terion is a fundamental component in the design and analysis of a codingsystem.

Let fXk(xk) be the probability density function (pdf) that a k-dimensional random variable Xk takes the k-dimensional real-value xk. LetQ be a mapping from k-dimensional Euclidean space Rk to a reproductionalphabet having N elements ck

1 , ..., ckN . The most common distortion mea-

sure is a nonnegative function ρ of the difference between an input sourcexk = {x1, ..., xk} ∈ Rk and its reconstruction xk = {x1, ..., xk} ∈ Rk:

d(xk, xk) = ρ(xk − xk) (4)

which is called a difference-distortion measure [21].

Because of its mathematical convenience, one popular difference-

4 Introduction

distortion measure is the mean r-th power distortion measure:

d(xk, xk) = ‖xk − xk‖r

≡ (1

k

k∑

m=1

(xm − xm)2)r2 . (5)

The r-th power distortion measure for r=2 is the mean squared distortionmeasure:

d(xk, xk) =1

k

k∑

m=1

(xm − xm)2. (6)

The weighted squared-error measure is

d(xk, xk) =1

k

k∑

m=1

wm(xm − xm)2 (7)

where {wm} are the weights, and can be generalized into

d(xk, xk) = (xk − xk)B(xk − xk)T (8)

where B is a k × k sensitivity matrix.

Difference-distortion measures are mathematically simple to calculate.However, perceptually relevant distortion measures may not be difference-distortion measures. In [10–13], the mean squared error criterion is not agood measure for image coding in terms of subjective quality, and more per-ceptually relevant distortion measures are developed. For speech coding, thelog-spectral distortion (LSD) measure for quantizing linear predictive coding(LPC) coefficients in a perceptually relevant domain has the form [14–16,33]

d(xk(ω), xk(ω)) = (20 log10 |xk(ω)| − 20 log10 |xk(ω)|)2 (9)

where xk(ω) and xk(ω) are the frequency response of LPC parameters andtheir quantized version, respectively.

Another non-difference distortion measure where the weight matrix B isdependent on xk or xk is:

d(xk, xk) = (xk − xk)B(xk)(xk − xk)T (10)

where B(xk) is a function of xk. Since the perceptual distortion measurein (9) requires heavy computational complexity, it is often modified into theform of (10) [15,16,33].

2 High-Rate Theory 5

2 High-Rate Theory

It is possible to analyze and predict the performance of quantizers of any di-mensionality by using high-rate theory [7,58–62]. High-rate theory was firstdeveloped by Bennett [2] for scalar quantization (SQ), which was extendedto vector quantization (VQ) by Zador [19] and Gersho [20]. VQ maps ablock of k samples xk into an index, while SQ is a special case of VQ withk = 1.

In this section, we present the performance of block source codes un-der the assumption that the rate is high and the distortion measure is ther-th power error criterion. High-rate theory assumes that the pdf doesnot change significantly within a quantization cell. With the r-th powerdistortion measure, an i-th Voronoi region or an i-th Voronoi cell in a k-dimensional vector space is given by

Vi = {xk ∈ Rk : ‖xk − cki ‖r ≤ ‖xk − ck

j ‖r, ∀j > i,

‖xk − cki ‖r < ‖xk − ck

j ‖r, ∀j < i} (11)

where cki and ck

j are i-th and j-th centroids, respectively.

Quantization can be divided into two classes:

• resolution-constrained quantization (RCQ) where the number of cen-troids, N , is fixed;

• entropy-constrained quantization (ECQ) where the index entropy,H(I), is fixed.

Let DRC and DEC be the mean distortion over all cells for RCQ andthe mean distortion for ECQ, respectively. Bennett [2] modelled a nonuni-form quantizer with a companding quantizer that consists of a compressorfollowed by a uniform quantizer, and the inverse of the compressor at theoutput of the uniform quantizer. His companding formula shows that themean squared error of SQ (k=1 and r=2) with many centroids and fixedrate is approximately given by

DRC ≈ 1

12N2(

∫

x∈R

fX(x)

gc(x)2dx) (12)

where N and gc(x) are the number of reconstruction points and the centroiddensity, respectively. Panter and Dite developed a mean squared error ofSQ [7]

DRC ≈ 1

12N2(

∫

x∈R

fX(x)13 dx)3. (13)

6 Introduction

Algazi found a Bennett-like bound for the r-th power distortion mea-sure [17]. Bucklew extended Bennett’s original idea of companding to mul-tidimensional cases [22, 24]. Bucklew and Wise rigorously derived the ex-tended Bennett integral for VQ [23, 25]. Gish and Pierce found that theasymptotically best centroid density for entropy-constrained scalar quantiz-ers is uniform [18]. They also showed that the mean squared distortion ofoptimal ECQ for the scalar case is

DEC ≈ 1

122−2(H(I)−h(X)) (14)

where h(·) is a differential entropy.Zador found the mean distortion of VQ for an r-th power distortion

measure [19]:

DRC = A(k, r, V )N− rk (

∫

xk∈Rk

fXk(xk)k

k+r dxk)k+r

k (15)

and

DEC = B(k, r, V )2−rk (H(I)−h(Xk)) (16)

where A(k, r, V ) and B(k, r, V ) are the quantization coefficients for the poly-tope of a Voronoi region, V . Zador also showed that

k

k + rκ−β

k ≤ B(k, r, V ) ≤ A(k, r, V ) ≤ Γ(1 + β)κ−βk (17)

where β = r/k, Γ(·) is the gamma function, and κk is the volume of thek-dimensional unit sphere. Asymptotically, the upper and lower boundsin (17) agree as k approaches infinity.

Based on Zador’s work, Gersho [20] conjectured that any optimal high-rate quantization has Voronoi regions that are congruent to some polytopeV :

C(k, r, Vi) = A(k, r, V ) = B(k, r, V )

≡ C(k, r, V ) (18)

where the dimensionless quantization coefficient C(k, r, Vi) is defined by

C(k, r, Vi) =

∫

Vi‖xk − ck

i ‖rdxk

v(Vi)k+r

k

. (19)

Using (18), the mean distortion over all cells can be derived. For theresolution-constrained case, the number of centroids, N , is fixed, and theoptimal centroid density is

gc(xk) ≈ N

fXk(xk)k

k+r

∫

xk∈Rk fXk(xk)k

k+r dxk(20)


and the associated distortion is

DRC ≈ C(k, r, V )2−rk R(

∫

xk∈Rk

fXk(xk)k

k+r dxk)k+r

k

≡ C(k, r, V )2−rk R〈fXk(xk)〉 k

k+r(21)

where R is the required rate and

〈fXk(xk)〉α = (

∫

xk∈Rk

fXk(xk)αdxk)1/α. (22)

For the entropy-constrained case, the average rate, H(I), is fixed. Theoptimal ECQ centroid density is constant:

gc(xk) ≈ gc

≈ 2−rk (H(I)−h(Xk)) (23)

and the mean distortion of ECQ is

DEC ≈ C(r, k, V )2−rk (H(I)−h(Xk)). (24)

The optimal high-rate ECQ has a uniform centroid density, and theindices of uniform quantizers are coded with lossless coding. The goal oflossless coding is to represent the index of a quantization point withoutintroducing distortion at the shortest average codeword length possible. If acell has a higher probability, the codeword length is shorter. Huffman codingprovides codes with an average length of not more than H(I) + 1 bits for aknown source pdf with relatively few symbols [8]. Arithmetic coding yieldscodes with average length not more than H(I) + 2 bits for a given sourcepdf [9]. While Huffman coding encodes each symbol separately, arithmeticcoding encodes a supersymbol (block of symbols). Thus, the average lengthper symbol of arithmetic codes is far less than that of Huffman codes. If asource pdf is not known, the pdf can be estimated with maximum-likelihood,maximum a-posteriori, or Bayesian estimates. The dynamically estimatedsource pdf can then be used with the arithmetic coding [62]. A Ziv-Lempelcode does not need to calculate the empirical pdf [60].

Since ECQ allows variable codeword length while RCQ allows fixed code-word length, ECQ requires a lower average rate than RCQ to obtain thesame distortion. However, ECQ cannot be used for fixed-rate communica-tion systems. More comparisons are described in subsection 2.5.

In the following subsections, based on high-rate theory, we discuss:

• quantization with various distortion measures;

• advantages of vector quantization over scalar quantization;

8 Introduction

• distortion distribution;

• robust quantization against source mismatch;

• comparison between RCQ and ECQ;

• and comparison between R-D theory and high-rate theory.

2.1 Quantization with Various Distortion Measures

Zador and Gersho studied the asymptotic performance of k-dimensionalVQ with an r-th power distortion measure [19, 20]. Yamada et al. ex-tended Zador and Gersho’s result to VQ with difference-distortion measuresin (4) [21].

Gardner and Rao extended the previous works to the non-differencedistortion measure [33]. They consider a distortion measure d(xk, xk) havingthe following constraints:

• d(xk, xk) ≥ 0 and equality holds if and only if xk = xk;

• d(xk, xk) is continuous and continuously differentiable.

By applying the Taylor expansion of d(xk, xk), Gardner and Rao approxi-mated the distortion measure as

d(xk, xk) ≈ 1

2(xk − xk)T B(xk)(xk − xk) (25)

where B(xk) is the sensitivity matrix. Based on this locally-quadratic non-difference distortion measure, the new centroid density and mean distortionbounds were derived for resolution-constrained quantization. In [33], thesensitivity matrix has an important application in the quantization of LPCparameters for speech coding.

In [34, 35], the Gardner and Rao distortion measure was more formallyderived and applied to image coding. The derived distortion measure wasused to find the centroid density and the mean distortion with a constrainton entropy. With a non-difference distortion measure, the optimal centroiddensity gc(x

k) for ECQ is not uniformly distributed. In [34, 35], Li et al.showed that the density gc(x

k) is higher in locations where the value ofdet(B(xk)) is higher:

gc(xk) = N

[det(B(xk))]1/2

∫

xk∈Rk [det(B(xk))]1/2dxk. (26)

In [36,37], the high-rate rate-distortion function is investigated for a classof non-difference distortion measures. For the source-dependent weighted


mean squared error criterion in (25), the rate-distortion function is approx-imated by

R(D) ≈ h(Xk) − k

2log(2πeD/k)

+1

2E[log det(B(xk))]. (27)

Note that for B(xk) = I, (27) reduces to the R(D) for the regular meansquared error case.

In [38], a rigorous derivation for the asymptotic distortion of a multidi-mensional compander based ECQ is given for a locally-quadratic distortionmeasure. The optimal compressor does not depend on the source distri-bution, but is determined by the distortion measure. If the compressorand uniform quantizer satisfy the optimality condition, the high-rate per-formance approaches the rate-distortion bound for large dimensionality.

In paper D, we consider just noticeable difference (JND)-based distortionmeasures that are not included in earlier works. Based on high-rate theory,we find bounds and approximations of the mean distortion.

2.2 Advantages of Vector Quantization over ScalarQuantization

In this subsection, for an r-th power distortion measure, the mean dis-tortion bounds of SQ and VQ given by the Bennett integral in (12) andZador-Gersho’s formula in (21) are compared for the resolution and entropyconstrained cases.

Under the resolution-constrained condition, VQ has memory, space-filling, and shape advantages over SQ . The ratio of the distortion of SQ tothat of VQ is defined as the VQ advantage factor

DRC(1, r, V1)

DRC(k, r, Vk)= FRC(k)MRC(k)SRC(k) (28)

where the space-filling advantage factor is

FRC(k) =C(1, r, V1)

C(k, r, Vk), (29)

the memory advantage factor is

MRC(k) =〈Πi=k

i=1fXki(xk

i )〉 11+r

〈fXk(xk)〉 kk+r

, (30)

and the shape advantage factor is

SRC(k) =〈fX(x)〉 1

1+r

〈Πi=ki=1fXk

i(xk

i )〉 11+r

. (31)

10 Introduction

The space-filling advantage results from the optimal shape of the Voronoiregions for each dimensionality. The maximum gain for the space-fillingadvantage is approximately 1.53 dB for r=2. As the dimensionality k goes toinfinity, the Zador upper bound for C(k, r, Vk) approaches the sphere boundand C(k, r, Vk) = (2πe)r/2 [19]. This is the minimum value of C(k, r, Vk)for the optimal vector quantizer compared with C(1, r, V1) = 1/12 for thescalar quantizer.

The memory advantage accounts for the dependencies between the sam-ples. If the vector components are independent, the memory advantagedisappears.

The shape advantage shows how well the centroid density of SQ can bematched to the optimal centroid density of VQ. The shape advantage isrelevant only for RCQ (it is always unity for ECQ). If the source density isuniform, the shape advantage vanishes. In [30], the shape advantage is di-vided into oblongitis and the remaining density advantages. The oblongitisadvantage is the ratio of the distortion of a quantizer that has a cubic cellshape over that of a quantizer which has the rectangular cell shape.

In the entropy-constrained case, the shape advantage of VQ vanishessince the optimal centroid density of ECQ is uniform, and the remainingVQ advantage factors are

DEC(1, r, V1)

DEC(k, r, Vk)= FEC(k)MEC(k) (32)

where FEC(k) = FRC(k). We note that the shape advantage still exists fora non-difference distortion measure, since the optimal centroid density isnot uniform in (26).

The space-filling advantage factor FEC(k) is identical to FRC(k), andthe memory advantage factor is

MEC(k) = 2rk (kh(Xi)−h(Xk)) (33)

where kh(Xi) and h(Xk) are determined by Πi=ki=1fXk

i(xk

i ) and fXk(xk),respectively.

In paper A, we propose a new resolution-constrained quantizer that uti-lizes the three VQ advantages more efficiently than a conventional speechcoding paradigm, code-excited linear predictive (CELP) coding.

Even though VQ has three advantages over SQ for the resolution-constrained case, VQ can be computationally complex and need hugeamount of storage as dimensionality k increases. To reduce complexity,structured quantizers, such as product, tree-structured, multi-stage, andtransform-based quantizers, have been proposed. In [27–30], loss analysesfor such quantizers based on high-rate theory are presented.


2.3 Distortion Distribution

High-rate theory provides the asymptotic average distortion for many quan-tization points. An analytical formula for the pdf of the distortion is pre-sented in [31]. The pdf of error depends on the source pdf, the centroiddensity, and the quantizer shape profile.

For scalar quantization, the error pdf is [31]

fV (v) =

∫

gc(x)≤1/2|v|fX(x)gc(x)dx (34)

where v is quantization error, x − x. Using the formula of fV (v), Lee andNeuhoff revisited the Bennett integral for scalar quantization [2, 30].

For vector quantization, if the normalized quantization error is w =N1/k||xk − xk||r, the error pdf is [31]

fW (w) = kκkwk−1

∫

fXk(xk)gc(xk)S(xk, wgc(x

k)1/k)dxk, w ≥ 0 (35)

where S(xk, ρ) is the quantizer shape profile. The shape profiles of a circle,hexagon, and several rectangles for two-dimensional cases are also givenin [31]. The Bennett integral for vector quantization [30] can be provenusing the error function fW (w). For a spherical cell shape, the error pdf isapproximated by

fW (w) = kκkwk−1

∫

cfXk (xk)k/(k+r)≤κ−1

k w−k

cfXk(xk)(2k+r)/(k+r)dxk

= kκkwk−1

∫

cfXk(xk)(2k+r)/(k+r)dxk,

for w ≤ 1/(cκk)1/kf1/(k+r)M (36)

where c = [∫

fXk(xk)k/(k+r)dxk]−1 and fM is the maximum value offXk(xk). Simulation results with 2, 3, and 4 dimensional Gaussian sourceshow that (36) is a good approximation. In [32], a rigorous derivation ofthe formula (35) is given.

2.4 Robust Quantization against Source Mismatch

Source mismatch between training and test data is unavoidable in practice.Source mismatch occurs when a quantizer that is trained for one source isapplied to another source.

An upper bound on the performance loss of scalar quantizers due to asource mismatch was derived in [41]. In rate-distortion theory, it has beenknown that Gaussian sources yield the highest distortion at a given rate fora mean squared error criterion [43]. It has also been known in rate-distortion

12 Introduction

theory that if a code is designed for a Gaussian source with the mean andvariance of the actual source and has mean distortion DG, then the meandistortion of the code applied to another source with the same mean andvariance is less than or equal to DG [43,47,48]. The code generated for theworst source is called a robust code.

Another approach to robust quantization is for a collection of sourcepdf’s with an r-th power distortion measure [40, 44, 45]. Since the sourcepdf is generally incompletely known in real applications, a class of sourcepdf’s instead of a unique pdf is defined and a quantizer is designed forthat class of sources. The designed quantizer guarantees performance forthe specified source pdf’s. Robust designs are called two-person games: adesigner attempts to minimize distortion, and an opponent tries to choosethe most difficult family of source pdf’s to maximize distortion. Let f , g,and D(f, g) be source pdf, centroid pdf, and mean distortion, respectively.A robust quantizer (f∗, g∗) for RCQ exists, if the following inequality holds:

D(f, g∗) ≤ D(f∗, g∗) ≤ D(f∗, g) (37)

where f ∈ F and F is a predetermined class of pdf’s. The left inequalitymeans that the distortion is no more than D(f ∗, g∗) for any f if the centroiddensity g∗ is selected. The left inequality means that the distortion is notless than D(f∗, g∗) if the worst pdf f∗ is chosen. Since (f∗, g∗) does notalways exist, an alternative approach is the minimax solution

D(f∗, g∗) = ming

maxf∈F

D(f, g). (38)

For the entropy-constrained case, the best centroid density has been shownto be uniform [18–20]. Thus, for the entropy-constrained case, robust quan-tization is equivalent to finding the f ∈ F that maximizes the entropy.

In [40], it is shown that, for the case that amplitudes of the source arebounded in [a, b], a scalar minimax quantizer for RCQ is uniform for [a, b].If we have very little knowledge of source, it is shown in [40] that uniformquantization is a good choice. In [44], the Bennett companding model wasused to find the minimax solution for scalar quantization of a fixed-momentconstrained pdf. In [45], robust vector quantizers were proposed for threeclasses of distributions:

• the class with fixed moments:

F1 = {f(xk);

∫

m(xk)f(xk)dxk = q} (39)

where m(xk) ≥ 0 is the moment function and q is an arbitrary con-stant;


• the class of ε-contaminated pdf’s:

F2 = {f(xk); f(xk) = (1 − ε)fk(xk) + εfu(xk)} (40)

where fk(xk) and fu(xk) are the known and unknown pdf’s, respec-tively;

• and the class of pdf’s which are characterized by a lower and an upperbound of amplitude:

F3 = {f(xk); f1(xk) ≤ f(xk) ≤ f2(x

k)} (41)

where f1(xk) and f2(x

k) are known functions, but are not alwaysdensity functions.

It is shown in [45] that the maximum norm element of the above classes ofdistributions is the solution f∗:

f∗(xk) = arg maxf(xk)∈F

〈f(xk)〉 kk+r

(42)

and the optimal centroid density is

g∗(xk) = Nf∗(xk)

kk+r

∫

xk∈Rk f∗(xk)k

k+r dxk. (43)

In [35], it is shown that when a resolution-constrained scalar quantizeris designed for a source pdf fX(x), and that quantizer is tested with anothersource fY (x), as k goes to infinity, the distortion loss due to the mismatchis given in the log-domain by the relative entropy H(fY (x)||fX(x)).

In [50–52], it is shown that when an entropy-constrained vector quan-tizer that is trained optimally for fXk(xk) is applied to a source fY k(xk), iffY k(xk)/fXk(xk) is bounded, the rate loss is given by H(fY k(xk)||fXk(xk)).Interestingly, the distortion loss of RCQ and the rate loss of ECQ are de-pendent on the relative entropy of the mismatched sources.

In papers C and D, the source mismatch problem is considered in thedesign of quantizers. We consider mean and variance mismatch of the Gaus-sian source, and we also apply the quantizers to practical situations such astraining and testing data mismatch of line spectral frequencies (LSF) in aspeech coding application.

2.5 Comparison between Resolution-Constrained andEntropy-Constrained Quantization

From high-rate theory, RCQ and ECQ are known to be optimal fixed andvariable rate coding schemes, respectively. If channels allow variable-rate

14 Introduction

transmission, ECQ is the best choice. For fixed-rate networks, such asPSTN, RCQ has been used widely [141,142,146].

ECQ has three advantages over RCQ: reduction of the rate requiredto obtain the same mean distortion, reduction of the number of distortionoutliers, and increased robustness against source pdf mismatch.

For a Gaussian source, Max showed that optimal ECQ has around 0.47bits/sample reduction compared to optimal RCQ for a scalar quantizationcase [4]. For vector quantization cases, as k increases, the required ratedifference between ECQ and RCQ is about 0.72 bits in total. Thus, therequired rate advantage of ECQ over RCQ decreases as k increases. Wealso note that optimal ECQ needs variable rate transmission which is notalways feasible in practical situations and which is more sensitive to channelerrors.

The optimal centroid density of ECQ is uniformly distributed, while thatof RCQ depends on the source pdf [3,5,39,54,56,64]. For RCQ, at a sparselypopulated region of the source pdf, the volumes of the Voronoi region arelarge. Large Voronoi regions produce distortion outliers that are often notacceptable in terms of perceptual quality. However, ECQ produces equalcells and thus the mean distortion of the individual cells is identical. Sincethe mean-distortion variance between cells is zero, ECQ produces a smallernumber of distortion outliers than RCQ for the same average distortion.

If the pdf of the test source is different from the pdf of the training data,the mean distortion of RCQ is increased at a given fixed rate transmission.However, misestimation of the source pdf for ECQ increases the averagerate, but the distortion performance is preserved.

The three advantages of ECQ result from uniform quantization with loss-less coding. In papers C and D, we design a new RCQ scheme to efficientlyexploit the advantages of ECQ.

2.6 Comparison between R-D Theory and High-RateTheory

The Shannon rate-distortion bound considers the dimensionality of sourcesamples to be asymptotically high for a given average rate, while high-rate theory assumes that the codebook size is asymptotically large for fixeddimensionality [7, 58–62]. If both dimensionality and rate are infinite, therate-distortion bound and the high-rate bound are identical. However, formoderate dimensionality and moderate rate, high-rate theory is preferred.For an independent and identically distributed (iid) Gaussian source, high-rate theory with rates higher than 2 ∼ 3 bits/symbol has known to beaccurate, but rate-distortion theory requires more than 250 dimensions [7].In papers A, C, and D, new quantization methods are proposed, and theirperformance is analyzed with high-rate theory.

3 Quantizer Design 15

The Shannon rate-distortion bound can be calculated analytically forsquared-error and absolute-error criteria with Gaussian and Laplaciansources. If the rate-distortion bound cannot be computed analytically, theBlahut algorithm can be used [57].

To calculate the high-rate bound based on the works of Zador and Ger-sho, we must know the quantization coefficient C(k, r, V ) [20,64]. If analyticcalculation of the high-rate bound is not feasible, numerical integration isutilized [63].

Since rate-distortion theory assumes an average rate for a long blocksize, it is more natural to compare the rate-distortion function with high-rate ECQ than with high-rate RCQ. For a squared-error criterion (r=2),the Shannon lower bound in (1) is given by

RSLB(D) =1

kh(Xk) − 1

2log2(2πeD) (44)

where D = D/k. The mean distortion of high-rate ECQ is

1

kH(I) =

1

kh(Xk) − 1

2log2(

DEC

C(2, k, V )). (45)

By subtracting (44) from (45), we obtain

1

kH(I) − RSLB(D) =

1

2log2(2πeC(2, k, V )). (46)

For a scalar quantizer, C(2, 1, V ) = 1/12 and thus the right hand side of (46)is about 0.25. As the dimensionality k increases, C(2, k, V ) approaches1/(2πe) and the high-rate performance of ECQ reaches the Shannon lowerbound. As k increases, the high-rate performance of RCQ also approachesthe Shannon lower bound [62].

3 Quantizer Design

Rate-distortion theory and high-rate theory lead to analytical expressionsfor quantizer performance. In this section, we introduce practical quantizerssuch as GLA, ECVQ, lattice quantization, and random codebook genera-tion.

3.1 Generalized Lloyd Algorithm

The design problem for quantizers has been considered in numerous studies.Lloyd [3] found optimality conditions for the N -level resolution-constrainedscalar quantizer and proposed an iterative algorithm for the mean squarederror criterion. The optimality conditions are that the set of Voronoi regions

16 Introduction

must be optimal for the given set of centroids, and that the set of centroidsmust be optimal for the given set of Voronoi regions. If the squared-errormeasure is used, for a given Voronoi region Vi, a centroid ci is given by

ci = arg miny∈R

E[d(x, y)|x ∈ Vi]

= E[x|x ∈ Vi]. (47)

Lloyd’s optimality conditions for r-th power distortion measures were provenby Max [4].

Linde, Buzo, and Gray (LBG) [5] generalized the resolution-constrainedLloyd algorithm to a vector quantizer for either known pdf of source orfor source sample data. The GLA is a double minimization problem andcan be solved in an iterative manner. Firstly, the optimal partition {Vi}for given centroids {ck

i } is found. Given the partition {Vi}, the optimalcentroid {ck

i } is found at the second step. The iterations are terminated ifthe improvement per iteration is below a threshold.

In [5], a splitting-based initialization method was proposed. The mean ofthe source is usually assigned as an initial codevector ck

1 , and the codebook isdoubled in size by splitting. Given N quantization points ck

1 , ..., ckN , new 2N

quantization points are generated for GLA training by splitting cki into ck

i +εk and ck

i − εk, where εk is a fixed perturbation vector. The mean distortionproduced by high-rate prediction and LBG are similar for moderate rate [5].

Since the source pdf is usually unknown, training data can be directlyapplied to find centroids and their Voronoi regions. For an unknown pdfwith a squared-error measure, instead of using (47), the centroid is estimatedby

ci =1

|Vi|∑

x∈Vi

x (48)

where |Vi| is a population of x in an i-th Voronoi region Vi. The Voronoiregion is a collection of data points instead of cell boundaries for partitioningthe data space. This algorithm is called the discrete GLA. The discrete GLAconverges in a finite number of iterations.

Since unstructured vector quantization requires high complexity andlarge memory as dimensionality and rate increase, structured vector quan-tizers have been developed such as tree-structured VQ [14], multi-stageVQ [72], pyramid VQ [73], polar VQ [74], split VQ [16], lattice VQ [20,64],etc. However, these structured vector quantizers yield nonoptimal perfor-mance in general.

3 Quantizer Design 17

3.2 Entropy-Constrained Vector Quantization

The GLA can be extended to ECVQ [6]. Based on a Lagrangian formulation,a new distortion criterion is defined as

η = d(xk, cki ) + λli (49)

where cki , λ, and li are an i-th centroid, a Lagrange multiplier, and the

length of the i-th codeword, respectively. The codeword length for an i-thVoronoi region is given by

li = − log2(|Vi|

∑Ni=1 |Vi|

). (50)

A Voronoi region for ECVQ is estimated for a given centroid cki and its

corresponding codeword length li:

Vi = {xk ∈ Rk : d(xk, cki ) + λli ≤ d(xk, ck

j ) + λlj , ∀j > i,

d(xk, cki ) + λli < d(xk, ck

j ) + λlj , ∀j < i} (51)

and its new centroid and codeword length are calculated by (48) and (50)in an iterative manner, respectively. Iteration steps to find Vi, li, and ck

i

are terminated, if the improvement per iteration is below a threshold. Theimprovement is also estimated with the new distortion measure in (49).The Lagrange multiplier λ decides the operating point of average rate anddistortion. If λ = 0, ECVQ reduces to GLA.

3.3 Lattice Quantization

Gersho conjectured that optimal high-rate quantization of a uniform sourcehas Voronoi regions that are congruent to some polytope V , which naturallyleads to lattice vector quantization [20]. Lattice vector quantization withlossless coding can be seen as an extension of the work of Gish and Pierce[18].

In lattice quantization (without considering companding), all theVoronoi regions have the same shape, size, and orientation (rotation). Thedesign of a lattice VQ can be separated into finding a generator matrixG, finding a truncation and scaling factor, and assigning an index to thecodevectors. The lattice is defined by

Λ = {GT uk;uk ∈ Zk}. (52)

To use only a finite number of centroids, the lattice must be truncated

C = Λ ∩ S (53)

18 Introduction

where S is a bounded region in Rk. The truncated lattice is then multipliedby a scaling factor to cover the source pdf. The truncation shape and scalingfactor can be estimated with repeated experiments to reduce distortion [68].

Lattice VQ has fast encoding and requires low memory, which allowsmuch larger dimensionality than GLA-based unconstrained VQ. Since thecentroids and their Voronoi regions are generated with G, almost no memoryis required to store the codebook.

In [69], by using a Voronoi code, a fast index assignment algorithmwas proposed. The Voronoi code CΛ consists of all vectors xk − bk forxk ∈ Λ ∩ (bk + Vr) where Vr is the r-times magnified Voronoi region. Thisalgorithm requires a fast coding method to find the closest point. In [70],Conway and Sloane presented a fast coding method for a variety of lattices.In [71], Sayood and Blankenau proposed a more generic coding algorithmwhich will work with all kinds of latticea but is slower.

For RCQ of non-uniform sources, a lattice VQ (uniform SQ for k=1) canbe combined with companding proposed by Bennett and Gersho [2,20]. ForECQ, since the optimal centroid density is uniform, a lattice VQ followedby lossless coding can be used.

The performance of lattice quantization depends on the quantizationcoefficient. If k and r are fixed, the quantization coefficient is determinedby the lattice structure G. Although a sphere would be the optimal latticefor quantization, we cannot partition a k-dimensional space into spheres andhence the associated lower bound on distortion is overly optimistic. Conwayand Sloane found the best-performing known lattices of infinite uniformpdfs, for example, in 2-8 dimensions: A2, A

∗3, D4, D

∗5 , E∗

6 , E∗7 , and E8 [64–

66]. Agrell and Eriksson found them for 9-10 dimensions [67].

Jeong and Gibson [68] argued that, for RCQ, the lattice should be trun-cated by a contour of constant pdf of the source. The theta function isused to truncate a lattice with the correct number of VQ points. Theyshowed that the shape of the constant pdf contours for a Gaussian sourceis a sphere, while that for a Laplacian source is a pyramid.

3.4 Random Coding based on High-Rate Theory

Zador gave a distortion upper bound for a random code [19]. This upperbound converges to the sphere bound as the dimensionality increases. Forsufficiently high dimensionality, a random codebook generated by optimalcentroid density based on high-rate theory gives similar performance asoptimal codebook design methods such as GLA [39,54–56].

To generate an RCQ codebook with random coding, the source pdffXk(xk) must be known. It can be approximated well by a Gaussian mix-ture model (GMM). A GMM is a weighted sum of multivariate Gaussian

4 Channel Coding 19

densities:

fXk(xk) =

M∑

i=1

ρifi(xk|ui, Ci) (54)

where the weights ρi, means ui, and covariance matrices Ci of each GMMcomponent are calculated with the expectation maximization (EM) algo-rithm.

The corresponding centroid density gc(xk) for RCQ is given by high-

rate theory in (20). Since the random codebook generation based on thiscentroid density is not simple, the codebook is generated for a GMM-basedpdf instead in [39,54,55]. For the GMM-based pdf, a random codevector isgenerated with two steps: (1) a mixture component i is selected based on

the probability ρi/∑M

j=1 ρj , (2) and a codeword is randomly generated for

the selected Gaussian pdf fi(xk|ui, Ci). In [54], the codebooks are generated

with the centroid density in (20) by using the rejection method [63]. It isshown that both methods produce similar distortion for a high dimensionalsource (k=10 in their spectral coding example).

The above codebook design method does not need to store codebook.Thus, complexity resulting from memory requirement is negligible even forhigh-dimensional VQ cases. In [39, 54–56], performance bounds of spectralquantization for the 10 dimensional case were given.

In papers C and D, new high-rate distortion bounds and centroid densityare found for better perceptual quality, and GMM-based random coding isa promising design alternative.

4 Channel Coding

Source coding attempts to remove all the redundancy to efficiently representa source in a compressed version, whereas channel coding adds redundancyto recover channel error. For example, in the Internet, the channel errorincludes single bit error and packet loss.

In this thesis, we focus on channel coding for packet-loss recovery. Weassume that bit errors of wired/wireless channels are corrected by bit-errorcorrecting codes, such as convolutional coding, and that packet losses arecorrected by packet-loss recovery techniques. In paper B, we comparepacket-loss recovery techniques to find the best one for varying channelconditions at a given overall rate.

The Shannon separation theorem states that source and channel codingcan be designed separately, which yet results in the optimal performance ofcommunication systems if the block length is infinite and the entropy rateof the source is less than the channel capacity in a time-invariant system [1].However, infinite block length causes infinite delay and infinite complexity,

20 Introduction

which is not feasible in reality. The communication channels may also betime-varying and band-limited especially in multi-user communication sys-tems. Thus, for optimal system design, source and channel coding must bejointly optimized [75,76].

In [77], Sayood et al. classified joint source-channel coding into fourclasses:

• joint source channel coders where source and channel coders are trulyintegrated (i.e., channel-optimized quantization [7]);

• concatenated source channel coders where source and channel codersare cascaded and fixed total rates are divided into the source and chan-nel coders according to channel condition (i.e., FEC [87] for adaptivemultirate (AMR) speech coding [154]);

• unequal error protection where more bits are allocated to more im-portant parts of source (i.e., uneven level protection [83]);

• and constrained source-channel coding where source encoders and de-coders are modified depending on channel condition (i.e., MDC [94,104,106,111]).

In the following subsections, a variety of packet-loss recovery techniquesis introduced. Especially, the second class of joint source-channel coding(FEC) and the fourth class (MDC) are more thoroughly studied for packet-loss recovery.

4.1 Packet-Loss Recovery Techniques

Packet-loss recovery techniques are divided into two classes: transmitter-based and receiver-based recovery techniques as shown in Fig. 2 [78–80].

The receiver-based error concealment techniques are of use when atransmitter-based scheme repairs most of the burst losses, and leaves asmall number of gaps to be repaired [81, 82]. Error concealment schemesproduce a replacement for a lost packet. This is possible since the perceivedfeatures of audiovisual signals change slowly over time, thus they are quitesimilar to each other during a short-time period.

Error concealment methods can be divided into three categories: in-sertion, interpolation, and regeneration. Insertion-based schemes repairlosses by inserting a fill-in packet by splicing (zero-length fill-in), silencesubstitution (zero-amplitude fill-in), noise substitution (repetition of theprevious packet with fading), and extrapolation (for example, losses duringvoiced speech are corrected using pitch-cycle waveform repetition) [142].Interpolation-based schemes fill gaps by stretching the received informationin both ends of the losses. Interpolation-based schemes, such as time-scale

4 Channel Coding 21

AutomaticRepeatreQuest(ARQ)

ForwardError

Correction(FEC)

MultipleDescription

Coding(MDC)

Media-Independent

FEC (MI-FEC)

Transmitter-based

Media-Specific

FEC (MS-FEC)

Receiver-based

Insertion Interpolation Regeneration

InterleavingLayered Coding

(LC)

Uneven Level

Protection (ULP)

Figure 2: Transmitter-based and receiver-based packet-loss recoverytechniques.

modification, have better reconstruction quality than insertion-based tech-niques with additional delay requirements. Regeneration uses knowledge ofsource coding algorithms to synthesize lost parts of audiovisual signals. Forexample, in [146], spectrum parameters, excitation signal, and gain param-eters are extrapolated instead of extrapolating the speech signal itself.

The basic transmitter-based mechanisms available to recover packet lossinclude automatic repeat request (ARQ), media independent FEC (MI-FECor generic FEC) and media specific FEC (MS-FEC), multiple descriptioncoding (MDC), interleaving, layered coding (LC), and uneven level protec-tion (ULP).

If the end-to-end delay is of secondary importance, such as in radiobroadcasts, ARQ, FEC and interleaving can be successfully applied forpacket-loss recovery. Interleaving distributes a burst error with large gapsinto multiple single errors while it does not introduce additional networkload. If the delay is of critical importance, then low-delay FEC and MDCschemes can be used. In paper B, we compare rate-distortion bounds ofFEC and MDC for real-time audiovisual communication systems.

ULP [83] is used, e.g., in combination with FEC and MDC to exploitthe unequal importance of data. UDP Lite provides an application layerwith a corrupted packet instead of discarding it at a protocol layer. It isup to the application to decide whether the damaged packet can be used ordiscarded.

It has to be noted that even though the described techniques help tosolve the packet-loss recovery problem, some of them, such as ARQ andFEC, introduce additional load to the network, which results in increased

22 Introduction

FECEncoder

channel errorD1 D2

P1

D1 D2 P1×FEC

Decoder

D1

D2P1D1 D2

Figure 3: Example of media-independent FEC [79].

probability of congestion and packet loss. The best tradeoff always dependson the application and network conditions.

The basic channel coding blocks for real-time audiovisual communicationsystems are

• FEC and MDC for the transmitter-based packet-loss recovery;

• FEC and ULP for bit error detection and correction for bit errors;

• and receiver-based error concealment.

In the following subsections, we describe basic ideas of FEC, MDC, andrate-distortion optimized packet loss recovery techniques.

4.2 Forward Error Correction (FEC)

In FEC, lost data are recovered at the receiver without further reference tothe transmitter [87]. Both the original data and the redundant informationare transmitted to the receiver. There are two kinds of redundant infor-mation: those that are either independent of or dependent on the mediastream:

• Generic FEC (media-independent FEC) [84] does not require infor-mation about the original data type (speech, audio, or video). Ingeneric FEC, original data together with some redundant data, calledparities, are transmitted to the receiver.

• In media-specific FEC (MS-FEC), if an original data packet is lost,redundant data packets, which are related to the specific media, areused to recover the loss [85]. The first transmitted data is referred toas the primary encoding and redundant transmissions as secondaryencoding. Usually, the secondary packet is produced using a lower-bandwidth encoding method than the primary encoding, which resultsin lower-quality.

In paper B, we show that MS-FEC is a subclass of MDC, and FECin the remaining section means generic FEC. In generic FEC, k original

4 Channel Coding 23

data packets and additional h redundant parity packets are transmitted.Figure 3 shows an example for k=2 and h=1. The FEC encoder producesone redundant packet (P1) for two data packets (D1 and D2). If one datapacket (D2) is lost, the receiver can recover the data packet (D2) by usingthe successfully received packets D1 and P1.

In generic FEC, redundant data, called parity, is derived from the orig-inal data using the eXclusive-OR (XOR) operation (one parity packet isgenerated for the given original packets), using Reed-Solomon codes (mul-tiple independent parities can be computed for the same set of packets), orother techniques. The Reed-Solomon codes can achieve optimal loss pro-tection, but lead to higher processing costs than schemes based on XORoperation.

The advantages and disadvantages for the generic FEC schemes are:

• the operation of generic FEC is source independent;

• the original data packet can be used by receivers that are not capableof FEC;

• FEC coding requires additional bandwidth for the efficient channelcoding;

• parity packets can be useless at a decoder, if the number of receivedpackets exceeds the number of required packets to recover packetlosses.

Linear block codes and convolution codes are subclasses of generic FEC.In the following sections, we focus on linear block codes [86,87].

Linear Block Codes

Linear block codes generate n code elements u = {u1, u2, ..., un} from koriginal message elements o = {o1, o2, ..., ok} with a k×n generating matrixG:

u = oG. (55)

If we denote the channel error term as e, the signal r received by the decoderis

r = u + e = oG + e. (56)

For each generating matrix, a parity-check matrix H can be constructedsuch that GHT =0. For error free channels, e=0 and thus rHT =0. However,if detectable channel errors occur, rHT is not zero. If the detected errorsare correctable, the syndrome vector

S = rHT = eHT (57)

24 Introduction

represents a unique error pattern sequence that is called the coset.If the minimum distance of the codewords in a vector space Vn is dmin,

the error-detection capability ε and error-correction capability t are

ε = dmin − 1

t = bdmin − 1

2c (58)

and such codes are denoted as (n, k, t) codes.One of classes of linear block codes are cyclic codes. If u =

{u0, u1, ..., un} is a cyclic codeword, single and multiple cyclic shifts of thecodeword is also a cyclic codeword. Bose, Chaudhuri, and Hocquenghem(BCH) codes and Reed-Solomon (RS) codes are examples of cyclic codes.BCH codes are generalizations of the Hamming code [86,88,89]

(n, k, t) = (2m − 1, 2m − 1 − mt, t). (59)

RS codes are maximum distance separable, and thus have the best errorcorrecting capability at the same length and dimensionality [86,90,91]

(n, k, t) = (2m − 1, 2m − 1 − 2t, t). (60)

RS codes are constructed in an extended Galois field GF (pm) where p andm are a prime number and a positive integer, respectively. RS codes canbe shortened to any desired size from larger codes, for example, by paddingzeros to the unused parts of blocks. These zeros are not sent, but at thedecoder they are inserted again to recover lost packets. The CD formatutilizes shortened RS codes.

The error patterns in RS codes are not of importance if the numberof incorrect received symbols is not larger than the maximum number ofsymbols that can be correctable. With an XOR operation, the pattern ofthe recoverable lost data is much more restricted. However, RS codes aremore complex than the XOR operation, which is an important disadvantageof RS codes. This higher complexity is mostly due to the syndrome testin (57) which is used to find a unique error pattern from the huge numberof syndrome sets.

However, for erasure channels such as packet-switched networks, theerror locations are known to the decoder and the complexity of the RS codeis not a problem. Moreover, the error correction capability t is increased to

t = dmin − 1 = n − k. (61)

Half of the information in syndromes is used to detect error locations, andthe other half is used to correct the detected errors. If error patterns areknown, all the syndrome information can be used to correct errors.

4 Channel Coding 25

Encoder 1

Encoder 2

Decoder 1

Decoder 0

Decoder 2

Channel 1

Channel 2

X

X1

X0

X2

^

^

^

D0

D1

D2

R1

R2

Figure 4: Multiple description coding: two encoders and three decoders.

Turbo codes [93] and low-density parity check (LDPC) codes [92] havea performance that approaches the Shannon limit. These codes use itera-tive decoding with soft decision with likelihood functions. Turbo codes areconcatenated forms of two or more convolution or block codes.

In this thesis, since we focus on packet-loss recovery techniques withstringent delay constraints, we mainly consider the RS codes. However,for bit-error channel, other error correcting codes such as turbo codes andLDPC codes can be deployed.

4.3 Multiple Description Coding (MDC)

Channel coding adds redundancy to encoded source streams for obtain-ing robustness against channel errors. If we simply send identical encodedstreams twice over a channel, one of descriptions is redundant for the casethat both descriptions are successfully received at a decoder. MDC dividesdata into multiple streams such that the decoding quality using any sub-set yields minimum acceptable quality, and higher quality is obtained byreceiving more descriptions [94,104,106,111].

As shown in Figure 4, for the case of two encoders and three decoders,the information to be transmitted to the receiver is divided into two descrip-tions. If both descriptions arrive at the receiver, the best possible qualityD0 can be obtained. We expect reasonable quality, D1 or D2, even if onepacket is lost.

In the following subsections, the MDC rate-distortion region is discussedto find the achievable rate-distortion sets for a particular source and distor-tion measures. Then, we introduce three major practical MDC coders.

Rate-Distortion Region of MDC

The MDC problem arose from the idea of diversity-based communicationsystems where the information is split and sent over separate paths. For

26 Introduction

two descriptions, El Gamal and Cover (EGC) [94] found an achievable rate-distortion region of (R1, R2, D0, D1, D2) for a sequence of iid random vari-ables X = {x1, x2, ..., xn}:

R1 > I(X; X1)

R2 > I(X; X2)

R1 + R2 > I(X; X1, X2, X0) + I(X1; X2) (62)

such that

D1 ≥ E[d(X; X1)]

D2 ≥ E[d(X; X2)]

D0 ≥ E[d(X; X0)] (63)

where X1, X2, and X0 are three estimates of X, and I(·; ·) is the mutualinformation.

Ozarow [95] showed that the EGC bound is tight for the iid Gaussiansource and squared-error criterion. Given the rates R1 and R2, centraldistortion, D0, and side distortions, D1 and D2, are

D0 ≥{

2−2(R1+R2)

1−(√

Π−√

∆)2, if Π > ∆

2−2(R1+R2), otherwise

D1 ≥ 2−2R1

D2 ≥ 2−2R2 (64)

where Π = (1−D1)(1−D2) and ∆ = D1D2 − 2−2(R1+R2). For other inputsources and distortion measures, the MDC bound is not completely knownand remains as an open problem.

Ahlswede [96,97] proved that the EGC bound is tight for the “no excessrate” case where R1 + R2 = R0(D0). Zhang and Berger [97] proved thatthis bound is not tight for the “excess rate” case where R1 + R2 > R0(D0).

In [98], inner and outer bounds of the achievable MDC region,RX(D0, D1, D2), were found for general memoryless sources and a squarederror measure:

R∗(σ2x, D0, D1, D2) ⊆ RX(D0, D1, D2) ⊆ R∗(Px, D0, D1, D2) (65)

where R∗(σ2x, D0, D1, D2) and R∗(Px, D0, D1, D2) are the Ozarow bound

with a variance σ2x and an entropy power defined by Px = 22h(X)/2πe of

the source, respectively [58]. These bounds form an extension of Shannonlower and upper bounds

1

2log(

σ2x

D) ≥ RX(D) ≥ 1

2log(

Px

D). (66)

4 Channel Coding 27

Equality in (65) and (66) holds for a Gaussian source where Px = σ2x. For a

high-rate case, the outer bound coincides with the optimum rate-distortionbound.

Linder, Zamir, and Zeger [99] found an MDC bound for memorylesssources and locally quadratic distortion measures at high rate. In [100],an MDC bound for wide-sense stationary Gaussian sources with infinitememory was proven to be asymptotically tight for high rate. In [100], analgorithm for the design of optimal two-channel filter banks for MDC ofGaussian sources was proposed. An achievable bound for the L-channelMDC problem was presented in [101].

In successive refinement or layered coding, one base-layer information issent to obtain basic quality of a system, and several enhancement descrip-tions are sent for better quality. An enhancement layer is sent for refiningthe previous optimal descriptions in a successive manner. If only enhance-ment layer descriptions are received, the reconstruction quality is poor. Onthe other hand, MDC yields at least minimum acceptable quality with anysubset of descriptions. Equitz and Cover [102] proved that 2-stage succes-sive refinement with coarse and fine descriptions is achievable if and only ifthe individual solutions of R-D problems can be written as a Markov chain.They showed that Gaussian sources with a squared-error measure, Lapla-cian sources with an absolute-error criterion, and arbitrary discrete sourceswith hamming distortion can be successively refinable. In [103], the L-stagesuccessive refinement problem was presented.

Practical application I: Multiple Description Scalar Quantizer(MDSQ)

In [106], a multiple description scalar quantizer (MDSQ) with a resolutionconstraint was proposed for multiple channel environments. For given ratesR1 and R2, an MDSQ is optimal if it minimizes E[(X − X0)

2] under theconstraints that E[(X − X1)

2] ≤ D1 and E[(X − X2)2] ≤ D2. The optimal

encoders and decoders are found with a Lagrange formulation:

η(A, g, λ1, λ2) = E[(X − X0)2]

+ λ1{E[(X − X1)2] − D1}

+ λ2{E[(X − X2)2] − D2} (67)

where A and g are the encoder partition and the decoder centroid density.The optimal encoder and decoder, (A∗, g∗), are determined iteratively, suchthat A∗ minimizes η(A, g∗, λ1, λ2) for all A, and such that g∗ minimizesη(A∗, g, λ1, λ2) for all g. The operation point between side-distortion opti-mized MDC (CD-MDC) and central-distortion optimized MDC (CD-MDC)is obtained by changing λ1 and λ2.

The central partition is determined by index assignment. A given indexset {i1, i2} of the side coders is mapped onto an index i0 for central coding.

28 Introduction

In [106], several heuristic techniques were proposed, since optimal index as-signment is very difficult. The number of side indices and the redundancyof the index assignment matrix should be determined to reduce the distor-tion in (67). If the matrix is full, side distortion is high and CD-MDC isobtained.

The entropy-constrained version of MDSQ was presented in [107].A high rate analysis showed that, for a memoryless Gaussian sourcewith a squared-error measure, resolution-constrained MDSQ and entropy-constrained MDSQ have a 3.07 dB and a 8.69 dB performance gap comparedwith the Ozarow R-D bound [108].

Practical application II: Multiple Description Transform Coding(MDTC)

The focus of multiple description transform coding (MDTC) is on how tominimize the side distortions, D1 and D2, for a given central distortion,D0 [109–112]. In single description coding (SDC), the minimum rate R toachieve distortion D is given by the rate-distortion function R(D). However,to reduce D1 and D2 below a certain level, the total rate for the side decod-ing must be greater than R∗. This extra bit rate is called redundancy, andbased on this, a redundancy-rate distortion (RRD) function is defined. Fora memoryless Gaussian source with a squared-error criterion, the Ozarowbound can be rearranged into the RRD function [104, 113, 114]. For thebalanced case (R1 = R2 and D1 = D2) with unit variance of a source:

D0,R1+R2≥ D0,r = 2−2r

D1,R1+R2≥

12 [1 + 2−2r − (1 − 2−2r)

√1 − 2−2ρ∗ ],

if ρ∗ ≤ r − 1 + log2(1 + 2−2r)

2−(r+ρ∗), otherwise

(68)

where r and ρ∗ are the base rate and the redundancy rate, respectively, andthe total rate is R1 + R2 = r + ρ∗.

To control the amount of correlation in MDTC, the pairwise correla-tion transform (PCT), T , was used [109–112]. Lost information can beapproximated with other descriptions that have correlation with the lostdescription. For the balanced case, two independent variables A and B aretransformed to correlated variables C and D for an arbitrary linear trans-form T :

[

CD

]

= T

[

AB

]

(69)

where

T =

[

r2 cos θ2 −r2 sin θ2

−r1 cos θ1 r1 sin θ1

]

(70)

4 Channel Coding 29

where the parameters r1 and r2 control the lengths of the basis vectors, andθ1 and θ2 control their directions. Wang et al. derived the approximateconditions for the optimality of the transform [111]:

r1 = r2 =1√

sin 2θ1

θ1 = −θ2. (71)

Thus the optimal transform is

T =

√

cot θ1

2

√

tan θ1

2

−√

cot θ1

2

√

tan θ1

2

(72)

which is different from the general orthogonal transform.Compared to MDSQ, MDTC can reach a lower redundancy range, but

MDSQ provides a higher peak signal-to-noise ratio (PSNR) in the highredundancy region. In [112], Wang et al. generalized MDTC by combiningMDTC with layered coding in [115], and this generalized MDTC resultsin an RRD performance close to the theoretical bound even for the highredundancy region.

Practical application III: Multiple Description Lattice Quantizer(MDLQ)

A multiple description lattice quantizer (MDLQ) is an extension of MDSQto lattice vector quantization. In [116, 117], Servetto, Vaishampayan, andSloane (SVS) designed a balanced MDLQ by labelling the side-coding pointsaround a central-coding point with ordered pairs. For a source vector xk,the central-coding point xk

o is determined by

xko = arg min

yk∈Λ

d(xk, yk) (73)

where Λ is a lattice. To produce two descriptions for xko , a coarse lattice Λ′

is defined as

Λ′ = cλU (74)

where c and U are a scaling factor and an orthogonal matrix, respectively.The index assignment maps the central-coding point at Λ to the two side-coding points at Λ′, l : Λ → Λ′ × Λ′. In the SVS method, the centraldistortion is fixed by the lattice Λ. The rate and the side distortion aredetermined by the scaling factor c in (74) and the index assignment.

In [117], under the assumption of high rate and a squared-error criterion,the performance of MDLQ was measured for a memoryless source with the

30 Introduction

differential entropy h(X) < ∞. For any a ∈ (0, 1) and balanced MDC, thecentral distortion D0 satisfies

limR→∞

D022R(1+a) =

1

4C(Λ)22h(X) (75)

and the side distortions, D1 and D2, satisfy

limR→∞

D122R(1−a) = C(SL)22h(X) (76)

where C(Λ) and C(SL) are the normalized second moments of a Voronoicell of the lattice Λ and of an L-dimensional sphere, respectively.

In [118,119], a weighted distortion measure is applied to the SVS methodsuch that the operating point of MDLQ can move towards the side-distortionoptimized case. In [122], translated sublattices are used instead of coarsesublattices to improve the performance for high packet loss. In [120,121], anasymmetric MDLQ is presented where each description is allowed to havedifferent distortion.

4.4 Rate-Distortion Optimized Packet Loss RecoveryTechniques

One of the transmitter-based packet-loss recovery methods is rate-distortionoptimized streaming with feedback channels. Suppose that there are Lpackets to be transmitted to a receiver. Let πl be the transmission policyfor a packet l ∈ {1, ..., L} and Π = (π1, ..., πL) be the policy vector. Forexample, in the AMR speech coder [154], the policy vector is used to allocatefixed rate to source and channel coding to minimize perceptual distortionfor varying channel conditions.

Given a policy vector Π = (π1, ..., πL), we can find an expected distortionD(Π) and an average rate R(Π) [123–125]. The expected rate R(Π) is

R(Π) =∑

l∈{1,...,L}Blρ(πl) (77)

where Bl is the number of bytes in data unit l, and ρ(πl) is the expectednumber of transmitted bytes per source byte under policy πl. The expecteddistortion D(Π) is

D(Π) = D0 −∑

l∈{1,...,L}∆Dl

∏

l′≤l

(1 − ε(πl′)) (78)

where D0 is the expected reconstruction distortion if no data unit has beenreceived, ∆Dl is the expected reduction in distortion if data unit l hasarrived on time, and ε(πl) is the probability that data unit l does not arrive

5 Applications: Audiovisual Coding 31

on time. The goal of the rate-distortion optimized streaming is to find theoptimal policy vector π that minimizes the Lagrangian

J(π) = D(π) + λR(π). (79)

An iterative descent algorithm can be used to minimize the objective func-tion J(π).

Rate-distortion optimization for video coding in error-free environmentswas reviewed in [126]. Extending this approach to error-prone environmentswas discussed in [127–130]. In these works, an effective framework for videocommunications in error-prone environments based on layered coding withprioritized transportation is presented.

In paper B, we apply a rate-distortion optimized method to MDC forfinding the best operating point at given channel conditions. Since we con-sider the case of fixed rate transmission, for fair comparison with otherpacket-loss methods, λ in (79) is set to be zero which means that only meandistortion is a constraint for optimization.

5 Applications: Audiovisual Coding

In this section, we briefly survey several international standard coding al-gorithms for speech, audio, image and video.

5.1 Speech and Audio Coding Standards

The Telecommunication Standardization Sector of the InternationalTelecommunication Union (ITU-T), which was formerly the InternationalTelegraph and Telephone Consultative Committee (CCITT), is responsiblefor studying and issuing global speech coding standard recommendations.Narrowband (200 Hz to 3.4 kHz) speech coding standards include G.711A-law or µ−law pulse code modulation (PCM) at 64 kbit/s [142], G.726adaptive differential PCM (ADPCM) at 40/32/24/16 kbit/s [143], G.728low-delay CELP at 16 kbit/s [144], G.723.1 dual-rate speech coder at 6.3/5.3kbit/s [145], and G.729 conjugate- structure algebraic CELP (CS-ACELP)at 8 kbit/s [146]. The functionality of each recommendation is extendedby annexes. Wideband (50 Hz to 7 kHz) speech coding standards includeG.722 subband PCM at 64/56/48 kbit/s [147], G.722.1 modulated lappedtransform (MLT) based coding at 32/24 kbit/s [148], and G.722.2 AMRwideband (AMR-WB) coding at 6.60-23.85 kbit/s [149].

The European Telecommunications Standards Institute (ETSI) and the3rd Generation Partnership Project (3GPP) have standardized five speechcompression algorithms: full rate at 13.0 kbit/s [150, 151], half rate at 5.6kbit/s [152], enhanced full rate at 12.2 kbit/s [153], AMR at 4.75-12.20kbit/s for source coding [154], and AMR-WB that is same as G.722.2 [149].

32 Introduction

For global system for mobile communication (GSM), AMR works at twotraffic channel modes [155]: adaptive full-rate speech (AFS) and adaptivehalf-rate speech (AHS). The sum of source and channel coding bit ratesfor AFS and AHS are 22.8 and 11.4 kbit/s. Instead of the AMR encoder,a mobile station (MS) requests a mode (different bit-rate combination ofsource and channel coding) according to the carrier-to-interference (C/I)measurements, which is approved by the base station (BS).

The 3GPP Project 2 (3GPP2) has standardized the IS-127 enhancedvariable-rate codec (EVRC) [156]. In the IS-127, a specific rate is selectedat a encoder side based on the phonetic class of the input speech. It guar-antees longer battery life for the reverse link (MS to BS) and more users percell capacity for the forward link (BS to MS). Even though EVRC has fourvariable-rate options, it was reported in [157] that EVRC works in prac-tice as a fixed full-rate codec with a silence coding mode. The selectablemode vocoder (SMV) was developed as a new speech coding standard for asubstitute for EVRC and the third generation systems [157,158].

For robust voice communication over IP, the Internet Engineer TaskForce (IETF) is standardizing the Internet low-bit-rate codec (iLBC) at13.33 kbit/s with a frame length of 30 ms and 15.20 kbit/s with a framelength of 20 ms [159]. Since iLBC does not utilize interframe correlationof parameters, the codec enables graceful quality degradation in the case ofpacket loss.

The moving picture experts group (MPEG) standardized MPEG audiocoders with three compression layers [160]. The layer 3 of the MPEG-1 standard is known as MP3, which is widely used for audio file formatover the Internet and for portable audio players. The Dolby Digital AC3is also one of the main compression schemes especially for high definitiontelevision (HDTV) and DVD in North America [161]. The basic buildingblocks of audio coding standards include transforms, psychoacoustic modelsto remove irrelevancy, and quantization [162].

5.2 Image and Video Coding Standards

The basic functions of image and video coding are composed of transform,uniform quantization, and lossless coding. For a squared error criterion anda Gaussian input, the Karhunen-Loeve transform (KLT) is optimal in thesense that the vector components are decorrelated [139]. For image andvideo coding, because of the heavy complexity of KLT, the discrete-cosinetransform (DCT) is commonly used [140].

The Joint Photographic Experts Group (JPEG) of the InternationalOrganization for Standardization (ISO) and the ITU established a jointstandard for still image compression. In 1992, the JPEG members selecteda DCT-based method with Huffman coding as the ISO/IEC internationalstandard 10928-1 or ITU-T recommendation T.81 [163]. With the increas-

6 Summary of Contributions 33

ing use of multimedia, a new image coding standard, JPEG2000, has beendevelopd [164]. JPEG2000 includes new features such as superior perfor-mance at low bit rate, progressive transmission, region-of-interest coding,robustness to bit errors, and protective security. JPEG2000 uses a wavelettransform followed by uniform scalar quantization with arithmetic coding.

Numerous international video coding standards exist, including ITU-TH.261, MPEG-1, H.262/MPEG-2 [165], and MPEG-4 (Part 2) [166]. Mostvideo coding standards use block-based DCT, motion estimation and com-pensation, quantization of DCT coefficients, and lossless coding of quantizedDCT coefficients and motion vectors. Each group of pictures (GOP, typ-ically 15 pictures) has at least one intra-picture (I-picture) and a set ofpredicted (P-pictures)/bidirectionally-predicted pictures (B-pictures) [162].

H.262/MPEG-2 is the most widely used video coding standard for digitaltelevision systems and high quality storage such as DVD. Recently, the jointvideo team (JVT) of ITU-T and ISO/International Engineering Consor-tium (IEC) developed a new coding standard: ITU-T H.264 and ISO/IECMPEG-4 (Part 10) advanced video coding (AVC) which is referred to asH.264/AVC [166]. H.264/AVC improves coding efficiency by a factor two(which means halving the bit rate for a given fidelity) over MPEG-2 withreasonable complexity and cost. H.264/AVC has important blocks suchas inter-picture prediction, low-complexity variable block-size transform,adaptive deblocking filter, context-based arithmetic coding, error-resilienceextensions, and so on [167].

6 Summary of Contributions

In this thesis, we study the basic building blocks, source and channel cod-ing, for audiovisual communication systems. Source coding has a tradeoffbetween rate and distortion. Rate-distortion theory and high-rate theoryare used to analyze practical quantizers. While rate-distortion theory as-sumes infinite dimensionality, high-rate theory assumes a large number ofreconstruction points. Since high-rate theory is accurate for moderate di-mensionality and rate, we predict quantizer performance using high-ratetheory before designing quantizers, and verify the prediction by comparingwith the performance of the designed quantizers. In paper A, we propose anew resolution-constrained quantizer that utilizes the three VQ advantagesmore efficiently than conventional speech coding paradigms. In papers Cand D, for the resolution-constrained case, new quantization methods areproposed considering perceptual quality, various distortion measures, thenumber of distortion outliers, and the source mismatch effect.

Channel coding adds redundancy to a source for recovering channel er-rors. Shannon proved that information can be sent reliably over a channelat all rates R below the channel capacity C. From the source and channel

34 Introduction

separation theorem, optimal source codes and optimal channel codes can beseparately and independently designed. However, if delay is of critical im-portance, such as in real-time communication systems, source and channelcoding should be optimized jointly. In paper B, we compare various packet-loss recovery techniques for real-time audiovisual communications over theInternet. Two promising techniques, FEC and MDC, are extensively com-pared in this paper. FEC is optimized for given channel conditions withoutconsidering the source, while MDC is a well-known joint source channelcoding scheme.

We summarize the contributions of the appended papers which consti-tute the main body of this thesis.

Paper A - KLT-Based Adaptive Classified VQ of theSpeech Signal

Compared to SQ, VQ has memory, space-filling, and shape advantages. Ifthe signal statistics are known, direct vector quantization (DVQ) accord-ing to these statistics provides the highest coding efficiency, but requiresunmanageable storage requirements if the statistics are time varying. InCELP coding, a single “compromise” codebook is trained in the excitation-domain and the space-filling and shape advantages of VQ are utilized in anon-optimal, average sense. In this paper, we propose KLT-based adaptiveclassified VQ (KLT-CVQ), where the space-filling advantage can be utilizedsince the Voronoi-region shape is not affected by the KLT. The memoryand shape advantages can be also used, since each codebook is designedbased on a narrow class of KLT-domain statistics. We further improve basicKLT-CVQ with companding. The companding utilizes the shape advantageof VQ more efficiently. Our experiments show that KLT-CVQ provides ahigher signal-to-noise ratio (SNR) than basic CELP coding, and has a com-putational complexity similar to DVQ and much lower than CELP. Withcompanding, even single-class KLT-CVQ outperforms CELP, both in termsof SNR and codebook search complexity.

Paper B - Comparative Rate-Distortion Performance ofMultiple Description Coding for Real-Time AudiovisualCommunication over the Internet

To facilitate real-time audiovisual communication through the Internet,FEC and MDC can be used as low-delay packet-loss recovery techniques.We use both a Gilbert channel model and data obtained from real IP con-nections to compare the rate-distortion performance of different variantsof FEC and MDC. Using identical overall rates with stringent delay con-straints, we find that side-distortion optimized MDC generally performs


better than Reed-Solomon based FEC, which itself, as expected, performsbetter than XOR based FEC. If the channel condition is known from feed-back, then channel-optimized MDC can be used to exploit this information,resulting in significantly improved performance. Our results confirm thattwo-independent-channel transmission is preferred to single-channel trans-mission both for FEC and MDC.

Paper C - Effect of Cell-Size Variation of Resolution-Constrained Quantization

Based on high-rate theory, resolution-constrained quantization is optimizedto reduce the mean distortion at a given fixed rate. However, not only themean distortion, but also the number of distortion outliers is an importantfactor in human perception. In this paper, to reduce the number of out-liers measured with an r-th power distortion criterion, we propose to use anαr-th power distortion criterion for codebook training. As we increase α,codevectors are placed more towards the tail of source distribution. With aproperly selected α, the number of distortion outliers is effectively reducedat a small cost in mean distortion. Experiments with a Gaussian sourceand line spectral frequencies show that in practical applications the meandistortion of the proposed quantizer is similar to that of conventional quan-tizers while having a lower percentage of outliers and a significantly reducedsensitivity to source mismatch.

Paper D - Resolution-Constrained Quantization withJND based Perceptual-Distortion Measures

When the squared error of observable signal parameters is below the justnoticeable difference (JND), it is not registered by human perception. Wemodify commonly used distortion criteria to account for this phenomenonand study the implications for quantizer design and performance. Takingthe JND into account in the design of the quantizer generally leads to im-proved performance in terms of mean distortion and the number of outliers.Moreover, the resulting quantizer exhibits better robustness against sourcemismatch.

References

[1] C. E. Shannon, “A mathematical theory of communications,” The BellSystem Technical Journal, vol. 27, pp. 379-423 623-656, 1948.

[2] W. R. Bennett, “Spectra of quantized signals,” The Bell System Tech-nical Journal, vol. 27, pp. 446-472, Jul. 1948.

36 Introduction

[3] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. In-form. Theory, vol. IT-28, no. 2, pp. 129-137, 1982.

[4] J. Max, “Quantizing for minimum distortion,” IEEE Trans. Inform.Theory, vol. IT-6, no. 1, pp. 7-12, 1960.

[5] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Commun., vol. COM-28, no. 1, pp. 84-95, 1980.

[6] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Entropy-constrainedvector quantization,” IEEE Trans. Acoust., Speech, Signal Processing,vol. 37, no. 1, pp. 31-42, 1989.

[7] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Inform.Theory, vol. 44, no. 6, pp. 2325-2383, 1998.

[8] D. A. Huffman, “A Method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098-1101, 1952.

[9] J. J. Rissanen, “Generalized Kraft inequality and arithmetic coding,”IBM J. Res. Devel., vol. 20, no. 9, pp. 198-203, 1976.

[10] J. L. Mannos and D. J. Sakrison, “The effects of a visual fidelitycriterion of the encoding of images,” IEEE Trans. Inform. Theory, vol.IT-20, no. 4, pp. 525-536, 1974.

[11] D. J. Sakrison, “On the role of the observer and a distortion measurein image transmission,” IEEE Trans. Commun., vol. COM-25, no. 11,pp. 1251-1267, 1977.

[12] F. X. J. Lukas and Z. L. Budrikis, “Picture quality prediction basedon a visual model,” IEEE Trans. Commun., vol. COM-30, no. 7, pp.1679-1692, 1982.

[13] N. B. Nill, “A visual model weighted cosine transform for image com-pression and quality assessment,” IEEE Trans. Commun., vol. COM-33, no. 6, pp. 551-557, 1985.

[14] A. Buzo, A. H. Gray, Jr., R. M. Gray, and J. D. Markel, “Speechcoding based upon vector quantization,” IEEE Trans. Speech AudioProcessing, vol. ASSP-28, no. 5, pp. 562-574, 1980.

[15] B. S. Atal, R. V. Cox, and P. Kroon, “Spectral quantization andinterpolation for CELP coders,” in Proc. IEEE Conf. Acoust., Speech,Signal Processing, May 1989, pp. 69-72.

[16] K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPCparameters at 24 bits/frame,” IEEE Trans. Speech Audio Processing,vol. 1, no. 1, pp. 3-14, 1993.


[17] V. R. Algazi, “Useful approximations to optimum quantization,” IEEETrans. Commun., vol. COM-14, no. 3, pp. 297-301, 1966.

[18] H. Gish and J. N. Pierce, “Asymptotically efficient quantizing,” IEEETrans. Inform. Theory, vol. IT-14, no. 5, pp. 676-683, 1968.

[19] P. L. Zador, “Asymptotic quantization error of continuous signals andthe quantization dimension,” IEEE Trans. Inform. Theory, vol. IT-28,no. 2, pp. 139-149, 1982.

[20] A. Gersho, “Asymptotically optimal block quantization,” IEEE Trans.Inform. Theory, vol. IT-25, no. 4, pp. 373-380, 1979.

[21] Y. Yamada, S. Tazaki, and R. M. Gray, “Asymptotic performance ofblock quantizers with difference distortion measures,” IEEE Trans.Inform. Theory, vol. IT-26, no. 1, pp. 6-14, 1980.

[22] J. A. Bucklew, “Companding and random quantization in several di-mensions,” IEEE Trans. Inform. Theory, vol. IT-27, no. 2, pp. 207-211,1981.

[23] J. A. Bucklew and G. L. Wise, “Multidimensional asymptotic quantiza-tion theory with rth power distortion measures,” IEEE Trans. Inform.Theory, vol. IT-28, no. 2, pp. 239-247, 1982.

[24] J. A. Bucklew, “A note on optimal multidimensional companders,”IEEE Trans. Inform. Theory, vol. IT-29, no. 2, pp. 279, 1983.

[25] J. A. Bucklew, “Two results on the asymptotic performance of quan-tizers,” IEEE Trans. Inform. Theory, vol. IT-30, no. 2, pp. 341-348,1984.

[26] T. D. Lookabaugh and R. M. Gray, “High-resolution quantization the-ory and the vector quantizer advantage,” IEEE Trans. Inform. Theory,vol. IT-35, no. 5, pp. 1020-1033, 1989.

[27] D. L. Neuhoff and D. H. Lee, “On The Performance Of Tree-structuredVector Quantization,” in Proc. IEEE Int. Symposium Inform. Theory,Jun. 1991, pp. 247.

[28] D. L. Neuhoff and D. H. Lee, “On the performance of tree-structuredvector quantization,” in Proc. IEEE Conf. Acoust., Speech, SignalProcessing, Apr. 1991, pp. 2277-2280.

[29] D. H. Lee, D. L. Neuhoff, and K. K. Paliwal, “Cell-conditioned mul-tistage vector quantization,” in Proc. IEEE Conf. Acoust., Speech,Signal Processing, Apr. 1991, pp. 653-656.

38 Introduction

[30] S. Na and D. L. Neuhoff, “Bennett’s integral for vector quantizers,”IEEE Trans. Inform. Theory, vol. 41, No. 4, pp. 886-900, 1995.

[31] D. H. Lee and D. L. Neuhoff, “Asymptotic distribution of the errorsin scalar and vector quantizers,” IEEE Trans. Inform. Theory, vol. 42,No. 2, pp. 446-460, 1996.

[32] D. L. Neuhoff, “On the asymptotic distribution of the errors in vectorquantization,” IEEE Trans. Inform. Theory, vol. 42, No. 2, pp. 461-468, 1996.

[33] W. R. Gardner and B. D. Rao, “Theoretical analysis of the high-ratevector quantization of LPC parameters,” IEEE Trans. Speech AudioProcessing, vol. 3, no. 5, pp. 367-381, 1995.

[34] J. Li, N. Chaddha, and R. M. Gray, “Asymptotic performance of vectorquantizers with the perceptual distortion measure,” in Proc. IEEE Int.Symposium Inform. Theory, Jul. 1997, pp. 55.

[35] J. Li, N. Chaddha, and R. M. Gray, “Asymptotic performance of vectorquantizers with a perceptual distortion measure,” IEEE Trans. Inform.Theory, vol. 45, no. 4, pp. 1082-1091, 1999.

[36] T. Linder and R. Zamir, “High-resolution source coding for non-difference distortion measures: the rate distortion function,” in Proc.IEEE Int. Symposium Inform. Theory, Jul. 1997, pp. 187.

[37] T. Linder and R. Zamir, “High-resolution source coding for non-difference distortion measures: the rate-distortion function,” IEEETrans. Inform. Theory, vol. 45, no. 2, pp. 533-547, 1999.

[38] T. Linder, R. Zamir, and K. Zeger, “High-resolution source codingfor non-difference distortion measures: multidimensional companding,”IEEE Trans. Inform. Theory, vol. 45, no. 2, pp. 548-561, 1999.

[39] P. Hedelin and J. Skoglund, “Vector quantization based on Gaussianmixture models,” IEEE Trans. Speech Audio Processing, vol. 8, no. 4,pp. 385-401, 2000.

[40] J. M. Morris and V. D. VandeLinde, “Robust quantization of discrete-time signals with independent samples,” IEEE Trans. Commun., vol.COM-22, no. 12, pp. 1897-1902, 1974.

[41] R. M .Gray and L. D. Davisson, “Quantizer mismatch,” IEEE Trans.Commun., vol. COM-23, no. 4, pp. 439-443, 1975.

[42] D. L. Neuhoff, R. M. Gray, and L. D. Davisson, “Fixed rate universalblock source coding with a fidelity criterion,” IEEE Trans. Inform.Theory, vol. IT-21, no. 5, pp. 511-523, 1975.


[43] D. J. Sakrison, “Worst sources and robust codes for difference distor-tion measures,” IEEE Trans. Inform. Theory, vol. IT-21, no. 3, pp.301-309, 1975.

[44] W. G. Bath and V. D. VandeLinde, “Robust memoryless quantizationfor minimum signal distortion,” IEEE Trans. Inform. Theory, vol.IT-28, no. 2, pp. 296-306, 1982.

[45] D. Kazakos, “New results on robust quantization,” IEEE Trans. Com-mun., vol. COM-31, no. 8, pp. 965-974, 1983.

[46] Y. Yamada, S. Tazaki, M. Kasahara, and T. Namekawa, “Variancemismatch of vector quantizers,” IEEE Trans. Inform. Theory, vol.IT-30, no. 1, pp. 104-107, 1984.

[47] A. Lapidoth, “On the role of mismatch in rate distortion theory,” inProc. IEEE Int. Symposium Inform. Theory, Sept. 1995, pp. 193.

[48] A. Lapidoth, “On the role of mismatch in rate distortion theory,” IEEETrans. Inform. Theory, vol. IT-43, no. 1, pp. 38-47, 1997.

[49] K. K. Lin and R. M .Gray, “Vector quantization of video with twocodebooks,” in Proc. Data Compression Conf., Mar. 1999, pp. 537.

[50] R. M. Gray and T. Linder, “Relative entropy and quantizer mismatch,”in Proc. Thirty-Sixth Asilomar Conf. Signals, Systems and Computers,Nov. 2002, pp. 129-133.

[51] R. M. Gray and T. Linder, “Mismatch in high-rate entropy-constrainedvector quantization,” IEEE Trans. Inform. Theory, vol. IT-49, no. 5,pp. 1204-1217, 2003.

[52] R. M. Gray and T. Linder, “High rate mismatch in entropy constrainedquantization,” in Proc. Data Compression Conf., 2003, pp. 173-182.

[53] P. Hedelin, F. Norden, and J. Skoglund, “SD optimization of spectralcoders,” in IEEE Workshop Speech Coding, Jun. 1999, pp. 28-30.

[54] J. Samuelsson and P. Hedelin, “Recursive coding of spectrum parame-ters,” IEEE Trans. Speech Audio Processing, vol. 9, no. 5, pp. 492-503,2001.

[55] J. Lindblom and J. Samuelsson, “Bounded support Gaussian mixturemodeling of speech spectra,” IEEE Trans. Speech Audio Processing,vol. 11, no. 1, pp. 88-99, 2003.

[56] A. D. Subramaniam and B. D. Rao, “PDF optimized parametric vectorquantization of speech line spectral frequencies,” IEEE Trans. SpeechAudio Processing, vol. 11, no. 2, pp. 130-142, 2003.

40 Introduction

[57] R. E. Blahut, “Computation of channel capacity and rate-distortionfunctions,” IEEE Trans. Inform. Theory, vol. IT-18, no. 4, pp. 460-473,1972.

[58] T. Berger, Rate distortion theory. Englewood Cliffs, NJ: Prentice-Hall,1971.

[59] R. M. Gray, Source coding theory. Boston, MA: Kluwer AcademicPublishers, 1990.

[60] T. M. Cover and J. A. Thomas, Elements of information theory. NewYork, NY: John Wiley & Sons, 1991.

[61] A. Gersho and R. M. Gray, Vector quantization and signal compression.Boston, MA: Kluwer Academic Publishers, 1992.

[62] W. B. Kleijn, A Basis for Source Coding. Course notes, KTH, Stock-holm, Jan. 2004.

[63] A. Papoulis and S. U. Pillai, Probability, random variables and stochas-tic processes. 4th ed. New York, NY: McGraw-Hill, 2002.

[64] J. H. Conway and N. J. A. Sloane, Sphere packings, lattices and groups.2nd ed. New York, NY: Springer-Verlag, 1993.

[65] N. J. A. Sloane, “Tables of sphere packings and spherical codes,” IEEETrans. Inform. Theory, vol. IT-27, no. 3, pp. 327-338, 1981.

[66] J. H. Conway and N. J. A. Sloane, “Voronoi regions of lattices, sec-ond moments of polytopes, and quantization,” IEEE Trans. Inform.Theory, vol. IT-28, no. 2, pp. 211-226, 1982.

[67] E. Agrell and T. Eriksson, “Optimization of lattices for Quantization,”IEEE Trans. Inform. Theory, vol. IT-44, no. 5, pp. 1814-1828, 1998.

[68] D. G. Jeong and J. D. Gibson, “Uniform and piecewise uniform latticevector quantization for memoryless Gaussian and Laplacian sources,”IEEE Trans. Inform. Theory, vol. IT-39, no. 3, pp. 786-804, 1993.

[69] J. H. Conway and N. J. A. Sloane, “A fast encoding method for latticecodes and quantizers,” IEEE Trans. Inform. Theory, vol. IT-29, no. 6,pp. 820-824, 1983.

[70] J. H. Conway and N. J. A. Sloane, “Fast quantizing and decodingalgorithms for lattice quantizers and codes,” IEEE Trans. Inform.Theory, vol. IT-28, no. 2, pp. 227-232, 1982.


[71] K. Sayood and S. J. Blankenau, “A fast quantization algorithm forlattice quantizer design,” in Proc. IEEE Conf. Acoust., Speech, SignalProcessing, May 1989, pp. 1168-1171.

[72] B.-H. Juang and A. H. Gray, Jr., “Multiple stage vector quantiza-tion for speech coding,” in Proc. IEEE Conf. Acoust., Speech, SignalProcessing, Apr. 1982, pp. 597-600.

[73] T. R. Fischer, “A pyramid vector quantizer,” IEEE Trans. Inform.Theory, vol. IT-32, no. 4, pp. 568-583, 1986.

[74] P. F. Swaszek and T. W. Ku, “Asymptotic performance of unrestrictedpolar quantizers,” IEEE Trans. Inform. Theory, vol. IT-32, no. 2, pp.330-333, 1986.

[75] S. Vembu, S. Verdu, and Y. Steinberg, “When does the source-channelseparation theorem hold?,” in Proc. IEEE Int. Symposium Inform.Theory, 27 June-1 July 1994, pp. 198.

[76] S. Vembu, S. Verdu, and Y. Steinberg, “The source-channel separationtheorem revisited,” IEEE Trans. Inform. Theory, vol. IT-41, no. 1, pp.44-54, 1995.

[77] K. Sayood, H. H. Otu, N. Demir, “Joint source/channel coding forvariable length codes,” IEEE Trans. Commun., vol. COM-48, no. 5,pp. 787-794, 2000.

[78] G. Carle and E. W. Biersack, “Survey of error recovery techniques forIP-based audio-visual multicast applications,” IEEE Network, vol. 11,no. 6, pp. 24-36, Nov.-Dec., 1997.

[79] C. Perkins, O. Hodson, and V. Hardman, “A survey of packet lossrecovery techniques for streaming audio,” IEEE Network, vol. 12, no.5, pp. 40-48, Sept.-Oct. 1998.

[80] C. Perkins and O. Hodson, “Options for repair of streaming media,”IETF RFC 2354, Jun. 1998.

[81] P. Cuenca, L. Orozco-Barbosa, A. Garrido, F. Quiles, and T. Olivares,“A survey of error concealment schemes for MPEG-2 video communi-cations over ATM networks,” in Proc. IEEE Canadian Conf. Electricaland Computer Engineering, May 1997, pp. 118-121.

[82] B. W. Wah, X. Su, and D. Lin, “A survey of error-concealment schemesfor real-time audio and video transmissions over the Internet,” in Proc.IEEE Int. Symposium Multimedia Software Engineering, Dec. 1998, pp.17-24.

42 Introduction

[83] A. Li, F. Liu, J. Villasenor, J. Park, D. Park, and Y. Lee, “An RTPpayload format for generic FEC with uneven level protection,” InternetDraft draft-ietf-avt-ulp-04.txt, Feb. 2002.

[84] J. Rosenberg and H. Schulzrine, “An RTP payload format for genericforward error correction,” IETF RFC 2733, Dec. 1999.

[85] C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J.-C. Bolot, A. Vega-Garcia, and S. Fosse-Parisis, “RTP payload forredundant audio data,” IETF RFC 2198, Dec. 1997.

[86] D. G. Hoffman, D. A. Leonard, C. C. Lindner, K. T. Phelps, C. A.Rodger, and J. R. Wall, “Coding theory: the essentials”. New York,NY: Marcel Dekker, 1991.

[87] B. Sklar and F. J. Harris, “The ABCs of linear block codes,” IEEESignal Processing Magazine, vol. 21, no. 4, pp. 14-35, 2004.

[88] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correctingbinary group codes,” Inform. Contr., vol. 3, pp. 69-79, 1960.

[89] R. C. Bose and D. K. Ray-Chaudhuri, “Future results on error cor-recting binary group codes,” Inform. Contr., vol. 3, pp. 279-290, 1960.

[90] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,”SIAM J. Appli. Math., vol. 8, pp. 300-304, 1960.

[91] R. C. Singleton, “Maximum distance Q-nary codes,” IEEE Trans.Inform. Theory, vol. IT-10, no. 2, pp. 116-118, 1964.

[92] D. J. C. MacKay and R. M. Near, “Shannon limit performance of lowdensity parity check codes,” Electronics Letters, vol. 32, no. 18, pp.1645-1646, 1996.

[93] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limiterror-correcting coding and decoding: Turbo-codes,” in Proc. IEEEProc. Int. Conf. Commun., May 1993, pp. 1064-1070.

[94] A. A. El Gamal and T. M. Cover, “Achievable rates for multiple de-scriptions,” IEEE Trans. Inform. Theory, vol. IT-28, no.6, pp. 851-857,1982.

[95] L. Ozarow, “On a source coding problem with two channels and threereceivers,” The Bell System Technical Journal, vol. 59, no. 10, pp.1909-1921, Dec. 1980.

[96] R. Ahlswede, “The rate-distortion region for multiple descriptionswithout excess rate,” IEEE Trans. Inform. Theory, vol. 31, no. 6,pp. 721-726, Nov. 1985.


[97] Z. Zhang and T. Berger, “New results in binary multiple descriptions,”IEEE Trans. Inform. Theory, vol. 33, no. 4, pp. 502-521, July 1987.

[98] R. Zamir, “Gaussian codes and Shannon bounds for multiple descrip-tions,” IEEE Trans. Inform. Theory, vol. 45, no. 7, pp. 2629-2636,Nov. 1999.

[99] T. Linder, R. Zamir, and K. Zeger, “The multiple description rateregion for high resolution source coding,” in Proc. Data CompressionConf., 1998, pp. 149-158.

[100] P. L. Dragotti, S. D. Servetto, and M. Vetterli, “Optimal filter banksfor multiple description coding: analysis and synthesis,” IEEE Trans.Inform. Theory, vol. 48, no. 7, pp. 2036-2052, July 2002.

[101] R. Venkataramani, G. Kramer, and V. K. Goyal, “Multiple descriptioncoding with many channels,” IEEE Trans. Inform. Theory, vol. 49, no.9, pp. 2106-2114, Sept. 2003.

[102] W. H. R. Equitz and T. M. Cover, “Successive refinement of informa-tion,” IEEE Trans. Inform. Theory, vol. 37, no. 2, pp. 269-275, Mar.1991.

[103] R. Venkataramani, G. Kramer, and V. K. Goyal, “Successive refine-ment on trees: a special case of a new MD coding region,” in Proc.Data Compression Conf., 2001, pp. 293-301.

[104] V. K. Goyal, “Multiple description coding: compression meets thenetwork,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 74-93,Sept. 2001.

[105] H. Coward, “Joint Source-Channel Coding: Development of Meth-ods and Utilization in Image Communications,” Ph. D. Dissertation,Norwegian University of Science and Technology, 2001.

[106] V. A. Vaishampayan, “Design of multiple description scalar quan-tizer,” IEEE Tran. Inform. Theory, vol. 39, no. 3, pp. 821-834, May1993.

[107] V. A. Vaishampayan and J. Domaszewicz, “Design of entropy-constrained multiple-description scalar quantizers,” IEEE Trans. In-form. Theory, vol. 40, no. 1, pp. 245-250, Jan. 1994.

[108] V. A. Vaishampayan and J.-C. Batllo, “Asymptotic analysis of multi-ple description quantizers,” IEEE Trans. Inform. Theory, vol. 44, no.1, pp. 245-250, Jan. 1998.

44 Introduction

[109] Y. Wang, M. T. Orchard, and A. R. Reibman, “Multiple descriptionimage coding for noisy channels by pairing transform coefficients,” inProc. IEEE First Workshop Multimedia Signal Processing, Jun. 1997,pp. 419-424.

[110] M. T. Orchard, Y. Wang, V. Vaishampayan, and A. R. Reibman,“Redundancy rate-distortion analysis of multiple description codingusing pairwise correlating transforms,” in Proc. IEEE Int. Conf. ImageProcessing, Oct. 1997, pp. 608-611.

[111] Y. Wang, M. T. Orchard, V. Vaishampayan, and A. R. Reibman,“Multiple description coding using pairwise correlating transforms,”IEEE Trans. Image Processing, vol. 10, no. 3, pp. 351-366, Mar. 2001.

[112] Y. Wang, A. R. Reibman, M. T. Orchard, and H. Jafarhani, “Animprovement to multiple description transform coding,” IEEE Trans.Signal Processing, Vol. 50, No. 11, pp. 2843-2854, Nov. 2002.

[113] V. K. Goyal and J. Kovacevic, “Generalized multiple description cod-ing with correlating transforms,” IEEE Trans. Inform. Theory, vol. 47,no. 6, pp. 2199-2224, Sept. 2001.

[114] V. K. Goyal, “On the side distortion Lower bound in “Generalizedmultiple description coding with correlating transforms”,” submittedto IEEE Trans. Inform. Theory, Nov. 2002.

[115] A. R. Reibman, H. Jafarkhani, M. T. Orchard, and Y. Wang, “Perfor-mance of multiple description coders on a real channel,” in Proc. IEEEConf. Acoust., Speech, Signal Processing, May 1999, pp. 2415-2418.

[116] S. D. Servetto, V. A. Vaishampayan, and N. J. A. Sloane, “Multipledescription lattice vector quantization,” in Proc. Data CompressionConf., Mar. 1999, pp. 13-22.

[117] V. A. Vaishampayan, N. J. A. Sloane, and S. D. Servetto, “Multi-ple description vector quantization with lattice codebooks: design andanalysis,” IEEE Trans. Inform. Theory, vol. 47, no. 5, pp. 1718-1734,2001.

[118] J. A. Kelner, V. K. Goyal, and J. Kovacevic, “Multiple descriptionlattice vector quantization: variations and extensions,” in Proc. DataCompression Conf., Mar. 2000, pp. 480-489.

[119] V. K. Goyal, J. A. Kelner, and J. Kovacevic, “Multiple descriptionvector quantization with a coarse lattice,” IEEE Trans. Inform. The-ory, vol. 48, no. 3, pp. 781-788, 2002.


[120] S. N. Diggavi, N. J. A. Sloane, and V. A. Vaishampayan, “Designof asymmetric multiple description lattice vector quantizers,” in Proc.Data Compression Conf., 2000, pp. 490-499.

[121] S. N. Diggavi, N. J. A. Sloane, and V. A. Vaishampayan, “Asymmetricmultiple description lattice vector quantizers,” IEEE Trans. Inform.Theory, vol. 48, no. 1, pp. 174-191, 2002.

[122] D. Y. Zhao and W. B. Kleijn, “Multiple-description vector quanti-zation using translated lattices with local optimization,” to appear inIEEE Global Telecommunications Conf., 2004.

[123] P.A. Chou and Z. Miao, “Rate-distortion optimized sender-drivenstreaming over best-effort networks,” in Proc. IEEE Fourth Workshopon Multimedia Signal Processing, 2001, pp. 587-592.

[124] J. Chakareski, P. A. Chou, and B. Aazhang, “Computing rate-distortion optimized policies for streaming media to wireless clients,”in Proc. Data Compression Conf., 2002, pp. 53-62.

[125] J. Chakareski and P.A. Chou, “Application layer error correctioncoding for rate-distortion optimized streaming to wireless clients,” inProc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2002,pp. 2513-2516.

[126] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization forvideo compression,” IEEE Signal Processing Magazine, vol. 15, no. 6,pp. 74-90, Nov. 1998.

[127] A. Ortega and K. Ramchandran, “Rate-distortion methods for imageand video compression,” IEEE Signal Processing Magazine, vol. 15,no. 6, pp. 23-50, Nov. 1998.

[128] M. Gallant and F. Kossentini, “Efficient scalable DCT-based videocoding at low bit rates,” in Proc. Int. Conf. Image Processing, 1999,pp. 782-786.

[129] M. Gallant and F. Kossentini, “Rate-distortion optimal jointsource/channel coding for robust and efficient low bit rate packet videocommunications,” in Proc. Int. Conf. Image Processing, 2000, pp. 355-358.

[130] M. Gallant and F. Kossentini, “Rate-distortion optimized layeredcoding with unequal error protection for robust Internet video,” IEEETrans. Circuits Syst. Video Technol., vol. 11, no. 3, pp. 357-372, Mar.2001.

46 Introduction

[131] J. D. Johnston, “Transform coding of audio signals using perceptualnoise criteria,” IEEE J. Selected Areas Comm., vol. 6, no. 2, pp. 314-323, 1988.

[132] N. Jayant, “Signal compression: technology targets and research di-rections,” IEEE J. Selected Areas Comm., vol. 10, no. 5, pp. 796-818,1992.

[133] N. Jayant, J. Johnston, and R. Safranek, “Signal compression basedon models of human perception,” Proceedings of the IEEE, vol. 81, no.10, pp. 1385-1422, 1993.

[134] T. Painter and A. Spanias, “Perceptual coding of digital audio,”Proceedings of the IEEE , vol. 88, no. 4, pp. 451-515, 2000.

[135] A. B. Watson, G. Y. Yang, J. A. Solomon, and J. Villasenor, “Visi-bility of wavelet quantization noise,” IEEE Trans. Image Processing,vol. 6, no. 8, pp. 1164-1175, 1997.

[136] D.-F. Shen and J.-H. Sung, “Application of JND visual model toSPHIT image coding and performance evaluation,” in Proc. IEEEConf. Image Processing, Jun. 2002, pp. 24-28.

[137] D. S. Kim and M. Y. Kim, “On the perceptual weighting functionfor phase quantization of speech,” in Proc. IEEE Workshop SpeechCoding, Sep. 2000, pp. 17-20.

[138] D. S. Kim, “Perceptual phase quantization of speech,” IEEE Trans.Speech Audio Processing, vol. 11, no. 4, pp. 355-364, Jul. 2003.

[139] J. J. Y. Huang and P. M. Schultheiss, “Block quantization of corre-lated Gaussian random variables,” IEEE Trans. Commun., vol. COM-11, no. 3, pp. 289-296, 1963.

[140] N. Ahmed, T. Natarajan, and K. Rao, “Dscrete cosine transform,”IEEE Trans. Comput., vol. C-23, pp. 90-93, 1974.

[141] W. B. Kleijn and K. K. Paliwal, Speech Coding and Synthesis. Ams-terdam, The Netherlands: Elsevier, 1995.

[142] ITU-T Recommendation G.711. Pulse code modulation (PCM) ofvoice frequencies, ITU-T, 1988.

[143] ITU-T Recommendation G.726. 40, 32, 24, 16 kbit/s adaptive differ-ential pulse code modulation (ADPCM), ITU-T, 1990.

[144] ITU-T Recommendation G.728. Coding of speech at 16 kbit/s usinglow-delay code excited linear prediction, ITU-T, 1992.


[145] ITU-T Recommendation G.723.1. Dual rate speech coder for multime-dia communications transmitting at 5.3 and 6.3 kbit/s, ITU-T, 1996.

[146] ITU-T Recommendation G.729. Coding of speech at 8 kbit/s us-ing conjugate-structure algebraic-code-excited linear-prediction (CS-ACELP), ITU-T, 1996.

[147] ITU-T Recommendation G.722. 7 khz audio - coding within 64 kbit/s,ITU-T, 1988.

[148] ITU-T Recommendation G.722.1. Coding at 24 and 32 kbit/s forhands-free operation in systems with low frame loss, ITU-T, 1999.

[149] ITU-T Recommendation G.722.2. Wideband coding of speech ataround 16 kbit/s using Adaptive Multi-rate Wideband (AMR-WB),ITU-T, 2002.

[150] Technical Report GSM 06.10, European digital telecommunicationssystem (Phase2); full rate speech processing functions. ETSI, 1992.

[151] P. Kroon, E. F. Deprettere, and R. J. Sluyter, “Regular-pulseexcitation–A novel approach to effective and efficient multipulse codingof speech,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 34,no. 5, pp. 1054-1063, 1986.

[152] I. A. Gerson, and M. A. Jasiuk, “Techniques for improving the perfor-mance of CELP-type speech coders,” IEEE J. Selected Areas Comm.,vol. 10, no. 5, pp. 858-865, 1992.

[153] Technical Report GSM 06.60, Digital cellular telecommunications sys-tem: Enhanced full rate (EFR) speech transcoding. ETSI, 1997.

[154] Technical Report GSM 06.90, Digital cellular telecommunications sys-tem (phase 2+): Adaptive multirate (AMR) speech transcoding. ETSI,1998.

[155] K. Homayounfar, “Rate adaptive speech coding for universal multi-media access,” IEEE Signal Processing Magazine, vol. 20, no. 2, pp.30-39, 2003.

[156] IS-127, Enhanced variable rate codec. 3GPP2 C.S0014-0, 1997.

[157] Y. Gao, E. Shlomot, A. Benyassine, J. Thyssen, H.-Y. Su, and C.Murgia, “The SMV algorithm selected by TIA and 3GPP2 for CDMAapplications,” in Proc. IEEE Conf. Acoust., Speech, Signal Processing,May 2001, pp. 709-712.

48 Introduction

[158] S. C. Greer and A. DeJaco, “Standardization of the selectable modevocoder,” in Proc. IEEE Conf. Acoust., Speech, Signal Processing, May2001, pp. 953-956.

[159] S. V. Andersen, W. B. Kleijn, R. Hagen, J. Linden, M. N. Murthi,and J. Skoglund, “iLBC - a linear predictive coder with robustness topacket losses,” in Proc. IEEE Workshop Speech Coding, Oct. 2002, pp.23-25.

[160] P. Noll, “MPEG digital audio coding,” IEEE Signal Processing Mag-azine, vol. 14, no. 5, pp. 59-81, 1997.

[161] ATSC Doc. A/52, Digital audio compression standard (AC-3), UnitedStates Advanced Television Systems Committee (ATSC), 1995.

[162] K. Konstantinides, C.-T. Chen, T.-C. Chen, H. Cheng, and F.-C.Jeng, “Design of an MPEG-2 codec,” IEEE Signal Processing Maga-zine, vol. 19, no. 4, pp. 32-41, 2002.

[163] G. K. Wallace, “The JPEG still picture compression standard,” IEEETrans. Consumer Electronics, vol. 38, no. 1, pp. xviii-xxxiv, 1992.

[164] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000still image compression standard,” IEEE Signal Processing Magazine,vol. 18, no. 5, pp. 36-58, 2001.

[165] MPEG-2/H.262, Generic coding of moving pictures and associatedaudio information - Part 2: Video, ITU-T and ISO/IEC JTC 1.

[166] H.264/AVC, Draft ITU-T recommendation and final draftinternational standard of joint video specification (ITU-T Rec.H.264/ISO/IEC 14486-10 AVC), Joint Video Team (JVT) of ISO/IECMPEG and ITU-T VCEG, 2003.

[167] T. Wiegand, G. J. Sullivan, G. Bjøtegaard, and A. Luthra,, “Overviewof the H.264/AVC Video Coding Standard,” IEEE Trans. CircuitsSyst. Video Technol., vol. 13, no. 7, pp. 560-576, 2003.

source and channel coding for audiovisual communication...

Documents