lecture 7 lossy source coding -...

Lossy Source Coding Theorem for Memoryless SourcesProof of the Coding Theorem

Lecture 7Lossy Source Coding

I-Hsiang Wang

Department of Electrical EngineeringNational Taiwan University

[email protected]

December 2, 2015

1 / 39 I-Hsiang Wang IT Lecture 7

[email protected]


The Block-to-Block Source Coding Problem

Source Encoder

Source DecoderSource Destination

s[1 : N ] b[1 : K] bs[1 : N ]

Recall: in Lecture 03, we investigated the fundamental limit of (almost)lossless block-to-block (or fixed-to-fixed) source coding.

The recovery criterion is vanishing probability of error :

limN→∞

P{

SN = SN}= 0.

The minimum compression ratio to fulfill lossless reconstruction isthe entropy rate of the source:

R∗ = H ({Si} ) , for stationary and ergodic {Si}.



The Block-to-Block Source Coding Problem

Source Encoder


s[1 : N ] b[1 : K] bs[1 : N ]

In this lecture, we turn our focus to lossy block-to-block source coding,where the setting is the same as before, except

The recovery criterion is reconstruction to with a given distortion D :

lim supN→∞

E[d(

SN, SN)]

≤ D.

The minimum compression ratio to fulfill reconstruction to within agiven distortion D is the rate-distortion function:

R(D) = minpS|S: E[d(S,S)]≤D

I(

S ; S), for DMS {Si}.



Why lossy source coding?

Sometimes it might be too expensive to reconstruct the source in alossless way.

Sometimes it is impossible to reconstruct the source losslessly.For example, if the source is continuous-valued, the entropy rate ofthe source is usually infinite!

Lossy source coding has wide range of applications, includingquantization/digitization of continuous-valued signals,image/video/audio compression, etc.

In this lecture, we first focus on discrete memoryless sources (DMS).Then, we employ the discretization technique to extend the codingtheorems from the discrete-source case to the continuous-source case. Inparticular, Gaussian sources will be our main focus.



Lossless vs. Lossy Source Coding

The general lossy source coding problem involves quantizing all possiblesource sequences sN ∈ SN into 2K reconstruction sequences sN ∈ SN,which can be represented by K bits.The goal is to design the correspondence between sN and sN so that thedistortion (quantization error) is below a prescribed level D.

Lossy source coding has a couple of notable differences from losslesssource coding:

Source alphabet S and the reconstruction alphabet S could bedifferent in general.Performance is determined by the chosen distortion measure.



Lossy Source Coding TheoremRate Distortion Function

1 Lossy Source Coding Theorem for Memoryless SourcesLossy Source Coding TheoremRate Distortion Function

2 Proof of the Coding TheoremConverse ProofAchievability




Distortion Measures

We begin with the definition of the distortion measure per symbol.

Definition 1 (Distortion Measure)A per-symbol distortion measure is a mapping d (s, s) that maps fromS × S to [0,∞), and it is understood as the cost of representing s by s.For two length N sequences sN and sN, the distortion between them isdefined as the average of the per-symbol distortion:

d(sN, sN) ≜ 1

N∑N

i=1 d (si, si) .

Examples: below are two widely used distortion measures:Hamming distortion: S = S, d (s, s) ≜ 1 {s = s}.

Squared-error distortion: S = S = R, d (s, s) ≜ (s − s)2.




Lossy Source Coding: Problem Setup

Source Encoder


s[1 : N ] b[1 : K] bs[1 : N ]

1 A(2NR,N

)source code consists of

an encoding function (encoder) encN : SN → {0, 1}K that mapseach source sequence sN to a bit sequence bK, where K ≜ ⌊NR⌋.a decoding function (decoder) decN : {0, 1}K → SN that maps eachbit sequence bK to a reconstructed source sequence sN.

2 The expected distortion of the code D(N) ≜ E[d(

SN, SN)]

.

3 A rate-distortion pair (R,D) is said to be achievable if there exist asequence of

(2NR,N

)codes such that lim sup

N→∞D(N) ≤ D.

The optimal compression rate R(D) ≜ inf {R | (R,D) : achievable}.




Rate Distortion Trade-off

Dmin ≜ mins(s)

E [d (S, s (S))]

It denotes the minimum possible targetdistortion so that the rate is finite.Even the decoder knows the entire sN

and finds a best representative sN (sN),

the expected distortion is still Dmin.

Dmax ≜ mins

E [d (S, s)] DmaxDmin

R

D

Achievable

Not Achievable

H (S )

R (D)R (Dmin)

Let s∗ ≜ arg mins

E [d (S, s)]. Then for target distortion D ≥ Dmax, we can use a

single representative s∗ ≜ (s∗, s∗, . . . , s∗) to reconstruct all sN ∈ SN (rate is 0!), and

D(N) = E[d(SN, s∗

)]= 1

N∑N

i=1 E [d (Si, s∗)] = Dmax ≤ D.

Hence, R (D) = 0 for all D ≥ Dmax.10 / 39 I-Hsiang Wang IT Lecture 7



Lossy Source Coding Theorem

Source Encoder


s[1 : N ] b[1 : K] bs[1 : N ]

Theorem 1 (A Lossy Source Coding Theorem for DMS)For a discrete memoryless source {Si | i ∈ N},

R(D) = minpS|S: E[d(S,S)]≤D

I(

S ; S). (1)

Interpretation:

H (S ) − H(

S∣∣∣S)

= I(

S ; S)

Uncertaintyof source S − Uncertainty of S

after learning S = The rate used incompressing S to S




Properties of Rate Distortion Function

DmaxDmin

R

D

Achievable

Not Achievable

H (S )

R (D)R (Dmin)

A rate distortion function R (D)satisfies the following properties:

1 Nonnegative2 Non-increasing in D.3 Convex in D.4 Continuous in D.5 R (Dmin) ≤ H (S ).6 R (D) = 0 if D ≥ Dmax.

These properties are all quite intuitive.Below we sketch the proof of these properties.




Monotonicity Clear from the definition.

Convexity

The goal is to prove that D1,D2 ≥ Dmin and λ ∈ (0, 1), λ ≜ 1− λ,

R(λD1 + λD2

)≤ λR (D1) + λR (D2) .

Let pi (s|s) ≜ arg minpS|S: E[d(S,S)]≤Di

I(

S ; S)

, the optimizing conditional

distribution that achieves distortion Di, for i = 1, 2. Let pλ ≜ λp1 + λp2.Under pλ (s|s), the expected distortion between S and S ≤ λD1 + λD2,

∵ Ep(S)pλ (s|s)

[d(

S, S)]

=∑

s

∑s

p (s)[λp1 (s|s) + λp2 (s|s)

]d (s, s) .

Proof is complete since I(

S ; S)

is convex in pS|S with a fixed pS:

R(λD1 + λD2

)≤ I

(S ; S

)pλ

≤ λI(

S ; S)

p1

+ λI(

S ; S)

p2

= λR (D1) + λR (D2) .




Nonnegativity Clear from the definition.

Continuity It is well-known that convexity within an open interval impliescontinuity within that open interval.

DmaxDmin

R

D

violates convexity

DmaxDmin

R

D

discontinuity is possible here

violates convexity

Hence, the only point where R (D) might be discontinuous is at the boundaryD = Dmin. The proof is technical and can be found in Gallager[2].




Example: Bernoulli Source with Hamming Distortion

Source (binary) Si ∈ S = {0, 1}, and Sii.i.d.∼ Ber (p) ∀ i.

Distortion (Hamming) d (s, s) = 1 {s = s}.

Example 1Derive the rate distortion function of the Bernoulli p source withHamming distortion and show that it is given by

R (D) =

{Hb (p)− Hb (D) , 0 ≤ D ≤ min (p, 1− p)

0, D > min (p, 1− p).

This is the first example about how to compute the rate distortionfunction, that is, how to solve (1) in the lossy source coding theorem.




sol. The first step is to identify Dmin and Dmax.

Dmin = 0 because one can choose s (s) = s.

Dmax = min (p, 1− p) because one can choose s =

{0 p ≤ 1

2

1 p ≥ 12

.

The next step is to lower bound I(

S ; S)= H (S )− H

(S∣∣∣S)

.

It is equivalent to upper bounding H(

S∣∣∣S)

:

H(

S∣∣∣S)

= H(

S ⊕ S∣∣∣S)

≤ H(

S ⊕ S)= Hb (q) ,

where we assume that S ⊕ S ∼ Ber (q) for some q ∈ [0, 1].

Observe that d(

S, S)≡ S ⊕ S. Hence, E

[d(

S, S)]

≤ D =⇒ q ≤ D.

Since D ≤ Dmax ≤ 12, we see that Hb (q) is maximized when q = D.

Hence, I(

S ; S)≥ Hb (p)− Hb (D).




Final step: show that the lower bound Hb (p)− Hb (D) can be attained.The goal is to find a probability transition matrix p (s|s) such that

S ⊥⊥ S ⊕ S so that H(

S ⊕ S∣∣∣S )

= H(

S ⊕ S)

and P{

S ⊕ S = 1}= D.

At first glance this looks hard.The difficulty can be resolved via an auxiliary reverse channel.

Consider a channel with input S, output S, additive noise Z ∼ Ber (D)⊥⊥ S.

S = S ⊕ Z =⇒ Z = S ⊕ S.

The reverse channel specifies the joint distribution p (s, s) and hence p (s|s)!

0

1

�S S

0

1

D

1 � D

� p

1 � � 1 � p

D

1 � D

p = (1 � �)D + �(1 � D) =� � =p � D

1 � 2D




Example: Gaussian Source with Squared Error Distortion

Source (Gaussian) Si ∈ S = R, and Sii.i.d.∼ N

(µ, σ2

)∀ i.

Distortion (Squared Error) d (s, s) = |s − s|2.

Example 2Derive the rate distortion function of the Gaussian source with squarederror distortion and show that it is given by

R (D) =

12 log

(σ2

D

), 0 ≤ D ≤ σ2

0, D > σ2.

Remark: Although the source is continuous, one can use weak typicalityor the discretization method used in channel coding to extend the lossysource coding theorem from discrete memoryless sources to continuous.Note: In particular, note that R (0) = ∞, which is quite intuitive!




sol. First step: identify Dmin and Dmax.

Dmin = 0 because one can choose s (s) = s.Dmax = σ2 because one can choose s = µ, the mean of S.

Next step: lower bound I(

S ; S)= h (S )− h

(S∣∣∣S)

.

It is equivalent to upper bounding h(

S∣∣∣S)

:

h(

S∣∣∣S)

= h(

S − S∣∣∣S)

≤ h(

S − S)≤ 1

2log (2πe D) ,

where the last inequality holds since Var[S − S

]≤ E

[∣∣∣S − S∣∣∣2] ≤ D.

Hence, I(

S ; S)≥ 1

2 log(2πeσ2

)− 1

2 log (2πe D) = 12 log

(σ2

D

).




Final step: show that the lower bound 12 log

(σ2

D

)can be attained.

The goal is to find a conditional distribution p (s|s) such that

S⊥⊥(

S − S)

so that h(

S − S∣∣∣S )

= h(

S − S)

and(

S − S)∼ N (0,D) .

Again, this can be done via an auxiliary reverse channel.Consider a channel with input S, output S, additive noise Z ∼ N (0,D)⊥⊥S.

S = S + Z =⇒ Z = S − S.

The reverse channel specifies the joint distribution p (s, s) and hence p (s|s)!

X � N�µ,�2

��X

Z � N (0, D)

�X � N�µ, �2 � D

��

�X � �X

�




Example: Source Alphabet = Reconstruction Alphabet

Source (ternary) Si ∈ S = {0, ∗, 1}, and Sii.i.d.∼ pS ∀ i, where

pS (0) = pS (1) = ε ≤ 12 .

Reconstruction (binary) S = {0, 1}.

Distortion d (s, s) ={1 if s = ∗ and s = s0 if s = ∗ or s

.

In other words, there is a don’t-care symbol ∗, and S = S.

Example 3(HW5) Derive the rate distortion function and show that it is given by

R (D) =

{2ε

(1− Hb

( D2ε

)), 0 ≤ D ≤ ε

0, D > ε.



Converse ProofAchievability






Proof of the Converse of Theorem 1

We aim to show that for any sequence of(2NR,N

)source codes with

lim supN→∞

D(N) ≤ D, the rate R must satisfy R ≥ R(D) (defined in (1)).

We begin with similar steps as in lossless source coding (cf. Lecture 03).

pf: Note that BK is a r.v. because it is generated by another r.v, SN.

K = NR ≥ H(BK )

≥ I(

BK ; SN) (a)≥ I

(SN ; SN

)(b)=

N∑i=1

I(

Si ; SN∣∣∣Si−1

)(c)=

N∑i=1

I(

Si ; SN,Si−1)

≥N∑

i=1

I(

Si ; Si

)(a) is due to SN − BK − SN and the data processing inequality.(b) is due to Chain Rule. (c) is due to Si ⊥⊥ Si−1 (memoryless source).

So far, we have not yet used the condition on distortion.




Further working on the inequality:

NR ≥∑N

i=1 I(

Si ; Si

)(d)≥

∑Ni=1 R

(E[d(

Si, Si

)])= N

∑Ni=1

1N R

(E[d(

Si, Si

)])(e)≥ NR

(∑Ni=1

1NE

[d(

Si, Si

)])= NR

(E[1N∑N

i=1 d(

Si, Si

)])= NR

(E[d(

SN, SN)])

= NR(D(N)

).

(d) is due to the definition of R (D) in (1).(e) is due to the convexity of R (D) and Jensen’s inequality.

Hence, R ≥ lim supN→∞

R(D(N)

) (f)≥ R

(lim sup

N→∞D(N)

)(g)≥ R (D).

(f) is due to continuity of R (D).(g) is due to lim sup

N→∞D(N) ≤ D and R (D) is non-increasing.




Remarks

You might note that in the previous proof of converse, we do not makeuse of lower bounds on error probability such as Fano’s inequality.This is because that in our formulation of the lossy source codingproblem, the reconstruction criterion is laid on the expected distortion.

Instead of the criterion lim supN→∞

D(N) ≤ D where D(N) ≜ E[d(

SN, SN)]

,we could use a stronger criterion as follows:

P(N,δ)e ≜ P

{d(

SN, SN)> D + δ

}, δ > 0 (Probability of Error)

limN→∞

P(N,δ)e = 0, ∀ δ > 0 (Reconstruction Criterion)

Under this stronger criterion, we can then give a new operationaldefinition of the rate distortion function.It turns out Theorem 1 remains the same! (converse is implied by our converse)




Idea of Constructing Good Source Code

Key in source coding:1 Find a good set of representatives (quantization codewords).2 For each source sequence, determine which codeword to be used.

Main tools we use so far in developing achievability of coding theorems:1 Random coding: construct the codebook randomly and show that

at least one realization can achieve the desired target performance.2 Typicality: help give bounds in performance analysis.

In the following, we prove the achievability part of Theorem 1 by

1 Random coding – show existence of good quantization codebook.

2 Typicality encoding – determine which codeword to be used.




SN�SN Random Coding

Typicality Encoding




Proof Program

1 Random Codebook Generation:Generate a random ensemble of quantization codebooks, each ofwhich contains 2K codewords.

2 Analysis of Expected Distortion:Goal: Show that lim sup

N→∞EC,SN

[d(

SN, SN)]

≤ D, and concludethat there must exist a codebook c such that the expected distortionsatisfies lim sup

N→∞ESN

[d(

SN, SN)]

≤ D.

Note that for a source sequence sN, the optimal encoder chooses anindex w ∈

[1 : 2K]

, that is, a codeword sN (w) in the codebook, sothat d

(sN, sN (w)

)is minimized.

However, similar to ML decoding in channel coding, such optimalencoder is hard to analyze. To simplify analysis, we shall introduce asuboptimal encoder based on typicality.




Random Codebook Generation

Fix the conditional p.m.f. that attains R(

D1+ε

):

qS|S = arg minpS|S: E[d(S,S)]≤ D

1+ε

I(

S ; S)

(2)

Based on the chosen qS|S and the source distribution pS, calculate pS,the marginal distribution of the reconstruction S.Generate 2K codewords

{sN (w) | w = 1, 2, . . . , 2K}

, i.i.d. according top(sN) = ∏N

i=1 pS (si).

In other words, if we think of the quantization codebook as a 2K × Nmatrix C, the elements of C will be i.i.d. distributed according to pS.

Remark: observe the resemblance with the channel coding achievability.




Encoding and Decoding

Encoding: unlike channel coding, the encoding process in source codingproblem is usually much involved.We use typicality encoding: (resembling typicality decoding in channel coding)

Given a source sequence sN, find an index w ∈[1 : 2K]

such that(sN, sN (w)

)∈ T (N)

ε

(pS,S

).

Recall the joint distribution pS,S = pS × qS|S as defined in (2).If there is no or more than one such index, randomly pick onew ∈

[1 : 2K]

.Send out the bit sequence that represent the chosen w.

Decoding: Upon receiving the bit sequence representing w, generate thereconstructed sN (w) by looking up the quantization codebook.




Analysis of Expected Distortion

Why typicality encoder? Typical average lemma (Lemma 2, Lecture 04):

For any nonnegative function g (x) on X , if xn ∈ T (n)ε (X), then

(1− ε)E [g (X)] ≤ 1n

n∑i=1

g (xi) ≤ (1 + ε)E [g (X)] .

In analyzing EC,SN

[d(

SN, SN)]

, we can then distinguish into two cases:

E ≜{(

SN, SN)/∈ T (N)

ε

}and Ec ≜

{(SN, SN

)∈ T (N)

ε

}:

P {E}EC,SN

[d(

SN, SN)∣∣∣E]+ P {Ec}EC,SN

[d(

SN, SN)∣∣∣Ec

]≤ P {E}max

s,sd (s, s) + P {Ec} (1 + ε)

D1 + ε

≤ P {E}maxs,s

d (s, s) + D.

Hence, as long as P {E} vanishes as N → ∞, we are done.




Analysis of Expected Distortion → Analysis of P {E}

With typicality encoding, analysis of expected distortion is made easy:just need to control P {E}, where E ≜

{(SN, SN

)/∈ T (N)

ε

}.

Let us look at event E : it is the event that the reconstructed SN is notjointly typical with SN, which can only happen when none of thequantization codewords in the codebook is jointly typical with SN.

Hence, E ⊆∩2K

w=1 Acw, where Aw ≜

{(SN, SN(w)

)∈ T (N)

ε

}.

=⇒ P {E} ≤ P{∩2K

w=1 Acw

}.

Unfortunately, the events{Ac

w | w = 1, . . . , 2K}may not be mutually

independent, because they all involve a common random sequence SN.

However, for fixed sN, the events Acw(sN) ≜ {(

sN, SN(w))/∈ T (N)

ε

},

w = 1, . . . , 2K, are indeed mutually independent!




Analysis of P {E}, E ≜ {(SN, SN) /∈ T (N)ε }

Motivated by the above observation, we give an alternative upper bound:

P {E} ≤∑

sN∈SNp(sN)P{∩2K

w=1 Acw(sN)}

=∑

sN∈SNp(sN)∏2K

w=1 P{Ac

w(sN)}

=∑

sN∈SNp(sN)∏2K

w=1

(1− P

{Aw

(sN)})

Question: Is there a way to lower bound

P{Aw

(sN)} ≜ P

{(sN, SN(w)

)∈ T (N)

ε

(pS,S

)}?

Yes – As long as sN ∈ T (N)ε′ (pS) for some ε′ < ε, Lemma 1 (next slide)

guarantees that P{Aw

(sN)} ≥ 2−N(I(S ;S )+δ(ε)) for sufficiently large N,

where limε→0 δ(ε) = 0.




Joint Typicality Lemma

The following lemma formally states the bounds.(Proof is omitted – see Section 2.5 of ElGamal&Kim[6])

Lemma 1 (Joint Typicality Lemma)Consider a joint p.m.f. pX,Y = pX · pY|X = pY · pX|Y. Then, there existδ(ε) > 0 with limε→0 δ(ε) = 0 such that:

1 For an arbitrary sequence xn and random Yn ∼∏n

i=1 pY (yi),

P{(xn,Yn) ∈ T (n)

ε (pX,Y)}≤ 2−n(I(X ;Y )−δ(ε)).

2 For an ε′-typical sequence xn ∈ T (n)ε′ (pX) with ε′ < ε, and random

Yn ∼∏n

i=1 pY (yi), for sufficiently large n,

P{(xn,Yn) ∈ T (n)

ε (pX,Y)}≥ 2−n(I(X ;Y )+δ(ε)).




Finalizing the Proof

Revoking Lemma 1, the additional condition that sN ∈ T (N)ε′ (pS) for

some ε′ < ε motivates us to split the upper bound on P {E} as follows:

P {E} ≤∑

sN∈SNp(sN)∏2K

w=1

(1− P

{Aw

(sN)})

≤∑

sN /∈T (N)

ε′ (pS)

p(sN)+ ∑

sN∈T (N)

ε′ (pS)

p(sN)∏2K

w=1

(1− P

{Aw

(sN)})

≤ P{

SN /∈ T (N)ε′ (pS)

}+

∑sN∈T (N)

ε′ (pS)

p(sN) (1− 2−N(I(S ;S )+δ(ε))

)2K

≤ P{

SN /∈ T (N)ε′ (pS)

}+(1− 2−N(I(S ;S )+δ(ε))

)2K

≤ P{

SN /∈ T (N)ε′ (pS)

}+ exp

(−2K × 2−N(I(S ;S )+δ(ε))

).

The last step is due to (1− x)r ≤ e−rx for x ∈ [0, 1] and r ≥ 0.




We obtain a nice upper bound

P {E} ≤ P{

SN /∈ T (N)ε′ (pS)

}+ exp

(−2K × 2−N(I(S ;S )+δ(ε))

).

The first term vanishes as N → ∞ due to AEP. The second termvanishes as N → ∞ if

R > I(

S ; S)+ δ(ε) = R

(D

1+ε

)+ δ(ε).

Hence, for any R > R(

D1+ε

)+ δ(ε) can achieve average distortion ≤ D.

Finally, due to the continuity of rate-distortion function, we take ε → 0and complete the proof.


lecture 7 lossy source coding -...

Documents