introduction to information theory and coding

Upload: harsha

Post on 10-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Introduction to Information Theory and Coding

    1/17

    Introduction to information theory and coding

    Louis WEHENKEL

    Channel coding (data transmission)

    1. Intuitive introduction

    2. Discrete channel capacity

    3. Differential entropy and continuous channel capacity

    4. Error detecting and correcting codes

    5. Turbo-codes

    IT 2000-8, slide 1

    1. Intuitive introduction to channel coding

    How can we communicate in reliable fashion over noisy channels ?

    Examples of noisy communication channels :

    1. Phone line (twisted pair : thermal noise, distortions, cross-talk. . . )

    2. Satellite (cosmic rays. . . )

    3. Hard-disk (read or write errors, imperfect material)

    Simplistic model : binary symmetric channel ( = probability of error)

    0

    1

    0

    1

    IT 2000-8, slide 2

  • 8/8/2019 Introduction to Information Theory and Coding

    2/17

    Let us suppose that : (one error every 10 bits, on the average)

    But in order to use the channel (say, a hard-disk) : we want to make sure that duringthe whole lifecycle of 100 disks there is no more than one error.

    E.g. : life period = 10 years. And let us suppose that we transfer 1GB every day on

    the disk :

    (requested)

    Two engineering approaches :

    1. Physical aproach : better circuits, lower density, better cooling, increase signalpower...

    2. System approach : compensate for the bad performances by using the disk in anintelligent way

    ! " $ % '

    ) 0 2 0

    4

    Message

    5

    4

    Estimatedmessage

    ENCODERCHANNEL

    DECODER

    IT 2000-8, slide 3

    Information theory (and coding)6

    system approach (and solutions)

    Add redundancy to the input message and exploit this redundancy when decoding thereceived message

    Information theory :

    Which are the possibilities (and limitations) terms of performance tradeoffs ?

    analysis problemCoding theory :

    How to build practical error-compensation systems ?

    design problem

    (Cf. analogy with Systems theory vs Control theory)

    IT 2000-8, slide 4

  • 8/8/2019 Introduction to Information Theory and Coding

    3/17

    Some preliminary remarks

    Analysis problem : - more or less solved for most channels (but not for networks)

    Design problam : - Very difficult in general- Solved in a satisfactory way for a subset of problems only.

    Cost of the system approach : - performance tradeoff, computational complexity,- loss of real-time characteristics (in some cases).

    Cost of the physical approach : - investment, power (energy).

    Tradeoff depends on application contexts

    Example : (deep space communication)

    Each dB of coding gain (which allows to reduce power, size . . . of antennas) leads todecrease in cost of $80.000.000 for the Deep Space Network!

    coding can make infeasible projects feasible

    IT 2000-8, slide 5

    Error handling systems

    Error

    avoidance

    and

    redundancy

    encoding

    Coded

    Modu-

    lation

    Channel

    or

    Storage

    Medium

    Demod-

    ulation

    Error

    detection

    Error

    correction

    Error

    conceal-

    ment

    Retransmission request

    Abstract channel

    (Error concealment : exploits natural redundancy to interpolate missing values)

    NB:

    Error detection with retransmission request is an alternative to error correction, but isnot always possible.

    In some protocols, error detection merely leads to dropping packets.

    We will come back later to the discussion of error detection and retransmission

    versus forward error correction.

    IT 2000-8, slide 6

  • 8/8/2019 Introduction to Information Theory and Coding

    4/17

    Codes for detecting and/or correcting errors on the binary symmetric channel

    1. Repetition codes :

    Source Code0 000

    1 111

    Decoder : majority vote.

    Example of transmission :

    .

    s 0 0 1 0 1 1 0x 000 000 111 000 111 111 000b 000 001 000 000 101 000 000y 000 001 111 000 010 111 000

    (

    : noise vector)

    Decoding :

    (per source bit) : and code rate : !

    NB: to reach

    #

    we need

    #

    ! % . . .

    Other properties : correction of single errors, detection of double errors.

    IT 2000-8, slide 7

    2. Linear block codes (Hamming & ( 0 )

    We would like to maximize the code rate under the reliability constraint

    #

    Block codes : to a block of2

    source bits we associate a codeword of length3 4 2

    .

    Example : Hamming & ( 0

    s x0000 0000000

    0001 00010110010 00101110011 0011100

    s x0100 0100110

    0101 01011010110 01100010111 0111010

    s x1000 1000101

    1001 10011101010 10100101011 1011001

    s x1100 1100011

    1101 11010001110 11101001111 1111111

    This code may be written in compact way by ( 7 and 8 denote line vectors)

    8

    7 9with

    9

    AB

    B

    C

    D E

    E

    G

    H I Q R T

    linear block code (additions and multiplications modulo 2)

    IT 2000-8, slide 8

  • 8/8/2019 Introduction to Information Theory and Coding

    5/17

    Definition (linear code) We say that a code is linear if all linear combinations ofcodewords are also codewords.

    Binary codewords of length form an dimensional linear space. A linear codeconsists of a linear subspace of this space.

    In our example

    first 4 bits = source word, last 3 bits = parity control bits.E.g. : 5th bit = parity (sum mod. 2) of the first 3 bits.

    Some more definitions

    Hamming distance

    8

    (

    of two vectors : nb of bits different.

    Hamming weight of a bit-vector : number of bits equal to 1.

    Minimum distance of a code : minimum distance of any two codewords

    For a linear code : minimum distance = minimum weight of non-zero codewords.

    Minimum distance of Hamming & ( 0 code : = 3

    IT 2000-8, slide 9

    Decoding :

    Let 8

    denote the received word (

    error vector)

    Maximum likelihood decoding :

    Guess that codeword

    8was sent which maximizes the probability

    8

    .

    Assuming that all codewords are equiprobable, this is equivalent to maximizing

    8

    If

    8

    (

    then

    8

    (here & )

    Hence :

    8

    maximal

    8

    (

    minimal (supposing that

    ).

    For the Hamming & ( 0 code :

    If Hamming weight of

    #

    : correct decoding.

    If Hamming weight of

    #

    : correct detection.

    Otherwise, minimum distance decoding will make errors.

    IT 2000-8, slide 10

  • 8/8/2019 Introduction to Information Theory and Coding

    6/17

    Pictorial illustration

    Repetition code (

    111

    000

    100

    110

    101001

    011

    010

    maximum likelihood decoding nearest neighbor decoding

    But : there is a more efficient way to do it than searching explicitly for the nearestneighbor (syndrom decoding)

    IT 2000-8, slide 11

    Let us focus on correcting single errors with the Hamming & ( 0 code

    Meaning of the 7 bits of a codeword

    s1 s2 s3 s4 p1 p2 p3

    4 signal bits 3 parity check bits

    If error on any of the 4 first bits (signal bits) two or three parity check violations.

    If single error on one of the parity check bits : only this parity check is violated.

    In both cases, we can identify erronous bit.

    IT 2000-8, slide 12

  • 8/8/2019 Introduction to Information Theory and Coding

    7/17

    Alternative representation : parity circles

    p1

    p2 p3

    s3

    s4

    s2 s1

    Parity within each circle must be even if no error.

    If this is not the case, we should flip one single received bit so as to realize thiscondition (always possible).

    For instance, if parity in the green (upper right) and blue (lower) circles are not even

    flip bit7

    IT 2000-8, slide 13

    Syndrom decoding

    Syndrom : difference between the received parity bit vector and those which arerecomputed from the received signal bits.

    Syndrom is a vector of three bits

    possible syndroms

    The syndrom contains all the information needed for optimal decoding :

    8 possible syndroms

    8 most likely error patterns (can be precomputed).E.g. : suppose that :- signal bits codeword (parity )- syndrom : (bit per bit)- more probable error pattern :

    - decoded word :

    E.g. : suppose that :- signal bits codeword (parity )

    - syndrom :

    (bit per bit)- more probable error pattern :

    - decoded signal bits : (code ).

    IT 2000-8, slide 14

  • 8/8/2019 Introduction to Information Theory and Coding

    8/17

    Summary

    Like the repetition code, the Hamming code & ( 0 corrects single errors and detectsdouble errors, but uses longer words (7 bits instead of 3) and has a higher rate.

    If : probability of error per code word : 0.14

    (signal) bit error rate (BER) of : 0.07

    Less good in terms of BER but better in terms of code-rate :

    0

    !

    & .

    There seems to be a compromize between BER and Code-rate.

    Intuitively :

    (this is what most people still believed not so long ago...)

    And then ?

    ...Shannon came...

    IT 2000-8, slide 15

    Second Shannon theorem

    States that if

    then

    may be attained.

    Third Shannon theorem (rate distorsion :

    tolerated)

    Using irreversible compression we can further increase the code rate by a factor

    ! ' if we accept to reduce realiability (i.e. !

    ).

    Conclusion : we can operate in a region satisfying # #

    ! $ '

    ! '

    10 0.53

    0.1

    0.01

    1e-11 )

    Impossible

    Possible

    0 1 3 5 7

    Conclusion : we only need twodisks and a very good code toreach

    #

    .

    IT 2000-8, slide 16

  • 8/8/2019 Introduction to Information Theory and Coding

    9/17

    2nd Shannon theorem (channel coding)

    What is a channel ?

    ....

    Noisy Version of input

    Later and/or elsewhere

    ..... Channel

    Abstract model :"

    "

    %

    %

    ! "

    "

    $ %

    %

    '

    in other words, the specification of (all) the conditional probability distributions

    (

    (

    (

    (

    (

    defined

    (

    (

    (

    IT 2000-8, slide 17

    Simplifications

    Causal channel : if

    #

    (

    (

    (

    (

    (

    (

    (

    (

    (1)

    Causal and memoryless channel : if 4

    (

    (

    (

    (

    (

    ( (2)

    Causal, memoryless and stationary channel : if 4 we have

    ( (3)

    this will be our working model

    1 symbol enters at time 1 symbol comes out at time .

    If stationary process at the input stationary process out

    If ergodic process at the input ergodic process at the output

    (NB: one can generalize to stationary channels of finite memory...)

    IT 2000-8, slide 18

  • 8/8/2019 Introduction to Information Theory and Coding

    10/17

    Information capacity of a (stationary memoryless) channel

    By definition :

    ! % '

    (4)

    Remarks

    This quantity relates to one single use of the channel (one symbol)

    depends both on source and channel properties.

    solely depends on channel properties.

    We will see later that this quantity coincides with the notion ofoperational capacity.

    NB: How would you generalize this notion to more general classes of channels ?

    IT 2000-8, slide 19

    Examples of discrete channels and values of capacity

    Channel transition matrix :

    H

    T

    AB

    C

    $ " $

    .... . .

    ...

    $ % $

    $ " $

    $ % $

    D E

    G

    1. Binary channel without noise

    Input and output alphabets are binary : H

    T

    , maximal when is maximal (=1 Shannon).

    Achievable rate (without errors) : 1 source symbol/channel use.

    Can we do better ?

    No, unless we admit

    .

    IT 2000-8, slide 20

  • 8/8/2019 Introduction to Information Theory and Coding

    11/17

    2. Noisy channel without overlapping outputs

    E.g. : transition matrix

    . (Achievable...)

    3. Noisy type-writer

    Input alphabet : a, b, c,. . . , z Output alphabet : a, b, c,. . . , z

    (

    ,

    (

    , . . .

    (

    with

    max if outputs are equiprobable

    E.g. if inputs are equiprobable :

    %

    Achievable : just use the right subset of input alphabet...

    (NB: this is THE idea of channel coding)

    IT 2000-8, slide 21

    4. Binary symmetric channel

    0

    1

    0

    1

    H

    T

    Information capacity of this channel :

    ) %

    ) %

    #

    (

    Equal to 0, if and equal to 1 if . Symmetric :

    .

    Achievable ? : less trivial...

    IT 2000-8, slide 22

  • 8/8/2019 Introduction to Information Theory and Coding

    12/17

    Main properties of the information capacity

    1.

    4

    .

    2.

    #

    (

    .

    Moreover, one can show that

    is continuous and concave with respect to

    .

    Thus every local maximum must be a global maximum (on the convex set of inputprobability distributions

    ).

    Since the function

    is upper bounded capacity must be finite.

    One can use powerful optimisation techniques to compute information capacity for alarge class of channels with the desired accuracy.

    In general, the solution can not be obtained analytically.

    IT 2000-8, slide 23

    Communication system

    ! " $ % '

    ) 0 2 0

    4

    Message

    5

    4

    Estimatedmessage

    ENCODERCHANNEL

    DECODER

    A message (finite set of possible messages ( ( ( ) is encoded bythe encoderinto a sequence of channel input symbols, denoted by

    .

    At the other end of the channel another (random) sequence of channel output symbols

    is received (distributed according to

    .)

    The sequence is then decoded by the decoder, who chooses an element

    the receiver makes an error if

    .

    In what follows, we will suppose that the encoder and the decoder operate in a deter-ministic fashion :

    -

    is the coding rule (or function);

    -

    is the decoding rule (or function).

    IT 2000-8, slide 24

  • 8/8/2019 Introduction to Information Theory and Coding

    13/17

    Memoryless channel specification

    Input and output alphabets and , and the

    are given

    Channel is used without feedback :

    In this case :

    .

    Definitions that will follow :

    - Channel code (

    - Different kinds of error rates...

    - Communication rate.

    - Achievable rates.

    - Operational capacity.

    IT 2000-8, slide 25

    ( Code

    An ( code for a channel (

    ( is defined by

    1. A set of indices ( ( ;

    2. A coding function

    (

    (

    , which gives the codebook

    (

    (

    .

    3. A decoding function

    (

    (

    ( (5)

    which is a deterministic mapping from all possible output strings into an inputindex

    .

    code words coded using input symbols.

    IT 2000-8, slide 26

  • 8/8/2019 Introduction to Information Theory and Coding

    14/17

    Decoding error rates

    1. Conditional Probability of error given that index

    was sent

    2

    0

    "

    0

    ! 2

    0

    '

    2. Maximal error probability for and ( code :

    !

    '

    3. The (arithmetic) average probability of error :

    !

    '

    expected error rate if

    has a uniform distribution.

    IT 2000-8, slide 27

    Optimal decoding rule

    By definition : the decoding rule which minimises expected error rate.

    For a received word choose

    such that

    is maximal.

    maximises a posteriori probability (MAP)

    minimises for each error probability

    minimises the expected error rate.

    general principle in decision theory : Bayes rule

    We use information (random variable which is observed).

    We want to guess (decide on) a certain variable

    (choose among possibilities).

    Correct decision :

    a random variable

    known.

    Cost of the taken decision : if correct, if incorrect.

    Optimal decision based on information :

    IT 2000-8, slide 28

  • 8/8/2019 Introduction to Information Theory and Coding

    15/17

    For our channel :

    Since

    does not depend on the decision, this is the same

    than maximizing

    .

    Discussion

    : channel specification.

    : source specification.

    If non-redundant source :

    independent of

    maximize

    .

    Maximum likelihood rule : minimize

    !

    '

    Quasi optimal, if source is quasi non-redundant.

    E.g. if we code long source messages (cf. AEP)

    IT 2000-8, slide 29

    Communication rate : denoted

    The communication rate (denoted by

    ) of an ( code is defined by

    Shannon/channel use input entropy per channel use if inputs are uniformly dis-tributed.

    Achievable rate (more subtle notion)

    is said to be achievable if

    a sequence of ( ( ( ( codes suchthat

    1. and 2.

    !

    '

    codes of rate

    , eventually become quasi perfect and remain so (when usingvery long codewords)

    Remark

    Definition is independent of the source distribution (cf. maximal probability of error).

    Operational capacity :

    is the supremum of all achievable rates.

    is achievable, but we need to check that it is possible to have

    .

    IT 2000-8, slide 30

  • 8/8/2019 Introduction to Information Theory and Coding

    16/17

    Second Shannon theorem

    Objective : prove that information capacity

    is equal to operational capacity

    .

    Hypothesis : the pair ( satisfy the AEP (stationary and ergodic : OK forstationary finite memory channels and ergodic inputs.)

    Information capacity (per channel use) :

    ! %

    0

    '

    (with stationary and ergodic)

    For memoryless channels the maximum is obtained for independent source symbolsand the above definition yields indeed :

    ! % '

    Basic idea : for large block lengths, every channel looks like the noisy typewriter :- it has a subset of inputs that produce essentially disjoint sequences at the output- the rest of the proof is matter of counting and packing...

    IT 2000-8, slide 31

    Outline of the proof (Shannons random coding ideas)

    Let us fix for a moment

    , and , and construct a codebook by generatingrandom signals according to

    (

    drawings)

    if is large the codewordsas well as the received sequences must be typical (AEP)

    0

    0

    0

    Typical transition Non-typical transition

    0 0

    IT 2000-8, slide 32

  • 8/8/2019 Introduction to Information Theory and Coding

    17/17

    Let us count and pack (approximately and intuitively...)

    At the input : ! % '

    possible typical messages (we choose of them at random toconstruct our codebook).

    At the output : for each (typical) ! " $ % '

    possible typical output sequences.

    The total number of possible (typical outputs) is ! " '

    If we want that there is no overlap : we must impose that

    ! " '

    4

    ! " $ % '

    #

    ! " % '

    Now, let us choose the

    which maximizes

    : it should be possibleto find

    #

    # input sequences such that the output regions do not overlap.

    Conclusion

    Using long codewords ( ) it is possible to exploit redundancy (correlationbetween inputs and outputs) to transmit information in a reliable way.

    IT 2000-8, slide 33

    Second Shannon theorem

    Statement in two parts :

    Forwards :

    achievable (

    !

    '

    ).

    Backwards :

    not achievable.

    The proof is based on random coding and the joint AEP.

    One can show that for a long codewords, a large proportion of the random codes

    generated by drawing !

    #

    '

    words are actually very good.

    Caveat

    In practice random coding is not a feasible solution because of computational limi-tations, just like typical message data compression is not a feasible approach to data

    compression.

    IT 2000 8 lid 34