lecture!10:! coding!and!storage! - fu-berlin.de

Lecture 10:

Coding and Storage

Rik Sarkar

Purpose of Coding

•  Compression (source coding)

•  Reduce transmission errors (channel coding)

Rik Sarkar. FU Berlin. Winter '11 -‐ '12

Source coding

•  Compression •  What is the smallest number of bits to represent the given data?

•  Depends on Entropy: InformaMon theory


Channel Coding

•  Data is transmiNed through noisy channel, some informaMon (bits) may get changed – Wireless transmission, shouMng across rooms – Stored on hard drive – WriNen on paper..

•  Idea: Store some addiMonal informaMon so that some changed bits can be detected


Run length encoding (source coding)

•  Suppose RGB are our colors •  A picture has blue sky •  Instead of storing the color for each pixel, we can say : Blue for next 800 pixels, green(grass) for next 1200 pixels etc

•  Used in LZW algorithms, gif, zip…


Huffman Codes (Source Coding)

•  Input: frequencies of occurrence of symbols


E 50

A 42

B 30

D 16

F 12

C 10

Huffman Codes

•  Take the two least common items and make a virtual node with the total frequency


E 50

A 42

B 30

D 16

F 12

C 10

22

Huffman Codes

•  Repeat


E 50

A 42

B 30

D 16

F 12

C 10

22

38

Huffman Codes

•  Repeat


E 50

A 42 B

30

D 16

F 12

C 10

22

38

68

Huffman Codes

•  Repeat


E 50

A 42

B 30

D 16

F 12

C 10

22

38

68 92

Huffman Codes

•  Repeat


E 50

A 42

B 30

D 16

F 12

C 10

22

38

68 92

160

Huffman Codes

•  Label edge •  Lea = 0 •  Right = 1

•  E = 00 •  D=110 •  ABC = 01101111


E 50

A 42

B 30

D 16

F 12

C 10

22

38

68 92

160 0

0 1

1

1

1

1

0

0

0

Huffman Codes

•  E = 00 •  D=110 •  ABC = 01101111

•  Prefix code •  Easy to decode


E 50

A 42

B 30

D 16

F 12

C 10

22

38

68 92

160 0

0 1

1

1

1

1

0

0

0

Why is it good?

•  Gives short expression to frequent items, long codes to infrequent items

•  Less bits on average


E 50

A 42

B 30

D 16

F 12

C 10

22

38

68 92

160 0

0 1

1

1

1

1

0

0

0

Huffman code is opMmal Lemma : There is an opMmal code tree in which the two symbols with smallest frequency are at lowest level. Proof : Otherwise we can switch the posiMons and save bits! InducMon with this idea shows opMmality.


E 50

A 42

B 30

D 16

F 12

C 10

22

38

68 92

160 0

0 1

1

1

1

1

0

0

0

How do we use Huffman codes?

•  At the beginning of the file, we store the tree using a standard format like ASCII

•  Then store the data using Huffman code •  Useful when data is large •  Or, have a standard Huffman code, for example, depending on typical frequencies of different leNers in a language, and then use that code everywhere.


RelaMon to informaMon theory •  Entropy: Measure of informaMon content : smallest number of bits necessary to store the data

•  Entropy: measure of surprise –  Predictable => less informaMon

•  1111111111111111…. •  001001001001001…. •  The sun rose on the east

– Unpredictable/unusual/less frequent => more info •  101001010111010000101001101111 •  Random bit strings have higher entropy: harder to compress •  Sun rose on the north..


Coding for fault tolerance and distributed storage

•  Store data from one node at other nodes •  If the node is turned off, we sMll have the data •  Everyone do not need to reach node x to get x’s data

•  Simple duplicaMon is expensive, inefficient •  Use coding for small storage


Coding for Fault tolerance

•  Encoding •  Input : String of length k : “data” •  Output : string of length n>k : “codeword”

•  Decoding •  Input : string of length n (may be corrupted) •  Output : original data of length k


Error detecMon and correcMon •  Error detecMon: decide is a string is a valid codeword •  CorrecMon: correct it valid codeword

•  Maximum likelihood decoding: find the codeword that has least Hamming distance – can be reached with least number of flips

•  For small codewords, we can store a table for all possible cases

•  NP hard in general


Parity

•  If the number of “1”s in the data is even, parity bit is 0, otherwise, it is 1 – 01101010, 11010111 – Number of 1s in codeword is always even

•  Can detect 1 bit errors but cannot correct it: we do not know which bit got flipped

•  Cannot detect 2 bit errors


Hamming Distance

•  Generalize parity to correct errors •  If any two valid codewords are at Hamming distance k, then we can detect (k-‐1) bit errors, and correct bit errors

•  Hamming code (7,4): adds 3 check bits to four data bits to correct any 1 bit error and detect any 2 bit error.


k / 2!" #$

Paper

•  A. G. Dimakis, V. Prabhakaran and K. Ramchandran, Ubiquitous Access to Distributed Data in Large-‐Scale Sensor Networks through Decentralized Erasure Codes, IPSN'05


Coding for storage

•  A few nodes create data (k nodes) •  All n nodes are used for storage •  Each node stores only one piece of data

•  We want to recover data by asking any k storage nodes


Distributed Random Linear Code

•  Each node sends its data to m=c*lnk random other nodes

•  A storage node gets mulMple pieces of data c1, c2…, but it stores a random combinaMon of them

•  a1c1+a2c2+…… •  Each a is a random number


Coding and decoding •  Storage size almost the same as before •  Each item c is big (say, a file of fixed length) •  Coefficients ‘a’ are stored, but small storage compared to c

•  We ask k nodes to send the data stored (s) and the coefficients vector A = (a1, a2,…)

•  We want to recover the original data •  Each node has created a linear equaMon –  a1c1+a2c2+…+akck = s – We need k linearly independent equaMons to solve for c1…ck


Decoding

•  Remember that each data node stored data at O(ln k) storage nodes

•  We take k random nodes in the network and ask them for s and A

•  Theorem: With high probability, the k equaMons will be linearly independent


How much communicaMon is necessary?

•  Lower bound: Ω(ln k) storages per data node is necessary

•  Proof: – Any storage node has to have at least one piece of data, otherwise there is a zero equaMon

–  Throw data randomly to cover all storage nodes –  Coupon collector bound : Θ(nlnn) throws are needed to hit all n storage nodes

–  If k≈n, then about ln n throws per data nodes is necessary


The Coupon collector bound •  There are n different types of coupons randomly inside

cereal boxes •  How many boxes do you need to buy to collect all n types

of coupons? •  Probability that the first box gives a new coupon is 1 •  The second box gives a new coupon with probability (n-‐1)/

n •  The third box with prob. (n-‐2)/n •  So, the number of boxes we need to buy to get the ith

coupon is n/(n-‐i) •  So, the total number of coupons is

–  n(1 + 1/2 + 1/3 + 1/4……+1/n) ≈ (n ln n)


General network protocol

•  Each node sends data to O(ln n) random nodes

•  In a grid, the cost per sending is O(n1/2) •  Total cost is O(n3/2ln n)


Storage on perimeter

•  PotenMal users are likely to be outside the network, and therefore easier to query perimeter nodes


Advantages

•  Robust to errors – take k good copies •  Fault tolerant •  No centralized processing •  Resilient to packet loss •  Privacy: If adversary does not know coefficients, they cannot find out data.

•  For example, we can use pseudorandom number generators for coefficients


Disadvantages

•  Always all k pieces data have to be collected and decoded, even if we do not want them all

•  Storage at far away nodes – inefficient


Course

•  Remember to send your project proposals today


lecture!10:! coding!and!storage! - fu-berlin.de

Documents