lecture!10:! coding!and!storage! - fu-berlin.de
TRANSCRIPT
Purpose of Coding
• Compression (source coding)
• Reduce transmission errors (channel coding)
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Source coding
• Compression • What is the smallest number of bits to represent the given data?
• Depends on Entropy: InformaMon theory
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Channel Coding
• Data is transmiNed through noisy channel, some informaMon (bits) may get changed – Wireless transmission, shouMng across rooms – Stored on hard drive – WriNen on paper..
• Idea: Store some addiMonal informaMon so that some changed bits can be detected
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Run length encoding (source coding)
• Suppose RGB are our colors • A picture has blue sky • Instead of storing the color for each pixel, we can say : Blue for next 800 pixels, green(grass) for next 1200 pixels etc
• Used in LZW algorithms, gif, zip…
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Huffman Codes (Source Coding)
• Input: frequencies of occurrence of symbols
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
Huffman Codes
• Take the two least common items and make a virtual node with the total frequency
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
Huffman Codes
• Repeat
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
Huffman Codes
• Repeat
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42 B
30
D 16
F 12
C 10
22
38
68
Huffman Codes
• Repeat
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
68 92
Huffman Codes
• Repeat
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
68 92
160
Huffman Codes
• Label edge • Lea = 0 • Right = 1
• E = 00 • D=110 • ABC = 01101111
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
68 92
160 0
0 1
1
1
1
1
0
0
0
Huffman Codes
• E = 00 • D=110 • ABC = 01101111
• Prefix code • Easy to decode
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
68 92
160 0
0 1
1
1
1
1
0
0
0
Why is it good?
• Gives short expression to frequent items, long codes to infrequent items
• Less bits on average
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
68 92
160 0
0 1
1
1
1
1
0
0
0
Huffman code is opMmal Lemma : There is an opMmal code tree in which the two symbols with smallest frequency are at lowest level. Proof : Otherwise we can switch the posiMons and save bits! InducMon with this idea shows opMmality.
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
E 50
A 42
B 30
D 16
F 12
C 10
22
38
68 92
160 0
0 1
1
1
1
1
0
0
0
How do we use Huffman codes?
• At the beginning of the file, we store the tree using a standard format like ASCII
• Then store the data using Huffman code • Useful when data is large • Or, have a standard Huffman code, for example, depending on typical frequencies of different leNers in a language, and then use that code everywhere.
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
RelaMon to informaMon theory • Entropy: Measure of informaMon content : smallest number of bits necessary to store the data
• Entropy: measure of surprise – Predictable => less informaMon
• 1111111111111111…. • 001001001001001…. • The sun rose on the east
– Unpredictable/unusual/less frequent => more info • 101001010111010000101001101111 • Random bit strings have higher entropy: harder to compress • Sun rose on the north..
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Coding for fault tolerance and distributed storage
• Store data from one node at other nodes • If the node is turned off, we sMll have the data • Everyone do not need to reach node x to get x’s data
• Simple duplicaMon is expensive, inefficient • Use coding for small storage
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Coding for Fault tolerance
• Encoding • Input : String of length k : “data” • Output : string of length n>k : “codeword”
• Decoding • Input : string of length n (may be corrupted) • Output : original data of length k
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Error detecMon and correcMon • Error detecMon: decide is a string is a valid codeword • CorrecMon: correct it valid codeword
• Maximum likelihood decoding: find the codeword that has least Hamming distance – can be reached with least number of flips
• For small codewords, we can store a table for all possible cases
• NP hard in general
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Parity
• If the number of “1”s in the data is even, parity bit is 0, otherwise, it is 1 – 01101010, 11010111 – Number of 1s in codeword is always even
• Can detect 1 bit errors but cannot correct it: we do not know which bit got flipped
• Cannot detect 2 bit errors
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Hamming Distance
• Generalize parity to correct errors • If any two valid codewords are at Hamming distance k, then we can detect (k-‐1) bit errors, and correct bit errors
• Hamming code (7,4): adds 3 check bits to four data bits to correct any 1 bit error and detect any 2 bit error.
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
k / 2!" #$
Paper
• A. G. Dimakis, V. Prabhakaran and K. Ramchandran, Ubiquitous Access to Distributed Data in Large-‐Scale Sensor Networks through Decentralized Erasure Codes, IPSN'05
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Coding for storage
• A few nodes create data (k nodes) • All n nodes are used for storage • Each node stores only one piece of data
• We want to recover data by asking any k storage nodes
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Distributed Random Linear Code
• Each node sends its data to m=c*lnk random other nodes
• A storage node gets mulMple pieces of data c1, c2…, but it stores a random combinaMon of them
• a1c1+a2c2+…… • Each a is a random number
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Coding and decoding • Storage size almost the same as before • Each item c is big (say, a file of fixed length) • Coefficients ‘a’ are stored, but small storage compared to c
• We ask k nodes to send the data stored (s) and the coefficients vector A = (a1, a2,…)
• We want to recover the original data • Each node has created a linear equaMon – a1c1+a2c2+…+akck = s – We need k linearly independent equaMons to solve for c1…ck
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Decoding
• Remember that each data node stored data at O(ln k) storage nodes
• We take k random nodes in the network and ask them for s and A
• Theorem: With high probability, the k equaMons will be linearly independent
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
How much communicaMon is necessary?
• Lower bound: Ω(ln k) storages per data node is necessary
• Proof: – Any storage node has to have at least one piece of data, otherwise there is a zero equaMon
– Throw data randomly to cover all storage nodes – Coupon collector bound : Θ(nlnn) throws are needed to hit all n storage nodes
– If k≈n, then about ln n throws per data nodes is necessary
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
The Coupon collector bound • There are n different types of coupons randomly inside
cereal boxes • How many boxes do you need to buy to collect all n types
of coupons? • Probability that the first box gives a new coupon is 1 • The second box gives a new coupon with probability (n-‐1)/
n • The third box with prob. (n-‐2)/n • So, the number of boxes we need to buy to get the ith
coupon is n/(n-‐i) • So, the total number of coupons is
– n(1 + 1/2 + 1/3 + 1/4……+1/n) ≈ (n ln n)
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
General network protocol
• Each node sends data to O(ln n) random nodes
• In a grid, the cost per sending is O(n1/2) • Total cost is O(n3/2ln n)
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Storage on perimeter
• PotenMal users are likely to be outside the network, and therefore easier to query perimeter nodes
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Advantages
• Robust to errors – take k good copies • Fault tolerant • No centralized processing • Resilient to packet loss • Privacy: If adversary does not know coefficients, they cannot find out data.
• For example, we can use pseudorandom number generators for coefficients
Rik Sarkar. FU Berlin. Winter '11 -‐ '12
Disadvantages
• Always all k pieces data have to be collected and decoded, even if we do not want them all
• Storage at far away nodes – inefficient
Rik Sarkar. FU Berlin. Winter '11 -‐ '12