management information systems lection 06 archiving information clark university college of...

Management Information Systems

Lection 06Archiving information

CLARK UNIVERSITY

College of Professional and Continuing Education (COPACE)

Plan

• Coding of numeric information• Coding of textual information• Coding of graphical information• Archiving of information• Shannon-Fano coding• Huffman coding

Basic terms

• Coding is the converting the message to the code, that is, to the set of symbols transmitted by the communication channel

Coding of numeric information

• Binary encoding used in computing, based on the representation of data sequence of two characters: 0 and 1.

• These signs are called binary digits, in English - binary digit, or, in short, bit (bit).

Coding of numeric information

One bit can be represent two numbers: 0 or 1 (yes or no, true or false, etc.). If the number of bits is increased to two, we can represent four different numbers:

00 01 10 11Three bits can encode eight different values:

000 001 010 011 100 101 110 111

Coding binary data

The general formula is:

N = 2i

where N - number of independent coded values; i - bit binary code.

Coding of binary integers

Principle: Integer is divided in a half, while the reminder is not either zero or one. The set of reminders from each division, written from right to left with the last reminder forms a binary equivalent of a decimal number.

Example

19 : 2 = 9 + 19 : 2 = 4 + 14 : 2 = 2 + 0

2 : 2 = 1

So, 1910 =10112

Coding of binary integers

• To encode the integers from 0 to 255 it is enough to have 8 bits.

• 16-bit coding is used for integers from 0 to 65535

• 24 bits are used for more than 16.5 million numbers.

Coding of textual information

• If each letter of the alphabet matches a certain integer, then we can use the binary code for the encoding the textual information.

• Eight bits are sufficient to encode 256 different characters.


U.S. Standards Institute (ANSI - American National Standard Institute) has put in place a system of encoding ASCII (American Standard Code for Informational Interchange - American Standard Code for Information Interchange).


• There are two encoding tables in ASCII: basic (symbols with numbers 0 - 127) and extended one (128 - 255).

The extended ASCII character set

Windows 1251 character set


• The use of multiple concurrent encoding happen due to the limited set of codes (256).

• The character set based on a 16-bit character encoding, called universal - UNICODE.

• It contains the unique codes for 65536 different characters.

• The transition to this system was limited by the insufficient resources of computing for a long time

Coding of graphical information

• Graphic image is made up of tiny dots (pixels) which form a grid called a raster.

Example

• increasing in seven times


• Pixels with only two possible colors (black and white) can be encoded by two numbers - 0 or 1. So, it is necessary to use only 1 bit.

• For black and white illustrations it is generally accepted coding with 256 shades of gray. How many bits do we need then?

Example


• The color image on the screen is obtained by mixing three primary colors:

red (Red) green (Green)

blue (Blue)


• While encoding color images, the principle of decomposition of any color on the basic components is used.

• Such a coding system is called RGB. • If for the encoding of each of the main

components of color it is used 256 bits, then the system provides 16777216 different colors.

Archiving of information

• Data archiving is the process of converting the information stored in a file to the form which reduces redundancy in its representation and thus requires less space for storage


• Archiving (packing) movement of the source files into an archive file in a compressed format

• Decompression (unpacking) is the process of recovering files from the archive in the exact form which they had before archiving


The aims:• accommodation in a more compact form on the

disk• reduction of time (or cost) of the transmission

of information through communication channels

• simplification of transferring files from one computer to another

• protection from unauthorised access


• One of the first archiving method was proposed in 1844 by Samuel Morse in the coding system of Morse code.

• Frequent characters are coded in shorter sequences


• In the 40-ies of the XX century the founder of the modern information theory Shannon and in independency with him Fano developed a universal algorithm for constructing optimal codes. There is an analogue of this algorithm which was proposed by Huffman.

• The principle of this algorithm is the encoding of frequently occurring characters by shorter sequences of bits.


• In the 70's of the XX century Lempel and Ziv proposed algorithms LZ77 and LZW.

• The algorithm finds the repeated sequences and replace some numbers instead of these sequences according to the dynamically generated dictionary.

• Most modern archives (WinRar, WinZip) are based on the variations of the Lempel-Ziv algorithm.


where Kc – the coefficient of the compressed file,

Vc – the volume of the compressed file,

Vr – the volume of the resource file.

The degree of the compression depends on the archiving program, the method and the type of source file

𝐾 𝑐=𝑉 𝑐

𝑉 𝑟

100%


• The degree of compression for graphical, text and data files is 5-40%.

• The degree of compression for executable files is 60-90%.

• The degree of compression for archived files is 90-100%.


• The self-extracting archive file is the boot executable module which is able to self-unzip contained files without using the archiver.

• Big archive files can be divided into several toms.

Shannon-Fano coding

1. Develop a list of probabilities or frequency counts2. Sort the lists of symbols according to frequency3. Divide the list into two parts, with the total frequency

counts of the left part being as close to the total of the right as possible.

4. The left part of the list is assigned the binary digit 0, and the right part is assigned the digit 1.

5. Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has a code.

Huffman coding

Symbol Codea1 0a2 10a3 110a4 111

Huffman coding

• A source generates 4 different symbols with probability.

• A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols.

• The process is repeated until there is just one symbol. • The tree can then be read backwards, from right to

left, assigning different bits to different branches.

management information systems lection 06 archiving information clark university college of...

Documents