malayalam text compression - ftms · malayalam is closer to the pre - tamil malayalam in phonology,...

11
ISSN: 2289-7615 Page 1 International Journal of Information System and Engineering www.ftms.edu.my/journals/index.php/journals/ijise Vol. 1(No.1), April, 2013 Page : 01-11 ISSN: 2289-7615 This work is licensed under a Creative Commons Attribution 4.0 International License . Malayalam Text Compression Sajilal Divakaran School of Engineering and Computing Sciences, FTMS College, Kuala Lumpur, Malaysia [email protected] Anjali C. University of Kerala, Thiruvananthapuram, Kerala, India 695581 [email protected] Biji C. L. University of Kerala, Thiruvananthapuram, Kerala, India 695581 [email protected] Achuthsankar S. Nair University of Kerala, Thiruvananthapuram, Kerala, India 695581 [email protected] Abstract In natural language processing and analysis, a very large number of problems remain unaddressed particularly in Malayalam computing. For instance, the informational analysis of Malayalam language text is itself not widely studied. Language studies of English, based on the concepts of information theory are quite well established, as evidenced by the success of text compression methods for English. However to the best of our knowledge, not a single attempt has been reported about Malayalam text compression even though the Unicode based Malayalam content is increasing in Malayalam blogs, Wikipedia and Websites. The general motivation behind every compression is the optimum use of resources such as data, space or transmission capacity. The availability of standard Unicode script and Google online language translation service in the internet triggers the use of Malayalam language. The statistics of Malayalam Wikipedia clearly indicates that the Malayalam content is steadily increasing since 2006. Moreover the searchable archives of Malayalam publications including eBooks and journals are likely to increase in the upcoming years. This opens up a way to seriously think about a Malayalam text compression for the optimum use of resources. Every language normally has certain hidden statistically significant features and certain redundancy. Exploiting all these features help us to frame a suitable text compression tool. Being motivated by the language studies of English based on Shannon theory, an informational analysis of Malayalam language text is being proposed in our frame work. Interestingly all language

Upload: others

Post on 20-Aug-2020

78 views

Category:

Documents


0 download

TRANSCRIPT

  • ISSN: 2289-7615 Page 1

    International Journal of Information System and Engineering

    www.ftms.edu.my/journals/index.php/journals/ijise

    Vol. 1(No.1), April, 2013

    Page : 01-11 ISSN: 2289-7615

    This work is licensed under a Creative Commons Attribution 4.0 International License.

    Malayalam Text Compression

    Sajilal Divakaran School of Engineering and

    Computing Sciences, FTMS College, Kuala Lumpur, Malaysia

    [email protected]

    Anjali C. University of Kerala,

    Thiruvananthapuram, Kerala, India 695581

    [email protected]

    Biji C. L. University of Kerala,

    Thiruvananthapuram, Kerala, India 695581 [email protected]

    Achuthsankar S. Nair

    University of Kerala, Thiruvananthapuram, Kerala, India 695581

    [email protected]

    Abstract In natural language processing and analysis, a very large number of problems remain unaddressed particularly in Malayalam computing. For instance, the informational analysis of Malayalam language text is itself not widely studied. Language studies of English, based on the concepts of information theory are quite well established, as evidenced by the success of text compression methods for English. However to the best of our knowledge, not a single attempt has been reported about Malayalam text compression even though the Unicode based Malayalam content is increasing in Malayalam blogs, Wikipedia and Websites. The general motivation behind every compression is the optimum use of resources such as data, space or transmission capacity.

    The availability of standard Unicode script and Google online language translation service in the internet triggers the use of Malayalam language. The statistics of Malayalam Wikipedia clearly indicates that the Malayalam content is steadily increasing since 2006. Moreover the searchable archives of Malayalam publications including eBooks and journals are likely to increase in the upcoming years. This opens up a way to seriously think about a Malayalam text compression for the optimum use of resources. Every language normally has certain hidden statistically significant features and certain redundancy. Exploiting all these features help us to frame a suitable text compression tool. Being motivated by the language studies of English based on Shannon theory, an informational analysis of Malayalam language text is being proposed in our frame work. Interestingly all language

    http://creativecommons.org/licenses/by/4.0/http://en.wikipedia.org/wiki/Bandwidth_(computing)

  • ISSN: 2289-7615 Page 2

    structure has certain bias to the input message. Some Characters are more likely to occur than others. In general, the symbols in language follow an unequal probability distribution. Every compression algorithm tries to represent the input message in a new form with a fewer number of bits by exploiting the probability distribution. The proposed Malayalam text compressor follows a variable length encoding technique in which most probable Unicode character is represented by less number of bits. Moreover we were able to derive a theoretical limit for Malayalam text compression as 21%. A compression tool is developed using Java/J2EE with Apache tomcat as web server. Since similar work was not reported we have created a small dataset from the Malayalam blogs, Wikipedia and Websites for testing the performance of developed tool. The proposed Malayalam text compressor based on variable length coding has achieved a compression ratio of 17% for the best case. The performance analysis of proposed algorithm is carried out by considering percentage of compression and compression ratio. Keywords: Compression, Entropy coding, Natural language processing I. Introduction Malayalam is the mother tongue of about 3 crore people residing in Kerala, the southern state of India. To add a few historical piece of information, Malayalam is the youngest of the four major Dravidian languages spoken in South India and it is the official language of

    Kerala. It is from the traditions of Sanskrit, the Indo-Aryan language, that Malayalam draws its rich diversity of words and compound alphabets (conjuncts). Malayalam is closer to the pre - Tamil Malayalam in phonology, morphology and syntax, the major feature which sets apart the two being the heavy Sanskrit borrowing in Malayalam [1]. It is only from the 8th century AD that Malayalam developed literature independent of Tamil. Languages are generally carriers of communication. The computer technology has so advanced that people can now convey messages and shares their thoughts using their own mother tongue. It is appreciable that Kerala Government has given more importance to Malayalam Computing in information technology. This explores a new world of opportunities for many, who are not even proficient with English language to get in touch with the global world. The Malayalam content started appearing in internet during early 2000’s. The statistics of Malayalam Wikipedia content shows a progressive rise since 2006. Based on the statistics published by Malayalam Wikipedia on April 2012[2], there are nearly 24, 000 articles.

  • ISSN: 2289-7615 Page 3

    Figure 1.1: Wiki Malayalam Content Statistics

    Growth rate of Malayalam article displayed in Figure 1.1 clearly indicates a need for Malayalam compression tools in the mere future. Compression is required for effective storage of information and for its smooth transmission over channel. Compression is employed everywhere starting from images found in web, which in general follows JPEG or GIF standards and audio files follow mp3 standard. Moreover several file system automatically compress the file, when stored. The possibility of compression was first studied in detail with English language by the great American Mathematician Claude Elwood Shannon. The seminal paper of Claude E. Shannon [3] clearly stated that the sequence of English language are not framed in random, it usually follow a statistical structure. For example, the occurrence of ‘e’ is more frequent than ‘q’. This structure can be exploited to achieve a smaller representation of input file. With the same assumption, as a first step we took the frequency of occurrence of Malayalam characters from a study report [4]. In addition to the character specified in the report, space and full stop was

    included. The informational analysis of Malayalam text is carried out by creating a dataset from popular Malayalam blogs and websites [5, 6]. Till date, state of the art works are not known to perform compression in Malayalam text. Hence no benchmarks exist for comparison with the proposed work. II. Entropy and Compressibility Communication is the process of sharing ideas, thoughts, facts and information from one person to another. Languages are being developed as a mean to provide effective communication. Irrespective of the diversity in human biological traits, every communication system follows a common process of transmitting message from one point to another. The hidden statistical nature of communication process was first recognized by the great research Mathematician Claude Elwood Shannon and he used mathematics to unify the theory [3]. In the famous work, C.E. Shannon emphasized that languages are not framed in random manner, there is a specific style being followed in framing language. Most of the advancement in digital technology ever since happened including the art of connecting people together through the social networking sites, blogs, and email has the inspiration of C.E. Shannon’s novel idea. The Information Content in a message is the amount of surprise it creates in us [7]; in other words an unusual scenario has more information than a usual scenario. Shannon defined the measure of information contained in a message, based on the probability of each symbol in

    0

    5000

    10000

    15000

    20000

    25000

    30000

    Ma

    lay

    ala

    m C

    on

    ten

    t

    YEAR

    Wiki Malayalam Content Statistics

  • ISSN: 2289-7615 Page 4

    it. Suppose there are n symbols {a1, a2 …an} emanating independent of each other from a source, with probabilities {p1, p2 …pn} respectively. Then the information content of any message of size k made out of these symbols is given by

    k

    i

    ipI1

    log …………..… (1)

    I.e. Information content of an English word such as “vande matharam” can be computed using standard probability of occurrence of English alphabet [7], as 54.79. The symbol ai which has a probability of pi to occur, is expected to occur n*pi times in the whole message. Thus the total information IT, of the message is given by

    iiT ppnI log)*( …………. (2) And the average information per symbol is the Information Entropy H, given by

    iiiiT ppppnnnIH loglog)*(*/1/ .… (3)

    Entropy is the measure of uncertainty. The significance of Information entropy is that it tells us the minimum number of bits required to encode the message digitally. Thus entropy provides a lower bound for the best possible lossless compression strategy. Intuitively, entropy reveals the extent to which a message can be compressed. C.E. Shannon used English language to define a measure of information [3]. The number of bits required to represent English text, if all letters and space are considered to have the same probability, is log2 (27) = 4.75 bits. Ideally, different letters of English alphabet has different occurrence rate. English text can be compressed to 42.55%, by taking the advantage of redundancy [7]. A similar approach can be adapted to Malayalam language. In the experimental

    analysis, based on the report [5], the frequency of occurrence of Malayalam character is selected. In addition, we considered the frequency of occurrences of space as well as full stop and computed the entropy. Thus the total character set under consideration for our study is limited to 125 (Appendix Table 1). Similarly, in the case of Malayalam Language, the number of bits required to represent, if all letters, space and full stop are considered to have the same probability, is log2 (125) = 6.97 bits and the calculated entropy based on its frequency of occurrence is ( Appendix Table 1) 5.47bits. The percentage of compressibility for Malayalam language may be computed as 21%. Thus it is possible to state that Malayalam text can be theoretically compressed by almost 21%. Practically, algorithm overheads will make the possible compression a little lesser. Based on these observations we created a simple statistically significant Malayalam text compression tool. III. Malayalam Text Compressor The general motivation for developing compression tool is the effective use of resources. It also makes the file transfer easy and fast. We tried to provide a prototype compressor and de-compressor for the Malayalam Language. The development of any compression tool has two main stages (i) an encoding algorithm which takes a message and generates a new compressed representation with a fewer bits. (ii) A decoding algorithm that reconstructs the original message from compressed representation [8]. Encoding forms the heart of any compression algorithm. Encoding is of two types (i)

  • ISSN: 2289-7615 Page 5

    Fixed length encoding and (ii) Variable length encoding. By taking the advantage of probabilistic model, a variable length code is preferred for better compressed representation. This helps to reduce the storage requirement of files. Figure 3.1 shows the schematic representation of proposed algorithm.

    Figure 3.1: Schematic Representation of proposed Malayalam Text Compressor

    The input to the compressor is a Malayalam text file in UTF-8 encoding. The Malayalam alphabets include vowels (svaram), consonants (vyanjanam) and chills. In our experiments, we have selected a total of 125 Malayalam characters. Since our main intention is to develop a prototype, no separate study is conducted to find out the probability values of Malayalam characters. We have used the results of the study report [4] for finding the entropy of Malayalam. Similar studies in English language shows that space is having a frequency of occurrence slightly higher than the most frequent letter ‘e’ and among them punctuations (here full stop only) are having the fourth place. The same is used for our work also. The first step is to convert the Malayalam sequence to be compressed into corresponding Unicode (Appendix Table 1). Unicode is a 16-bit fixed code that assigns a unique number to every

    character in use. It is a standard used for storing and transmitting documents in natural languages like Malayalam, Spanish, and Chinese etc. Based on the probability of occurrence of each Unicode character, a variable length encoding is performed. The most probable Unicode characters are represented by shorter codes and vice versa. Based on our experimentation, we found that 125 characters can be represented by codes with length of 1 – 6 bits [Appendix table 2]. The total number of characters under consideration is 125. Ideally, all 125 characters can be represented using a fixed 7 bit representation. Inorder to provide a realistic performance analysis, we compare our result with standard 7 bit representation rather than considering 16 bit Unicode representation. The output of compression algorithm includes the variable length code along with a overhead of 3 bit codes. The decompression algorithm takes the compressed file along with overhead and does the reverse operation to obtain back the Unicode Malayalam Characters. IV. Results & Discussion To test our compression algorithm, in the absence of standard dataset, we report the results in five selected web resources. We have chosen them to ensure a mix of classical and modern writing. Textual content from the following web resources [9, 10, 11,12, 13] have been taken, Wiki Grandasala (Contains the

    classical poems and classical articles).

    Wiki esopkathakal (Contains many Malayalam esop fables)

    Mal

    ayal

    am

    Tex

    t

    Un

    ico

    de

    Tra

    nsl

    atio

    n

    Var

    iab

    le L

    eng

    th

    En

    cod

    ing

    Co

    mp

    ress

    ed D

    ata

  • ISSN: 2289-7615 Page 6

    Mini-minilokaam (A popular contemporary blog contains many short stories)

    Pattepedam ramji blogspot ( A popular blog contains many stories)

    Mathrubhumi News (A popular malayalam newspaper)

    The selected files have size from 1KB to 1MB. In each case we have chosen 5

    different text passages. The results are given in Table 4.1 & Figure 4.1.

    Table 4.1 Performance Measure of Proposed Algorithm

    In normal case we need 7 bits to represent the whole character set (125 characters). As per our proposed method a variable code is assigned based on the probability distribution functions. The percentage compression [7] for proposed algorithm is calculated as follows.

    Compression

    it equired before Compression

    it required

    after Compression

    it equired before Compression 00

    The proposed compressor provides best compression of 17%. These values indicate that we have reached up to 75% of the theoretical limit dictated by entropy which is 21%. As a worst case the

    proposed algorithm provides a compression of 8.4%. Figure 4.1.a, 4.1.b, 4.1.c shows the performance measure for various selected input files.

    Figure 4.1. a. Percentage Compression of Proposed Malayalam Compressor

    It was noticed that when the input files have Malayalam characters along with numbers and other symbols apart from the 125 characters we have selected, the percentage compression is reduced to much lower level. Inorder to analyze the performance we have selected some famous literary poems written by the great poets kumaranasan and Ulloor. The poems selected are Nalini, Leela, Karuna, ChandalaBikshuki and Bhakthi Deepika. As a worst case, the proposed

    0

    2

    4

    6

    8

    10

    12

    Inp

    ut

    Fil

    e 1

    Inp

    ut

    Fil

    e 2

    Inp

    ut

    Fil

    e 3

    Inp

    ut

    Fil

    e 4

    Inp

    ut

    Fil

    e 5

    % C

    om

    pre

    ssio

    n

    Malayalam Content: Classical Poems

    Performance Measure

  • ISSN: 2289-7615 Page 7

    compression algorithm provides a compression of 8.4%.

    Figure 4.1.b Percentage Compression of Proposed Malayalam Compressor

    When the input files have less symbols and numbers, the percentage of compression is further improved to an average of 12.7%. For the analysis we used malayalam content from Wiki esopkathakal, Mini-minilokaam, Pattepedam ramji blogspot. When we selected passages containing only the considered 125 Malayalam characters, the percentage of compression is further improved to 17%. The Malayalam content is extracted from editorials of online Mathrubhumi news.

    Figure 4.1.c Percentage Compression of Proposed Malayalam Compressor

    A voluminous dataset is to be compiled to conduct further studies. However, the present results themselves are unique as there are no comparable results reported in literature. V. Conclusion Malayalam Text Compression opens a very fresh area of research. A comprehensive study on Malayalam text compression is done and a prototype is developed, perhaps for the first time. We estimated the entropy of Malayalam as 5.47 bits/character. The number of bits required to represent Malayalam text, if we consider that all letters, space and full stop have same probability is log2 (125) = 6.97 bits. Based on this it can be concluded that Malayalam text can be compressed to a maximum of 21%. A standard dataset is not available till date for Malayalam and it is required to develop one for testing. For testing similar works, we intended to develop a bench mark dataset as the future extension. The dataset which we have used for this work is taken from the Malayalam Wikipedia and blogs. We obtained a compression ratio of 17% in the best case. The compression can be further improved by proving an adaptive transition table rather than static transition table. The future enhancement include, the realization of Huffman based Malayalam compressor and much more realistic compressor, which take both Malayalam and English alphabets along with arithmetic numbers. References

    10.5 11

    11.5 12

    12.5 13

    13.5 14

    14.5

    Inp

    ut

    Fil

    e 1

    Inp

    ut

    Fil

    e 2

    Inp

    ut

    Fil

    e 3

    Inp

    ut

    Fil

    e 4

    Inp

    ut

    Fil

    e 5

    % C

    om

    pre

    ssio

    n

    Malayalam Content: Short Stories

    Performance Measure

    15.5

    16

    16.5

    17

    17.5

    18

    Inp

    ut

    Fil

    e 1

    Inp

    ut

    Fil

    e 2

    Inp

    ut

    Fil

    e 3

    Inp

    ut

    Fil

    e 4

    Inp

    ut

    Fil

    e 5

    % C

    om

    pre

    ssio

    n

    Malayalm Content: Editorials

    Performance Measure

  • ISSN: 2289-7615 Page 8

    C.E. Shannon, (1948). “A Mathematical

    Theory of Communication”. The Bell system, Technical Journal, Vol.27, pp.379-423.

    S. Prema and Manu Joseph, (2001). “Malayalam frequency count study report”, Department of Linguistics, University of Kerala.

    K. S. Arun and Achuthsankar S. Nair, (20 2). “It's 60 years since “kpb wcy xz” became more informative than ‘I love you’”. IEEE Potentials, Vol. 29, pp. 16-19.

    Salomon, (2004). “Data Compression: The Complete eference”, Springer, pp. 1-14.

    Grantha, Vattezhuthu, Kolezhuthu, Malayanma, Devanagiri, Brahmi and Tamil alphabets, Available: http://c-radhakrishnan.info/alphabet.htm. Accessed on 20 Dec. 2012.

    Wiki, Available : http://ml.wikipedia.org/wiki, Accessed on 13 Jan. 2013.

    Thanimalayalam, Available: http://thanimalayalam.org/, Accessed on 17 Feb. 2013

    Malayalam blogkut, Available: http://malayalam.blogkut.com/, Accessed on 3 Dec. 2012.

    Malayalam Wiki Source, Available: http://ml.wikisource.org, Accessed on 11 Dec. 2012.

    Mini-kathakal, Available: http://mini-kathakal.blogspot.in/, Accessed on 23 Dec. 2012.

    Pattepadamramji, Available: http://pattepadamramji.blogspot.in/, Accessed on 2 Jan 2013.

    Malayalam, Available: http://org/wiki/esopkathakal/, Accessed on 10 Jan. 2013.

    Mathrubhumi, Available: http://www.mathrubhumi.com/, Accessed on 21 Feb. 2013.

    Appendix

    1. Entropy of Malayalam Characters

    Alphabet Total Frequency

    Count

    Occurrence

    Probability - log2(p)

    അ 14311 0.00820 6.9302

    ആ 6724 0.00385 8.0209

    ഇ 6539 0.00375 8.0589

    ഈ 1109 0.00064 10.6096

    ഉ 3691 0.00212 8.8817

    ഊ 102 0.00006 14.0247

    ഋ 13 0.00001 16.6096

    എ 11366 0.00651 7.2631

    ഏ 1382 0.00079 10.3059

    ഐ 959 0.00055 10.8283

    ഒ 3258 0.00187 9.0627

    ഓ 1933 0.00111 9.8152

    ഔ 115 0.00007 13.8023

    അo 464 0.00027 11.8548

    ക 53088 0.03042 5.0388

    ഖ 2192 0.00126 9.6324

    ഗ 10640 0.00610 7.3570

    ഘ 1533 0.00088 10.1503

    ങ 649 0.00037 11.4002

    ച 8780 0.00503 7.6352

    ഛ 38 0.00002 15.6096

    ജ 10606 0.00608 7.3617

    ഝ 12 0.00001 16.6096

    ഞ 723 0.00041 11.2521

    ട 29255 0.01676 5.8988

    ഠ 698 0.00040 11.2877

    ഡ 8019 0.00460 7.7642

    ഢ 55 0.00003 15.0247

    ണ 17570 0.01007 6.6338

  • ISSN: 2289-7615 Page 9

    ത 42772 0.02451 5.3505

    ഥ 987 0.00057 10.7768

    ദ 9439 0.00541 7.5302

    ധ 5058 0.00290 8.4297

    ന 47153 0.02702 5.2098

    പ 37563 0.02153 5.5375

    ഫ 5127 0.00294 8.4100

    ബ 8925 0.00511 7.6125

    ഭ 6811 0.00390 8.0023

    മ 42978 0.02463 5.3434

    യ 51155 0.02931 5.0925

    ര 46469 0.02663 5.2308

    റ 19509 0.01118 6.4829

    ല 26390 0.01512 6.0474

    ള 17082 0.00979 6.6745

    ഴ 6582 0.00377 8.0512

    വ 45964 0.02634 5.2466

    ശ 10617 0.00608 7.3617

    ഷ 9151 0.00524 7.5762

    സ 41359 0.02370 5.3990

    ഹ 7194 0.00412 7.9231

    ക്ക 28881 0.01655 5.9170

    ന്ന 23059 0.01321 6.2422

    ത്ത 18699 0.01072 6.5436

    ട്ട 10970 0.00629 7.3127

    പ്പ 9502 0.00545 7.5195

    ച്ച 9011 0.00516 7.5984

    ങ്ങ 7922 0.00454 7.7831

    ണ്ട 7379 0.00423 7.8851

    ന്റ 6074 0.00348 8.1667

    റ്റ 5175 0.00297 8.3953

    ന്ത 5065 0.00290 8.4297

    ലല 5050 0.00289 8.4347

    ക്ഷ 3899 0.00223 8.8087

    ഞ്ഞ 3082 0.00177 9.1420

    ള്ള 2934 0.00168 9.2173

    മ്പ 2911 0.00167 9.2259

    മ്മ 2777 0.00159 9.2968

    ങ്ക 2527 0.00145 9.4297

    സ്ഥ 2103 0.00121 9.6908

    ന്ദ 2050 0.00117 9.7393

    സ്റ്റ 1901 0.00109 9.8415

    ഞ്ച 1473 0.00084 10.2173

    യ്യ 1427 0.00082 10.2521

    ദ്ധ 1351 0.00077 10.3429

    ന്ധ 1049 0.00060 10.7028

    സ്സ 1047 0.00060 10.7028

    ണ്ണ 1023 0.00059 10.7270

    ദ്ദ 879 0.00050 10.9658

    ക്ത 876 0.00050 10.9658

    ത്ഥ 477 0.00027 11.8548

    ത്സ 447 0.00026 11.9092

    ശ്ശ 366 0.00021 12.2173

    ന്മ 364 0.00021 12.2173

    ത്മ 255 0.00015 12.7028

    വ്വ 249 0.00014 12.8023

    ജ്ഞ 147 0.00008 13.6096

    ച്ഛ 129 0.00007 13.8023

    ബ്ബ 124 0.00007 13.8023

    ഗ്ഗ 96 0.00006 14.0247

    പമ 88 0.00005 14.2877

    ന്ഥ 83 0.00005 14.2877

    ഗ്ന 78 0.00004 14.6096

    ത്ഭ 74 0.00004 14.6096

    ഹ്ന 51 0.00003 15.0247

    ണ്മ 37 0.00002 15.6096

    ഗ്മ 18 0.00001 16.6096

    ഡ്ഡ 9 0.00001 16.6096

    ല് 20226 0.00001 16.6096

    ള് 9647 0.00001 16.6096

    ര് 24929 0.01159 6.4310

    ണ് 2195 0.00553 7.4985

    ന് 14390 0.01429 6.1289

    ാ 85839 0.00126 9.6324

    ാ 139814 0.00825 6.9214

    ാ 13794 0.04919 4.3455

    ാ 89800 0.08012 3.6417

    ാ 11300 0.00648 7.2698

    ാ 2423 0.00139 9.4907

    ൊ 41620 0.02385 5.3899

    ോ 21948 0.01258 6.3127

    ൈാ 3787 0.00217 8.8481

    ൊ 5609 0.00321 8.2832

  • ISSN: 2289-7615 Page 10

    ോ 25062 0.01436 6.1218

    ൊ 1591 0.00091 10.1018

    o 46418 0.02660 5.2324

    33 0.00002 15.6096

    18909 0.01084 6.5275

    15228 0.00873 6.8398

    3773 0.00216 8.8548

    ാ 2709 0.00155 9.3335

    ാ 80 0.00005 14.2877

    space 209721 0.12018 3.0567

    periods 76897 0.04407 4.5041

    Total 1745067 1.0000

    Entropy (-Σpi log pi) 5.47

    Table 1: Entropy of Malayalam Characters

    Total number of characters = 125 log2 125 = 6.97b % of Compressibility = (6.97 - 5.47)/6.97

    = 21.58%

    2. Unicode to Bits Mapping

    Alphabet Unicode Code

    അ 3333 10001

    ആ 3334 00101

    ഇ 3335 00111

    ഈ 3336 110110

    ഉ 3337 100010

    ഊ 3338 001011

    ഋ 3339 011001

    എ 3342 10011

    ഏ 3343 110100

    ഐ 3344 111011

    ഒ 3346 100011

    ഓ 3347 101111

    ഔ 3348 001010

    അo 3330 111

    ക 3349 11

    ഖ 3350 101100

    ഗ 3351 10110

    ഘ 3352 110001

    ങ 3353 000000

    ച 3354 11111

    ഛ 3355 010100

    ജ 3356 11000

    ഝ 3357 011011

    ഞ 3358 111110

    ട 3359 1010

    ഠ 3360 111111

    ഡ 3361 00000

    ഢ 3362 010010

    ണ 3363 0101

    ത 3364 010

    ഥ 3365 111010

    ദ 3366 11011

    ധ 3367 01101

    ന 3368 101

    പ 3370 1001

    ഫ 3371 01011

    ബ 3372 11110

    ഷ 3383 11100

    സ 3384 1000

    ഹ 3385 00011

    ക്ക 3349-3405-3349 1011

    ന്ന 3368-3405-3368 1111

    ത്ത 3364-3405-3364 0100

    ട്ട 3359-3405-3359 10101

    പ്പ 3370-3405-3370 11010

    ച്ച 3354-3405-3354 11101

    ങ്ങ 3353-3405-3353 00001

    ണ്ട 3363-3405-3359 00010

    ന്റ 3368-3405-3377 01000

    റ്റ 3377-3405-3377 01010

    ന്ത 3368-3405-3364 01100

    ലല 3378-3405-3378 01110

    ക്ഷ 3349-3405-3383 01111

    ഞ്ഞ 3358-3405-3358 100100

    ള്ള 3379-3405-3379 100101

    മ്പ 3374-3405-3370 100110

    മ്മ 3374-3405-3374 100111

  • ISSN: 2289-7615 Page 11

    ങ്ക 3353-3405-3349 101001

    സ്ഥ 3384-3405-3365 101101

    ന്ദ 3368-3405-3366 101110

    ഞ്ച 3358-3405-3354 110010

    യ്യ 3375-3405-3375 110011

    ദ്ധ 3366-3405-3367 110101

    ന്ധ 3368-3405-3367 110111

    സ്സ 3384-3405-3384 111000

    ണ്ണ 3363-3405-3363 111001

    ദ്ദ 3366-3405-3366 111100

    ക്ത 3349-3405-3364 111101

    ത്ഥ 3364-3405-3365 000001

    ത്സ 3364-3405-3384 000010

    ണ്മ 3363-3405-3374 011000

    ഗ്മ 3351-3405-3374 011010

    ഡ്ഡ 3361-3405-3361 011100

    ല് 3453 0001

    ള് 3454 11001

    ര് 3452 1110

    ണ് 3450 101011

    ന് 3451 10000

    ാ 3390 10

    ാ 3391 00

    ാ 3392 10010

    ാ 3393 01

    ാ 3394 10100

    ാ 3395 101010

    ൊ 3398 011

    ോ 3399 0000

    ൈാ 3400 100000

    ൊ 3402 01001

    ോ 3403 1101

    ൊ 3404 110000

    o 3330 111

    3331 010110

    3405-3377 0011

    3405-3375 0111

    3405-3381 100001

    ാ 3405 101000

    ാ 3415 001111

    space 32 0

    periods 46 1

    Table 2: Transition Table

    http://www.codetable.net/decimal/3390http://www.codetable.net/decimal/3391http://www.codetable.net/decimal/3392http://www.codetable.net/decimal/3393http://www.codetable.net/decimal/3394http://www.codetable.net/decimal/3395http://www.codetable.net/decimal/3398http://www.codetable.net/decimal/3399http://www.codetable.net/decimal/3400http://www.codetable.net/decimal/3402http://www.codetable.net/decimal/3403http://www.codetable.net/decimal/3404http://www.codetable.net/decimal/3405http://www.codetable.net/decimal/3415