universal compression of memoryless sources over unknown alphabets

13
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50,NO. 7, JULY2004 1469 Universal Compression of Memoryless Sources Over Unknown Alphabets Alon Orlitsky, Narayana P. Santhanam, Student Member, IEEE, and Junan Zhang, Student Member, IEEE Abstract—It has long been known that the compression redun- dancy of independent and identically distributed (i.i.d.) strings in- creases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbols appear. Concen- trating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be com- pressed with diminishing redundancy, both in block and sequen- tially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem. Index Terms—Large and unknown alphabets, patterns, set and integer partitions, universal compression. I. INTRODUCTION S HANNON showed that every discrete source can be com- pressed no further than to its entropy, and that if its distri- bution is known, compression to essentially the entropy can be achieved. However in most applications, the underlying distri- bution is not known. For example, in text compression, neither the distribution of words nor their dependencies on previous words are known, and change with author, subject, and time. Similarly, in image compression, the distribution and interrela- tion of pixels are not known and vary from picture to picture. A common assumption in these applications is that the source distribution belongs to some natural class , such as the collection of independent and identically distributed (i.i.d.), Markov, or stationary distributions, but that the precise distribution within is not known [1], [2]. The objective then is to compress the data almost as well as when the distribution is known in advance, namely, to find a universal compression scheme that performs almost optimally by approaching the entropy no matter which distribution in generates the data. The following is a brief introduction to universal compression. For an extensive overview, see [3]–[6]. Manuscript received March 29, 2003; revised March 31, 2004. This work was supported by the National Science Foundation under Grant CCR-0313367. The material in this paper was presented in part at the IEEE International Symposium on Information Theory, Yokohama, Japan, June/July 2003. The authors are with the Department of Electrical and Computer Engineering and the Department of Computer Science and Engineering, University of Cal- ifornia, San Diego, La Jolla, CA 92093 USA (e-mail: [email protected]; nsan- [email protected]; [email protected]). Communicated by W. Szpankowski, J. C. Kieffer, and E.-h. Yang, Guest Ed- itors. Digital Object Identifier 10.1109/TIT.2004.830761 Let a source be distributed over a discrete support set according to a probability distribution . An encoding of is a prefix-free mapping . It can be shown that every encoding of corresponds to a probability assignment over where the number of bits allocated to is approx- imately . Roughly speaking, the optimal encoding that is selected based on the distribution and achieves its entropy, allocates bits to every . The extra number of bits required to encode when is used instead of is, therefore, The worst case redundancy of with respect to the distribution is the largest number of extra bits allocated for any possible . The worst case redundancy of with respect to the collection is the number of extra bits used for the worst distribution in and worst . The worst case redundancy of is (1) the lowest number of extra bits required in the worst case by any possible encoder . For any pair of distributions and , is nonnegative, and therefore is always nonnegative. Note that when the redundancy is small, there is an encoding that assigns to every a probability not much smaller than that assigned to by the most favorable distribution in . In addition to worst case redundancy, one can define the av- erage-case redundancy of that reflects the lowest expected number of additional bits re- quired by an encoding that does not know the underlying dis- tribution. The average-case redundancy is clearly always lower than the worst case redundancy. Since our primary interest is 0018-9448/04$20.00 © 2004 IEEE

Upload: np

Post on 24-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004 1469

Universal Compression of Memoryless Sources OverUnknown Alphabets

Alon Orlitsky, Narayana P. Santhanam, Student Member, IEEE, and Junan Zhang, Student Member, IEEE

Abstract—It has long been known that the compression redun-dancy of independent and identically distributed (i.i.d.) strings in-creases to infinity as the alphabet size grows. It is also apparent thatany string can be described by separately conveying its symbols,and its pattern—the order in which the symbols appear. Concen-trating on the latter, we show that the patterns of i.i.d. strings overall, including infinite and even unknown, alphabets, can be com-pressed with diminishing redundancy, both in block and sequen-tially, and that the compression can be performed in linear time.

To establish these results, we show that the number of patterns isthe Bell number, that the number of patterns with a given numberof symbols is the Stirling number of the second kind, and that theredundancy of patterns can be bounded using results of Hardyand Ramanujan on the number of integer partitions. The resultsalso imply an asymptotically optimal solution for the Good-Turingprobability-estimation problem.

Index Terms—Large and unknown alphabets, patterns, set andinteger partitions, universal compression.

I. INTRODUCTION

SHANNON showed that every discrete source can be com-pressed no further than to its entropy, and that if its distri-

bution is known, compression to essentially the entropy can beachieved. However in most applications, the underlying distri-bution is not known. For example, in text compression, neitherthe distribution of words nor their dependencies on previouswords are known, and change with author, subject, and time.Similarly, in image compression, the distribution and interrela-tion of pixels are not known and vary from picture to picture.

A common assumption in these applications is that thesource distribution belongs to some natural class , suchas the collection of independent and identically distributed(i.i.d.), Markov, or stationary distributions, but that the precisedistribution within is not known [1], [2]. The objective thenis to compress the data almost as well as when the distributionis known in advance, namely, to find a universal compressionscheme that performs almost optimally by approaching theentropy no matter which distribution in generates the data.The following is a brief introduction to universal compression.For an extensive overview, see [3]–[6].

Manuscript received March 29, 2003; revised March 31, 2004. This work wassupported by the National Science Foundation under Grant CCR-0313367. Thematerial in this paper was presented in part at the IEEE International Symposiumon Information Theory, Yokohama, Japan, June/July 2003.

The authors are with the Department of Electrical and Computer Engineeringand the Department of Computer Science and Engineering, University of Cal-ifornia, San Diego, La Jolla, CA 92093 USA (e-mail: [email protected]; [email protected]; [email protected]).

Communicated by W. Szpankowski, J. C. Kieffer, and E.-h. Yang, Guest Ed-itors.

Digital Object Identifier 10.1109/TIT.2004.830761

Let a source be distributed over a discrete support setaccording to a probability distribution . An encoding of is aprefix-free – mapping . It can be shown thatevery encoding of corresponds to a probability assignmentover where the number of bits allocated to is approx-imately . Roughly speaking, the optimal encodingthat is selected based on the distribution and achieves itsentropy, allocates bits to every .

The extra number of bits required to encode when is usedinstead of is, therefore,

The worst case redundancy of with respect to the distributionis

the largest number of extra bits allocated for any possible . Theworst case redundancy of with respect to the collection is

the number of extra bits used for the worst distribution in andworst . The worst case redundancy of is

(1)

the lowest number of extra bits required in the worst case by anypossible encoder .

For any pair of distributions and , is nonnegative,and therefore is always nonnegative. Note that when theredundancy is small, there is an encoding that assigns toevery a probability not much smaller than that assigned toby the most favorable distribution in .

In addition to worst case redundancy, one can define the av-erage-case redundancy of

that reflects the lowest expected number of additional bits re-quired by an encoding that does not know the underlying dis-tribution. The average-case redundancy is clearly always lowerthan the worst case redundancy. Since our primary interest is

0018-9448/04$20.00 © 2004 IEEE

1470 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

showing that low redundancy can be achieved, we concentrateon worst case redundancy.

Most universal-compression results, even those applying todistributions with memory, build on corresponding results fori.i.d. distributions. Consequently, the redundancy of , the col-lection of i.i.d.distributions over sequences of length drawnfrom an alphabet of size , was studied extensively, and a suc-cession of papers [7]–[14] has shown that for any fixed asincreases

(2)

where is the gamma function, and the term diminisheswith increasing at a rate determined by .

For any fixed alphabet size , this redundancy growslogarithmically with the block length , hence, as increasesto infinity, the per-symbol redundancy diminishes,implying that, asymptotically, i.i.d. distributions can berepresented essentially optimally even when the underlyingprobability distribution is unknown.

In many applications, however, the natural alphabet whichcaptures the structure of the data is very large, possibly eveninfinite. For example, the natural units that capture the structureof English are words, not letters, and clearly not bits. Englishconsists of hundreds of thousands of words, comparable to thenumber of words in a typical text. The natural symbols in imagecoding are pixels, which may assume different values, andthe natural symbols in facsimile transmission are images of theindividual letters.

When, as in the above applications, the alphabet size islarge, is high too, and increases to infinity as grows.Similar conclusions hold also when is allowed to grow withthe block length [15].

A systematic study of universal compression over arbitraryalphabets was taken by Kieffer [4] who analyzed a slightlyless restrictive form of redundancy, related to weak universalcompression. He derived a necessary and sufficient conditionfor weak universality, and used it to show that even under thisweaker requirement, i.i.d. distributions over infinite alphabetsentail infinite redundancy. Faced with Kieffer’s impossibilityresults, subsequent universal compression work has typicallyavoided general distributions over large alphabets.

On the theoretical side, researchers constructed compressionalgorithms for subclasses of i.i.d. distributions that satisfy Ki-effer’s condition. Elias [16], Györfi, Pali, and Van der Meulen[17], and Foster, Stine, and Wyner [18] considered monotone,namely, , i.i.d. distributions over the naturalnumbers, Uyematsu and Kanaya [19] studied bounded-momentdistributions, and Kieffer and Yang [20] and He and Yang [21]showed that any collection satisfying Kieffer’s condition can beuniversally compressed by grammar codes.

Since actual distributions may not satisfy Kieffer’s condi-tion, practical compression algorithms have typically taken adifferent approach, circumventing some of the problems associ-ated with large alphabets by converting them into smaller ones.For example, common implementations of both the Lempel–Zivand the context-tree-weighting algorithms do not operate on

words. Instead, they convert words into letters, letters into bits,and then compress the resulting sequence of bits. Such com-pression algorithms risk losing the sequential relation betweenwords. For example, if the probability of a word depends on thepreceding four words, then it is determined by the previous 20or more letters, namely, upwards of 160 bits, yet most programstruncate their memory at significantly fewer bits.

Motivated by language modeling for speech recognition andthe discussion above, we recently took a different approach tocompression of large, possibly infinite, alphabets [15], [22].A similar approach was considered by Åberg, Shtarkov, andSmeets [23] who lower-bounded its performance when theunderlying alphabet is finite and the sequence length increasesto infinity, see Section V-C.

To motivate this approach, consider perhaps the simplest in-finite-redundancy collection. Let , whereeach is the constant distribution that assigns probability tothe length- sequence , and probability to any other(constant or varying) sequence. If the source that generates thesequence, namely, , is known, no bits are needed to describethe resulting sequence . Yet a universal compressionscheme that knows only the class of distributions, needs todescribe which, as grows, requires an unlimited number ofbits. Therefore, both the worst case and average redundancy in-curred by any universal compression scheme is infinite.

While disappointing in showing that even simple collectionsmay have infinite redundancy, this example also suggests an ap-proach around some of this redundancy. The unboundedly manybits required to convey the sequence clearly do not de-scribe the sequence’s evolution, or pattern, but only the element

it consists of. It is natural to ask if this observation holds ingeneral, namely, whether the patterns of all sequences can beefficiently conveyed, and the infinite redundancy found by Ki-effer stems only from describing the elements that occur.

The description of any string, over any alphabet, can beviewed as consisting of two parts: the symbols appearing in thestring and the pattern that they form. For example, the string

"abracadabra"

can be described by conveying the pattern

" "

and the dictionary

indexletter a b r c d

Together, the pattern and dictionary specify that the string “abra-cadabra” consists of the first letter to appear (a), followed by thesecond letter to appear (b), then by the third to appear (r), thefirst that appeared (a again), the fourth (c), the first (a), etc.

Of the pattern and the dictionary parts of describing astring, the former has a greater bearing on many applications[24]–[30]. For example, in language modeling, the patternreflects the structure of the language while the dictionaryreflects the spelling of words. We therefore concentrate onpattern compression.

ORLITSKY et al.: UNIVERSAL COMPRESSION OF MEMORYLESS SOURCES OVER UNKNOWN ALPHABETS 1471

II. RESULTS

In Section III, we formally define patterns and their redun-dancy. In Section IV, we derive some useful properties of pat-terns, including a correspondence between patterns and set par-titions. We use this correspondence to show that the number ofpatterns is the Bell number and that the number of patterns witha given number of symbols is the Stirling number of the secondkind.

We are primarily interested in universal codes for the classof all i.i.d. distributions, over all possible alphabets, finite

or infinite. As mentioned earlier, for standard compression, theper-symbol redundancy increases to infinity as the alphabet sizegrows. Yet in Section V, we show that , the block redun-dancy of compressing patterns of i.i.d. distributions over poten-tially infinite alphabets is bounded by

Therefore, the per-symbol redundancy of coding patterns dimin-ishes to zero as the block length increases, irrespective of thealphabet size. The proofs use an analogy between patterns andset partitions which allows us to incorporate celebrated resultsof Hardy and Ramanujan on the number of partitions of an in-teger.

In Section VI, we consider sequential pattern compression,which is of interest in most practical applications. We first con-struct a sequential compression algorithm whose redundancy isat most

However, this algorithm has high computational complexity,hence, we also derive a linear-complexity sequential algorithmwhose redundancy is at most

where the implied constant is less than . The proofs of thesequential-compression results are more involved than those ofblock compression, and further improvement of the proposedalgorithms involved is warranted.

Note that for both block and sequential compression, the re-dundancy grows sublinearly with the block length, hence theper-symbol redundancy diminishes to zero. These re-sults can be extended to distributions with memory [31]. Theycan also be adapted to yield an asymptotically optimal solutionfor the Good-Turing probability-estimation problem [32].

III. PATTERNS

We formally describe patterns and their redundancy.Let be any alphabet. For

denotes the set of symbols appearing in . The index ofis

and

one more than the number of distinct symbols preceding ’s firstappearance in . The pattern of is the concatenation

of all indexes. For example, if , ,, , , and , hence,

Let

denote the set of patterns of all strings in . For example, ifcontains two elements, then , ,

, etc. Let

denote the set of all length- patterns, and let

be the set of all patterns. For example

where is the empty string. Fig. 1 depicts a tree representationof all patterns of length at most .

It is easy to see that a string is a pattern iff it consists ofpositive integers such that no integer appears before thefirst occurrence of . For example, , , and are pat-terns, while , , and are not.

Every probability distribution over induces a distribu-tion over patterns on , where

is the probability that a string generated according to has pat-tern . When is used to evaluate a specific pattern probability

, the subscript can be inferred, and is hence omitted.For example, let be a uniform distribution over . Then

induces on the distribution

1472 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

Fig. 1. A tree representation of patterns of length � 4.

For a collection of distributions over let

denote the collection of distributions over induced by prob-ability distributions in . From the derivations leading to (1),the worst case pattern redundancy of , i.e., the worst case re-dundancy of patterns generated according to an unknown distri-bution in is

(3)

where is any distribution over . In particular, for all

As mentioned earlier, we are mostly interested in , thepattern redundancy of , the collection of arbitrary i.i.d. dis-tributions over length- strings. We show that the per-symbolredundancy diminishes to zero, and that diminishingper-symbol redundancy can be achieved both by block andsequential coding with a constant number of operations persymbol.

IV. PRELIMINARIES

We first establish a correspondence between patterns andset partitions. Set partitions have been studied extensively bya number of well-known researchers [33]–[35], and in thissection and in Section V we use their properties to derive theasymptotics of and of the growth rate of .

A. Set Partitions and Patterns

A partition of a set is a collection of disjoint nonemptysubsets of whose union is . For , let

with . Let be the set of all partitions of , and let

be the collection of all partitions of for all . Forexample

For , let be the number of partitions ofinto sets, and let

be the number of partitions of . For example, ,, , , , ,

and so on, hence,

Note that is for and otherwise.The numbers are called Stirling numbers of the

second kind while the numbers , are called Bell numbers.Many results are known for both [36]. In particular, it is easy tosee that for all , Bell numbers satisfy the recursion

Set partitions are equivalent to patterns. To see that, let themapping assign to the set partition

where denotes the length of . For example, for the pattern

The following follows easily.

Lemma 1: The function is a bijection. Fur-thermore, for every

B. Profiles

We classify patterns and set partitions by their profile, whichwill be useful in evaluating the redundancy of i.i.d.-induced dis-tributions.

The multiplicity of in is

the number of times appears in . The prevalence of a multi-plicity in is

ORLITSKY et al.: UNIVERSAL COMPRESSION OF MEMORYLESS SOURCES OVER UNKNOWN ALPHABETS 1473

the number of symbols appearing times in . The profile ofis

the vector of prevalences of in for .Similarly, the profile of is the vector

where

is the number of sets of cardinality in . The following iseasily observed.

Lemma 2: For all ,

For example, the pattern has multiplicities ,, and for all other . Hence, its

prevalences are , andits profile is . On the other hand, we seethat

with its profile .Let

be the set of profiles of all length- patterns, and let

be the set of profiles of all patterns. Clearly, and are alsothe set of profiles of all set partitions in and , respectively,and for all

For , let

be the collection of patterns of profile , and, equivalently, let

denote the collection of partitions whose profile is . It followsthat for all

C. Useful Results

In this subsection, we evaluate the size of and recallShtarkov’s result for computing the worst case redundancy.

Number of Patterns of a Given ProfileLet

be the number of patterns of profile . We get the following.

Lemma 3: For all and

Proof: There is only one pattern of length , the emptystring , hence the lemma holds.

There are many ways to derive this result. To see one, letbe a profile- partition of . For ,

let be the collection of elements in sets of size . Clearly

hence can be decomposed into the sets in

ways. Each set can be further decomposed into inter-changeable sets of size in

ways. These two decompositions uniquely define the partition,hence, the number of profile- partitions of is

Shtarkov’s SumWe will frequently evaluate redundancies using a result by

Shtarkov [37] showing that the distribution achieving in(1) is

It follows that the redundancy of a collection of distributionsover is determined by Shtarkov’s sum

(4)

Approximation of Binomial CoefficientsWhile finite-alphabet results typically involve binomial coef-

ficients of form for some constant , large alphabets oftenrequire the calculation of . The following lemma providesa convenient approximation.

Lemma 4: When and

1474 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

and when, in addition,

Proof: Feller’s bounds on Stirling’s approximation [38]state that for every

(5)

Hence, for all

Taking derivatives, it is easy to see that for all ,

hence, for all

and

Therefore, for all

where

proving the first part of the lemma. When and, , and the second part follows.

V. BLOCK COMPRESSION

We show that for all

namely, the redundancy of patterns of i.i.d. distributions is sub-linear in the block length, implying that the per-symbol redun-

dancy diminishes to zero. To obtain these bounds we rewriteShtarkov’s sum (4) as

(6)

In the next subsection, we use this sum to computeand . However, for larger , exact calculation ofthe maximum-likelihood probabilities of patterns, namely,

, seems difficult [39]. Hence, in Sections V-Band V-C, respectively, we prove upper and lower bounds onthe maximum-likelihood probabilities of patterns and use thesebounds to upper and lower bound .

A. The Redundancy of Patterns of Lengths and

We determine the redundancies and of i.i.d.-in-duced distributions over patterns of lengths and , respectively.

There is only one distribution on , hence,

For length , consider the collection of distributions over

induced by the set of i.i.d. distributions over strings oflength . By Shtarkov’s sum

(7)

Since any constant i.i.d. distribution assigns , themaximum-likelihood probability of is , hence,

Similarly, any continuous distribution over assigns, hence,

Incorporating into (7), we obtain

Unfortunately, calculation of maximum-likelihood probabil-ities for longer patterns seems difficult. Therefore, instead ofevaluating the sum in (6) exactly, we bound the maximum-like-lihood probabilities of to obtain bounds on .

B. Upper Bound

We first describe a general bound on the redundancy of anycollection of distributions, and then use it to show that

ORLITSKY et al.: UNIVERSAL COMPRESSION OF MEMORYLESS SOURCES OVER UNKNOWN ALPHABETS 1475

A General Upper Bound TechniqueLet be a collection of distributions. The following bound

on the redundancy of is easily obtained.

Lemma 5: For all

Proof: The claim is obvious when is infinite, and forfinite , Shtarkov’s sum implies that

Intuitively, for finite , the lemma corresponds to first identi-fying the maximum-likelihood distribution of from all distri-butions in and then describing using this distribution. If notall distributions are candidates for maximum-likelihood distri-butions, the above bound can be improved as follows.

A collection of distributions dominates if for all

namely, the highest probability of any in is at leastas high as that in . The next lemma then follows immediatelyfrom Shtarkov’s sum.

Lemma 6: If dominates , then

The preceding lemmas imply that the redundancy is upper-bounded by the logarithm of the size of .

Corollary 7: If dominates , then

To illustrate this bound, we bound the standard redundancyof strings distributed i.i.d. over finite alphabets.

Example 1: Consider the collection of all i.i.d. distribu-tions over alphabets of size , which, without loss of generality,we assume to be . Clearly

where and

dominates , hence, from Corollary 7

Similarly, since

it follows that for every and

RedundancyWe show that is at most the logarithm of the number

of profiles. Similar to the correspondence between patterns andset partitions, we obtain a correspondence between profiles ofpatterns and unordered partitions of positive integers, and useHardy and Ramanujan’s results on the number of unordered par-titions of positive integers to bound the number of profiles, andhence the redundancy.

Lemma 8: For all

Proof: Every induced i.i.d. distribution assigns the sameprobability to all patterns of the same profile. Hence, the prob-ability assigned to every pattern of a given profile is at most theinverse of the number of patterns of that profile. Let consistof distributions, one for each profile. The distribution as-sociated with any profile assigns to each pattern of that profile aprobability equal to the inverse of the number of patterns of theprofile. clearly dominates and the lemma follows.

To count the number of profiles in , we observe the fol-lowing correspondence with unordered partitions of positive in-tegers.

An unordered partition of a positive integer is a multiset ofpositive integers whose sum is . An unordered partition can berepresented by the vector , where denotesthe number of times occurs in the partition. For example, thepartition of corresponds to the vector .Unordered partitions of a positive integer and profiles of pat-terns in are equivalent as follows.

Lemma 9: A vector is an unordered partition of iff.

Henceforth, we use the notation developed for profiles of pat-terns in Section IV-B for unordered partitions as well.

Lemma 10: (Hardy and Ramanujan [40], see also [41]) Thenumber of unordered partitions of is

Lemmas 8 and 10 imply the following upper bound on thepattern redundancy of .

Theorem 11: For all

In particular, the pattern redundancy of i.i.d. strings is sub-linear in the block length, and hence the per-symbol redundancydiminishes as the number of compressed symbols increases. Wenote that the number of integer partitions has also been used by

1476 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

Csiszár and Shields [42] to bound the redundancy of renewalprocesses.

C. Lower Bound

In the last section, we showed that the redundancy of patternsof i.i.d. strings is . We now show that it is . Weprovide a simple proof of this lower bound and mention a morecomplex approach that yields the same growth rate, but with ahigher multiplicative constant.

Theorem 12: As increases

Proof: Let

be the support of a pattern with respect to a distributionover an alphabet . As noted in [23], can be parti-tioned into sets, each with equi-probable sequences,where is the profile of . By standard max-imum-likelihood arguments, the probability of any sequencewith profile is at most , hence,

(8)

From Shtarkov’s sum (4)

where follows from Lemma 3 and (8), from Feller’sbounds (5), from the arithmetic-geometric mean inequality,

because each unordered partition into parts can be or-dered in at most ways, and from Lemma 4. The theoremfollows.

Note that the constant in the bound can be increased by taking

in the proof, yielding

Generating functions and Hayman’s theorem can be used toevaluate the exact asymptotic growth of

(9)

thereby improving the lower bound to the following.

Theorem 13: [43] (see also [15]) As increases

These lower bounds should be compared with those in Åberg,Shtarkov, and Smeets [23] who lower-bounded pattern redun-dancy when the number of symbols is fixed and finite andthe block length increases to infinity. While it is not clearwhether their proof extends to arbitrary , which may growwith , the bound they derive may still hold in general. If so, itwould yield a lower bound similar to those described here. For amore complete discussion, see [44]. Note also that subsequent tothe derivation of Theorem 13, Shamir et al. [45], [46] showedthat the average-case pattern redundancy is lower-bounded by

for arbitrarily small .

VI. SEQUENTIAL COMPRESSION

The compression schemes considered so far operated on thewhole block of symbols. In many applications the symbols ar-rive and must be encoded sequentially. Compression schemesfor such applications are called sequential and associate withevery pattern , a probability distribution over

representing the probability that the encoder assigns to the pos-sible values of after seeing . For example

is a distribution over , namely, , , andare distributions over , while is a distribution over

.Let be a sequential encoder. For each , induces a

probability distribution over given by

Some simple algorithms along the lines of the add-constantrules were analyzed in [15] and shown to have diminishing per-symbol redundancy when the number of distinct symbols issmall, but a constant per-symbol redundancy in general. In thissection, we describe two sequential encoders with diminishing

ORLITSKY et al.: UNIVERSAL COMPRESSION OF MEMORYLESS SOURCES OVER UNKNOWN ALPHABETS 1477

per-symbol redundancy. The first encoder, , has worst caseredundancy of at most

only slightly higher than the upper bound on . However,this encoder has high computational complexity, and in Sec-tion VI-B, we consider a sequential encoder with linearcomputational complexity and redundancy less than

which still grows sublinearly with though not as slowly as theblock redundancy.

A. A Low-Redundancy Encoder

We construct an encoder that for all , and all patternsachieves a redundancy

The encoder uses distributions that are implicit in the blockcoding results.

Let

denote the maximum-likelihood probability assigned to a pat-tern by any i.i.d. distribution in . Recall thatis the number of patterns with profile , and that, as in Lemma 8,every i.i.d. distribution assigns the same probability to all pat-terns of the same profile. We, therefore, obtain the followingupper bound on the maximum pattern probabilities.

Lemma 14: For any pattern of profile

Based on this upper bound, we can construct the followingdistribution over :

(10)

For , let

be the smallest power of that is at least , e.g., , ,and . Note that .

For every , and patterns , let

be the set of all patterns that extend in , and let

be the probability of the set under the distribution .The encoder assigns

and for all and it assigns the conditionalprobability

(11)

We now bound the redundancy of .

Theorem 15: For all

Proof: Recall that

The theorem holds trivially for . For , rewrite

For all , Lemma 16 shows that

and Lemma 17 that

and the theorem follows.

Lemma 16: For all and

Proof: Observe that

where follows since for all and any i.i.d.-induceddistribution

by interchanging with the summation, and because(10), together with Lemmas 10 and 14 imply that for all

Note that this inequality corresponds to the upper bound onin Section V-B.

1478 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

Lemma 17: For all and all

Proof: We prove by induction on that for alland all

(12)

The lemma will follow since for every , is a power of two.The basis holds since for , , and all satisfy

To prove the step, note from (11) that for , alland satisfy

hence,

(13)

By the induction hypothesis

(14)

By definition (10) and Lemma 10

On the other hand, distinct patterns have disjoint sets, while patterns of the same profile have the same

probability , hence,

and thus,

It follows that

Incorporating this inequality and (14) in (13), we get (12).

B. A Low-Complexity Encoder

The evaluation of in (11) may take super-polyno-mial time. We, therefore, present a linear-complexity encoder

whose sequential redundancy is less than henceits per-symbol redundancy also diminishes to zero as the blocklength increases, albeit at the slower rate of .

For notational convenience, let denote the prevalence ofin , , and let

For , define

and let

Finally, define the sequence

The encoder assigns

and for all and , it assigns the conditionalprobability

(15)where

is a normalization factor. It can be shown by induction that forall and all patterns

Theorem 18: For all

where .Proof: The theorem holds trivially for . For ,

it can be shown that for all

Since for all and ,

In Lemma 20, we show that for all and

ORLITSKY et al.: UNIVERSAL COMPRESSION OF MEMORYLESS SOURCES OVER UNKNOWN ALPHABETS 1479

Since only terms are not nonzero, and using the inequalityfor , we obtain

In Lemma 21, we show that for all and

Substituting , we obtain

The theorem follows by a simple evaluation of the constantsinvolved.

Theorem 19: The number of operations required to computeall of

grows linearly with .Proof: The conditional probabilities are calculated in (15).

Note that computing requires only multipli-cations and comparisons. It therefore suffices to evaluatethe complexity of calculating .

Let be the set of perfect cubes. Forall , can be updated fromin constant time. For all , can be calculated in

time because there are at most multiplicitiessuch that .Hence, the evaluation of can be

completed in time

Finally, the following two technical lemmas complete theproof of Theorem 18.

Lemma 20: For all and

Proof: Let

.

Observe that

where we note that for , the second case will neveroccur. Hence, for all and

(16)

Let maximize . It can be shown that

and for where

is the solution of

Hence the right-hand side of (16) simplifies to

implying

Lemma 21: For all and all patterns

Proof: As before, write as , and let

Recall that

For convenience, let , so that

takes different forms depending on whetherfalls into , ,

, or . We, therefore, partitioninto alternating nonempty segments

1480 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 7, JULY 2004

with prevalences and which we call high andlow segments, respectively. More precisely, a high segment is aninterval such that

and a low segment is an interval such that

.

Observe that the first segment is always a high segment and thelast is always a low segment, hence, the number of high seg-ments equals the number of low segments. Let be the numberof high (or low) segments, and be the th high segment.Hence, for , is the th lowsegment and is the th low segment.

For all in high segments, and, while for all in low segments,

. For , let and . Define

Consequently

.

We upper-bound by

.

(17)Recall that

From (17)

(18)

To simplify the above expression, observe that for ,and , hence,

(19)

and that

(20)

where the last inequality follows because the profile maximizing

has , , and forwhere

ORLITSKY et al.: UNIVERSAL COMPRESSION OF MEMORYLESS SOURCES OVER UNKNOWN ALPHABETS 1481

Incorporating (19) and (20) into (18), we obtain

ACKNOWLEDGMENT

We thank N. Alon, N. Jevtic, G. Shamir, W. Szpankowski,and K. Viswanathan for helpful discussions.

REFERENCES

[1] B. Fittingoff, “Universal methods of coding for the case of unknownstatistics,” in Proc. 5th Symp. Information Theory, Moscow-Gorky,U.S.S.R., 1972, pp. 129–135.

[2] L. Davisson, “Universal noiseless coding,” IEEE Trans. Inform. Theory,vol. IT-19, pp. 783–795, Nov. 1973.

[3] J. Shtarkov, “Coding of discrete sources with unknown statistics,” inTopics in Information Theory (Coll. Math. Soc. J. Bolyai, no. 16 ), I.Csiszár and P. Elias, Eds. Amsterdam, The Netherlands: North Hol-land, 1977, pp. 559–574.

[4] J. Kieffer, “A unified approach to weak universal source coding,” IEEETrans. Inform. Theory, vol. IT-24, pp. 674–682, Nov. 1978.

[5] J. Rissanen, “Universal coding, information, prediction, and estima-tion,” IEEE Trans. Inform. Theory, vol. IT-30, pp. 629–636, July 1984.

[6] N. Merhav and M. Feder, “Universal prediction,” IEEE Trans. Inform.Theory, vol. 44, pp. 2124–2147, Oct. 1998.

[7] R. Krichevsky and V. Trofimov, “The preformance of universal coding,”IEEE Trans. Inform. Theory, vol. IT-27, pp. 199–207, Mar. 1981.

[8] T. Cover, “Universal portfolios,” Math. Finance, vol. 1, no. 1, pp. 1–29,Jan. 1991.

[9] Y. Shtarkov, T. Tjalkens, and F. Willems, “Multialphabet universalcoding of memoryless sources,” Probl. Inform. Transm., vol. 31, no. 2,pp. 114–127, 1995.

[10] T. Cover and E. Ordentlich, “Universal portfolios with side informa-tion,” IEEE Trans. Inform. Theory, vol. 42, pp. 348–363, Mar. 1996.

[11] J. Rissanen, “Fisher information and stochastic complexity,” IEEETrans. Inform. Theory, vol. 42, pp. 40–47, Jan. 1996.

[12] W. Szpankowski, “On asymptotics of certain recurrences arising in uni-versal coding,” Probl. Inform. Transm., vol. 34, no. 2, pp. 142–146,1998.

[13] Q. Xie and A. Barron, “Asymptotic minimax regret for data compres-sion, gambling and prediction,” IEEE Trans. Inform. Theory, vol. 46, pp.431–445, Mar. 2000.

[14] M. Drmota and W. Szpankowski, “The precise minimax redundancy,”in Proc. IEEE Symp. Inform. Theory, Lausanne, Switzerland, June/July2002, p. 35.

[15] A. Orlitsky and N. Santhanam, “Performance of universal codes overinfinite alphabets,” in Proc. Data Compression Conf., Snowbird, UT,Mar. 2003, pp. 402–410.

[16] P. Elias, “Universal codeword sets and representations of integers,” IEEETrans. Inform. Theory, vol. IT-21, pp. 194–203, Mar. 1975.

[17] L. Györfi, I. Pali, and E. V. der Meulen, “On universal noiseless sourcecoding for infinite source alphabets,” Europ. Trans. Telecommun. andRelated Technologies, vol. 4, pp. 125–132, 1993.

[18] D. Foster, R. Stine, and A. Wyner, “Universal codes for finite sequencesof integers drawn from a monotone distribution,” IEEE Trans. Inform.Theory, vol. 48, pp. 1713–1720, June 2002.

[19] T. Uyematsu and F. Kanaya, “Asymptotic optimality of two variationsof Lempel-Ziv codes for sources with countably infinite alphabet,”in Proc. IEEE Symp. Information Theory, Lausanne, Switzerland,June/July 2002, p. 122.

[20] J. Kieffer and E. Yang, “Grammar based codes: A new class of uni-versal lossless source codes,” IEEE Trans. Inform. Theory, vol. 46, pp.737–754, May 2000.

[21] D. He and E. Yang, “On the universality of grammar-based codes forsources with countably infinite alphabets,” in Proc. IEEE Symp. Infor-mation Theory, Yokohama, Japan, June/July 2003.

[22] N. Jevtic, A. Orlitsky, and N. Santhanam, “Universal compression of un-known alphabets,” in Proc. IEEE Symp. Information Theory, Lausanne,Switzerland, 2002, p. 320.

[23] J. Åberg, Y. Shtarkov, and B. Smeets, “Multialphabet coding with sepa-rate alphabet description,” in Proc. Compression and Complexity of Se-quences, Salerno, Italy, 1997, pp. 56–65.

[24] K. Church and W. Gale, “Probability scoring for spelling correction,”Statist. and Comput., vol. 1, pp. 93–103, 1991.

[25] W. Gale, K. Church, and D. Yarowsky, “A method for disambiguatingword senses,” Comput. and Humanities, vol. 26, pp. 415–419, 1993.

[26] S. Chen and J. Goodman, “An empirical study of smoothing techniquesfor language modeling,” in Proc. 34th Annu. Meeting Association forComputational Linguistics, San Francisco, CA, 1996, pp. 310–318.

[27] K. Yamanishi, “A decision-theoretic extension of stochastic complexityand its application to learning,” IEEE Trans. Inform. Theory, vol. 44, pp.1424–1439, July 1998.

[28] V. Vovk, “A game of prediction with expert advice,” J. Comput. Syst.Sci., vol. 56, no. 2, pp. 153–173, 1998.

[29] N. Cesa-Bianchi and G. Lugosi, “Minimax regret under log loss forgeneral classes of experts,” in Proc. 12th Annu. Conf. ComputationalLearning Theory, Santa Cruz, CA, 1999, pp. 12–18.

[30] F. Song and W. Croft, “A general language model for information re-trieval (poster abstract),” in Proc. Conf. Research and Development inInformation Retrieval, Berkeley, CA, 1999, pp. 279–280.

[31] A. Dhulipala and A. Orlitsky, “On the redundancy of hmm patterns,” inProc. 2004 Proc. IEEE Int. Symp. Information Theory, to be published.

[32] A. Orlitsky, N. Santhanam, and J. Zhang, “Always Good Turing:Asymptotically optimal probability estimation,” Science, vol. 302, no.5644, pp. 427–431, Oct. 2003.

[33] E. Bell, “Exponential numbers,” Amer. Math. Monthly, vol. 41, pp.411–419, 1934.

[34] G. Rota, “The number of partitions of a set,” Amer. Math. Monthly, vol.71, pp. 498–504, 1964.

[35] N. Sloane. Online Encyclopedia of Integer Sequences [Online]. Avail-able: http://www.research.att.com/~njas/sequences/

[36] J. van Lint and R. Wilson, A Course in Combinatorics. Cambridge,U.K.: Cambridge Univ. Press, 1996.

[37] Y. Shtarkov, “Universal sequential coding of single messages,” Probl.Inform. Transm., vol. 23, no. 3, pp. 3–17, 1987.

[38] W. Feller, An Introduction to Probability Theory. New York: Wiley,1968.

[39] A. Orlitsky, N. Santhanam, K. Viswanathan, and J. Zhang, “On modelingprofiles instead of values,” manuscript, submitted for publication.

[40] G. Hardy and S. Ramanujan, “Asymptotic formulae in combinatoryanalysis,” Proc. London Math. Soc., vol. 17, no. 2, pp. 75–115, 1918.

[41] G. Hardy and E. Wright, An Introduction to the Theory of Num-bers. Oxford, U.K.: Oxford Univ. Press, 1985.

[42] I. Csiszár and P. Shields, “Redundancy rates for renewal and other pro-cesses,” IEEE Trans. Inform. Theory, vol. 42, pp. 2065–2072, Nov. 1996.

[43] N. Jevtic, A. Orlitsky, and N. Santhanam, “A Lower bound on compres-sion of unknown alphabets,” manuscript, submitted for publication.

[44] A. Orlitsky and N. Santhanam, “Speaking of infinity,” manuscript, sub-mitted for publication.

[45] G. Shamir and L. Song, “On the entropy of patterns of iid sequences,”in Proc. 41st Annu. Allerton Conf. Communication, Control, and Com-puting, Allerton, IL, Oct. 2003, pp. 160–170.

[46] G. Shamir, “Universal lossless compression with unknown alpha-bets—The average case,” IEEE Trans. Inform. Theory, submitted forpublication.