how to estimate capacity dimension

Nucleffr Physics B (Proc. Suppl ) 5A (1988) 125-128 125 North-Holland, Amsterdam

HOW TO ESTIMATE CAPACITY DIMENSION

Francis SULLIVAN and Fern HUNT

Center for Applied Mathematws, Natmnal Bureau of Standards, Ga~thersburg, Maryland 20899, USA

We describe a class of robust, computationally efficient algorithms for estimating capacity dimension and related quantities for compact subsets of R n. Our algorithms are based on Monte Carlo quadrature, data compression, and generalized distance functions.

1. I n t r o d u c t i o n .

We are interested in est imat ing the (capacity) dimension

of complicated point sets in R 2 and R 3. This is a type ot

computa t ion which occurs, for example, in the s tudy of

formation of clouds and particulates. For a point set, .4,

the capacity dimension, D, is given by the equation

In N(e) D = h m s u p

~-0 l n ( 1 / , )

Here N(e) is the infimum of the number of cubes or

"boxes" of side e needed to cover `4. Several things about

D are obvious, or nearly so First, if .4 is a fimte set then

D = 0, because eventually each point of .2, is in a sepa-

rate box, so tha t N(e) eventually remains constant as

decreases. In calculations, of course, .4 is finite, but then

as not allowed to get too small How this is done will

be explained later when "'saturation" is discussed It is

also not too hard to see tha t if .4 C R 2 and .4 has some

non-zero area, then D = 2, because the number of boxes

to cover .4 eventually grows like ( l / c ) 2 This generalizes

in the natural way. A little more thought reveals that

for the Cantor subset of [0,1], D = l n 2 / l n 3 = 0 6309 . . .

because when e = 1/3 k, at least 2 k intervals are needed

to cover `4 Thus the Cantor set IS a fractal because its

dimension is a fraction.

In Section 2 we describe the quadrature idea which

IS the basis of our techmques. Section 3 contains m-

formation on the compression map which is used to re-

formulate all calculations as efficient 1-D methods Sat-

uration is discussed in Section 4 and direct methods are

outlined in Section 5.

2. Q u a d r a t u r e m e t h o d s .

An obvious way to a t t empt to est imate D is to cover the

region with a regular grid of size e and then count the

number of grid blocks containing a point of A This esti-

mates N(e) If this calculation is done for a series of e's,

then the slope of a s traight line fit to a graph of in N(e)

vs ln(1/c) can es t imate D The "box counting" algorithm

just described looks like a direct implementat ion of the

definitions, but it is known to have some theoretical dif-

ficulties, and It is fairly lnefficmnt We'll now describe

a much more efficient algorithm. For c > O, A(e) will

126 F Sulhvan, F Hunt / How to esumate capactty dtrnen~ton

denote the area or volume of an e-cover of .4. T h a t is

A(e) = a r ea{y : dzst(y,.4) < e}

If we assume tha t for sets in the plane A(e) can be ap-

p rox imated by N(e)e 2, then taking logari thms we get t ha t

since D is approximated by l n N ( e ) / l n ( 1 / e ) , ,t is also ap-

p rox imated by 2 + in A(e) / ln(1 /e)

The approx imat ion can be made rigorous 3 but our

a im here is to show tha t it is possible to design an al-

go r i thm for es t imat ing A(e). T he value of A(¢) can be

approx imated using Mon te Carlo in tegra t ion or directly.

In the Monte Carlo case, a sequence of r a n d o m numbers

{Xl, x 2 , . . . , XM} is generated and for each x, we evaluate,

1 ~f & s t ( x . `4) < e f i (x , ) =

0 otherwzse

We take as A(e) the mean E f ¢ ( x , ) / M This reduces

the problem to t ha t of finding a fast way to evaluate

f¢. Suppose first t h a t .4 is a subset of the line. We

can sort .4 quickly using a sui table a lgor i thm 2 A bi-

nary search of the sor ted list gives the A E .4 such t ha t

[ A - x,[ = &st(x , , .4 ) If this is less t han or equal to e,

t hen f~ = 1. In fact, more can be said. Let r} = [A - x,[

t hen we know t h a t f~(z,) = 0 for all e < r / and f~(x,) = 1

otherwise. Th ink now of the sequence % = 2-C One can

in fact de te rmine the f~ s imultaneously for all %. We say

t h a t there is a "hi t" a t level 3 ff for e = % f~(x) = 1. Ac-

cording to the definit ion of ~? given above, there are hits a t

all levels j < -ln2(~/). This is a lmost the same as saying

t h a t x, agrees wi th some A E .4 in the first - ln2(r/) bi ts

of the b inary expansion. Of course, count ing bits is not

exactly the same as comput ing dis tance because of non-

, t e rmina t ing b inary expansions. (The numbers 1000. .

and 0 1 1 1 . . are equal, but no bits agree.) However, be-

cause of the large size of the sets .4 used in computa t ion ,

no significant errors are int roduced by making this as-

sumpt ion Defining distances in terms of bi t agreement

leads to extremely fast programs for bo th vector proces-

sors and massively parallel bit-serial machines such as

the M P P and the DAP

3. C o m p r e s s i o n .

Suppose now t h a t .4 C R 2 . Points of .4 are pairs of

co-ordinates < B, C > The 1-D a lgor i thm suggests t ha t

we assume B and C are given in b inary no ta t ion so t h a t

B = bib2. and C = clc2. where the b,, c, are all 0~s

and l ' s . We can define a single number D = dld2.

using the formula d, = c, + 2b, This definition gives a

unique base-four representa t ion for each pair < B. C > ,

i.e we have a one-to-one map, c, from the unit square of

R 2 to the umt interval of R Since c[`4] is jus t a list of

real numbers , we can sort it and then generate r a n d o m

test points as in the one-dimensional case. We would

now expect hgs at all levels less t h a n - l n 4 ( q ) This is

not precisely correct because the problem i l lustrated by

the case of non- te rmina t ing decimal expansions cannot

be avoided, since the uni t square cannot be mapped 1-

1 cont inuously onto the uni t Interval. However, more

analysis shows t h a t in this case it doesn ' t mat ter . The

map c is measure preserving, and from this it can be

proved tha t .

m(e, `4) = n(¢ 2, c[`4])

where A is the area de te rmined by Lebesgue measure

on the uni t square and L is the length determined by

Lebesgue measure on the interval. Hence,

�*m[A] = 2,d,m[~[A]]

F Sulhvan, F Hunt / How to estimate capactty chmenston 127

Therefore, all calculations can be done on the compressed

1-dimensional list. Notice that once the compression is

done, one can use e in powers of any base b by deterrmn-

mg integer levels'

[ log(q(x,)) ] l, = - [logbQ/(x,)) j = - [ log(b) J

The obvious generalization to 3-D also works How-

ever, the compression mapping requires the use of three

bits for each base-eight digit, and so for fixed length float-

mg point numbers, accuracy degenerates.

4. Saturat ion .

As has been mentioned, we always work with a sequence

of ~3 and we would lake to stop decreasing e when N(e)

is the same as N, the number of points in .4 When

should thas happen? Because of the definition of D, one

should not go beyond the saturation level 3 = in N/D.

Numerical experiments with sets for whmh D is known

seem to indicate that the best estamate as obtained at

the saturat ion level Of course, in interesting cases D as

unknown and 3 cannot be predicted. However, there is a

related idea which can be used instead of fitting a series

of ~ values.

For A E .4 let 5(A) be the distance from A to its

nearest neaghbor in .4 and let < 5 > denote the average

value of this quantity. If .4 has N points then the mean

< 5 > satisfies

< 5 > = /5P(5, N)d5

Here P ( b , N ) is the probabflaty density for 5(A) in .4.

If < a > can be estimated then we can est imate the

&mension given by:

( D ( 1 ) = h m o o I n < 6 > )

Obviously < 6 > behaves hke N -I/DO) and D(1) should

be close to D.

It turns out that D(1) can be estimated by numerical

quadrature because a Monte Carlo-like technique can be

used to approximate f P(~, N)d5 over sub-mtervMs Here

is a sketch of the algorithm '3

• Generate M samples of A.

• Generate N additional points of .4, with M << N

• Use the M samples and N points to compute the

areas A($t) by the sorting method. Here $l = 2 -z

• Approxamate the integral of P(~I, N) over the in-

terval [51+1,51] by

A(~) - A(~+I)

• Approximate f 5P(5, N)d5 by a quadrature formula

The same method can be used to apprommate the means

< 6 ~ > = /5"YP(5, N)d~

and the corresponding dlmenmons

D(~) = Nlirno~ ( _ 7 1 n N . ~ kin < 5~ > )

It is known 1 that for many sets the capacity dimension

is the umque ~/such that D(7) = 7-

5. D i r e c t m e t h o d s .

Note that once ,4 is sorted, the generated random num-

bers used to deterrmne A(Q must fall in the gaps g, be-

tween the A, E .4, i.e

0 +- go ---+ A1 e----g1 - - - -+A2+-g2 .---+ 1

By a "total sample" of .4 we mean the entire set .4 In

computing A(e) if total sampling is done, no random

128 F Sulhvan, F Hunt / How to estimate capacity &menston

numbers are needed. To see this, note that the prob-

abihty of generating a random x with dzst(x,.4) < e is

given by'

A ( c ) = Y ~ g ' f 2 ~ A 1 ] " [ g ,

For a large .4, a direct computation of this sum would be

inaccurate because of roundoff accumulation In order to

avoid adding up many small quantities, we use the gaps

to define bins (by powers of 2, for example) via

and

bin, = -- Llog2 (9,)J

H(bm,) = H(bzn,) + 1

Combining this with the above gives the approximation

A(e)~, ~ 6 ,H(1)+2e(y~ H ( 1 ) ) . 6 , < 2 ~ 6,>2~

Notice that this allows for the evaluation of the whole

ser,es of A(e,) once the H(l) have been determined An-

other useful approximation IS

H(O ff, N ~ P(5, N)d6, l+l

Estimating the integral as a sum gives the interesting

formula

References

1 Badil, R. and Pohti, A 1985 StatlmtcalDescnptlon

of Chaotic Attractors. J. of Sta t is t ical Physics

40

2 Brock, H., Brooks, B and Sullivan, F. 1981. Di-

amond: A Sorting Method for Vector Machines

BIT. 21.

3 Hunt, F and Sullivan, F Efficient Algorithms for

Computing Fractal Dimensions. 1986 Proceed-

ings: Conference on Dimens ions and E n t h r o p y

of D yna mi c a l Sys tems Los Alamos National

Laboratory.

how to estimate capacity dimension

Documents