how to estimate capacity dimension
TRANSCRIPT
Nucleffr Physics B (Proc. Suppl ) 5A (1988) 125-128 125 North-Holland, Amsterdam
HOW TO ESTIMATE CAPACITY DIMENSION
Francis SULLIVAN and Fern HUNT
Center for Applied Mathematws, Natmnal Bureau of Standards, Ga~thersburg, Maryland 20899, USA
We describe a class of robust, computationally efficient algorithms for estimating capacity dimension and related quantities for compact subsets of R n. Our algorithms are based on Monte Carlo quadrature, data compression, and generalized distance functions.
1. I n t r o d u c t i o n .
We are interested in est imat ing the (capacity) dimension
of complicated point sets in R 2 and R 3. This is a type ot
computa t ion which occurs, for example, in the s tudy of
formation of clouds and particulates. For a point set, .4,
the capacity dimension, D, is given by the equation
In N(e) D = h m s u p
~-0 l n ( 1 / , )
Here N(e) is the infimum of the number of cubes or
"boxes" of side e needed to cover `4. Several things about
D are obvious, or nearly so First, if .4 is a fimte set then
D = 0, because eventually each point of .2, is in a sepa-
rate box, so tha t N(e) eventually remains constant as
decreases. In calculations, of course, .4 is finite, but then
as not allowed to get too small How this is done will
be explained later when "'saturation" is discussed It is
also not too hard to see tha t if .4 C R 2 and .4 has some
non-zero area, then D = 2, because the number of boxes
to cover .4 eventually grows like ( l / c ) 2 This generalizes
in the natural way. A little more thought reveals that
for the Cantor subset of [0,1], D = l n 2 / l n 3 = 0 6309 . . .
because when e = 1/3 k, at least 2 k intervals are needed
to cover `4 Thus the Cantor set IS a fractal because its
dimension is a fraction.
In Section 2 we describe the quadrature idea which
IS the basis of our techmques. Section 3 contains m-
formation on the compression map which is used to re-
formulate all calculations as efficient 1-D methods Sat-
uration is discussed in Section 4 and direct methods are
outlined in Section 5.
2. Q u a d r a t u r e m e t h o d s .
An obvious way to a t t empt to est imate D is to cover the
region with a regular grid of size e and then count the
number of grid blocks containing a point of A This esti-
mates N(e) If this calculation is done for a series of e's,
then the slope of a s traight line fit to a graph of in N(e)
vs ln(1/c) can es t imate D The "box counting" algorithm
just described looks like a direct implementat ion of the
definitions, but it is known to have some theoretical dif-
ficulties, and It is fairly lnefficmnt We'll now describe
a much more efficient algorithm. For c > O, A(e) will
126 F Sulhvan, F Hunt / How to esumate capactty dtrnen~ton
denote the area or volume of an e-cover of .4. T h a t is
A(e) = a r ea{y : dzst(y,.4) < e}
If we assume tha t for sets in the plane A(e) can be ap-
p rox imated by N(e)e 2, then taking logari thms we get t ha t
since D is approximated by l n N ( e ) / l n ( 1 / e ) , ,t is also ap-
p rox imated by 2 + in A(e) / ln(1 /e)
The approx imat ion can be made rigorous 3 but our
a im here is to show tha t it is possible to design an al-
go r i thm for es t imat ing A(e). T he value of A(¢) can be
approx imated using Mon te Carlo in tegra t ion or directly.
In the Monte Carlo case, a sequence of r a n d o m numbers
{Xl, x 2 , . . . , XM} is generated and for each x, we evaluate,
1 ~f & s t ( x . `4) < e f i (x , ) =
0 otherwzse
We take as A(e) the mean E f ¢ ( x , ) / M This reduces
the problem to t ha t of finding a fast way to evaluate
f¢. Suppose first t h a t .4 is a subset of the line. We
can sort .4 quickly using a sui table a lgor i thm 2 A bi-
nary search of the sor ted list gives the A E .4 such t ha t
[ A - x,[ = &st(x , , .4 ) If this is less t han or equal to e,
t hen f~ = 1. In fact, more can be said. Let r} = [A - x,[
t hen we know t h a t f~(z,) = 0 for all e < r / and f~(x,) = 1
otherwise. Th ink now of the sequence % = 2-C One can
in fact de te rmine the f~ s imultaneously for all %. We say
t h a t there is a "hi t" a t level 3 ff for e = % f~(x) = 1. Ac-
cording to the definit ion of ~? given above, there are hits a t
all levels j < -ln2(~/). This is a lmost the same as saying
t h a t x, agrees wi th some A E .4 in the first - ln2(r/) bi ts
of the b inary expansion. Of course, count ing bits is not
exactly the same as comput ing dis tance because of non-
, t e rmina t ing b inary expansions. (The numbers 1000. .
and 0 1 1 1 . . are equal, but no bits agree.) However, be-
cause of the large size of the sets .4 used in computa t ion ,
no significant errors are int roduced by making this as-
sumpt ion Defining distances in terms of bi t agreement
leads to extremely fast programs for bo th vector proces-
sors and massively parallel bit-serial machines such as
the M P P and the DAP
3. C o m p r e s s i o n .
Suppose now t h a t .4 C R 2 . Points of .4 are pairs of
co-ordinates < B, C > The 1-D a lgor i thm suggests t ha t
we assume B and C are given in b inary no ta t ion so t h a t
B = bib2. and C = clc2. where the b,, c, are all 0~s
and l ' s . We can define a single number D = dld2.
using the formula d, = c, + 2b, This definition gives a
unique base-four representa t ion for each pair < B. C > ,
i.e we have a one-to-one map, c, from the unit square of
R 2 to the umt interval of R Since c[`4] is jus t a list of
real numbers , we can sort it and then generate r a n d o m
test points as in the one-dimensional case. We would
now expect hgs at all levels less t h a n - l n 4 ( q ) This is
not precisely correct because the problem i l lustrated by
the case of non- te rmina t ing decimal expansions cannot
be avoided, since the uni t square cannot be mapped 1-
1 cont inuously onto the uni t Interval. However, more
analysis shows t h a t in this case it doesn ' t mat ter . The
map c is measure preserving, and from this it can be
proved tha t .
m(e, `4) = n(¢ 2, c[`4])
where A is the area de te rmined by Lebesgue measure
on the uni t square and L is the length determined by
Lebesgue measure on the interval. Hence,
�*m[A] = 2,d,m[~[A]]
F Sulhvan, F Hunt / How to estimate capactty chmenston 127
Therefore, all calculations can be done on the compressed
1-dimensional list. Notice that once the compression is
done, one can use e in powers of any base b by deterrmn-
mg integer levels'
[ log(q(x,)) ] l, = - [logbQ/(x,)) j = - [ log(b) J
The obvious generalization to 3-D also works How-
ever, the compression mapping requires the use of three
bits for each base-eight digit, and so for fixed length float-
mg point numbers, accuracy degenerates.
4. Saturat ion .
As has been mentioned, we always work with a sequence
of ~3 and we would lake to stop decreasing e when N(e)
is the same as N, the number of points in .4 When
should thas happen? Because of the definition of D, one
should not go beyond the saturation level 3 = in N/D.
Numerical experiments with sets for whmh D is known
seem to indicate that the best estamate as obtained at
the saturat ion level Of course, in interesting cases D as
unknown and 3 cannot be predicted. However, there is a
related idea which can be used instead of fitting a series
of ~ values.
For A E .4 let 5(A) be the distance from A to its
nearest neaghbor in .4 and let < 5 > denote the average
value of this quantity. If .4 has N points then the mean
< 5 > satisfies
< 5 > = /5P(5, N)d5
Here P ( b , N ) is the probabflaty density for 5(A) in .4.
If < a > can be estimated then we can est imate the
&mension given by:
( D ( 1 ) = h m o o I n < 6 > )
Obviously < 6 > behaves hke N -I/DO) and D(1) should
be close to D.
It turns out that D(1) can be estimated by numerical
quadrature because a Monte Carlo-like technique can be
used to approximate f P(~, N)d5 over sub-mtervMs Here
is a sketch of the algorithm '3
• Generate M samples of A.
• Generate N additional points of .4, with M << N
• Use the M samples and N points to compute the
areas A($t) by the sorting method. Here $l = 2 -z
• Approxamate the integral of P(~I, N) over the in-
terval [51+1,51] by
A(~) - A(~+I)
• Approximate f 5P(5, N)d5 by a quadrature formula
The same method can be used to apprommate the means
< 6 ~ > = /5"YP(5, N)d~
and the corresponding dlmenmons
D(~) = Nlirno~ ( _ 7 1 n N . ~ kin < 5~ > )
It is known 1 that for many sets the capacity dimension
is the umque ~/such that D(7) = 7-
5. D i r e c t m e t h o d s .
Note that once ,4 is sorted, the generated random num-
bers used to deterrmne A(Q must fall in the gaps g, be-
tween the A, E .4, i.e
0 +- go ---+ A1 e----g1 - - - -+A2+-g2 .---+ 1
By a "total sample" of .4 we mean the entire set .4 In
computing A(e) if total sampling is done, no random
128 F Sulhvan, F Hunt / How to estimate capacity &menston
numbers are needed. To see this, note that the prob-
abihty of generating a random x with dzst(x,.4) < e is
given by'
A ( c ) = Y ~ g ' f 2 ~ A 1 ] " [ g ,
For a large .4, a direct computation of this sum would be
inaccurate because of roundoff accumulation In order to
avoid adding up many small quantities, we use the gaps
to define bins (by powers of 2, for example) via
and
bin, = -- Llog2 (9,)J
H(bm,) = H(bzn,) + 1
Combining this with the above gives the approximation
A(e)~, ~ 6 ,H(1)+2e(y~ H ( 1 ) ) . 6 , < 2 ~ 6,>2~
Notice that this allows for the evaluation of the whole
ser,es of A(e,) once the H(l) have been determined An-
other useful approximation IS
H(O ff, N ~ P(5, N)d6, l+l
Estimating the integral as a sum gives the interesting
formula
References
1 Badil, R. and Pohti, A 1985 StatlmtcalDescnptlon
of Chaotic Attractors. J. of Sta t is t ical Physics
40
2 Brock, H., Brooks, B and Sullivan, F. 1981. Di-
amond: A Sorting Method for Vector Machines
BIT. 21.
3 Hunt, F and Sullivan, F Efficient Algorithms for
Computing Fractal Dimensions. 1986 Proceed-
ings: Conference on Dimens ions and E n t h r o p y
of D yna mi c a l Sys tems Los Alamos National
Laboratory.