density based exploration of bivariate data

7
Statistics and Computing 3 (1993) 171-177 Density based exploration of bivariate data ADRIAN BOWMAN Statistics Department, The University, Glasgow G12 8QQ, UK PETER FOSTER Statistical Laboratory, Mathematics Department, The UniversiO,, Manchester M13 9PL, UK Received May 1993 and accepted May 1993 The difficulties of assessing details of the shape of a bivariate distribution, and of contrasting subgroups, from a raw scatterplot are discussed. The use of contours of a density estimate in highlighting features of distributional shape is illustrated on data on the development of aircraft technology. The estimated density height at each observation imposes an ordering on the data which can be used to select contours which contain specified proportions of the sample. This leads to a display which is reminiscent of a boxplot and which allows simple but effective comparison of different groups. Some simple properties of this technique are explored. Interesting features of a distribution such as 'arms' and multimodality are found along the directions where the largest probability mass is located. These directions can be quantified through the modes of a density estimate based on the direction of each observation. Keywords: Boxplot, contour, kernel density estimate, mode, circular data 1. Introduction A scatterplot is an invaluable tool for exploring the relationship between measurements on two continuous variables. From a raw scatterplot, however, it can some- times be difficult to identify or assess anything other than the simplest features, such as location, scale and linear correlation. Although these features are often sufficient for the intended analysis, there are problems where it is the shape of the joint distribution, and the nature of any changes in this shape across different groups, which are of principal interest. It is also wise, with any bivariate data, to explore the existence of any unusual features before embarking on an analysis based on normality. Such explorations can be difficult to carry out visually on a raw scatterplot, even when there is a large amount of data. As an example, we shall consider a set of data on six simple characteristics of aircraft technology throughout the twentieth century, collected mainly by P. Saviotti from Jane (1978): Total engine power (kW) Wing span (m) Length (m) Maximum take-off weight (kg) 0960-3174 1993 Chapman & Hall Maximum speed (kin/h) Range (km) Saviotti and Bowman (1984) discuss a smaller set of data. In the present paper, we will consider these measurements for 709 models of aircraft produced between 1914 and 1984. The principal aim is to explore the way in which the technology has developed, as characterised by these simple variables. A log transformation was applied to each in order to remove substantial skewness. A principal compo- nents analysis, based on the correlation matrix, was carried out to explore the structure of the data graphically. The first component is a mixture of all variables and may be inter- preted as measuring the size of an aircraft. The second component is a contrast between speed and wing span and may be interpreted as a measure of speed after allow- ing for size. These two components account for 92% of the variation in the data. The techniques to be discussed in this paper are appropriate to a scatterplot of any kind, but the example happens to involve two-dimensional pat- terns derived from principal components. The first two component scores for the aircraft data are displayed in Fig. 1. The first panel displays all the data, labelled by time period, and involves too much overplotting for any details of shape to emerge or to allow informative

Upload: adrian-bowman

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Statistics and Computing 3 (1993) 171-177

Density based exploration of bivariate data

A D R I A N B O W M A N

Statistics Department, The University, Glasgow G12 8QQ, UK

P E T E R F O S T E R

Statistical Laboratory, Mathematics Department, The UniversiO,, Manchester M13 9PL, UK

Received May 1993 and accepted May 1993

The difficulties of assessing details of the shape of a bivariate distribution, and of contrasting subgroups, from a raw scatterplot are discussed. The use of contours of a density estimate in highlighting features of distributional shape is illustrated on data on the development of aircraft technology. The estimated density height at each observation imposes an ordering on the data which can be used to select contours which contain specified proportions of the sample. This leads to a display which is reminiscent of a boxplot and which allows simple but effective comparison of different groups. Some simple properties of this technique are explored.

Interesting features of a distribution such as 'arms' and multimodality are found along the directions where the largest probability mass is located. These directions can be quantified through the modes of a density estimate based on the direction of each observation.

Keywords: Boxplot, contour, kernel density estimate, mode, circular data

1. Introduction

A scatterplot is an invaluable tool for exploring the relationship between measurements on two continuous variables. F rom a raw scatterplot, however, it can some- times be difficult to identify or assess anything other than the simplest features, such as location, scale and linear correlation. Although these features are often sufficient for the intended analysis, there are problems where it is the shape of the joint distribution, and the nature of any changes in this shape across different groups, which are of principal interest. It is also wise, with any bivariate data, to explore the existence of any unusual features before embarking on an analysis based on normality. Such explorations can be difficult to carry out visually on a raw scatterplot, even when there is a large amount of data.

As an example, we shall consider a set of data on six simple characteristics of aircraft technology throughout the twentieth century, collected mainly by P. Saviotti from Jane (1978):

Total engine power (kW) Wing span (m) Length (m) Maximum take-off weight (kg)

0960-3174 �9 1993 Chapman & Hall

Maximum speed (kin/h) Range (km)

Saviotti and Bowman (1984) discuss a smaller set of data. In the present paper, we will consider these measurements for 709 models of aircraft produced between 1914 and 1984. The principal aim is to explore the way in which the technology has developed, as characterised by these simple variables. A log transformation was applied to each in order to remove substantial skewness. A principal compo- nents analysis, based on the correlation matrix, was carried out to explore the structure of the data graphically. The first component is a mixture of all variables and may be inter- preted as measuring the size of an aircraft. The second component is a contrast between speed and wing span and may be interpreted as a measure of speed after allow- ing for size. These two components account for 92% of the variation in the data. The techniques to be discussed in this paper are appropriate to a scatterplot of any kind, but the example happens to involve two-dimensional pat- terns derived from principal components.

The first two component scores for the aircraft data are displayed in Fig. 1. The first panel displays all the data, labelled by time period, and involves too much overplotting for any details of shape to emerge or to allow informative

172 B o w m a n a n d Fos t e r

1914-1984

:1 -4 -2 0 2 4 6

pc1

r

"u

1914-1935

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

-~ � 9 e---~?rgSee �9 �9 �9 �9 �9 -JP.~- �9

�9

-4 -2 0 2 4 6

pc1

1936-1955 1956-1984

,v-- t'M

8 _ o

"7,

! �9

i. " ' .

�9 q ~ , ~ o ~ o . . . . . . . . . . . . . . , . . . . . . ; . . . . . . . . . . . . . .

�9 �9 . . , . , , : . . . �9 ,~-�9 �9 .

& o . , r -

i

�9

~ ~ ~ 1 7 6 . . . . " . _ j ~ t ' z , # " "• ' .

wql- ~ 0 , � 9 1 4 9 p 0 � 9 ~ m

�9 �9 , g i " � 9 " 0 0 �9 0"

-4 -2 0 2 4 6 -4

pc1

Fig. 1. Plots of the first two principal component scores of the aircraft comparison

-2 0 2 4 6 pc1

data, separated into three time periods, with common scales to aid

comparisons of groups. The other panels display separate scatterplots with identical axes and these at least allow broad comparisons of location and scale, particularly when the axes cross in the middle of the plots to provide a helpful frame of reference.

The distributional shape of a sample of univariate data is adequately expressed in a histogram or stem-and-leaf plot. With bivariate data, a histogram or other form of density estimate is a three-dimensional object. Graphics software is available which would allow perspective plots of these objects to be produced, but care must sometimes be exer- cised in the choice of viewing angle so that interesting features are not hidden at the rear of the picture. This approach is also of very limited use in comparing different groups of data since it would be difficult to disentangle visually the superimposition of several such plots.

When comparison of groups is of principal interest, box- plots are a very effective means of summarising and con- trasting univariate samples. Although the mediancentre described by Gower (1974) is a multivariate version of the median, direct multivariate analogues of quartiles are less easy to define and compute. The rang�9 boxplots of Becketti and Gould (1987) combine two marginal boxplots for bivariate data but such a summary ignores the joint nature of the distribution. Techniques such as convex hulls and minimum ellipsoids (Green, 1981) do describe

distributional shape, and can be used successively to 'peel' the data, but they assume the underlying distribution to be convex. The recent approach of Goldberg and Inglewicz (1992) is also based on an assumption of convexity.

Scott (1978) gave a very effective demonstration of how the contours of an estimated density function can be used to identify bimodality, and to compare two groups of bivariate data. In the present paper the principle that inter- esting features of distributional shape may be characterised by regions of high density is explored further. In Section 2 the height of a density estimate is used to rank a set of bivariate data. This allows the levels of a contour plot to be chosen to partition the data into groups of specified size, in a manner reminiscent of a boxplot. Some simple properties of this procedure are outlined in Section 3. In Section 4, directions are identified along which interesting features may be found.

2. D e n s i t y b a s e d d i s p l a y o f b i v a r i a t e d a t a

A simple form of two-dimensional density estimate is avail- able from a data set { (xi, Yi)} through the kernel method as

1 n J~(X,y) = ~ . ~ W(X -- x i , y - - Y i ; h x , h y , p )

7-5

Density-based exploration o f bivariate data 173

03

�9 d 3 �9 i t

�9 �9 I

Q_

0

"T

: . . . ' . . . : . ,

u

-2 0

,C ~

~ o �9

i n n

2 4 6

pc1

"-.dl

Fig. 2. The first two principal component scores for the third time period (1956-1984) of the aircraft data with density contours containing 25%, 50% and 75% of the observations. The positions of Concorde (C) and the Tupolev TU144 (T), and the directions identified by the direc- tional information, are marked

where w(.) is the two-dimensional normal density function with standard deviations hx, hy and correlation p. The kernel method is well described by Silverman (1986). The values of the scale and correlation parameters control the amount of smoothing which is applied to the data and specific choices for these are discussed in Section 3.

Kittler (1976) used the heights of an estimated density at each observation to scan through the data in a clustering algorithm based on modes. Tukey and Tukey (1981) pro- posed that scatterplots could be 'sharpened' by omitting a certain proportion of the observations corresponding to the lowest density heights. These ideas can be used in the context of contouring to impose a rank on each observa- tion according to its position in the ordered list of estimated density heights { f (x i ,Y i )} . The median and quartiles of these density heights can then be used to define appropriate levels at which contours of the density estimate should be superimposed on a scatterplot of the data.

The effect of this is illustrated in Fig. 2 where the princi- pal component scores of the aircraft data are displayed for the third time period. The contours divide the data into four groups of equal size. The first contour (labelled 25) contains the quarter of the data which corresponds to highest density heights. As we progress to lower contours (labelled 50 and 75) we pass through the other quarters of the data. In this sense the display is reminiscent of a boxplot since the

data are split into four groups of equal size. However, the advantage of doing this by density height is that multi- modality can also be displayed. The disjoint nature of the second contour shows the distribution to be trimodal. The addition to the contouring approach of Scott (1978) is that the levels have not been chosen in the usual way to represent equally spaced gradations in density height, but to quantify features in the shape of the distribution through the containment of specified proportions of the data.

The application of such summaries in comparing groups is illustrated in Fig. 3 where contours containing 75% of the data are superimposed for each of the three different time periods. Here we can see clearly the development in size and speed between the two earlier time periods, with an inverted T shape for the third period indicating a wide range of sizes of aircraft with average speed and also the development of very fast aircraft of average size. The positions of Concorde, and its Soviet equivalent the Tupolev TU144, which are unusual in combining high speed with large size, are marked in Fig. 2 by the letters C and T respectively. More detailed contouring of each group shows the first two periods to be bimodal and the third period to have three modes, indicating a degree of specialisation within the range of aircraft types produced.

174 Bowman and Foster

03

C4

~ 0

11956-84

1 9 3 6 - 5 5 ~ ~

. . . . . . . . . . . . . . . . . . . . . . . . . .

I I I o Io~

,7

t i

-4 -2 0 pc1

i i i

2 4 6

Fig. 3. Density contours containing 75% of the observations within each of the three time periods of the aircraft data

Notice that these displays are more than direct extensions of boxplots to two dimensions since the boxplot cannot dis- play multimodality. Application of the technique described above in one dimension could lead to the representation of, say, the top 50% of the data by two disjoint intervals, which cannot happen with the boxplot.

3. Some simple properties of density contouring

The kernel method of density estimation requires choices to be made for the smoothing parameters hx, hy and the kernel correlation parameter p. In many applications the choice of particular values for these parameters is crucial but it is very helpful that in the present context the density heights are to be used principally for ranking and so it is their relative rather than their absolute values which is important. An increase in the amount of smoothing applied to the data may flatten the density estimate and decrease its absolute value at a mode but the relative heights at the data points may change very little. In terms of a contour plot, the principal effect of changes in the smoothing parameters over quite a wide range is to alter the roughness of the contour curve rather than to alter its location. Features such as bimodality can, of course, be obscured by over- smoothing, but in general the sensitivity of the proposed density contour plots is less severe than that of the absolute density estimate.

For this reason, and for simplicity, all the contours presented in the present paper were drawn from density estimates which were smoothed by the formulae

h = n-l/6; h~ = h.sx; hy = h.sy; O = 0.0;

where Sx and Sy denote the sample standard deviations of x and y (Silverman, 1986). These formulae are optimal when the underlying bivariate distribution is an uncorrelated normal distribution. Since the normal distribution is very smooth, it is likely that the effect of these formulae will be to oversmooth the data. In this sense the procedure is a conservative one which will not be oversensitive to non- normal features of the data. However, the simulation study of Bowman (1985) gives reassuring evidence that with uni- variate data the normal optimal formula provides a very effective choice of smoothing parameter, even when the underlying distribution contains distinctly non-normal features. In data where there is strong correlation, it would be helpful to set the kernel correlation parameter to the sample correlation of the data. However, Scott (1992) recommends O = 0 as an effective general strategy. It is particularly appropriate when the axes are principal components.

It is easy to see that as the smoothing parameter h tends to zero the ranking imposed on the data tends to unifor- mity, since the density estimate consists of identical spikes placed over each observation. More interestingly, as h tends to infinity the ranks become identical to those

Dens i t y -based exp lora t ion o f bivariate data 175

imposed by a fitted bivariate normal distribution. This is most easy to show with the quadratic kernel

w(x, y; h, sx, sy) =

{ 3 , [ 1 , ( , y2 x2 </12

O, otherwise

If h is chosen so that h 2 exceeds (x~-x~)2/S2x+ (y~-y~)2 / s2 for all r and s then every kernel takes a positive value at every other observation. We then have

iff

iff

- 4 J j

,2 X 2 __ 2 X i ~ j 2 __ 2y/2y x f - 2xj~ ) j - 2yj f + < +

iff

- , s F y s~ ,

_ > e x p - ~ \ sx / \ sy /

and so the ranking imposed on the data is the same as the ranking imposed by a normal density function with mean and variance estimated by the sample versions.

A similar argument holds by Taylor series expansion for any symmetric, unimodal kernel function. This interesting result therefore defines the end of the spectrum of effects which the smoothing parameter can produce, although an extremely large value of h is required before a normal rank- ing is guaranteed.

Another interesting connection arises when the highest mode of the density estimate is used to provide an estimate of location. The value of x which maximises

! ~ i w(x - xi, 3' - Yi, hsx, hsy, p)

is clearly the same as the value of x which minimises

w(x- x,,y -.,,; hsx, hs,,, p) } i

where m is the modal height of w. This last expression defines an M-estimator of location, with redescending weight function m - w(.. .) and provides a link between

density estimation and robust methods previously observed by Bowman (1981).

Finally, the concept of atypicality was introduced by Aitchison and Kay (1975) to quantify the extremity of an observation in a sample. In a parametric setting this can be done by calculating the probability mass which lies inside the normal density contour on which the observ- ation lies. A non-parametric analogue of atypicality is provided simply by the rank of the observation as defined above, divided by the sample size to produce a proportion.

4. Identifying interesting directions

When interesting features of a bivariate distribution have been identified, it can be helpful to quantify the directions in which these lie. The first principal component locates the direction along which the projected data have maxi- mum variation. Projection pursuit (Friedman, 1987) has similar aims but uses other criteria for identifying interest- ing directions. However, the nature of a projection has the potential disadvantage of blurring together features which may actually be well separated in a bivariate distribution.

An alternative approach arises from the fact that features such as clusters and 'arms' in a distribution have the effect of producing a large cross-sectional area in the density function in certain directions. In two dimensions, one criterion for identifying such effects is therefore to find directions s from the mean which maximise f ~ f ( c . v ) d c where f is the underlying density function, i -= (vl, v2) is a unit vector and c is a positive scalar. This approach is not based on projections, although in the special case of correlated bivariate normal data it is straightforward to prove that it is equivalent to finding the first principal component.

As we are interested only in one direction, it is convenient to reparametrise the problem through polar co-ordinates. Each observation is then represented by a radius and angle as (ri, Oi) and interesting directions correspond to modes in the marginal distribution of 0. The kernel method can be employed to estimate this marginal density function as

1 ~ V(O - Oi: h) =

i=1

where V(#; h) denotes the von Mises density function on the circle with location parameter /~ and concentration parameter h. Since we are interested only in the relative heights of ~, we can ignore the awkward scaling constant in the von Mises density and use for V the simple function exp{h cos #}.

Figure 4 displays the density estimate ~, scaled to lie in the range 0 to 1, evaluated at each observed direction Oi and using the smoothing parameter h = 48, derived by like- lihood based cross-validation, with the aircraft data from

176 Bowman and Foster

o

co c~

c"

o

o,I

d

o c~

i.

dl

1 d3 "

I

"% �9 /

~ o

d2

8

o � 9 end o �9 0 0

A ,'I

i i i

0 100 200 300 Angle

Fig. 4. A scaled density estimate o f the directional distribution, evaluated at each data point, for the third time period in the aircraft data. The identified directions are marked as dl, d2 and d3

the third time period. The three clear modes correspond to the arms of the distribution as displayed in the contours of Fig. 2, where the identified directions are displayed as dotted lines. These lines quantify in a simple way the direc- tions along which the arms of the distribution lie. The direc- tions are close to, but distinct from, principal components, which are based on projections rather than on local features of the data. The relative sizes of these local features are quantified through the cross-sectional integrals 0.24, 0.24 and 0.14 along the three identified directions.

Instead of using the mean as the central location from which direction vectors are constructed the median centre or highest mode of the density could also be used.

With p-dimensional, rather than bivariate, data the same ideas can be applied through the estimation of the density of directions on the (p - l)-dimensional sphere. The diffi- culty in identifying the modes of this density estimate through numerical methods increases rapidly with p. How- ever, the ranking imposed on the data by estimated density height will cause a sudden jump in the _0 i as a new local mode is detected. The performance of this algorithm is under investigation and will be reported elsewhere.

Acknowledgements

The financial support of P. J. Foster by a Science and

Engineering Research Council research studentship is gratefully acknowledged. We are also grateful to Dr P. Saviotti for permission to use the aircraft data.

References

Aitchison, J. and Kay, J. W. (1975) Principles, practice and perfor- mance in decision making in clinical medicine. In The Role and Effectiveness o f Theories o f Decision in Practice, D. J. White and K. C. Bowen (eds.) pp. 252-72, Hodder & Stoughton, London.

Becketti, S. and Gould, W. (1987) Rangefinder boxplots: a note. American Statistician 41, 149.

Bowman, A. W. (1981) Some Aspects of Density Estimation by the Kernel Method. Ph.D. thesis, University of Glasgow.

Bowman, A. W. (1985) A comparative study of some kernel-based nonparametric density estimates. J. Stat. Comp. Sire. 21, 313 - 327.

Friedman, J. H. (1987) Exploratory projection pursuit. J. Amer. Statist. Assoc. 82, 249-266.

Goldberg, K. M. and Inglewicz, I. (1992) Bivariate extensions of the boxplot. Technometrics 34, 307-20.

Gower, J. C. (1974) The mediancentre. Applied Statistics 23, 466-470.

Green, P. J. (1981) Peeling bivariate data. In Interpreting Multi- variate Data (Chapter 1) V. Barnett (ed.) Wiley, Chichester.

Jane (1978) Jane's Encyclopaedia o f Aviation. Jane's, London.

Density-based exploration o f bivariate data

Kittler, J. (1976) A locally sensitive method for cluster analysis. Pattern Recognition 8, 23-33.

Saviotti, P. P. and Bowman, A. W. (1984) Indicators of output of technology. Proc. ICSSR/SSRC Workshop on Science and Technology Policy in the 1980s; M. Gibbons et al. (eds.) Harvester PresS, Brighton.

Scott, D. W. (1978) Plasma lipids as collateral risk factors in coronary artery disease - - a study of 371 males with chest pain. J. Chron. Dis. 31, 337-345.

177

Scott, D. W. (1992) Multivariate Density Estimation: Theory, Practice and Visualisation. Wiley, New York.

Scott, D. W. and FactOr, L. E. (1981) Monte Carlo study of three data-based nonparametric density estimators. 3". Amer. Statist. Assoc. 76, 9-15.

Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

Tukey, P. A. and Tukey, J. W. (1981) Data-driven view selection; agglomeration and sharpening. In Interpreting Multivariate Data (Chapter 11), V. Barnett (ed). Wiley, Chichester.