methods of environmental data analysis

METHODS OF ENVIRONMENTAL DATA ANALYSIS

Environmental Management Series

Edited by

Prof. J. Cairns, Jr, University Center for Environmental and Hazardous Materials Studies. Virginia Polytechnic Institute, USA

and

Prof. R.M. Harrison, Institute of Public and Environmental Health, University of Birmingham, UK

This series hJS been established [0 meet the need for a set of in-depth volumes dealing with environmental issues. particufarly with regard to a sustainable future. The series provides a uniform and quality coverage. building up to form a library of reference books spanning major tOPICS within this diverse tield.

The level of presentation is advanced. aimed primarily at a research/consultancy readership. Coverage IOcludes all aspects of environmental science and engineering relevant to evaluation and management of the natural and human-modified environment. as weI! as topics dealing with the political. economic. legal and social conslderariom pertaining to environmental management.

Previously published titles in the Series include:

Biomonitoring of Trace Aquatic Contaminants D.J.H. Phillips and P.S. Rainbow (1993. reprinted 1994)

Global Atmospheric Chemical Change C.N. Hewitt and W.T. Sturges (eds) (1993. reprinted 1995)

Atmospheric Acidity: Sources, Consequences and Abatement M. Radojevic and R.M. Harrison (eds) (1992)

Methods of Environmental Data Analysis C. N. Hewitt (ed.) (1992, reprinted 1995)

Please contact the Publisher or one of the Series' Editors if you would like to contribute to the Series.

Dr R.C.J. Carling Senior Editor Environmental Sciences Chapman & Hall 2·6 Boundary Row London SEI SHN. UK email: bob. [email protected]

Prof. Roy Harrison The Institute of Public and Environmental Health School of Chemistry University of Birmingham Edgbaston BI5 2IT. UK email: [email protected]

Prof. John Cairns, Jr Environmental and Hazardous Materials Studies Virginia PolytechniC Institute and State University Blacksburg Virginia 24061·0414. USA email: cairnsb@v!.edu

METHODS OF ENVIRONMENTAL DATA ANALYSIS

Edited by

C.N. HEWITT INSTITUTE OF ENVIRONMENTAL & BIOLOGICAL

SCIENCES, LANCASTER UNIVERSITY, LANCASTER LA1 4YQ, UK

CHAPMAN &. HALL London· Glasgow· Weinheim . New York· Tokyo· Melbourne· Madras

Published by Chapman & Hall. 2-6 Boundary Row. London SE1 8HN

Chapman & Hall, 2-6 Boundary Row, London SEl BHN, UK

Blackie Academic & Professional, Wester Cleddens Road, Bishopbriggs, Glasgow G64 2NZ, UK

Chapman & Hall GmbH, Pappelallee 3, 69469 Weinbeim, Germany

Chapman & Hall USA, 115 Fifth Avenue, New York, NY 10003, USA

Chapman & Hall Japan, ITP-Japan, Kyowa Building, 3F, 2-2-1 Hirakawacho, Chiyoda-ku, Tokyo 102, Japan

Chapman & Hall Australia, 102 Dodds Street, South Melbourne, Victoria 3205, Australia

Chapman & Hall India, R. Seshadri, 32 Second Main Road, CIT East, Madras 600 035, India

First published by Elsevier Science Publishers Ltd 1992

© 1992 Chapman & Hall

Typeset by Alden Multimedia Ltd, Northampton

ISBN 0 412 739909

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the UK Copyright Designs and Patents Act, 1988. this publication may not be reproduced, stored, or transmitted, in any form or by any means. without the prior permission In writing of the publishers. or in the case of reprographic reproduction only in accordance with the terms of the licences issued by the Copyright Licensing Agency in the UK, or in accordance with the terms of licences issued by the appropriate Reproduction Rights Organization outside the UK. Enquiries concerning reproduction outside the terms stated here should be sent to the publishers at the London address printed on this page.

The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made.

A catalogue reGord for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data available

Special regulations for readers in the USA

This publication has been registered with the Copyright Clearance Center Inc. (CCCl. Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside the USA, should be referred to the publisher.

Foreword

ENVIRONMENTAL MANAGEMENT SERIES

The current expansion of both public and scientific interest in environmental issues has not been accompanied by a commensurate production of adequate books, and those which are available are widely variable in approach and depth.

The Environmental Management Series has been established with a view to co-ordinating a series of volumes dealing with each topic within the field in some depth. It is hoped that this Series will provide a uniform and quality coverage and that, over a period of years, it will build up to form a library of reference books covering most of the major topics within this diverse field. It is envisaged that the books will be of single, or dual authorship, or edited volumes as appropriate for respective topics.

The level of presentation will be advanced, the books being aimed primarily at a research/consultancy readership. The coverage will include all aspects of environmental science and engineering pertinent to management and monitoring of the natural and man-modified environment, as well as topics dealing with the political. t:conomic, legal and social considerations pertaining to environmental management.

J. CAIRNS JNR and R.M. HARRISON

Preface

In recent years there has been a dramatic increase in public interest and concern for the welfare of the planet and in our desire and need to understand its workings. The commensurate expansion in activity in the environmental sciences has led to a huge increase in the amount of data gathered on a wide range of environmental parameters. The arrival of personal computers in the analytical laboratory, the increasing automation of sampling and analytical devices and the rapid adoption of remote sensing techniques have all aided in this process. Many laboratories and individual scientists now generate thousands of data points every month or year.

The assimilation of data of any given variable, whether they be straightforward, as for example, the annual average concentrations of a pollutant in a single city, or more complex, say spatial and temporal variations of a wide range of physical and chemical parameters at a large number of sites, is itself not useful. Raw numbers convey very little readily assimilated information: it is only when they are analysed, tabulated, displayed and presented can they serve the scientific and management functions for which they were collected.

This book aims to aid the active environmental scientist in the process of turning raw data into comprehensible, visually intelligible and useful information. Basic descriptive statistical techniques are first covered, with univariate methods of time series analysis (of much current importance as the implications of increasing carbon dioxide and other trace gas concentrations in the atmosphere are grappled with), regression, correlation and multivariate factor analysis following. Methods of analysing and determining errors and detection limits are covered in detail, as are graphical methods of exploratory data analysis and the visual representation of

VII

VllI PREFACE

data. The final chapter describes in detail the management procedures necessary to ensure the quality and integrity of environmental chemical data. Numerous examples are used to illustrate the way in which particular techniques can be used.

The authors of these chapters have been selected to ensure that an authoritative account of each topic is given. I sincerely hope that a wide range of readers, including undergraduates, researchers, policy makers and administrators, will find the book useful and that it will help scientists produce information, not just numbers.

NICK HEWITT

Lancaster

Contents

Foreword . . . . . v

Preface . . . . . . Vll

List of Contributors. Xl

Chapter 1 Descriptive Statistical Techniques A.C. BAJPAI, I.M. CALUS and J.A. FAIRLEY

Chapter 2 Environmetric Methods of Nonstationary Time-Series Analysis: Univariate Methods

P.c. YOUNG and T. YOUNG. . . . . . . .. 37

Chapter 3 Regression and Correlation A.C. DAVISON. . . . . . . .

Chapter 4 Factor and Correlation Analysis of Multivariate Environmental Data

79

P.K. HOPKE. . . . . . . . . . . . . . 139

Chapter 5 Errors and Detection Limits M.J. ADAMS ..... .

Chapter 6 Visual Representation of Data Including Graphical Exploratory Data Analysis

181

J.M. THOMPSON. . . . . . . . . . . . . 213

Chapter 7

Index . ..

Quality Assurance for Environmental Assessment Activities

A.A. LIABASTRE, K.A. CARLBERG and M.S. MILLER

IX

259

301

List of Contributors

M.J. ADAMS

School of Applied Sciences, Wolverhampton Polytechnic, Wulfruna Street, Wolverhampton WVIISB, UK

A.C. BAJPAI

Department of Mathematical Sciences, Loughborough University of Technology, Loughborough LEll 3TU, UK

I.M. CALUS

72 Westfield Drive, Loughborough LEll 3QL, UK

K.A. CARLBERG

29 Hoffman Place. Belle Mead, New Jersey 08502, USA

A.C. DAVISON

Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OXI 3TG, UK

J.A. FAIRLEY

Department of Mathematical Sciences, Loughborough University of Technology, Loughborough LEll 3TU, UK

A.A. LIABASTRE

Environmental Laboratory Division, US Army Environmental Hygiene Activity-South, Building 180, Fort McPherson, Georgia 30330-5000, USA

M.S. MILLER

Automated Compliance Systems, 673 Emory Valley Road, Oak Ridge, Tennessee 37830, USA

xi

XII LIST OF CONTRIBUTORS

1.M. THOMPSON

Department of Biomedical Engineering and Medical Physics, University of Keele Hospital Centre, Thornburrow Drive, Hartshill, Stoke-on-Trent, Staffordshire ST4 7QB, UK. Present address: Department of Medical Physics and Biomedical Engineering, Queen Elizabeth Hospital, Birmingham BJ5 2TH, UK

P.e. YOUNG

Institute of Environmental and Biological Sciences, Lancaster University, Lancaster, Lancashire LAJ 4YQ, UK

T. YOUNG

Institute of Environmental and Biological Sciences, Lancaster University, Lancaster, Lancashire LAJ 4YQ, UK. Present address: Maths Techniques Group, Bank of England, Threadneedle Street, London EC2R 8AH, UK

Chapter 1

Descriptive Statistical Techniques

A.C. BAJPAI,a IRENE M. CALUSb and J.A. FAIRLEyli GDepartment of Mathematical Sciences, Loughborough University of

Technology, Loughborough, Leicestershire LE11 3TU, UK; b72 Westfield Drive, Loughborough, Leicestershire LE11 3QL, UK

1 RANDOM VARIATION

The air quality in a city in terms of, say, the level of sulphur dioxide present, cannot be adequately assessed by a single measurement. This is because air pollutant concentrations in the city do not have a fixed value but vary from one place to another. They also vary with respect to time. Similar considerations apply in the assessment of water quality in a river in terms of, say, the level of nitrogen or number of faecal coliforms present, or in assessing the activity of a radioactive pollutant. In such situations, while it may be that some of the variation can be attributed to known causes, there still remains a residual component which cannot be fully explained or controlled and must be regarded as a matter of chance. It is this random variation that explains why, for instance, two samples of water, of equal volume, taken at the same point on the river at the same time give different coliform counts, and why, in the case of a radioactive source, the number of disintegrations in, say, a I-min time interval varies from one interval to another.

Random variation may be caused, wholly or in part, by errors in measurement or it may simply be inherent in the nature of the variable under consideration. When a die is cast, no error is involved in counting the number of dots on the uppermost face. The score is affected by a multitude of factors-the force with which the die is thrown, the angle at which it is thrown, etc.-which combine to produce the end result.

2 A.C. BAJPAI, IRENE M. CALUS AND J.A. FAIRLEY

At the other extreme, variation in the results of repeated determinations of nitrogen concentrations in the same sample of water must be entirely due to error. Just as the die is never thrown in exactly the same way on successive occasions, so repeated determinations are not exact repetitions even though made under apparently identical conditions. Lee & Lee! point out that in a laboratory there may be slight changes in room temperature, pressure or humidity, fluctuations in the mains electricity supply, or variation in the level to which a pipette or standard flask is filled. In a titrimetric analysis, the two burette readings and the judging of the end point are amongst possible sources of variation. All such factors, of which the observer is unaware, combine to produce the random error which is causing the random variation.

Between the two extremes is the situation where random error partly, but not wholly, explains the variation. An example of this would be given by determinations of nitrogen made on a number of samples of water taken at the same time at a gauging station. While random error would contribute to the variation in results, there would also be sample-tosample variation in the actual amount of nitrogen present, as river water is unlikely to be perfectly homogeneous.

The score when a die is thrown and the observation made on the amount of nitrogen present in a water sample are both examples of a random variable or variate, but are of two different types. The distinction between these two types is relevant when a probability model is used to describe the pattern of variation. When the variable can take only certain specific values, and not intermediate ones, it is said to be discrete. This would apply if it is the result of a counting process, where only integer values can result. Thus the score when a die is thrown, the number of emissions from a radioactive source in a 30-s interval and the number of coliforms in 1 ml of water are all examples of a discrete variate. When a variable can take any value within the range covered it is said to be continuous. Values of a variable of this kind are obtained by measurement along a continuous scale as, for instance, when dealing with length, mass, volume or time. Measurements of the level of nitrogen in water, lead in blood or sulphur dioxide in air fall into this category.

In situations such as those which have been described here, a single observation would be inadequate for providing the information required. Hence sets of data must be dealt with and the remainder of this chapter will be devoted to ways of presenting and summarising them.

DESCRIPTIVE STATISTICAL TECHNIQUES 3

2 TABULAR PRESENTATION

2.1 The frequency table A mass of numerical data gives only a confused impression of the situation. Reorganising it into a table can make it more informative, as illustrated by the following examples.

Example 1. A 'colony' method for counting bacteria in liquids entails, as a first step, the dilution with sterile water of the liquid under examination. Then I ml of diluted liquid is placed in a nutrient medium in a dish and incubated. The colonies of bacteria which have formed by the end of the incubation period are then counted. This gives the number of bacteria originally present, assuming that each colony has grown from a single bacterium. Recording the number of colonies produced in each of 40 dishes might give the results shown here.

234 3 0 3 2 4

220 2 3 2 5 3 2

4 0

4 2

2 4 5

2 2

2 0

033

In Table I the data are presented as a frequency table. It shows the number of dishes with no colony, the number with one colony, and so on. To form this table you will probably find it easiest to work your way systematically through the data, recording each observation in its appropriate category by a tally mark, as shown. In fact, these particular data may well be recorded in this way in the first place.

The variate here is 'number of colonies' and the column headed 'number of dishes' shows the frequency with which each value of the variate

Number of colonies

o 1 2 3 4 5

TABLE 1 Frequency table for colony counts

Taffy

J.Hf J.Hf IIII 1Hf1Hf II J.HfII 1Hf II

Number of dishes

5 9

12 7 5 2

40


occurred. Sometimes it is useful to give the frequencies as a proportion of the total, i.e. to give the relative frequency. Corresponding to 0, I, 2 .. . colonies, the relative frequencies would be respectively 5/40, 9/40, 12/40 .. . Using f to denote the frequency with which a value x of the variate occurred, the corresponding relative frequency is f/ N where N is the total number of observations.

Example 2. Random error in measurement, of which mention has already been made, would be the cause of the variation in the results, shown here, of 40 replicate determinations of nitrate ion concentration (in Jlg/ml) in a water specimen.

0-49 0·45 0-48 0·48 0·49 0·48 0·46 0·48 0·51 0·52

0·51 0·49 0·50 0·49 0·47 0·48 0·50 0·50 0·47 0·50

0·50 0·51 0·48 0·50 0·50 0-47 0·50 0-48 0·49 0-47

0·52 0·50 0·51 0-49 0·48 0·51 0-46 0·49 0·50 0·49

In Example I we were dealing with a discrete variate, which could take only integer values. The variate in this example has the appearance of being discrete, in that it only takes values 0'45, 0-46, 0'47, etc., and not values in between, but this is because recordings have been made to only 2 decimal places. Nitrate ion concentration is measured on a continuous scale and should therefore be regarded as a continuou,; variate. The formation of a frequency table follows along the same lines as in Example I. Table 2 shows the frequency distribution obtained.

The variation in this case is entirely attributable to random error and

TABLE 2 Frequency table for nitrate ion concentration

measurements

Nitrate ion concentration (J1g/ml)

0·45 0·46 0·47 0·48 0·49 0·50 0·51 0·52

Frequency

I 2 4 8 8

10 5 2

40


only 8 different values were taken by the variate. If, however, the determinations had been made on different water specimens, there would have been a greater amount of variation. Giving the frequency corresponding to each value taken by the variate would then make the table too unwieldy and grouping becomes advisable. This is illustrated by the next example, where a similar situation exists.

Examp/e3. The lead concentrations (in Jl.g/m3) shown here represent recordings made on 50 weekday afternoons at an air monitoring station near a US freeway.

5·4 6·8 6·1 10·6 7·0 5·2 4·9 6·5 8·3 7·1

6·0 5·0 7·8 5·9 6·0 8·7 6·0 6·2 6·0 10·1

6-4 7·2 6·4 6·4 8·0 8·3 8·0 8·1 9·9 6·8

5·3 2·1 7·2 7·6 7·3 3·9 10·9 6·1 6·8 9·3

5·0 9·2 7·9 8·6 3·2 6·9 8·6 9·5 11·2 6·4

The smallest value is 2·1 and the largest 11·2. It is desirable that groups of equal width should be chosen. In Table 3, values from 2·0 up to and including 2·9 have been placed in the first group, values from 3·0 up to and including 3·9 in the second group, and so on. The data then being covered by 10 groups, a table of reasonable size is obtained. If the number of groups is too small, too much information is lost. Too many groups would

TABLE 3 Frequency table for lead concentration measurements

Lead concentration True group Frequency (pg/m3) boundaries

2·0-2·9 1·95-2·95 1 3·0-3·9 2,95-3,95 2 4'~'9 3·95-4·95 1 5·0-5·9 4·95-5·95 6 6·0-6,9 5·95-6·95 16 7·0-7-9 6·95-7·95 8 8,0-8,9 7,95-8,95 8 9·0-9·9 8·95-9·95 4

10·0-10·9 9,95-10·95 3 11·0-11·9 10·95-11·95 1

50

6 A.C. BAIPAI, IRENE M. CALUS AND I.A. FAIRLEY

mean the retention of too much detail and show little improvement on the original data.

As with the measurements of nitrate ion concentration in Example 2, we are here dealing with a variate which is, in essence, continuous. In this case, readings were recorded to I decimal place. Thus 2·9 represents a value between 2·85 and 2·95, 3·0 represents a value between 2·95 and 3·05, and so on. Hence there is not really a gap between the upper end of one group and the lower end of the next. The true boundaries of the first group are 1·95 and 2·95, of the next 2·95 and 3·95, and so on, as shown in Table 3. Notice that no observation falls on these boundaries and hence no ambiguity arises when allocating an observation to a group. If the boundaries were 2·0, 3·0, 4·0, etc., then the problem would arise of whether 3·0, say, should be allocated to the group 2·0-3·0 or to the group 3·0-4·0. Various conventions exist for dealing with this problem but it can be avoided by a judicious choice of boundaries, as seen here.

2.2 Table of cumulative frequencies It may be of interest to know on how many days the lead concentration was below a stated level. From Table 3 it is readily seen that there were no observations below 1·95, 1 below 2·95, 3 (= I + 2) below 3·95, 4 (= 1 + 2 + 1) below 4·95 and so on. The complete set of cumulative frequencies thus obtained is shown in Table 4.

In the case of a frequency table, it has been mentioned that it may be more useful to think in terms of the relative frequency, i.e. the proportion of observations falling into each category. Similar considerations apply in

TABLE 4 Table of cumulative frequencies for data in Table 3

Lead concentration Cumulative % Cumulative (/lg/m3 ) frequency frequency

1·95 0 0 2·95 1 2 3·95 3 6 4·95 4 8 5·95 10 20 6·95 26 52 7·95 34 68 8·95 42 84 9·95 46 92

10·95 49 98 11·95 50 100


the case of cumulative frequencies. Thus, in the present example, it may be more useful to know the proportion of days on which the recorded lead concentration was below a stated level. In Table 4 this is given as a percentage, a common practice with cumulative frequencies.

We have considered here the 'less than' cumulative frequencies as this is more customary but the situation could obviously be looked at from an opposite point of view. For example, saying that there were 10 days with recordings below 5·95 Jl.g/m3 is equivalent to saying that there were 40 days when recordings exceeded 5·95 Jl.g/m3• In percentage terms, a 20% 'less than' cumulative frequency corresponds to an 80% 'more than' cumulative frequency.

3 DIAGRAMMATIC PRESENTATION

For some people a diagram conveys more than a table of figures, and ways of presenting data in this form will now be considered.

3.1 Dot diagram In a dot diagram, each observation is marked by a dot on a horizontal axis which represents the scale of measurement. An example will show how this is done.

Example 4. Values of biochemical oxygen demand (BOD), which is a measure of biodegradable organic matter, were recorded on water specimens taken throughout 1987 and 1988 from the River Clyde at Station 12A (Addersgill) downstream of Carbams sewage treatment works. The results (in mg/litre) were as follows:

1987: 3·2 2·9 2·1 4·3 2·9 3·8 4·6 2·4 4·5 3·9 4·5 2·0 4·2

1988: 3·7 2·8 2·6 2·9 3·8 5·1 3·2 5·0 2·3 2·8 2·2 2·8 2·2

The dot diagram in Fig. l(a) displays the 1988 data. Although 5·0 and 5·1 appear as outlying values, they seem less exceptional in Fig. l(b) where the combined results for the two years are displayed. Obviously this very simple form of diagram would not be suitable when the number of observations is much larger.

3.2 Line diagram The frequency distribution in Table I can be displayed as shown in Fig. 2. The variate (number of colonies) is shown along the base axis and


(a)~. I.. .. 234

BOD (mg/litre)

(b) u1....L. II I .1. 234

BOD (mg/litre)

•• •

• 5

II

5

Fig. 1. Dot diagrams for data recorded at Station 12A on River Clyde: (a) 1988; (b) 1987 and 1988.

o 2 3 5

Number of Colonies

Fig. 2. Line diagram for data in Table 1.

the heights of the lines (or bars) represent the frequencies. This type of diagram is appropriate here as it emphasises the discrete nature of the variate.

3.3 Histogram For data in which the variate is of the continuous type, as in Examples 2 and 3, the frequency distribution can be displayed as a histogram. Each frequency is represented by the area of a rectangle whose base, on a horizontal scale representing the variate, extends from the lower to the upper boundary of the group. Taking the distribution in Table 3 as an example, the base of the first rectangle should extend from 1·95 to 2'95, the next from 2·95 to 3'95, and so on, with no gaps between the rectangles, as shown in Fig. 3.

In the case of the distribution of nitrate ion concentration readings in Table 2, 0·45 is considered as representing a value between 0·445 and 0'455, 0·46 as representing a value between 0·455 and 0'465, and so on. Thus, when the histogram is drawn, the bases of the rectangles would have


5 6 7 8 9 10 11 12

Lead Concentration (llg/m 3)

Fig. 3. Histogram for frequency distribution in Table 3.

these boundaries as their end points. Again, there must be continuous coverage of the base scale representing nitrate ion concentration.

It must be emphasised that it is really the areas of the rectangles that represent the frequencies, and not, as is often thought, the heights. There are two reasons why it is important to make this clear. Firstly, the heights are proportional to the frequencies only when the groups are all of equal width. That is the most common situation and one therefore becomes accustomed to measuring the heights on a scale representing frequency when drawing a histogram. Thus in Fig. 3, after choice of a suitable vertical scale, heights of 1, 2, I, 6, ... units were measured off when constructing the rectangles. However, occasionally tables are encountered in which the group widths are not all equal (e.g. wider intervals may have been used at the ends of the distribution where observations are sparse). In such cases, the bases of the rectangles will not be of equal width and the heights will not therefore be proportional to the frequencies if the histogram is drawn correctly. Suppose, for instance, that in Table 3 the final group had extended from 9·95 to 11·95. Its frequency would then have been 4, the same as for the adjacent group (8·95 to 9,95). For the rectangles representing these frequencies to have equal areas, they will not be of equal height, as Fig. 4 shows.

The second reason is that if one proceeds to the stage of using a probability model to describe the distribution of a continuous variate, it is an area, not a height, that represents relative frequency and therefore, also, frequency. The true interpretation of the height of a rectangle in the histogram is that it represents frequency density, i.e. frequency per unit along the horizontal axis.

IO A.C. BAJPAI, IRENE M. CALUS AND I.A. FAIRLEY

8 9 10 11 12 Lead Concentration ().tg/m')

Fig. 4. Construction of histogram where groups are of unequal width.

3.4 Cumulative frequency graph A distribution of cumulative frequencies can be displayed diagrammatically by plotting a graph with relative cumulative frequency (usually given as a percentage) on the vertical scale and the variate on the horizontal scale. Figure 5 shows the plot for the distribution in Table 4. The vertical scale represents the percentage of days when the lead concentration was below the value given on the horizontal scale. Thus, for instance, 2% on the vertical scale corresponds to 2·95 J.lg/m3 on the horizontal scale.

The term 'ogive' is often given to a cumulative frequency graph, though it properly applies only in the case when the frequency distribution is symmetrical with a central peak and the cumulative frequency curve then has the shape of an elongated'S'.

100

90 >-(J c 80 Q) ~ 0-Q) 70 u: Q)

> 60 ~ :;

50 E ~

u 40 Q) Cl .;g

30 c Q)

f:! 20 Q)

11.

10

0 0 2 3 4 5 6 7 8 9 10 11 12

Lead Concentration ().tg/m')

Fig. 5. Plot of percentage cumulative frequencies in Table 4.


The value on the horizontal scale corresponding to P% on the vertical scale is the Pth percentile of the distribution. Thus, in the present example, 5·95 fJ,g/m3 is the 20th percentile, i.e. 20% of recordings gave lead concentrations below 5·95 fJ,g/m3• The Global Environment Monitoring System (GEMSf report on the results of health-related environmental monitoring gives 90th percentile values for various aspects of the water quality of rivers. For example, for the 190 rivers monitored for biochemical oxygen demand (BOD) the 90th percentile was 6·5 mg/litre. Or, to put it another way, for 10% of rivers, i.e. 19 of them, the level of BOD exceeded 6·5 mg/litre. The levels of pesticide residue and industrial chemicals in foods are also described by showing the 90th percentile in each case. For instance, for DDT in meat the 90th percentile value was 100 fJ,g/kg and thus in 10% of participating countries this level was exceeded.

World Health Organization (WHO) guidelines established for urban air quality (e.g. levels of sulphur dioxide and suspended particulate matter) are expressed in terms of the 98th percentile, meaning that this level should not be exceeded more than 2% of the time or approximately seven days in a year. A GEMS3 report on air quality in urban areas presents frequency distributions for data in terms of the 10th, 20th, ... and 90th percentiles (sometimes referred to as deciles). This form of presentation looks at cumulative frequencies from the reverse viewpoint to that used in Table 4. There, percentage cumulative frequencies were given for selected values of the variate, whereas the GEMS report gives values of the variate corresponding to selected percentage cumulative frequencies.

4 MEASURES OF LOCATION (TYPICAL VALUE)

We shall now look at ways of summarising a set of data by calculating numerical values which measure various aspects of it. The first measures to be considered will be ones which aim to give a general indication of the size of values taken by the variate. These are sometimes termed 'measures of location' because they are intended to give an indication of where the values of the variate are located on the scale of measurement. To illustrate this idea, let us suppose that, over a period of 24 h, recordings were made each hour of the carbon monoxide (CO) concentration (in parts per million) at a point near a motorway. The first 12 results, labelled 'day', were obtained in the period beginning at 08.00 h. The second 12 results, labelled 'night', were obtained in the remaining part of the period, i.e.

12 A.C. BAJPAI, IRENE M. CALUS AND I.A. FAIRLEY

.1 .... -... Day

-'1L.l.L-a ...... -'.IL-.., ..... " ......... ~_ ..... ______ Night

2 3 4 5 6 7 8

CO Concentration (ppm)

Fig. 6. Dot diagrams showing recordings of CO concentration.

beginning at 20.00 h.

Day: 5·8 6·9 6·7 6·7 6·3 5·8 5·5 6·1 6·8 7·0 7·4 6·4

Night: 5·0 3·8 3·5 3·3 3·1 2·4 1·8 1·5 1·3 1·3 2·0 3·4

From a glance at the two sets of data it is seen that higher values were recorded during the daytime period. (This is not surprising because the density of traffic, which one would expect to have an effect on CO concentration, is higher during the day.) This difference between the two sets of data is highlighted by the two dot diagrams shown in Fig. 6.

4.1 The arithmetic mean

4. 1. 1 Definition One way of indicating where each set of data is located on the scale of measurement is to calculate the arithmetic mean. This is the measure of location in most common use. Often it is simply referred to as the mean, as will sometimes be done in this chapter. There are other types of mean, e.g. the geometric mean of which mention will be made later. However, in that case the full title is usually given so there should be no misunderstanding. In common parlance the term 'average' is also used for the arithmetic mean although, strictly speaking, it applies to any measure of location:

A . h . Sum of all observations nt metIc mean = .

Total number of observatIons (1)

Applying eqn (l) to the daytime recordings gives the mean CO concentration as

5·8 + 6·9 + ... + 7·4 + 6·4 12

77·4 = 12 = 6·45ppm

Similarly, for the night recordings the mean is 32·4/12, i.e. 2·7 ppm.


Generalising, if XI' X2, X3, •.. , Xn are n values taken by a variate X,

their mean x is given by

n n

Using I: to denote 'the sum from i = 1 to i = n of' this can be written as

;=1

1 n

X = - L X; n ;=1

(2)

Where there is no possibility of misunderstanding, the use of the suffix i on the right-hand side can be dropped and the sum simply written as I:x.

Example 5. An important aspect of solid waste is its physical composition. Knowledge of this is required, for instance, in the design and operation of a municipal incinerator or in the effective use of landfilling for disposal. An estimate of the composition can be obtained by manually separating portions of the waste into a number of physical categories and calculating the percentage of waste in each category. Let us suppose that in a preliminary study 6 portions, each weighing approximately 150 kg, are taken from waste arriving at a disposal site. Worker A separates 4 portions and reports the mean percentage (by weight) in the food waste category as 8·8. The corresponding 2 percentages obtained by Worker B in the separation of the remaining portions have a mean of 5·2. In order to obtain, for all 6 portions, the mean percentage in the food waste category, we return to the basic definition in eqn (l).

For Worker A, sum of observations = 4 x 8·8 = 35·2 For Worker B, sum of observations = 2 x 5·2 = 10·4 Combined sum of observations = 35·2 + 10·4 = 45·6 Combined number of observations = 4 + 2 = 6 For all 6 observations, mean percentage = 45·6/6 = 7·6

Note that the same answer would not be obtained by taking the mean of 8·8 and 5'2, as this would not take account of these two means being based on different numbers of observations. The evaluation of the overall mean was given by

4 x 8·8 + 2 x 5·2

4 + 2

and in this form you can see how 8·8 and 5·2 are weighted in accordance with the number of observations on which they are based.


4.1.2 Properties of the arithmetic mean Two properties of the arithmetic mean may now be noted. An appreciation of them will lead to a better understanding of procedures met with later on. The first concerns the values of x - i, i.e. the deviations of individual observations from their mean. As an example, consider the 4 values of x given here. Their mean, i, is 4. Hence the deviations are as shown, and it will be obvious that their sum is zero, the negative and positive deviations cancelling each other out.

x 5 8 2

x - i -3 4 -2

Try it with any other set of numbers and you will find that the same thing happens. A little simple algebra shows why this must be so. For n observationsx l , x2 , ••• ,xn ,

I:(x - i) (XI - i) + (X2 - i) + ... + (xn - i)

(XI + x2 + ... + xn ) - ni

ni - ni

= 0 We have, therefore, the general result that the sum of the deviations from the arithmetic mean is always zero.

The second property concerns the values of the squares of the deviations from the mean, i.e. (x - i)2. Continuing with the same set of 4 values of x, squaring the deviations and summing gives

I:(x - i)2 = (- 3)2 + 12 + 42 + (- 2)2 = 30

Now let us look at the sum of squares of deviations from a value other than i. Choosing 2, say, gives

I:(x - 2)2 = (l - 2)2 + (5 - 2)2 + (8 - 2)2 + (2 - 2)2 = 46

Choosing 10 gives

(l

174

lW + (5 - 10)2 + (8 - 10)2 + (2 - lW

Both these sums are larger than 30, as also would have been the case if numbers other than 2 and 10 had been chosen. This illustrates the general result that the sum of the squares of the deviations from the arithmetic mean is less than the sum of the squares of the deviations taken from any other


value. An algebraic proof of this is given by Bajpai et al.4 Here, another approach will be adopted by beginning with the question: 'What value of a will make I:r=l(x; - a)2 a minimum?'

For convenience, let n

S = L (x; - a)2 = (XI - a)2 + (Xl - a)l . .. + (xn - a)l ;=1

For a particular set of n values of x, S will vary with a. From differential calculus it is known that, for S to be a minimum, dS/da = O. Now

dS - = - 2(xl - a) - 2(Xl - a) - ... - 2(xn - a) da

Putting this equal to zero then gives

(XI - a) + (Xl - a) + ... + (xn - a) 0

i.e.

whence

As dl S/dd = 2n, which is positive, a minimum value of S is confirmed. If XI, X2, ... , Xn are replicate measurements obtained in the chemical

analysis of a specimen, as was the case with the nitrate ion concentration data from which Table 2 was formed, it is a common practice to use their mean i as an estimate of the actual concentration present. In so doing, an estimator is being used which satisfies the criterion that the sum of the squares of the deviations of the observations from it should be a minimum. A criterion of this kind is the basis of the method of least squares. An important instance of its application occurs in the estimation of the equations of regression lines, described in another chapter.

4.1.3 Calculation of the mean from a frequency table We shall now consider how to interpret the definition of the arithmetic mean, as given in eqn (I), when the data are presented in the form of a frequency table. The colony counts in Example I will be used as an illustration. By reference to Table I it will be seen that the sum of the original 40 observations will, in fact, be the sum of five Os, nine Is, twelve 2s, seven 3s, five 4s and two 5s. Hence the value of the numerator in eqn

16 A.C. BAJPAI, IRENE M. CALUS AND I.A. FAIRLEY

TABLE 5 Calculation of arithmetic mean for data in Table 1

Number of colonies (x) Frequency (f) fx

0 5 0 1 9 9 2 12 24 3 7 21 4 5 20 5 2 10

40 84

(1) is given by

5 x 0 + 9 x 1 + 12 x 2 + 7 x 3 + 5 x 4 + 2 x 5 84

The calculations can be set out as shown in Table 5.

Mean number of colonies per dish = ~~ = 2·1

It will be noted that "i.fx is the sum of all the observations (in this case the total number of colonies observed) and "i.fis the total number of observations (in this case the total number of dishes).

An extension to the general case is easily made. If the variate takes values XI' X2, ... ,Xn with frequencies J;, J;, ... ,In respectively, the arithmetic mean is given by

J;XI + J;X2 + ... + Inxn x = J;+J;+···+1n

--n or, more simply,

LA ;=1

"i.fx "i.f

(3)

On examination of eqn (3), it will be seen that XI' X2, ... , Xn are weighted according to the frequencies J;, J;, ... , In with which they occur. The mean for the data in Example 2 can be found by a straightforward application of this formula to Table 2, thus giving 0·49. In Examples 1 and 2 no information was lost by putting the data into the form of a frequency table. In each case it would be possible, from the frequency table, to say what the original 40 observations were. The same is not true of the frequency table in Example 3. By combining values together in each class interval, some of the original detail was lost and it would not be possible


to reproduce the original data from Table 3. To calculate the mean from such a frequency table, the values in each class interval are taken to have the value at the mid-point of the interval. Thus, in Table 3, observed values in the first group (1·95-2'95) are taken to be 2'45, in the next group (2·95-3·95) they are taken to be 3'45, and so on. This is, of course, an approximation, but an unavoidable one. It may be of interest to know that the mean calculated in this way from Table 3 is 7'15, whereas the original 50 recordings give a mean of 7'08, so the approximation is quite good.

4.2 The mode Although the arithmetic mean is the measure of location most often used, it has some drawbacks. In the case of a discrete variate, for instance, it will often turn out to be a value which cannot be taken by the variate. This happened with the colony counts in Table 5 where the mean was found to be 2·1, a value that would never result from counting the number of colonies in a dish. Nevertheless, in this particular situation, although the figure of 2·1 may appear to be nonsensical, it can be given an interpretation. The colony count, it will be recalled, was assumed to give the number of bacteria present in I ml of diluted liquid. Now, although the notion of 2·1 bacteria may appear ridiculous, it becomes meaningful when 2·1 bacteria per ml is converted into 2100 bacteria per litre.

Suppose, however, that a builder wants to decide what size of house will be in greatest demand. The fact that the mean size of household in the region is 3·2 persons is not helpful, as there will not be any potential customers of that kind. It might be more useful, in that case, to know the size of household that occurs most frequently. This is the mode. For a discrete variate it is the value where peak frequency occurs. Reference to Table 1 shows that for the colony counts the mode is 2. For a continuous variate, the mode occurs where the frequency density has a peak value. All the frequency distributions that have been considered in this chapter have a single peak and are therefore said to be unimodal. Sometimes, a bimodal distribution, with two peaks, may occur. This could happen if, for instance, the distribution is really a combination of two distributions.

4.3 The median Another drawback of the arithmetic mean is that a few, or even just one, extreme recorded values can make it unrepresentative of the data as a whole, as an example will now show.

Values of conductivity (in /lS/cm) were recorded for water specimens taken at Station 20 (Tidal Weir) on the River Clyde at intervals of 3 to 4


weeks in 1988. The results, in chronological order, were as follows:

240 289 290 308 279 380 574 488

3590 17200 235 260 323 318 188

The mean is 1664, but 13 of the 15 results are way below this. Clearly the value of the mean has been inflated by the two very high results, 3590 and 17200, recorded in July and August. (In fact, they arose from the tidal intrusion of sea water into the lower reaches of the river. The river board's normal procedure is to avoid taking samples when this occurs, but occasionally there is a residue of sea water in the samples after the tide has turned.) If these two values are excluded the mean of the remaining 13 is found to be 321.

The median is a more truly middle value, in that half of the observations are below it and half above. It is, in fact, the 50th percentile. Thus when the GEMS2 report states that, for the 190 rivers monitored for biochemical oxygen demand (BOD), the median was 3 mg/litre, it indicates that 95 rivers reported a BOD level above 3 mg/litre and 95 reported a value below.

To calculate the value of the median from a set of data, it is necessary to consider the recorded values rearranged in order from the smallest to the largest (or the largest to the smallest, though this is less usual). For the River Clyde data, such rearrangement gives

188 235 240 260 279 289 318 323 380 488 574 3590

290 308 17200

There are 15 values so the middle one is the 8th and the median conductivity is therefore 308 j1.Sjcm. Where the number of observations is even there will be two middle values and the median is taken to be midway between them. Thus for the 190 rivers in the GEMS survey, the median would have been midway between the 95th and 96th. Sorting such a large amount of data into order is easily done with the aid of a computer.

When incomes of people in a particular industry or community are being described, it is often the median income that is specified rather than the mean. This is because when just a few people have very high incomes compared with the rest, the effect of those high values may be to raise the mean to a value that is untypical, as happened with the conductivity data. The mean would then give a false impression of the general level of income. In that situation the distribution of incomes would be of the type depicted in the histogram in Fig. 7(b). A similar pattern of distribution has been found to apply to the carbon monoxide emissions (in mass per unit

(a) (b)

A~ Fig. 7. Histograms showing distributions that are (a) symmetric; (b) skew.

distance travelled) of cars, and to blood lead concentrations.5- 7 Such a distribution is said to be skew, in contrast to the symmetric distribution depicted in Fig. 7(a).

While perfect symmetry can occur in a theoretical distribution used as a model, a set of data will usually show only approximate symmetry. Where variation is entirely attributable to random error, as with the repeated determinations of nitrate ion concentration in Example 2, the distribution is, in theory, symmetric. When perfect symmetry exists as in Fig. 7(a), it follows that

Mode = Median = Mean

When a distribution is skewed as in Fig. 7(b), with a long right-hand tail,

Mode < Median < Mean

A skew distribution can also be the reverse of that in Fig. 7(b), having the long tail at the left-hand end, though this situation is a less common occurrence. Whereas previously it was seen that a few high values resulted in the mean being un typically large, now the tail of low values reduces the mean to an untypically low level. The relation between the three measures of location is now

Mode > Median > Mean

A common practice in use by analytical chemists is to perform several determinations on the specimen under examination and then use the mean of the results to estimate the required concentration. Let us suppose that a titrimetric method was involved and that in four titrations the following volumes (in ml) were recorded:

25·06 24·89 25·03 25·01

Suspicion at once falls on the value 24·89 which appears to be rather far away from the other three readings. An observation in a set of data which seems to be inconsistent with the remainder of the set is termed an outlier. The question arises: 'Should it be included or excluded in subsequent

20 A.C. BAJPAI, IRENE M. CALUS AND l.A. FAIRLEY

calculations?' Finding an answer becomes less important if the median is to be used instead of the mean. As has previously been noted, the median is affected much less by one or two extreme values than is the mean. Thus, taking the titration data as an example, rearrangement in order of magnitude gives:

24·89 25·01 25·03 25·06

The median is t(25'01 + 25'03), i.e. 25·02. Now, it might be that 24·89 was the result of wrongly recording 24·98 (transposing numbers in this way is a common mistake). However, although such a mistake affects the value of the mean, it has no effect on the median. In general, the effect an outlier can have on the median is limited and this is an argument in favour of using the median, instead of the mean, in this kind of situation.

There are also situations in which, compared with the median, the mean is less practical to calculate or perhaps even impossible. As an example, the toxicity of an insecticide might be measured by observing how quickly insects are killed by it. To calculate the mean survival time of a batch of insects after having been exposed to the insecticide under test, it would be necessary to record the time of death of each insect. On the other hand, obtaining the median survival time would be just a matter of recording when half of the batch had been killed off and, moreover, it would not be necessary to wait until every insect had died. The 'half-life' of a radioactive element is another example. It is the time after which half the atoms will have disintegrated and is thus the median life of the atoms present, i.e. half of them will have a life less than the 'half-life' and half will have a life which exceeds it.

4.4 The geometric mean A change of variable can transform a skew distribution of the type shown in Fig. 7(b) to a symmetric distribution which is more amenable to further statistical analysis. Such an effect could be produced by a logarithmic transformation which entails changing from the variable x to a new variable y, where y = logx (using any convenient base). Then if the distribution of x is as in Fig. 7(b), the distribution of y would show symmetry. For a set of data x(, x2 , ••• , xn , such a transformation would produce a new set of values y(, 12, ... ,Yn where Y( = logx(, 12 = logx2' ... , Yn = logxn • The meany of the new values is then given by

I Y = - (y( + Y2 + ... + Yn)

n

The geometric mean (GM) of the original data is the value of x that


transforms to y, i.e. is such that its logarithm is y. Thus

I 10gGM = Y = - (logxl + logx, + ... + logxn) n -

Rewriting this as

10gGM

leads to

GM = ':jXI X 2 • •• Xn

Both the GEMS' report on air quality and the UK Blood Lead Monitoring Programme5- 7 include geometric means in their presentation of results.

5 MEASURES OF DISPERSION

In addition to a measure that indicates where the observations are located on the measurement scale, it is also useful to have a measure that provides some indication of how widely they are scattered. Let us suppose that the percentage iron content of a specimen is required and that two laboratories, A and B, each make five determinations, with the following results:

Laboratory A: 13·99 14·15 14·28 13·93 14·30

Laboratory B: 14·12 14·10 14·15 14·11 14·17

Both sets of data give a mean of 14·13. (This would be unlikely to happen in practice, but the figures have been chosen to emphasise the point that is being made here.) It can, however, be seen at a glance that B's results show a much smaller degree of scatter, and thus better precision, than those obtained by A. Although the terms precision and accuracy tend to be used interchangeably in everyday speech, the theory of errors makes a clear distinction between them. A brief discussion of the difference between these two features of a measuring technique or instrument will be useful at this point.

Mention has already been made of the occurrence of random error in measurements. Another type of error that can occur is a systematic error (bias). Possible causes, cited by Lee & Lee,1 are instrumental errors such as the zero incorrectly adjusted or incorrect calibration, reagent errors such as the sample used as a primary standard being impure or made up


to the wrong concentration of solution, or personal errors arising from always viewing a voltmeter from the side in an identical way or from the individual's judgment in detecting a colour change. Such a systematic error is present in every measurement. If it is constant in size, it is the amount by which the mean of an infinitely large number of repeated measurements would differ from the true value (of what is being measured). The accuracy of a measuring device is determined by the systematic error. If none exists, the accuracy is perfect. The greater the systematic error, the poorer the accuracy.

Accuracy, then, is concerned with the position of the readings on the measurement scale, the existence of the systematic error causing a general shift in one direction. This, of course, is the aspect of data which is described by a measure oflocation such as the mean. Precision, however, is concerned with the closeness of agreement between replicate test results, i.e. with the variation due to the presence of random error. Therefore, in describing precision, a measure of dispersion is appropriate. The greater the scatter, the poorer is the precision.

Although the need to measure dispersion has been introduced here in relation to the precision of a test procedure, the amount of variation is an important feature of most sets of data. For instance, the consistency of the quality of a product may matter to the consumer as well as the general level of quality.

5.1 The range This is the simplest measure of spread, easy to understand and easy to calculate, being given by:

Range = Largest observation - Smallest observation

Thus returning to the %Fe determinations, we have

Laboratory A: Range

Laboratory B: Range

14·30 - 13·93

14·17 - 14·10

0·37

0·07

With calculators and computers now readily available, ease of calculation has become less important. This advantage of the range is now outweighed by the disadvantage that it is based on just two of the observations, the only account being taken of the others is that they are somewhere in between. One observation which is exceptionally large or small can exert a disproportionate influence on it and give a false impression of the general amount of scatter in the data.


5.2 The interquartile range The influence of extreme values is removed by a measure that ignores the highest 25% and the lowest 25% of the observations. It is the range of the remaining middle 50%. You have already seen how the total frequency is divided into two equal parts by the median. Now we are considering the total frequency being divided into four equal parts by the quartiles. The lower quartile, QI' has 25% of the observations below it and is therefore the 25th percentile. The upper quartile, Q3, is exceeded by 25% of the observations and is therefore the 75th percentile. The subdivision is completed by the middle quartile, Q2, which is, of course, the median. To summarise, there are thus:

25% of observations less than QI 25% of observations between QI and Q2 25% of observations between Q2 and Q3 25% of observations above Q3

Using the range of the middle 50% of observations as a measure of spread:

Interquartile range = Upper quartile - Lower quartile = Q3 - QI

The exceptionally high value at the upper extreme of the Clyde conductivity data (for which the median was found in Section 4.3) would have a huge effect on the range, but none at all on the interquartile range, which will now be calculated.

In the calculation of the median from a set of data there is a universal convention that when there is no single middle value, the median is taken to be midway between the two middle values. There is, however, no universally agreed procedure for finding the quartiles, or, indeed, percentiles generally. It can be argued that, in the same way that the median is found from the ordered data by counting halfway from one extreme to the other, the quartiles should be found by counting halfway from each extreme to the median. Applying this to the conductivity data, there are two middle values, the 4th and 5th, which are halfway between the I st and 8th (the median). They are 260 and 279, so the lower quartile would then be taken as (260 + 279)/2, i.e. 269·5. Similarly, the upper quartile would then be (380 + 488)/2, i.e. 434, giving the interquartile range as 434 - 269·5 = 164·5.

Another approach is based on taking the Pth percentile of n ordered observations to be the (P/lOO)(n + l)th value. For the conductivity data, n = 15 and the quartiles (P = 25 and P = 75) would be taken as the 4th

24 A.C. BAJPAI, ffiENE M. CALUS AND I.A. FAffiLEY

and 12th values, i.e. 260 and 488. For a larger set of data, the choice of approach would have less effect.

5.3 The mean absolute deviation Although the interquartile range may be an improvement on the range, it still does not use to the full all the information given by the data. Greater use of the values recorded would be made by a measure which takes into account how much each value deviates from the arithmetic mean. Taking the mean of such deviations would, however, be of no avail. It has already been noted in Section 4.1.2 that their sum is always zero. One way round this would be to ignore the negative signs, i.e. take all deviations as positive. Their mean is then the mean absolute deviation (more often loosely referred to as the mean deviation, though that can be a misleading description). For n observations Xl' X2' •.. , Xn whose mean is i, it would be

This measure of spread is, however, of limited usefulness and is not often met with nowadays.

5.4 Variance and standard deviation Another way of dealing with the problem of the negative and positive deviations cancelling each other out is to square them. This idea fits in more neatly with the mathematical theory of statistics than does the use of absolute values. Taking the mean of the squares of the deviations gives the variance. Thus:

Variance = Mean of squares of deviations from the mean

However, this will not be in the original units. For example, if the data are in,mm, the variance will be in mm2 • Usually, it is desirable to have a measure of spread expressed in the same units as the data themselves. Hence the positive square root of the variance is taken, giving the standard deviation, i.e.

Standard deviation = JVariance

The standard deviation is an example of a root-mean-square (r.m.s.) value. This is a general concept which finds application when the 'positives' and 'negatives' would cancel each other out, as for instance, in finding the mean value of an alternating current or voltage over a cycle. Thus the


figure of 240 V stated for the UK electricity supply is, in fact, the r.m.s. value of the voltage.

The calculations required when finding a standard deviation are more complicated than for other measures of dispersion, but this disadvantage has been reduced by the aids to computation now available. A calculator with the facility to carry out statistical calculations usually offers a choice of two possible values. To explain this we must now make a slight digression.

5.4.1 Population parameters and sample estimates One of the main problems with which statistical analysis has to deal is that, of necessity, conclusions have to be made about what is called the population from only a limited number of values in it, called the sample.

A manufacturer of electric light bulbs who states that their mean length of life is 1000 h is making a statement about a popUlation-the lengths of life of all the bulbs being manufactured by the company. The claim is made on the basis of a sample-the lengths of life of a small number of bulbs which have been tested in the factory's quality control department.

In introducing measures of dispersion, the results of 5 determinations of %Fe content made by each of two laboratories, A and B, were considered. What has to be borne in mind, when interpreting such data, is that if a further 5 determinations were made by, say, Laboratory A, it is most unlikely that they would give the same results as before. Such a set of 5 measurements, therefore, represents a sample from the population of all the possible measurements that might result from determinations by Laboratory A of the %Fe content of the specimen. If no bias is present in the measurement process, the mean of this population would be the actual %Fe content, which is, of course, what the laboratory is seeking to determine.

As a further illustration, let us take the solid waste composition study referred to in Example 5. The values for the percentage of food waste in the six 1 50-kg lots taken for separation can be regarded as a sample from the population of the values that would be obtained from the separation of all the 150-kg lots into which the waste arriving at the site could be divided. Obviously, it is the composition of the waste as a whole that the investigators would have wanted to know about, not merely the composition of the small amounts taken for separation.

In all these three cases, the situation is the same. Information about the popUlation is what is required, but only information about a sample is available. Where values of population parameters, such as the mean and

26 A.C. BA1PAI, IRENE M. CALUS AND 1.A. FAIRLEY

standard deviation are required, estimates based on a sample will have to be used. The convention of denoting population parameters by Greek letters is well established and will be followed here, with fJ. for the mean and (1 for the standard deviation. The sample mean x can be used as an estimate of the population mean fJ.. Different samples would produce a variety of values of x. Some would underestimate fJ., some would overestimate it but, on average, there is no bias in either direction. In other words, x provides an unbiased estimate of fJ..

For a population of size N, the variance (12 is given by

where the summation is taken over all N values of x. Mostly, however, a set of data represents a sample from a population, not the population itself. It might be thought that I:(x - x)2jn should be the estimate of (12

provided by a sample of size n. However, just as different samples give a variety of values of x, so also they will give a variety of values for I:(x - x)2jn. While some of these underestimate (12 and some overestimate it, there is, on average, a bias towards underestimation. An unbiased estimate of (12 is given by I:(x - x)2j(n - I). (Explanation of this, in greater detail, can be found in Ref. 4.) Obviously, the larger the value of n, the less it matters whether n - 1 or n is used as the denominator. When n = 100, the result of dividing by n - 1 will differ little from that obtained when dividing by n, but when n = 4 the difference would be more appreciable. The estimate will be denoted here by i. Although this nota

. tion is widely adopted, it is not a universal convention so one should be watchful in this respect. For the unbiased estimate of the population variance, then, we have

i = _I _ I:(x _ X)2 n-I

(4)

and the corresponding estimate of the population standard deviation is

s = )_1- I:(x _ X)2 n-I

(5)

5.4.2 Calculating s from the definition Even though, in practice, a calculator may be used in finding s, setting out the calculations in detail here will help to clarify what the summation process in eqns (4) and (5) involves. This is done in Table 6 for the


TABLE 6 Calculation of s for Laboratory 8's measurements of %Fe

x x - X (x - X)2 x2

14·12 -0'01 0·000 I 199·3744 14· 10 -0·03 0·0009 198·8100 14·15 0·02 0·0004 200·2225 14· \I -0'02 0·0004 199·0921 14·17 0·04 0·0016 200·7889

70·65 0·0034 998·2879

x = 70·65 = 14.13 5

i = 0·0034 = 0.00085 4

s = 0·029

determinations of %Fe made by Laboratory B. (Ignore the column headed X2 for the time being.)

Similar calculations give, for Laboratory A's results, s = 0'167, the larger value reflecting the greater amount of spread.

5.4.3 A shortcut method The formula for s can be converted into an alternative form which cuts out the step of calculating deviations from the mean. By simple algebra, it can be shown that

(6)

The %Fe results obtained by Laboratory B are again chosen for illustration.

From Table 6, LX = 70·65 and Lx2 = 998·2879. Substitution in eqn (6) gives

s is then found as before.

5.4.4 The use of coding

998·2879 - 70.652/5

998·2879 - 998·2845

0·0034

If you examine the calculations that have just been carried out in applying eqn (6), you will see that they involve a considerable increase in the number of digits involved. While the original items of data involved only


4 digits, the process of squaring and adding nearly doubled this number. The example will serve to show a hazard that may exist, not only in the use of the shortcut formula but in similar situations when a computer or calculator is used.

In the previous calculations all the figures were carried in the working and there was no rounding off. Now, let us look at the effect of working to a maximum of 5 significant figures. The values of ~ would then be taken as 199'37, 198'81,200'22, 199·09 and 200'79, giving ~X2 = 998·28. Substitution in eqn (6) gives ~(x - X)2 = 998·28 - 998·28 = 0, leading to s = O. What has happened here is that the digits discarded in the rounding off process are the very ones that produce the correct result. This occurs when the two terms on the right-hand side of eqn (6) are near to each other in value. Hence, in the present example, 7 significant figures must be retained in the working in order to obtain 2 significant figures in the answer.

Calculators and computers carry only a limited number of digits in their working. Having seen what can happen when just 5 observations are involved, only a little imagination is required to realise that, with a larger amount of data, overflow of the capacity could easily occur. The consequent round-off errors can then lead to an incorrect answer, of which the unsuspecting user of the calculator will be unaware (unless an obviously ridiculous answer like s = 0 is obtained).

The number of digits in the working can be reduced by coding the data, i.e. making a change of origin and/or size of unit. For Laboratory B's %Fe data, subtracting 14·1 from each observation (i.e. moving the origin to x = 14'1) gives the coded values 0'02, 0'00, 0·05, 0·01 and 0·07. The spread of these values is just the same as that of the original data and they will yield the same value of s. The number of digits in the working will, however, be drastically reduced. This illustrates how a convenient number can be subtracted from each item of data without affecting the value of s. Apart from reducing possible risk arising from round-off error, fewer digits mean fewer keys to be pressed, thus saving time and reducing opportunities of making wrong entries. In fact, the %Fe data could be even further simplified by making 0·0 I the unit, so that the values become 2,0, 5, I and 7. It would then be necessary, when s has been calculated, to convert back to the original units by multiplying by 0·01. A more detailed explanation of coding is given in Ref. 4.


TABLE 7 Calculation of s for frequency distribution in Table 1

x f fx x - x (x - X)2 f(x - X)2

0 5 0 -2·1 4·41 22·05 I 9 9 -1·1 1·21 JO·89 2 12 24 -0·1 O·OJ 0·12 3 7 21 0·9 0·81 5·67 4 5 20 1·9 HI 18·05 5 2 10 2·9 8·41 16·82

40 84 7HO

_ 84 s = J73.60 = 1·37 ~ = - = 2·1 . 40 39 -

5.4.5 Calculating s from a frequency table As with the arithmetic mean, calculating s from data given in the form of a frequency table is just a matter of adapting the original definition. Thus eqn (5) now becomes

'Lf(x _ .q2 'Lf - I

The appropriate form of eqn (6) is now

- 2 2 I 2 'Lf(x - x) = 'Lf'K - 'Lf('Lf'K)

(7)

(8)

Applying eqn (7) to the colony count data in Table I gives the calculations shown in Table 7. Applying eqn (8) to the same data,

I fx2 = 0 + 9 + 48 + 63 + 80 + 50 = 250

and hence

If(x - .~)2 = 250 - 842/40 = 73·60

Calculation of s then proceeds as before.

5.5 Coefficient of variation (relative standard deviation) An error of I mm would usually be of much greater consequence when measuring a length of I cm than when a length of I m is being measured. Although the actual error is the same in both cases, the relative error, i.e. the error as a proportion of the true value, is quite different. Similar remarks apply to a measure of spread. Hence, when using the standard


deviation as a measure of precision, an analytical chemist will often prefer to express it in relative terms. The coefficient of variation (CV) does this by expressing the standard deviation as a proportion (usually a percentage) of the true value as estimated by the mean. Thus, for a set of data,

s CV = - x 100 x

This is also known as the percentage relative standard deviation (RSD). It is independent of the size of unit in which the variate is measured. For example, if data in litres were converted into millilitres, both x and s would be multiplied by the same factor and thus their ratio would remain unchanged. The value of the CV is, however, not independent of the position of the origin on the measurement scale. Thus the conversion of data from °C to of would affect its value, because the two temperature scales do not have the same origin. Although Marriott8 warns of its sensitivity to error in the mean, the CV enjoys considerable popularity.

6 MEASURES OF SKEWNESS

Another feature of a distribution, to which reference has already been made, is its skewness. Measures oflocation and dispersion have been dealt with at length because of their widespread use. Measures of skewness, however, are encountered less frequently and so only a brief mention is given here. The effect that lack of symmetry in a distribution has on the relative positions of the mean, median and mode was noted earlier. Measures of skewness have been devised which make use of this, by incorporating either the difference between mean and mode or the difference between mean and median. Another approach follows along the lines of the mean and variance in that it is based on the concept of moments. This is a mathematical concept which it would not be appropriate to discuss in detail here. Let it suffice to say that the variance, which involves the squares of the deviations from the mean, is a second moment. The coefficient of skewness uses a third moment which involves the cubes of the deviations from the mean.

Whichever measure of skewness is chosen, its value is:

-positive when the distribution has a long right tail as in Fig. 7(b), -zero when the distribution is symmetric, -negative when the distribution has a long tail at the left-hand end.


7 EXPLORATORY DATA ANALYSIS

Various ways of summarising a set of data have been described in this chapter. Some of them are incorporated in two visual forms of display that have been developed in recent years and are now widely used in exploratory data analysis (EDA), in which J.W. Tukey has played a leading role. As its name suggests this may be used in the initial stages of an investigation in order to indicate what further analysis might be fruitful. A few simple examples of the displays will be given here, by way of introduction to the ideas involved. A more detailed discussion is provided by Tukey,9 and also by Erickson & Nosanchuk. 'o

7.1 Stem-and-Ieaf displays We have already seen that one way of organising a set of data is to form a frequency table but that when this entails grouping values together some information about the original data is lost. In a stem-and-leaf display, observations are classified into groups without any loss of information occurring. To demonstrate how this is done, we shall use the following values of alkalinity (as calcium carbonate in mg/litre) recorded for water taken from the River Clyde at Station 20 (Tidal Weir) at intervals throughout 1987 and 1988.

68 90 102 74 68 108 122 85 115 62 66 89 66 70 112

50 67 60 66 60 92 117 126 88 133 60 68 76 82 42

A frequency table could be formed from the data by grouping together values from 40 to 49,50 to 59, 60 to 69 and so on. All observations in the first group would then begin with 4, described by Tukey as the 'starting part'. This is now regarded as the stem. For the other groups the stems are thus 5, 6, etc. All observations in a particular group have the same stem. The remaining part of an observation, which varies within the group, is the leaf Thus 68 and 62 have the same stem, 6, but their leaves are 8 and 2 respectively. The way that the stem and its leaves are displayed is shown in Table 8. If desired a column giving the frequencies can be shown alongside the display.

In the present example, the stem represented 'tens' and the leaf 'units'. If, however, the data on lead concentration in Example 3 were to be displayed, the stem would represent units and the leaf the first place of decimals. Thus, for the first reading, 5'4, the stem would be 5 and the leaf 4. In other situations a two-digit leaf might be required, in which case each pair of digits is separated from the next by a comma.

32 A.C. BA1PAI, IRENE M. CALUS AND 1.A. FAIRLEY

TABLE 8 Stem-and-Ieaf display for River Clyde (Station 20)

alkalinity data

Stem Leaf Frequency

4 2 1 5 0 1 6 88266706008 11 7 406 3 8 5982 4 9 02 2

10 28 2 11 527 3 12 26 2 13 3 1

A further refinement of Table 8 can be achieved by placing the leaves in order of magnitude, as shown in Table 9. It then becomes easier to identify such measures as the quartiles, for instance.

Two sets of data can be compared using a back-to-back stem-and-leaf display. With a common stem, the leaves for one set of data are shown on the left and the leaves for the other set on the right. Table 10 shows the alkalinity data for Station 20 on the River Clyde, previously displayed in Table 8, on the right. The leaves on the left represent values of alkalinity recorded during the same period at Station l2A.

TABLE 9 Ordered stem-and-Ieaf display

for data in Table 8

Stem Leaf

4 2 5 0 6 00026667888 7 046 8 2589 9 02

10 28 11 257 12 26 13 3


TABLE 10 Back-to-back stem-and-Ieaf display for River Clyde

alkalinity data

Station 12A Station 20

9 2 4 3

688 4 2 9885234 5 0

20205 6 88266706008 21 7 406

780 8 5982 62 9 02 6 10 28

40 11 527 12 26 13 3

7.2 Box-and-whisker displays (box plots) Another form of display is one which features five values obtained from the data-the three quartiles (lower, middle and upper) and two extremes (one at each end). For reasons which will become obvious, it is called a box-and-whisker plot. You may also find it referred to as a box-and-dot plot or, more simply, as a box plot.

The box represents the middle half of the distribution. It extends from the lower quartile to the upper quartile, and a line across it indicates the position of the median. The 'whiskers' extend to the extremities of the data. Taking again, as an example, the alkalinity data recorded for the River Clyde at Station 20, the lowest value recorded was 42 and the highest 133. There were 30 observations so, when they are placed in order, the median is halfway between the 15th and 16th, i.e. 75. Taking the lower quartile as the 8th, i.e. 66, and the upper quartile as the 23rd, i.e. 102, the box-and-whisker display is as shown in Fig. 8.

Here the box is aligned with a horizontal scale but, if preferred, it can be positioned vertically as in Fig. 9. Where a larger amount of data is involved, the bottom and top 5% or 10% of the distribution may be ignored when drawing the whiskers, thus cutting out any freak values that might have occurred. This is done in the reports of the UK Blood Lead Monitoring Programme6,7 where the extremities of the whiskers are the 5th and 95th percentiles. These reports provide an example of a particularly effective use of box plots, in that they can be shown alongside one another

34 A.C. BAlPAI, IRENE M. CALUS AND l.A. FAIRLEY

_. -----II-I -,---I _~--.

40 60 80 100 120 140

Alkalinity (as CaCO, in mg/litre)

Fig. 8. Box-and-whisker display for River Clyde (Station 20) alkalinity data.

to compare various data sets, e.g. blood lead concentrations in men and women, in different years or in various areas where surveys were carried out. Figure 9 gives an indication of how this can be done, enabling a rapid visual assessment of any differences between sets of data to be made quite easily.

20

E o o 15 0, 3 c: o .~

C ~ 10 c: o

(,)

-g Ql ~

"0 g 5 iIi

o

YEAR 1 YEAR 2

MEN WOMEN MEN WOMEN

Fig. 9. Box-and-whisker displays comparing blood lead concentrations in men and women in successive years.


ACKNOWLEDG EM ENT

The authors are indebted to Desmond Hammerton, Director of the Clyde River Purification Board, for supplying data on water quality and for giving permission for its use in illustrative examples in this chapter.

REFERENCES

I. Lee, J.D. & Lee, T.D., Statistics and Numerical Methods in BASIC for Biologists. Van Nostrand Reinhold, Wokingham, 1982.

2. GEMS: Global Environment Monitoring System, Global Pollution and Health. United Nations Environment Programme and World Health Organization, London, 1987.

3. GEMS: Global Environment Monitoring System, Air Quality in Selected Urban Areas 1975-1976. World Health Organization, Geneva, 1978.

4. Bajpai, A.C., Calus, I.M. & Fairley, J.A., Statistical Methods for Engineers and Scientists. John Wiley, Chichester, 1978.

5. Department of the Environment, UK Blood Lead Monitoring Programme 1984-1987 Resultsfor 1984 Pollution Report No 22. Her Majesty's Stationery Office, London, 1986.



8. Marriott, F.H.C., A Dictionary of Statistical Terms. 5th edn, Longman Group, UK, Harlow, 1990.

9. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA, 1977.

10. Erickson, B.H. & Nosanchuk, T.A., Understanding Data. Open University Press, Milton Keynes, 1979.

Chapter 2

Environmetric Methods of Nonstationary Time-Series

Analysis: Univariate Methods

PETER YOUNG and TIM YOUNG* Centre for Research in Environmental Systems, Institute of

Environmental and Biological Sciences, Lancaster University, Lancaster, Lancashire, LA] 4YQ, UK

1 INTRODUCTION

By 'environmetrics', we mean the application of statistical and systems methods to the analysis and modelling of environmental data. In this chapter, we consider a particular class of environmetric methods; namely the analysis of environmental time-series. Although such time-series can be obtained from planned experiments, they are more often obtained by passively monitoring the environmental variables over long periods of time. Not surprisingly, therefore, the statistical characteristics of such series can change considerably over the observation interval, so that the series can be considered nonstationary in a statistical sense. Figure l(a), for example, shows a topical and important environmental time-series-the variations of atmospheric CO2 measured at Mauna Loa in Hawaii over the period 1974 to 1987. This series exhibits a clear upward trend, together with pronounced annual periodicity. The trend behaviour is a classic example of statistical nonstationarity of the mean, with the local mean value of the series changing markedly over the observation interval. The

*Present address: Maths Techniques Group, Bank of England, Threadneedle Street, London EC2R 8AH.

37

38 PETER YOUNG AND TIM YOUNG

352 .----------------~

350 (a) 348

346

344

342

340

:~WfNV 20 40 60 80 100 120 140

160.--------------------,

(b) 140

120

100

80

60

40

20

I.

en 60

5 50

6 z 40

'" ~ 30

~ 20 UJ

~ 10 UJ Q.

1750 1800 1850 1900 1950

37 43 49 55 61 67

Tim. (kyrs.B.P.l " -----"-----,,,-----,--- ,

7 13 19 25 31 SAMPLE NUMBER

250 200 150 100 50

PERMIAN ITRIASSlcl JURASSIC CAETACEOUS I TERTIARY

Fig, 1, Three examples of typical nonstationary environmental time series: (a) the Mauna Loa CO2 series (1974-1987); (b) the Waldemeir sunspot series (1700-1950); (c) the Raup-Sepkoski 'Extinctions' series (between the Permian (253 My BP) and the Tertiary (11·3 My BP) periods of geologic time).

NONSTATIONARY TIME-SERIES ANALYSIS 39

nature of the periodicity (or seasonality), on the other hand, shows no obvious indications of radical change although, as we shall see, there is some evidence of mild nonstationarity even in this aspect of the series.

The well-known Waldmeier annual sunspot data (1700-1950) shown in Fig. l(b) are also quite clearly nonstationary, although here it is the amplitude modulation of the periodic component which catches the eye, with the mean level remaining relatively stable. However, here we see also that the periodic variations are clearly distorted as they fluctuate around the 'mean' value, with the major amplitude variation appearing in the upper maxima. This suggests more complex behavioural characteristics and the possibility of nonlinearity as well as nonstationarity in the data.

Finally, Fig. l(c) shows 70 samples derived from the 'Extinction' series compiled by Raup & Sepkoski.1 This series is based on the percentage of families of fossil marine animals that appear to become extinct over the period between the Permian (253 My BP) and the Tertiary (11·3 My BP) periods of geologic time (My = 1 000 000 years). The series in Fig. l(c) was obtained from the original, irregularly spaced Raup & Sepkoski series by linear interpolation at a sampling interval of 3·5 My. Using a simple analysis, which gave equal weighting to all the observed peaks, Raup & Sepkoski l noted that the series appeared to have a cyclic component with a period in the region of 26 My. However, any evaluation of this cyclic pattern must be treated with caution since it will be affected by the shortness of the record and, as in the case of the sunspot data, by the great variation in amplitude and associated assymetry of the main periodic component. In other words there are, once again, clear indications of both nonlinearity and nonstationarity in the data, both of which are not handled well by conventional techniques of time-series analysis.

The kinds of nonstationarity and/or nonlinearity observed in the three series shown in Fig. I are indicative of changes in the underlying statistical characteristics ofthe data. As a result, we might expect that any mathematical models used to characterise these data should be able to represent this nonstationarity or nonlinearity if they are to characterise the time-series in an acceptable manner. Box & Jenkins,2 amongst many other statisticians, have recognised this problem and have proposed various methods of tackling it. In their case, they propose simple devices, such as differencing the data prior to model identification and estimation, in order to remove the trend,2 or nonlinear transformation prior to analysis in order to purge the series of its nonlinearity. But what if, for example, we do not wish to difference the data, since we feel that this will amplify high frequency components in the series? Can we account for the nonstationar-


ity of the mean in other ways? Or again, if we do not choose to perform prior nonlinear transformation, or feel that even after such transformation the seasonality is still distorted in some manner, then how can we handle this situation? Similarly, what if we find that, on closer inspection, the amplitude of the seasonal components of the CO2 data set is not, in fact, exactly constant: can we develop a procedure for estimating these variations so that the estimates can be useful in exercises such as 'adjusting' the data to remove the seasonal effects?

In this chapter, we try to address such questions as these by considering some of the newest environmetric tools available for evaluating nonstationary and nonlinear time-series. There is no attempt to make the treatment comprehensive; rather the authors' aim is to stimulate interest in these new and powerful methodological tools that are just emerging and appear to have particular relevance to the analysis of environmental data. In particular, a new recursive estimation approach is introduced to the modelling, forecasting and seasonal adjustment of nonstationary timeseries and its utility is demonstrated by considering the analysis of both Mauna Loa CO2 and the Extinctions data.

2 PRIOR EVALUATION OF THE TIME-SERIES

We must start at the beginning: the first step in the evaluation of any data set is to look at it carefully. This presumes the availability of a good computer filing system and associated plotting facilities. Fortunately, most scientists and engineers have access to microcomputers with appropriate software; either the IBM-PC-AT/IBM PS2 and their compatibles, or the Macintosh SE/H family. In this chapter, we will use mainly the micro CAPTAIN program developed at Lancaster/ which is designed for the IBM-type machines but will be extended to the Macintosh in the near future. Other programs, such as StatGraphics@, can also provide similar facilities for the basic analysis of time-series data, but they do not provide the recursive estimation tools which are central to the micro CAPTAIN approach to time-series analysis.

Visual appraisal of time-series data is dependent very much on the background and experience of the analyst. In general, however, factors such as nonstationarity of the mean value and the presence of pronounced periodicity will be quite obvious. Moreover, the eye is quite good at analysing data and perceiving underlying patterns, even in the presence of background noise: in other words, the eye can effectively 'filter' off the


effects of stochastic (random) influences from the data and reveal aspects of the data that may be of importance to their understanding within the context of the problem under consideration. The comments above on the CO2 and sunspot data are, for example, typical of the kind of initial observations the analyst might make on these two time-series.

Having visually appraised the data, the next step is to consider their more quantitative statistical properties. There are, of course, many different statistical procedures and tests that can be applied to time-series data and the reader should refer to any good text on time-series analysis for a comprehensive appraisal of the subjectY Here, we will consider only those statistical procedures that we consider to be of major importance in day-to-day analysis; namely, correlation analysis in the time-domain, and spectral analysis in the frequency-domain.

A discrete time-series is a set of observations taken sequentially in time; thus N observations, or samples, taken from a series y(t) at times tl , t2, ••• , tk , ••• , tN' may be denoted by y(tl), y(t2), ... ,y(td, ... ,y(tN)' In this chapter, however, we consider only sampled data observed at some fixed interval M: thus we then have N successive values of the series available for analysis over the observation interval of N samples, so that we can use y(1), y(2), ... , y(k), ... , y(N) to denote the observations made at equidistant time intervals to, to + bt, to + 2bt, ... , to + kbt, ... , to + Nbt. If we adopt to as the origin and bt as the sampling interval, then we can regard y(k) as the observation at time t = tk •

A stationary time-series is one which can be considered in a state of statistical equilibrium; while a strictly stationary time-series is one in which its statistical properties are unaffected by any change in the time origin to. Thus for a strictly stationary time-series, the joint distribution of any set of observations must be unaffected by shifting the observation interval. A nonstationary time-series violates these requirements, so that its statistical description may change in some manner over any selected observation interval.

What do we mean by 'statistical description'? Clearly a time-series can be described by numerous statistical measures, some of which, such as the sample mean y and the sample variance a; are very well known, i.e.

I k=N I k=N

Y = - L y(k); a; = - L [y(k) - YF N k=1 N k=1

If we are to provide a reasonably rich description of a fairly complex time-series, however, it is necessary to examine further the temporal patterns in the data and consider other, more sophisticated but still


conventional statistical measures. It is necessary to emphasise, however, that these more sophisticated statistics are all defined for stationary processes, since it is clearly much more difficult to consider the definition and computation of statistical properties that may change over time. We will leave such considerations, therefore, until Section 3 and restrict discussion here to the conventional definitions of these statistics.

2.1 Statistical properties of time-series in the time domain: correlation analysis The mean and variance provide information on the level and the spread about this level: they define, in other words, the first two statistical moments of the data. For a time-series, however, it is also useful to evaluate the covariance properties of the data, as defined by the sample covariance matrix, i.e.

where Cn is the covariance at lag n, which is defined as,

I k=N-n

Cn = N k~1 [y(k) - y][y(k + n) - y]; n = 0, I, 2, ... r -

which provides a useful indication of the average statistical relationship of the time-series separated, or lagged, by n samples. Two other related time-domain statistical measures that are of particular importance in characterising the patterns in time-series data are the sample autocorrelation function and the sample partial autocorrelation function, which are exploited particularly in the time-series analysis procedures proposed by Box and Jenkins in their classic text on time-series analysis, forecasting and controU

For a stationary series, the sample autocorrelation function (A F), r(n), at lag n is simply the normalised covariance function, defined as,

cn r(n) = -; Co

n = 0, I, 2, ... , r - I


Clearly, there is perfect correlation for lag zero and, by definition, r(O) = 1·0. If r(n) is insignificantly different from zero for any other lag, then there is no significant relationship at lag 11; if r(n) is close to unity, then there is a significant positive autocorrelation at lag n; and if r(n) is close to minus unity, then there is a significant negative autocorrelation at lag n. Thus for a strongly seasonal series of period p samples, we would expect the lagp autocorrelation to be highly significant and positive; and the lagp/2 autocorrelation to be highly significant and negative. Statistical significance in these terms is discussed, for example, in Ref. 2.

The partial autocorrelation function is not such an obvious measure and derives directly from modelling the time-series as an autoregressive process of order n, i.e. by assuming that sequential samples from the series can be related by the following expression,

y(k) + a,y(k - 1) + a2y(k - 2) + ... + any(k - n) = e(k)

or, in vector terms,

y(k) Z(k)T a + e(k)

where,

a(k) = [a,a2,"" anlT ;

z(k) = [-y(k - 1), -y(k - 2), ... , -y(k - nW

in which the superscript T denotes the vector-matrix transpose; aj, i = 1, 2, ... , n are unknown constant coefficients or autoregressive parameters; and e(k) is a zero mean, serially uncorrelated sequence of random variables with variance (f2, i.e. discrete 'white noise'. This AR(n) model suggests that there is a linear relationship between y(k) and its n previous past values y(k - 1), ... ,y(k - n), which is also affected by a stochastic influence in the form of the white noise input e(k); in other words, y(k) is a weighted linear aggregate of its past values and a 'random shock' e(k).

If we introduce the backward shift operator z-r of order r, i.e. z-r y(k) = y(k - r), then this AR(n) process can be written in the transfer function form,

y(k) 1

'2 -n e(k) I + a, z + a2z + ... + anz


which can be represented in the following block diagram terms,

e(k) ---Autoregressive

I y(k) + a1z- 1 + a2z-2 + ... + anz-n

Filter

where we see that y(k) can be interpreted as the output of an autoregressive filter which shapes the 'white' noise input e(k) to produce 'colour' or temporal pattern in y(k). This temporal pattern, or autocorrelation, can be evaluated quantitatively by the statistical identification of the order n of the AR process, and the estimation of the associated model parameters aj, i = I, 2, ... , n, and (i (see, e.g. Refs. 2 and 5). Here identification is normally accomplished by reference to a statistical structure (or order) identification procedure, of which the best known is that proposed by Akaike;6 while estimation of the parameters characterising this identified model structure is normally accomplished by least squares or recursive least squares procedures, as discussed in Section 3.

The partial autocorrelation function (PAF) exploits the fact that, whereas it can be shown that the autocorrelation function for an AR(n) process is theoretically infinite in extent, it can clearly be represented in terms of a finite number of coefficients, namely the n parameters of the AR model. This topic is discussed fully in Ref. 2 but, in effect, the PAFs are obtained by estimating successively AR models of order 1, 2, 3, ... , p by least squares estimation. The PAF values are then defined as the sequence of estimates of the last (i.e. nth) coefficient in the AR model at each successive step, with the 'lag' n = 1, 2, 3, ... , p, where p is chosen as a sufficiently large integer by the analyst. For a time-series described adequately by an AR(n) process, the PAF should be significantly non-zero for p ~ n but will be insignificantly different from zero for p > n: in other words, the PAF of an nth order autoregressive process should have a 'cut-off' after lag n. In practice, however, this cut-off is often not sufficiently sharp to unambiguously define the appropriate AR order and order identification criteria such as the AIC seem preferable in practice (although it can be argued that the AIC tends to over-identify the order).

2.1 Statistical properties of time-series in the frequency-domain: spectral analysis Time-domain methods of statistical analysis, such as those discussed above, are often easier to comprehend than their relatives in the fre-


quency-domain: after all, we observe the series and plot them initially in temporal terms, so that the first patterns we see in the data are those which are most obvious in a time-domain context. An alternative way of analysing a time-series is on the basis of Fourier-type analysis, i.e. to assume that the series is composed of an aggregation of sine and cosine waves with different frequencies. One of the simplest and most useful procedures which uses this approach is the periodogram introduced by Schuster in the late nineteenth century.2

The intensity of the periodogram I(.f;) is defined by

2 {[k=N J2 [k=N J2} I(/;) = N k~l y(k)cos(2n/;k) + k~l y(k)sin(2n/;k) ;

i = 1,2, ... , q

where q = (N - 1)/2 for odd Nand q = N/2 for even N. The periodogram is then the plot of I(/;) against /;, where /; = i/ N is ith harmonic of the fundamental frequency liN, up to the Nyquist Frequency of 0·5 cycles per sampling interval (which corresponds to the smallest identifiable wavelength of two samples). Since I(.f;) is obtained by multiplying y(k) by sine and cosine functions of the harmonic frequency, it will take on relatively large values when this frequency coincides with a periodicity of this frequency occurring in y(k). As a result, the periodogram maps out the spectral content of the series, indicating how its relative power varies over the range of frequencies between /; = 0 and 0·5. Thus, for example, pronounced seasonality in the series with period T = 1/;; samples will induce a sharp peak in the periodogram at /; cycles/sample; while if the seasonality is amplitude modulated or the period is not constant then the peak will tend to be broader and less well defined.

The sample spectrum is simply the periodogram with the frequency /; allowed to vary continuously over the range 0 to 0·5 cycles, rather than restricting it to the harmonics of the fundamental frequency (often, as in later sections of the present chapter, the sample spectrum is also referred to as the periodogram). This sample spectrum is, in fact, related directly to the autocovariance function by the relationship,

I(.f;) = 2 {co + 2 k~t:l AkCkCOS(2n/;k)}; 0 ~ /; ~ 0·5

with Ak = }·O for all k. In other words, the sample spectrum is the Fourier cosine transform of the estimate of the autocovariance function. It is clearly a very useful measure of the relative power of the y(k) at different frequencies and provides a quick and easy method of evaluating the series


in this regard. It is true that the sample spectrum obtained in this manner has high variance about the theoretical 'true' spectrum, which has led to the computation of 'smoothed' estimates obtained by choosing Ak in the above expression to have suitably chosen weights called the lag window. However, the raw sample spectrum remains a fundamentally important statistical characterisation of the data which is complementary in the frequency-domain with the autocovariance or autocorrelation function in the time-domain.

There is also a spectral representation of the data which is complementary with the partial autocorrelation function, in the sense that it depends directly on autoregression estimation: this is the autoregressive spectrum. Having identified and estimated an AR model for the time-series data in the time-domain, its frequency-domain characteristics can be inferred by noting that the spectral representation of the backward shift operator, for a sampling interval (jt, is given by,

z-r = exp(-jrl;(jt) = cos(rl;(j() + jsin(rl;(jt); o ~ I; ~ 0·5

so that by substituting for z-', r = I, 2, ... , n, in the AR transfer function, it can be represented as a frequency-dependent complex number of the form A(I;) + jB(I;). The spectrum associated with this representation is then obtained simply by plotting the squared amplitude A(I;)2 + B(I;)2, or its logarithm, either against I; in the range 0 to O· 5 cycles/sample interval, or against the period 1/1; in samples. This spectrum, which is closely related to the maximum entropy spectrum,4 is much smoother than the sample spectrum and appears to resolve spectral peaks rather better than the more directly smoothed versions of sample spectrum.

3 RECURSIVE ESTIMATION

The AF, PAF, sample spectrum and AR spectrum computations reduce a stationary time-series to a set of easily digestible plots which describe very well its statistical properties and provide a suitable platform for subsequent time-series analysis and modelling. It seems likely, however, that if we wish to develop a general procedure for modelling nonstationary time-series, then we should utilise estimation techniques that are able to handle models, such as the AR process discussed in the last section, with parameters which are not constant, as in the conventional approach to time-series analysis, but which may vary over time. This was one of the major motivations for the development of recursive techniques for Time


Variable Parameter (TVP) estimation, in which the object is to 'model the parameter variations'7-9 by some form of stochastic state-space model. Such TVP models have been in almost continual use in the control and system's field since the early 1960s, when Kopp and Orford,1O and Lee ll

pioneered their use in the wake of Kalman's seminal paper on state variable filtering and estimation theory.12 The present first author made liberal use of this same device in the 1960s within the context of self adaptive control7,8,13-16 and, in the early 1970s, reminded a statistical audience of the extensive system's literature on recursive estimation (see Refs. 17 and 18; also the comments ofW.D. Rayon Ref. 19). Since the early 1970s, time-varying parameter models have also been proposed and studied extensively in the statistical econometrics literature. Engle et al. have presented a brief review of this literature and discuss an interesting application to electricity sales forecasting, in which the model is a timevariable parameter regression plus an adaptive trend.20

The best known recursive estimation procedure is the recursive least squares (RLS) algorithm.5,9,21 It is well known that the least squares estimates of the parameters in the AR(n) model are obtained by choosing that estimate i of the parameter vector a which minimises the least squares cost function J, where

k=N J = L [y(k) - ZT (k)if

k=n+1

and that this minimisation, obtained simply by setting the gradient of J with respect to it equal to zero in the usual manner, i.e.

oj k=N :1 A = 2 L z(k)[y(k) - ZT (k)i] = 0 va k=n+1

results in an estimate i(N), for an N sample observation interval, which is obtained by the solution of the normal equations of least squares regression analysis,

L~t:1 Z(k)ZT(k)]i(N) = k~t:1 z(k)y(k)

Alternatively, at an arbitrary kth sample within the N observations, the RLS estimate i(k) can be obtained from the following algorithm:

i(k) = i(k - I) + g(k) { y(k) - ZT (k)i(k - I)} (I)

where,

g(k) P(k - l)z(k)[a2 + Z(k)Tp(k - I)z(k)]-I


and,

P(k) = P(k - I) + P(k - I)z(k)

x [0-2 + Z(k)TP(k - l)z(k)r'zT(k)P(k - 1) (2)

In this algorithm, y(k) is the kth observation of the time-series data; a(k) is the recursive estimate at the kth recursion of the autoregressive parameter vector a(k), as defined for the AR model; and P(k) is a symmetric, n x n matrix which provides an estimate of the covariance matrix associated with the parameter estimates. A full derivation and description of this algorithm, the essence of which can be traced back to Gauss at the beginning of the nineteenth century, is given in Ref. 9. Here, it will suffice to note that eqn (1) generates an estimate a(k) of the AR parameter vector a at the kth instant by updating the estimate a(k) obtained at the previous (k - I)th instant in proportion to the prediction error e(k/k - I), where

e(k/k - I) = y(k) - zT(k)a(k - I)

is the error between the latest sample y(k) and its predicted value zT(k)a(k - I), conditional on the estimate a(k) at the (k - I)th instant. This update is controlled by the vector g(k), which is itself a function of the covariance matrix P(k). As a result, the magnitude of the recursive update is seen to be a direct function of the confidence that the algorithm associates with the parameter estimates at the (k - I )th sampling instant: the greater the confidence, as indicated by a P(k - 1) matrix with elements having low relative values, the smaller the attention paid to the prediction error, since this is more likely to be due to the random noise input e(k) and less likely to be due to estimation error on a(k - 1).

In the RLS algorithm shown above, there is an implicit assumption that the parameter vector a is time-invariant. In the recursive TVP version of the algorithm, on the other hand, this assumption is relaxed and the parameter may vary over the observation interval to reflect some changes in the statistical properties of the time-series y(k), as described by an assumed stochastic model of the parameter variations. In the present chapter, we make extensive use of recursive TVP estimation. In particular, we exploit the excellent spectral properties of certain recursive TVP estimation and smoothing algorithms to develop a practical and unified approach to adaptive time-series analysis, forecasting and seasonal adjustment, i.e. where the results of TVP estimation are used recursively to update the forecasts or seasonal adjustments to reflect any nonstationary or nonlinear characteristics in y(k).

The approach is based around the well-known 'structural' or 'com-


ponent' time-series model and, like previous state-space solutions,22-24 it employs the standard Kalman filter-type l2 recursive algorithms. (The term 'structural' has been used in other connections in both the statistical and economics literatures and so we will employ the latter term.) Except in the final forecasting and smoothing stages of the analysis, however, the justification for using these algorithms is not the traditional one based on 'optimality' in a prediction error or maximum likelihood (ML) sense. Rather, the algorithms outlined here are utilised in a manner which allows for straightforward and effective spectral decomposition of the time series into quasi-orthogonal components. A unifying element in this analysis is the modelling of non stationary state variables and time variable parameters by a stochastic model in the form of a class of second order random walk equations. As we shall see, this simple device not only facilitates the development of the spectral decomposition algorithms but it also injects an inherent adaptive capability which can be exploited in both forecasting and seasonal adjustment.

4 THE TIME-SERIES MODEL

Although the analytical procedures proposed in this paper can be applied to multi variable (vector) processes,25 we will restrict the discussion, for simplicity of exposition, to the following component model of a univariate (scalar) time-series y(k),

y(k) = t(k) + p(k) + e(k) (3)

where, t(k) is a low frequency or trend component; p(k) is a perturbational component around the long period trend which may be either a zero mean, stochastic component with fairly general statistical properties; or a sustained periodic or seasonal component; and, finally, e(k) is a zero mean, serially uncorrelated, discrete white noise component, with variance (J2.

Component models such as eqn (3) have been popular in the literature on econometrics and forecasting26,27 but it is only in the last few years that they have been utilised within the context of state-space estimation. Probably the first work of this kind was by Harrison and Stevens l9.22 who exploited state-space methods by using a Bayesian interpretation applied to their 'Dynamic Linear Model', which is related to models we shall discuss here. More recent papers which exemplify this state-space approach and which are particularly pertinent to the present paper, are those of Jakeman & Young,28 Kitagawa & Gersch,29 and Harvey.24


In the state-space approach, each of the components t(k) and p(k) is modelled in a manner which allows the observed time series y(k) to be represented in terms of a set of discrete-time state equations. These state equations then form the basis for recursive state estimation, forecasting and smoothing based on the recursive filtering equations of Kalman. '2 In order to exemplify this process, within the present context, we consider below a simple TVP, linear regression model for the sum of t(k) + p(k). It should be noted, however, that this model is a specific example of the more general component models discussed in Refs. 30-33.

4.1 The dynamic linear regression (OLR) model In the general DLR model, it is assumed that y(k) in eqn (3) can be expressed in the form of a linear regression with time-variable coefficients cj(k), i = 0, 1,2, ... , n i.e.

y(k) = coCk) + c,(k)x,(k) + C2 (k)x2 (k), ... ,cn(k)xn(k) + e(k) (4)

where coCk) = t(k) represents the time-variable trend component; cj(k), i = I, 2, ... , n, are time-variable parameters or regression coefficients, with the associated regression variables xj(k) selected so that the sum of the terms cj(k)xj(k) provides a suitable representation of the perturbational component p(k). For example, if xj(k) = y(k - i), then the model for p(k) is a dynamic autoregression (DAR) in y(k), i.e. an autoregression model with time-variable parameters.

Clearly, an important and novel aspect of this model is the presence of the time-variable parameters. In order to progress further with the identification and estimation of the model, therefore, it is necessary to make certain assumptions about the nature of the time-variation. Here, we assume that each of the n + I coefficients cj(k) can be modelled by the following stochastic, second order, generalised random walk (GRW) process,

(5)

where,

Xj(k) and

and,

Here, ~(k) is a second state variable, the function of which is discussed


below; while '1il (k) and '1i2(k) represent zero mean, serially uncorrelated, discrete white noise inputs, with the vector '1i(k) normally characterised by a covariance matrix Qi' i.e.

{I for k = j

(\. = .J 0 for k oF j

where, unless there is evidence to the contrary, Qi is assumed to be diagonal in form with unknown diagonal elements qill and qi22' respectively.

This GRW model subsumes, as special cases,9 the very well-known random walk (RW: P = y = 0; '1i2(k) = 0); and the integrated random walk (IRW: P = y = 1; '1il (k) = 0). In the case of the IRW, we see that ci(k) and ~(k) can be interpreted as level and slope variables associated with the variations of the ith parameter, with the random disturbance entering only through the di(k) equation. If '1il (k) is non-zero, however, then both the level and slope equations can have random fluctuations defined by '1il (k) and '1i2(k), respectively. This variant has been termed the 'Linear Growth Model' by Harrison & Stevens. 19•22

The advantage of these random walk models is that they allow, in a very simple manner, for the introduction of nonstationarity into the regression model. By introducing a parameter variation model of this type, we are assuming that the time-series can be characterised by a stochastically variable mean value, arising from co(k) = t(k) and a perturbational component with potentially very rich stochastic properties deriving from the TVP regression terms. The nature of this variability will depend upon the specific form of the GRW chosen: for instance, the IRW model is particularly useful for describing large smooth changes in the parameters; while the RW model provides for smaller scale, less smooth variationsY4 As we shall see below, these same models can also be used to handle large, abrupt changes or discontinuities in the level and slope of either the trend or the regression model coefficients.

The state space representation of this dynamic regression model is obtained simply by combining the GRW models for the n + I parameters into the following composite state space form,

x(k) = Fx(k - 1) + G'1(k - I)

y(k) = Hx + e(k)

(6a)

(6b)

where the composite state vector x is composed of the ci(k) and ~(k) parameters, i.e.

x(k) = [co(k) do(k) cl(k) d1(k) ... cn(k) dn(kW


the stochastic input vector r,(k) is composed of the disturbance terms to the GRW models for each of the time-variable regression coefficients, i.e.

r,(k)T = [r,OI (k) r,02(k) r,11 (k) r,12(k) ... r,nl (k) r,n2(kW

while the state transition matrix F, the input matrix G and the observation matrix H are defined as,

F; 0 0 0 Gt 0 0 0

0 F; 0 0 0 Gt 0 0 F G

0 0

0 0 0 F; 0 0 0 Gt

H = [I 0 xl(k) 0 x2(k) o ... xn(k) 0]

In other words, the observation eqn (6b) represents the regression model (4); with the state equations in (6a) describing the dynamic behaviour of the regression coefficients; and the disturbance vector r,(k) in eqn (6a) defined by the disturbance inputs to the constituent GRW sub-models.

We have indicated that one obvious choice for the definition of the regression variables x;(k) is to set them equal to the past values of y(k), so allowing the perturbations to be described by a TVP version of the AR(n) model. This is clearly a sensible choice, since we have seen in Section 2 that the AR model provides a most useful description of a stationary stochastic process, and we might reasonably assume that, in its TVP form, it provides a good basis for describing nonstationary time-series. If the perturbational component is strongly periodic, however, the spectral analysis in Section 2 suggests an alternative representation in the form of the dynamic harmonic regression (DHR) model.35 Here t(k) is defined as in the DAR model but p(k) is now defined as the linear sum of sin and cosine variables in F different frequencies, suitably chosen to reflect the nature of the seasonal variations, i.e.

;=F

p(k) L OI;(k)cos(2n/;k) + 02;(k)sin(2n/;k) (7) ;=1

where the regression coefficients OJ;(k),j = 1,2 and i = 1,2, ... , Fare assumed to be time-variable, so that the model is able to handle any nonstationarity in the seasonal phenomena. The DHR model is then in the form of the regression eqn (4), with co(k) = t(k), as before, and appropriate definitions for the remaining c; (k) coefficients, in terms of OJ; (k). The integer n, in this case, has to be set equal to 2F, so that the regression

NONSTA TIONARY TIME-SERIES ANALYSIS 53

variables xj(k), i = 1, 2, ... , 2F can be defined in terms of the sin and cosine components, sin (27th k) and cos (27th k), i = I, 2, ... , F, respectively, i.e.

H = [lOcos (27tfl k) 0 sin (27t.t; k) 0 ... cos (27tfFk) 0 sin (27tfFk) 0]

Finally, it should be noted that, since there are two parameters associated with each frequency component, the changes in the amplitude A(k) of each component, as defined by

Aj(k) = JOlj(k)2 + 02j(k)2

provides a useful indication of the estimated amplitude modulation on each of the frequency components. In many practical situations, the overall periodic behaviour can be considered in terms of a primary frequency and its principal harmonics, e.g. a 12-month annual cycle and its harmonics at periods of 6,4,3,2·4 and 2 months, respectively. As a result, Aj (k), i = 1, 2, ... , F will represent the amplitude of these components and will provide information on how the individual harmonics vary over the observation interval.

5 THE RECURSIVE FORECASTING AND SMOOTHING ALGORITHMS

In this chapter, recursive forecasting and smoothing is achieved using algorithms based on optimal state-space (Kalman) filtering and fixedinterval smoothing equations. The Kalman filtering algorithml2 is, of course, well known and can be written most conveniently in the following general 'prediction-correction' form,9

Prediction:

i(k/k - I) Fi(k - 1)

P(k/k - I) = FP(k - I)FT + G[Qr]GT

Correction:

i(k) = i(k/k - I) + P(k/k - I)HT[1 + HP(k/k - I)HTrl

x {y(k) - Hi(k/k - I)}

P(k) P(k/k - 1) - P(k/k - I)

x HT[1 + HP(k/k - I)HT ]-IHP(k/k - 1)

(8)

(9)


In these equations, we use i(k) to denote the estimate of the composite state vector x of the complete state-space model (eqn (6». The reader should note the similarity between the correction eqns (9) and the RLS algorithm discussed in Section 3. This is no coincidence: the RLS algorithm is a special case of the Kalman filter in which the unknown 'states' are considered as time-invariant parameters.9

Since the models used here are all characterised by a scalar observation equation, the filtering algorithm (8) and (9) has been manipulated into the well-known form (see, e.g. Ref. 9) where the observation noise variance is normalised to unity and Qr represents a 'noise variance ratio' (NVR) matrix. This matrix and the P(k) matrix are then both defined in relation to the white measurement noise variance (J2, i.e.

P(k) = P*(k)/~ (10)

where P*(k) is the error covariance matrix associated with the state estimates. In the RW and IRW, models, moreover, there is only a single white noise disturbance input term to the state equations for each unknown parameter, so that only a single, scalar NVR value has to be specified by the analyst in these cases.

In fixed interval smoothing, an estimate x(k/N) of x(k) is obtained which provides an estimate of the state at the kth sampling instant based on all N samples over the observation interval (in contrast to the filtered estimate x(k) = x(k/k) which is only based on the data up to the kth sample). There are a variety of algorithms for off-line, fixed interval smoothing but the one we will consider here utilises the following backwards recursive algorithm, subsequent to application of the above Kalman filtering forwards recursion/,34

x(k/N) = F-1[x(k + liN) + GQrGTL(k - 1)] (lIa)

where,

L(N) = 0

N is the total number of observations (the 'fixed interval'); and

L(k) = [I - P(k + I)HTH]T

x {FTL(k + 1) - HT[y(k + 1) - HFi(k)]} (lIb)

is an associated backwards recursion for the 'Lagrange Multiplier' vector L(k) required in the solution of this two-point boundary value problem. An alternative update equation to eqn (IIa) is the following,

x(k/N) = x(k/k) - P(k)FTL(k) (llc)


where we see that x(kIN) is obtained by reference to the estimate i(klk) generated during the forward recursive (filtering) algorithm. Finally, the covariance matrix P*(kIN) = q'lP(kIN) for the smoothed estimate is obtained by reference to P(kIN) generated by the following matrix recursion,

P(kIN) = P(k) + P(k)FT[P(k + l/k)tl

x {P(k + lIN) - P(k + I/k)}[P(k + I/k)tIFP(k) (12)

while the smoothed estimate of original series y(k) is given simply by,

y(k/N) = HX(k/N) (13)

i.e. the appropriate linear combination of the smoothed state variables. Provided we have estimates for the unknown parameters in the state

space model (6), then the procedures for smoothing and forecasting of y(k) now follow by the straightforward application of these state-space filtering/smoothing algorithms. This allows directly for the following operations: -

(1) Forecasting. The/step ahead forecasts of the composite state vector x(k) in eqn (6) are obtained at any point in the time-series by repeated application of the prediction eqns (8) which, for the complete model, yields the equation,

x(k + /Ik) = FfX(k) (14)

where f denotes the forecasting period. The associated forecast of y(k) is provided by,

y(k + Jlk) = Hx(k + Jlk)

with the variance of this forecast computed from

(15)

var{ji(k + Jlk)} = (12[1 + HP(k + //k)HT] (16)

where ji( k + JI k) is the / step ahead prediction error, i.e.

ji(k + /Ik) = y(k + f) - y(k + Jlk)

In relation to more conventional alternatives to forecasting, such as those of Box & Jenkins,2 the present state-space approach, with its inherent structural decomposition, has the advantage that the estimates and forecasts of individual component state variables can also be obtained simply as by-products of the analysis. For example, it is easy to recover the estimate and forecast of the trend component, which provides a measure of the underlying 'local' trend at the forecasting origin.


(2) Forward interpolation. Within this discrete-time setting, the process of forward interpolation, in the sense of estimating the series y(k) over a section of missing data, based on the data up to that point, follows straightforwardly; the missing data points are accommodated in the usual manner by replacing the observation y(k) by its conditional expectation (or predicted value) y(k/k - I) and omitting the correction eqns (9).

(3) Smoothing. Finally, the smoothed estimate y(k/N) of y(k) for all values of k is obtained directly from eqn (13); and associated smoothed estimates of all the component states are available from eqn (II). Smoothing can, of course, provide a superior interpolation over gaps in the data, in which the interpolated points are now based on all of the N samples.

6 IDENTIFICATION AND ESTIMATION OF THE TIME-SERIES MODEL

The problems of structure identification and subsequent parameter estimation for the complete state space model (6) are clearly non-trivial. From a theoretical standpoint, the most obvious approach is to formulate the problem in Maximum Likelihood (ML) terms. If the stochastic disturbances in the state-space model are normally distributed, the likelihood function for the observations may then be obtained from the Kalman Filter via 'prediction error decomposition'.36 For a suitably identified model, therefore, it is possible, in theory, to maximise the likelihood with respect to any or all of the unknown parameters in the state-space model, using some form of numerical optimisation.

This kind of maximum likelihood approach has been tried by a number of research workers but their results (e.g. Ref. 37, which provides a good review of competing methods of optimisation) suggest that it can be quite complex, even in the case of particularly simple models such as those proposed here. In addition, it is not easy to solve the ML problem in practically useful and completely recursive terms, i.e. with the parameters being estimated recursively, as well as the states.

The alternative spectral decomposition (SO) approach used in this chapter has been described fully in Refs. 30-33 and 38, the reader is referred to these papers for detailed information on the method. It is based on the application of the state-space, fixed interval smoothing algorithms discussed in Section 5, as applied to the component models such as eqn (3). In particular, it exploits the excellent spectral properties of the smoothing algorithms derived in this manner to decompose the time-series into

NONSTA TIONARY TIME-SERIES ANALYSIS 57

quasi-orthogonal estimates of the trend and perturbational components. This considerably simplifies the solution, albeit at the cost of strict optimality in a ML sense, and yields an analytical procedure that is completely recursive in form and robust in practical application. It also allows for greater user-interaction than other alternatives. As we shall see, this procedure is well suited for adaptive implementations of both on-line forecasting and off-line smoothing.

In relation to more conventional approaches to data analysis, we can consider that SD replaces better known, and more ad hoc, filtering techniques, * by a unified, more sophisticated and flexible approach based on the recursive smoothing algorithms.

6.1 The IRWSMOOTH algorithm A full discussion of the spectral properties of the fixed interval smoothing algorithms discussed in this chapter is given in the various references cited above. In order to exemplify the approach, however, it is interesting to consider how the smoothing algorithms can be used in a sub-optimal fashion to obtain a smoothed estimate i(kIN) of the trend component, t(k). This is accomplished quite simply by applying the forward-pass filtering and backward-pass smoothing equations directly to the model (3) with the p(k) term removed. In this manner, y(k) is modelled simply as the sum of a trend, represented by a GRW process, and the white noise disturbance e(k). This IRWSMOOTH algorithm will normally be suboptimal in a minimum variance sense, since it seems unlikely that y(k) could, in general, be described by such a simplistic model: for most real environmental data, the perturbations about the long term trend, i.e. b(k) = y(k) - t(k), are likely to be highly coloured or periodic in form and white noise perturbations seem most unlikely. Nevertheless, the algorithm is a most useful one in practical terms and, while sub-optimal in the context of the minimum variance Kalman filter, it can be justified in terms of its more general low-pass filtering properties.

These filtering properties are illustrated in Fig. 2. The amplitude spectrum for the IRWSMOOTH algorithm is shown in Fig. 2(a) which demonstrates that the behaviour of the algorithm is controlled by the selected NVR value, which is simply a scalar in this case because the IRW model

·Such as the use of centralised moving averaging or smoothing spline functions for trend estimation; and constant parameter harmonic regression (HR) or equivalent techniques for modelling the seasonal cycle; see Ref. 39 which uses such procedures with the HR based on the first two harmonic components associated with the annual CO2 cycle.

58

0.8

PETER YOUNG AND TIM YOUNG

0.0 0.1 0.2 0.3

(b)

Cycle per interval

~!:;:. .

j I' Ii \\ '\\\ . !i \\ .

I 0.4

I 0.5

NVR:

.00001

~Q.Q.(l! __

~ J!L._

5 0.6 c

I, ii \ \ I \ NVR: /I Ii \i \ \

/. ,Ii \\ '\ \ j li:\ \

.J / j \ \. \ '" o+-______ ~~--~/~·~:~IL-~I~.~ .. ~~~.~r_----_.

0.2

o 'i:

~ 0.4

0.0 0.1 0.2 0.3 0.4 0.5 Cycle per interval

.00001

~Q.Q.Q.L_

~99.L ... .01

.1

1.

Fig. 2. Spectral properties of the recursive smoothing algorithms: (a) amplitude spectrum of the recursive IRWSMOOTH algorithm; (b) amplitude

spectrum of the recursive DHRSMOOTH algorithm.


has only a single white noise input disturbance. It is clear that this scalar NVR, which is the only parameter to be specified by the analyst, defines the 'bandwidth' of the smoothing filter. The phase characteristics are not shown, since the algorithm is of the 'two-pass' smoothing type and so exhibits zero phase lag for all frequencies. We see from Fig. 2(a) that the IRWSMOOTH algorithm is a very effective 'low-pass' filter, with particularly sharp 'cut-off' properties for low values of the NVR. The relationship between the loglO(F5O ), where F50 is the 50% cut-off frequency, and 10gI0(NVR) is approximately linear over the useful range of NVR values, so that the NVR which provides a specified cut-off frequency can be obtained from the following approximate relationship,35

NVR = 1605[F50 ]4 (17)

In this manner, the NVR which provides specified low-pass filtering characteristics can be defined quite easily by the analyst. For an intermediate value of NVR = 0·0001 (cut-off frequency = 0·016cycles/sample), for example, the estimated trend reflects the low frequency movements in the data while attenuating higher frequency components; for NVR = 0 the bandwidth is also zero and the algorithm yields a linear trend with constant slope; and for large values greater than 10, the estimated 'trend' almost follows the data and the associated derivative estimate dj(k) provides a good smoothed numerical differentiation of the data. The bandpass nature of the DHR recursive smoothing algorithm (DHRSMOOTH) is clear from Fig. 2(b) and a similarly simple relationship once again exists between the bandwidth and the NVR value. These convenient bandwidthNVR relationships for IRWSMOOTH and DHRSMOOTH are useful in the proposed procedure for spectral decomposition discussed below.

Clearly, smoothing algorithms based on other simple random walk and TVP models can be developed: for instance, the double integrated random walk (DIRW, see Refs. 9 and 40) smoothing algorithm has even sharper cut-off characteristics than the IRW, but its filtering characteristics exhibit much higher levels of distortion at the ends of the data set.40

6.2 Parametric nonstationarity and variance intervention As has been pointed out, the response characteristics of environmental systems are clearly subject to change with the passage of time. As a result, the time-series obtained by monitoring such systems will be affected by these changes and any models that are based on the analysis of these time-series will need to reflect the nature of this nonstationarity, for example by changes in the model parameters. Similarly, many environ-


mental systems exhibit nonlinear dynamic behaviour and linear, small perturbation models of such systems will need to account for the nonlinear dynamics by allowing for time-variable parameters. However, the nature of such parametric variation is difficult to predict. While changes in environmental systems are often relatively slow and smooth, more rapid and violent changes do occur from time-to-time and lead to similarly rapid changes, or even discontinuities, in the related time series. One approach to this kind of problem is 'intervention analysis',41 but an alternative, recursively based approach is possible which exploits the special nature of the GRW model.

If the variances qill and qi22, are assumed constant, then the GRW model, in its RW or IRW form, can describe a relatively wide range of smooth variation in the associated trend or regression model parameters. Moreover, if we allow these variances to change over time, then an even wider range of behaviour can be accommodated. In particular, large, but otherwise arbitrary, instantaneous changes in qill and qi22 , (e.g. increases to values ~ 102) introduced at selected 'intervention' points, can signal to the associated estimation algorithm the possibility of significant changes in the level or slope, respectively, of the modelled variable (i.e. the trend or parameter estimates in the present context) at these same points. The sample number associated with such intervention points can be identified either objectively, using statistical detection methods;42 or more subjectively by the analyst.43

It is interesting to note that this same device, which is termed 'variance intervention,43 can be applied to any state-space or TVP model: Young,7,8,14--16 for example, has used a similar approach to track the significant and rapid changes in the level of the model parameters of an aerospace vehicle during a rocket boost phase.

6.3 The DHR model and seasonal adjustment Recursive identification and estimation of the DHR model is straightforward, In the case where the regression parameters are assumed constant, the normal recursive least squares (RLS) algorithm can be used. When the parameters are assumed time-variable, then it is simply necessary to represent the variations by the GRW model, with or without variance intervention as appropriate, and use the recursive least squares filtering and fixed interval smoothing algorithms outlined in Section 5. Recursive state-space forecasting then follows straightforwardly by application of the estimation and forecasting eqns (7), (8) and (14)-(16).

If off-line analysis of the non stationary time-series is required, then the


recursive fixed interval smoothing eqns (11)-(13) can be used to provide smoothed estimates of the model components and any associated timevariable parameters. The main effect of allowing the parameters and, therefore, the amplitude and phase of the identified seasonal components to vary over time in this manner, is to include in the estimated seasonal component, components of other frequencies with periods close to the principal period. As pointed out above, the chosen NVR then controls the band of frequencies that are taken into account by the ORRSMOOTR algorithm. If it is felt that the amplitude variations in the seasonal component are related to some known longer period fluctuations, then such prior knowledge can be used to influence the choice of the NVR.35

If the ORR model is identified and estimated for all the major periodic components characterising the data (i.e. those components that are associated with the peaks in the periodogram or AR spectrum) then the ORRSMOOTR algorithm can be used to construct and remove these nonstationary 'seasonal' components in order to yield a 'seasonally adjusted' data set. Such an 'adaptive' seasonal adjustment procedure is, of course, most important in the evaluation of business and economic data, where existing SA methods, such as the Census X-II procedure44 which uses a procedure based on centralised moving average (CMA filters), are well established. The proposed ORR-based approach offers various advantages over techniques such as X-11: by virtue of variance intervention, it can handle large and sudden changes in the dynamic characteristics of the series, including amplitude and phase changes; it is not limited to the normally specified seasonal periods (i.e. annual periods of 12 months or 4 quarters); and it is more objective and simpler to apply in practice.31 ,32,35,40 In this sense, it provides an excellent vehicle for seasonal adjustment of environmental data,

7 PRACTICAL EXAMPLES

In this section, we consider in detail the analysis of two of the three environmental time-series discussed briefly in the Introduction to this chapter; namely the atmospheric CO2 and Raup-Sepkoski Extinctions data. Similar analyses have been conducted on the Sunspot data35 and these reveal various interesting aspects of the series, including a possible link between the long term modulation of the cyclic component and central England temperature record for the period 1659-1973, compiled by Manley.45


7.1 Analysis of the Mauna Loa CO2 data Following the establishment of the World Climate Programme by the 8th Meteorological Congress, the World Meteorological Organisation (WMO) Project on Research and Monitoring of Atmospheric Carbon Dioxide was incorporated into the World Climate Research Programme (WCRP) and, in 1979, the Executive Committee recognised the potential threat to future climate posed by increasing amounts of atmospheric CO2 •

In the ensuing decade, the importance of atmospheric CO2 levels has been stressed, not only in scientific journals, but also in the more general media. Discussion of the potential perils of the 'Greenhouse Effect' have become ever more prevalent during this period as the public have perceived the possibility of a link between factors such as atmospheric CO2 and extreme climatic events throughout the world.

One important aspect of the various scientific programmes on atmospheric CO2, currently being carried out in many countries, is the analysis and interpretation of monitored CO2 data from different sites in the world. Indeed, one of the first events organised under the auspices of the WCRP was the WMOjICSUjUNEP Scientific Conference on this subject held at Bern in September 1981; and the interested reader is directed to Ref. 46 for an introduction to the topic. References 39, 47 and 48 are of particular relevance to the analysis reported here. The CO2 data shown in Fig. l(a) are monthly figures based on an essentially continuous record made at Mauna Loa in Hawaii between 1974 (May) and 1987 (Sept.). Here these data are used to exemplify the kind of analysis made possible by the recursive procedures described in previous sections: no attempt is made to interpret the results, which are part of a joint research programme between Lancaster and the UK Meteorological Office.

The more traditional time-series analysis of the CO2 data, such as that reported in the above references, uses various approaches, both to the estimation and removal of the low frequency trend in the data and the analysis of the seasonality in the resulting detrended series. These include the use of smoothing spline functions for trend estimation and constant parameter harmonic regression (HR) for modelling the seasonal cycle.48

Here, we first consider the amplitude periodogram of the CO2 series shown in Fig. 3. The scale is chosen to expose some of the lower magnitude peaks in the spectrum: we see that the 12- and 6-monthly period harmonics dominate the periodogram, with only a very small peak at 4 months and, apparently, no detectable harmonics at 3, 2-4 and 2 months. A small peak worth noting for future reference, however, is the one marked with an


0.10

0.09

0.08 -

0.07 -

0.06 -

~0.05 -

0.04 -

0.03 -

* 0.02 -

0.01 -

0.00 -. I -. •• ,--, • -·-·--r·-···-.---... ·.-··· .. I .••• I ••. ""'":""""i ,~,~-, -•• ;-~~;-. -:-:-:-: I

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Frequency (Cycles! sample)

Fig. 3. Amplitude periodogram of the Mauna Loa CO2 series.

asterisk at about 40 months; this mode of behaviour will be commented on later in this Section.

The low frequency trend, so obvious in the CO2 data, is estimated and removed by the IRWSMOOTH algorithm with NVR = 0·000005. The results are given in Fig. 4, which shows: (a) the estimated trend superimposed on the data; (b) the estimated slope or derivative of the trend; and (c) the detrended data. The value of the NVR is selected interactively so that the trend derivative contains only a very slight leakage of the higher frequency components associated with the annual, 12-monthly cycle. This NVR corresponds to a 50% cut-off frequency F50 = 0·00747 cyclesl sample, which induces 50% attenuation of periods less than about 12 years; and 95% attenuation of components with periods less than about 5 years.

It is worth commenting upon the estimated slope or derivative of the trend in Fig. 4(b), which reveals evidence of what appears to be very long term oscillatory characteristics. Little can be concluded about this mode


354 (a) 352

350 348 346 344 342 340 338 336 334 332 330 328 326

20 40 60 80 100 120 140

0.128

0.127 (b) I~"

/ "' 0.126 !

\ / 0.125 ! 0.124

0.123

J 0.122

0.121

0.120

0.119 V 0.118

0.117 20 40 60 80 100 120 140

5

4

3

2

t (c)

o -1

-2

-3

-4

-5

I

\ ~

~ f\ ~

20 40 60 80 100 120 140

Fig. 4. Initial analysis of the Mauna Loa CO2 series: (a) the IRWSMOOTH estimated trend superimposed on the data; (b) the IRWSMOOTH estimated

CO2 slope or derivative of the trend; (c) the detrended CO2 series data.

NONSTA TIONAR Y TIME-SERIES ANALYSIS 65

of behaviour, which appears to have a periodicity, if repeated, of some 10-11 years (a possible link with the sun-spot cycle?), since the observational period is too short. However, the results help to demonstrate the potential usefulness of the IRWSMOOTH algorithm in allowing the analyst to expose such characteristics in the data. In the analysis of numerous sets of socio-economic data, for instance, the authors have found that this estimated slope series provides strong evidence for the existence of underlying quasi-periodic 'economic cycles'.

Let us now continue the analysis of the CO2 data by considering AR modelling of the detrended data shown in Fig. 4(c). The AIC indicates that an AR(15) model is most appropriate: this yields a coefficient of determination R~ = 0·9818 (i.e. 98·18% of the data explained by the model) and AIC = - 2·274. However, further examination indicates that a subset AR(15) model with the 3rd, 5th and 7th parameters constrained to zero provides a superior AIC = 2,304, with R~ only marginally less than the full AR(15) at 0·9816. Moreover, a subset AR(15) with all intermediate parameters from 3rd to 9th constrained to zero provides a quite reasonable model with R~ = 0·980 and minimum AIC = - 2·25. Here, the insignificant parameters were identified by the usual significance tests and reference to the convergence of the recursive estimates.9 The subset AR(l5) spectrum, which reveals clearly the nature of the periodicity with the dominant 12 and 6 month components already noted in the periodogram of Fig. 4, also provides some evidence of higher order harmonics (see below).

The forecasting and smoothing results given in Fig. 5(a) and (b) are generated by an adaptive DHR model incorporating the fundamental component and the first harmonic (i.e. 12 and 6 months period, respectively), plus an IRW model for the trend, with an NVR = 0·000005. (The full details of this DHR analysis (forecasting, smoothing and seasonal adjustment) are given in Ref. 38.) The DHR model coefficients are modelled here by an RW with NVR = 0·001. One step ahead forecasts are given up to a forecasting origin (FO) of 100 samples and, thereafter, multistep ahead forecasts are generated for the remaining 60 samples (five years). We see that the multistep forecasts are very good, with small standard errors over the whole 5-year period, even though the algorithm is not referring at all to the data over this period. Note that a small negative forecasting error develops over this period, indicating that the atmospheric CO2 has risen slightly less than might be expected from the prior historical record. Figure 5(b) shows how, subsequent to the forecasting analysis, it is a simple matter to continue on to the backward smooth-

355~---------------------.------------~

(a) 350 -+- One-Step-Ahead Forecasts

345

340

335

330

Forecasting Origin (FO)

~~ 20 40 60 80 100 120 140

355~--------------------~----------~

(b) 350

345

340

335

330

20 40 60 80 100

Original Data

I 120

I 140

Fig. 5. Time variable parameter (TVP) forecasting, smoothing and seasonal adjustment of the CO2 series based on dynamic harmonic regression (DHR) analysis: (a) forecasting results based on IRW trend and two component DHR model (one step ahead up to forecasting origin (FO) at 100 samples; then multiple step ahead from FO); (b) smoothed estimates obtained by backwards smoothing recursions from the 100 sample FO in (a); (c) seasonally adjusted series compared with the original CO2 series; (d) the residual or 'anomaly'

series; (e) amplitude periodogram of the anomaly series.

354 352 350 348 346 344 342 340 338 336 334 332 330 328

NONSTA TIONARY TIME-SERIES ANALYSIS

(c)

326~~~~~~~~ __ -T __ -r~~~~ 20 40 60 80 100 120 140

0.6 ....... -----------------,

0.5 (d) 0.4

0.3

0.2

0.1

0.0 +--'--++--t-'---+--i1t--'"--'\---+---+1'11t-A-r.,.--l

-0.1

-0.2

-0.3

-0.4

-0.5

20 40 60 80 100 120 140

8~----------------------------~

7 (e)

6

2

0.1 0.2 0.3 0.4 Frequency (Cycles/sample)

Fig. 5. Continued.

67


ing pass: note the improvement in the residuals in this case compared with the forward, filtering pass. Smoothed estimates of the Aj(k) amplitudes of the seasonal components are also available from the analysis, if required.

The seasonally adjusted CO2 series, as obtained using the proposed method of DHR analysis, is compared with the original CO2 measurements in Fig. 5(c), while Figs 5(d) and (e) show, respectively, the residual or 'anomaly' series (i.e. the series obtained by subtracting the trend and total seasonal components from the original data), together with its associated amplitude periodogram. Here, the trend was modelled as an IRW process with NVR = 0·0001 and the DHR parameters (associated, in this case, with all the possible principal harmonics at 12, 6, 4, 3, 2-4 and 2 month periods) were also modelled as IRW processes with NVR = 0·001. These are the 'default' values used for the 'automatic' seasonal analysis option in the micro CAPT AIN program; they are used here to show that the analysis is not too sensitive to the selection of the models and their associated NVR values.

The residual component in Fig. 5(d) can be considered as an 'anomaly' series in the sense that it reveals the movements of the seasonally adjusted series about the long term 'smooth' trend. Clearly, even on visual appraisal, this estimated anomaly series has significant serial correlation, which can be directly associated with the interesting peak at 40 months period on its amplitude periodogram. This spectral characteristic obviously corresponds to the equivalent peak in the spectrum of the original series, which we noted earlier on Fig. 3. The seasonal adjustment process has nicely revealed the potential importance of the relatively low, but apparently significant power in this part of the spectrum. In this last connection, it should be noted that Young et al. 38 discuss further the relevance of the CO2

anomally series and consider a potential dynamic, lagged relationship between this series and a Pacific Ocean Sea Surface Temperature (SST) anomaly series using multivariate (or vector) time-series analysis.

7.2 Analysis of the Raup Sepkoski Extinctions data The variance periodogram of the Extinctions data, as shown in Fig. 6, indicates that most of the variance can be associated with the very low frequency component which arises mainly from the elevated level of extinctions at either end of the series. The next most significant peak occurs at a period of 30·65 My. The position of the maxima associated with the least squares best fit cycle of this wavelength are marked by vertical dashed lines in Fig. l(c), and we see that, as might be expected, they also correspond closely to the large peaks at either end of the data.

50

40 w U Z < it ~

30

w ~

~ 20 z 30.65 MYRS w U a: w 0-

10

0

0.0 1.0 2.0 3.0 4.0 5.0

CYCLES PER INTERVAL X10-1

Fig. 6. Variance periodogram of the Extinctions series.

In order to correct some of the amplitude asymmetry, the log-transformed Extinction series are shown in Fig. 7 together with the IRWSMOOTHestimated long term trend. This is obtained with NVR = 0'001, which ensures that the estimate contains negligible power around those frequencies associated with the quasi-periodic behaviour. It is interesting also to

LOGGED EXTINCTIONS SERIES 5

C/)

~ 4

~ 1= ~

3

i1 ~ 2 9

7 13 19 25 31 37 43 49 55 61 67

Sl\MPLE NUMBER

Fig. 7. IRWSMOOTH estimated trend superimposed on the log-transformed Extinctions series.


PERIOOOGRAM ANALYSIS 30T---------------------------------------~

30.65 MYRS

20 40.87 MYRS

0.0 1.0 2.0 3.0 4.0 5.0


Fig. 8. Variance periodogram of the log-transformed Extinctions series.

note that the trend, as estimated here, appears to be quite sinusoidal in form, with an almost complete period covering the 242 My of the record: this is consistent with the speculative suggestion by Rampino & Stothers49

of a 260-My period in comet impacts, arising from the orbit of the solar system about the galactic centre. The variance periodogram of the detrended and log-transformed series shown in Fig. 8 shows that, following removal of the low frequency power, the power is now seen to be concentrated in two major peaks, at periods of 30·65 and 40·87 My, with a smaller peak at the 17·52-My period (which appears to be a combination of the harmonic wavelengths associated with the two lower frequency peaks).

Because the Extinctions series is so short, there are only 36 ordinates over the spectral range. In order both to determine more precisely the position and nature of the spectral peaks, and also to investigate the possibility of non stationarity, DAR spectral analysis can be applied to the detrended and log-transformed series. For reference, Fig. 9 shows the AR(7) variance spectrum in the stationary case; i.e. with assumed timeinvariant AR parameters. Figure 10 is a contour spectrum obtained from the DAR(7) analysis using an IRW process to model the parameter variations, with the NVR = 0·007 for all parameters. This contour spectrum is obtained by computing the AR(7) spectrum at all 63 sample points


AR SPECTRAL ANALYSIS 3~ ____________________________________ -,

34.91 MYRS

2

\ 20.10 MYRS

~----I , I • I

5.0 oU

I

3.0 4.0 0.0 1.0 2.0


Fig. 9. AR(7) spectrum of the log-transformed Extinctions series.

5.0

4.0

3.0

2.0

1.0

0.0

0.5 2.5 4.5 6.5

SAMPLE NUMBER

Fig. 10. Contour spectrum based on the smoothed DAR estimates of the AR(7) parameter values at each of the 63 sample points between 8 and 70.


between 8 and 70, based on the smoothed estimates of the AR parameter values at each sample obtained from the DAR analysis. Because of the extremely sharp spectral peaks resolved by the spectral model, the contours are plotted in binary orders of magnitude, rather than linearly, in order to give fewer contours around the peaks where the slope is steep. (Clearly, a three-dimensional plot of the spectrum against frequency would be an alternative, and possibly more impressive, way of presenting these results but the authors feel this contour plot allows for more precise and easy measurement.)

This rather novel contour spectrum clearly illustrates the very different nature of the series in the middle section, where there are few high peaks; in fact, we see that this region is characterised by two peaks at periods of around 18 and 39 My, which is in contrast to either end of the series, where there is a dominant peak of between 30 and 32 My, lengthening to around 36 My over the last few samples. Obviously, this kind of DAR analysis, based on only 70 data points, should be assessed carefully since it is difficult to draw any firm conclusions about the significance of estimated spectral peaks based on such a small sample size, particularly when using this novel form of nonstationary time-series analysis. These and other important aspects of this analysis are, however, discussed in Ref. 35, to which the interested reader is referred for a more detailed discussion of the methods and results.

Finally, the high frequency variation in the middle of the series can be effectively attenuated by smoothing the log-transformed and detrended series using the IRWSMOOTH algorithm with NVR = 0·1, as shown in Fig. 11. The aim here is to reduce the power at frequencies greater than those under major investigation and it shows how the six peaks in the Jurassic and Cretaceous periods are smoothed to produce three lower magnitude peaks of longer wavelength. This leaves eight smoothed peaks in all (including the two high points at the ends of the series), with a mean separation of 34·5 My. This result is reasonably insensitive to the smoothness of the line and is in reasonable agreement with the treatment of this part by Rampino & Stothers49 who, by rejecting peaks below 10% extinctions, remove two of the peaks in the mid-region.

The analysis in this example has produced spectral estimates of the changes in the quasi-periodic behaviour which characterises the Extinctions data, based on the assumption that the series may be nonstationary in statistical terms. The results indicate that the period of the cyclic component is at least 30 My, and may be as large as 34·5 My. It should be stressed, however, that these results should be regarded as simply provid-

NONSTA TIONARY TIME-SERIES ANALYSIS

DETRENDED LOGGED EXllNCTlOIIS ~TA

-1

-2 -I-r-~-j-,...,...., ,I, ,

I I

I I 1 I I I I I I I

I I I I I r I I I I i I I

7131925313743 556167

SAMPLE NUMBER

73

Fig. 11. Additional smoothing of the detrended, log-transformed series using the IRWSMOOTH algorithm with NVR = 0·1.

ing further critical evaluation of this important series, rather than providing an accurate estimate of the cyclic period. The small number of complete cycles in the record, coupled with the uncertainty about the geologic dating of the events, unavoidably restricts our ability to reach unambiguous conclusions about the statistical properties of the series. The authors hope, however, that the example helps to illustrate the extra dimension to time-series analysis provided by recursive time variable parameter estimation.

9 CONCLUSIONS

This chapter has discussed the analysis of nonstationary environmental data and introduced a new approach to environmental time-series analysis and modelling based on sophisticated recursive filtering and smoothing algorithms. The two practical examples show how it can be used quite straightforwardly to investigate nonstationarity in typical environmental series. The approach can also be extended to the analysis of multivariable (vector) time-series25•32 and we have seen that it allows for the development of various time-variable parameter (TVP) modelling procedures. These include the dynamic (TVP) forms of the common regression models, i.e. the dynamic linear, auto and harmonic regression models discussed here, as well as TVP versions of the various transfer function (TF) modelling procedures.9•s()-s3 Such transfer function modelling techniques are not well known in environmental data analysis but the topic is too extensive to be covered in the present chapter. Additional information on these pro-


cedures within an environmental context can, however, be found in Refs. 5, 9, 54-60 and from the many references cited therein.

Finally, the new procedure can be considered as a method of analysing certain classes of nonlinear dynamic systems by the development of TVps8

or the closely related state dependent model (SDM: see Refs. 59 and 61) approximations. This is discussed fully in Ref. 62.

ACKNOWLEDG EM ENTS

Some parts of this chapter are based on the analysis of the Mauna Loa CO2 data carried out by one of the authors (P.Y.) while visiting the Institute of Empirical Macroeconomics at the Federal Reserve Bank of Minneapolis, as reported in Ref. 38; the author is grateful to the Institute for its support during his visit and to the Journal of Forecasting for permission to use this material.

REFERENCES

I. Raup, D.M. & Sepkoski, 1.1., Periodicity of extinctions in the geologic past. Proc. USA Nat. A cad. Sci., 81 (1984) 801-05.

2. Box, G.E.P. & Jenkins, G.M., Time Series Analysis, Forecasting and Control. Holden-Day, San Francisco, 1970.

3. Young, P.e. & Benner, S., microCA PTA IN Handbook: Version 2.0, Lancaster University, 1988.

4. Priestley, M.B., Spectral Analysis and Time Series. Academic Press, London, 1981.

5. Young, P.e., Recursive approaches to time-series analysis. Bull. of Inst. of Math. and its Applications, 10 (1974) 209-24.

6. Akaike, H., A new look at statistical model identification. IEEE Trans. on Aut. Control, AC-19 (1974) 716-22.

7. Young, P.C., The Differential Equation Error Method of Process Parameter Estimation. PhD Thesis, University of Cambridge, UK, 1969.

8. Young, P.C., The use of a priori parameter variation information to enhance the performance of a recursive least squares estimator. Tech. Note 404-90, Naval Weapons Center, China Lake, CA, 1969.

9. Young, P.e., Recursive Estimation and Time-Series Analysis. Springer-Verlag, Berlin, 1984.

10. Kopp, R.E. & Orford, R.J., Linear regression applied to system identification for adaptive control systems. AIAA Journal, 1 (1963) 2300-06.

11. Lee, R.C.K., Optimal Identification, Estimation and Control. MIT Press, Cambridge, MA, 1964.

12. Kalman, R.E., A new approach to linear filtering and prediction problems. ASME Trans., J. Basic Eng., 83-D (1960) 95-108.


13. Young, P.e., On a weighted steepest descent method of process parameter estimation. Control. Div., Dept. of Eng., Univ. of Cambridge.

14. Young, P.C., An instrumental variable method for real time identification of a noisy process. Automatica, 6 (1970) 271-87.

15. Young, P.e. A second generation adaptive pitch autostabilisation system for a missile or aircraft. Tech. Note 404-109, Naval Weapons Centre, China Lake, CA, 1971.

16. Young, P.e., A second generation adaptive autostabilisation system for airborne vehicles. Automatica, 17 (1981), 459-69.

17. Young, P.e., Comments on 'Dynamic equations for economic forecasting with GDP-unemployment relation and the growth in the GDP in the U.K. as an example'. Royal Stat. Soc., Series A, 134 (1971) 167-227.

18. Young, P.e., Comments on 'Techniques for assessing the constancy of a regression relationship over time'. J. Royal Stat. Soc., Series B, 37 (1975) 149-92.

19. Harrison, P.J. & Stevens, e.F., Bayesian forecasting. J. Royal Stat. Soc., Series B., 38 (1976) 205-47.

20. Engle, R.F., Brown, S.J. & Stem, G., A comparison of adaptive structural forecasting models for electricity sales. J. of Forecasting, 7 (1988) 149-72.

21. Young, P.e., Applying parameter estimation to dynamic systems, Parts I and II, Control Engineering, 16 (10) (1969) 119-25 and (11) (1969) 118-24.

22. Harrison, P.J. & Stevens, e.F., A Bayesian approach to short-term forecasting. Operational Res. Quarterly, 22 (1971) 341-62.

23. Kitagawa, G., A non-stationary time-series model and its fitting by a recursive filter. J. of Time Series, 2 (1981) 103-16.

24. Harvey, A.e., A unified view of statistical forecasting procedures. J. of Forecasting, 3 (1984) 245-75.

25. Ng, e.N., Young, P.e. & Wang, e.L., Recursive identification, estimation and forecasting of multivariate time-series. Proc. IFAC Symposium on Identification and System Parameter Estimation, ed. H.F. Chen, Pergamon Press Oxford, 1988, pp. 1349-53.

26. Nerlove, M., Grether, D.M. & Carvalho, l.L., Analysis of Economic Time Series: A Synthesis. Academic Press, New York, 1979.

27. Bell, W.R. & Hillmer, S.e., Issues involved with the seasonal adjustment of economic time series. J. Bus. and £Con. Stat., 2 (1984) 291-320.

28. lakeman, A.J. & Young, P.e., Recursive filtering and the inversion of ilIposed causal problems. Utilitas Mathematica, 35 (1984) 351-76. (Appeared originally as Report No. AS/R28/1979, Centre for Resource and Environmental Studies, Australian National University, 1979.)

29. Kitagawa, G. & Gersch, W., A smoothness priors state-space modelling of time series with trend and seasonality. J. American Stat. Ass., 79 (1984) 378-89.

30. Young, P.e., Recursive extrapolation, interpolation and smoothing of nonstationary time-series. Proc. IFAC Symposium on Identification and System Parameter Estimation, ed. H.F. Chen, Pergamon Press, Oxford, 1988, pp.33-44.

31. Young, P.e., Recursive estimation, forecasting and adaptive control. In


Control and Dynamic Systems, vol. 30, ed. e.T. Leondes. Academic Press, San Diego, 1989, pp. 119-66.

32. Young, P.e., Ng, e.N. & Armitage, P., A systems approach to economic forecasting and seasonal adjustment. International Journal on Computers and Mathematics with Applications, 18 (1989) 481-501.

33. Ng, e.N. & Young, P.e., Recursive estimation and forecasting of nons tat ionary time-series. J. of Forecasting, 9 (1990) 173-204.

34. Norton, J.P., Optimal smoothing in the identification of linear time-varying systems. Proc. lEE, 122 (1975) 663-8.

35. Young, TJ., Recursive Methods in the Analysis of Long Time Series in Meteorology and Climatology. PhD Thesis, Centre for Research on Environmental Systems, University of Lancaster, UK, 1987.

36. Schweppe, F., Evaluation of likelihood function for Gaussian signals. IEEE Trans. on In! Theory, 11 (1965) 61-70.

37. Harvey, A.e. & Peters, S., Estimation procedures for structural time-series models. London School of Economics, Discussion Paper No. A28 (1984).

38. Young, P.e., Ng, e.N., Lane, K. & Parker, D., Recursive forecasting, smoothing and seasonal adjustment of nonstationary environmental data. J. of Forecasting, 10 (1991) 57-89.

39. Bacastow, R.B. & Keeling, e.D., Atmospheric CO2 and the southern oscillation effects associated with recent EI Nino events. Proceedings of the WMOj ICSUjUNEP Scientific Conference on the Analysis and Interpretation of Atmospheric CO2 Data. Bern, Switzerland, WCP-14, 14-18 Sept. 1981, World Meteorological Organisation, pp. 109-12.

40. Ng, C.N., Recursive Identification, Estimation and Forecasting of NonStationary Time-Series. PhD Thesis, Centre for Research on Environmental Systems, University of Lancaster, UK.

41. Box, G.E.P. & Tiao, G.e., Intervention analysis with application to economic and environmental problems. J. American Stat. Ass., 70 (1975) 70-9.

42. Tsay, R.S., Outliers, level shifts and variance changes in time series. J. of Forecasting, 7 (1988) 1-20.

43. Young, P.e. & Ng, e.N., Variance intervention. J. of Forecasting, 8 (1989) 399-416.

44. Shiskin, J., Young, A.H. & Musgrave, J.e., The X-II variant of the Census Method II seasonal adjustment program. US Dept of Commerce, Bureau of Economic Analysis, Tech. Paper No. 15.

45. Manley, G., Central England temperatures: monthly means: 1659-1973. Quart. J. Royal Met. Soc., 100 (1974) 387-405.

46. WCRP, Proceedings of the WMOjICSUjUNEP Scientific Conference on the Analysis and Interpretation of Atmospheric CO2 Data, Bern, Switzerland, WCP-14, 14-18 Sept., World Meteorological Organisation, 1981.

47. Schnelle et al. (1981) In Proceedings of the WMOjlCSUjUNEP Scientific Conference on the Analysis and Interpretation of Atmospheric CO2 Data. Bern, Switzerland, WCP-14, 14-18 Sept., World Meteorological Organisation, 1981, pp. 155-62.

48. Bacastow, R.B., Keeling, e.D. & Whorf, T.P., Seasonal amplitude in atmospheric CO2 concentration at Mauna Loa, Hawaii, 1959-1980. Proceedings of the WMOjlCSUjUNEP Scientific Conference on the Analysis and Interpre-


tation of Atmospheric CO2 Data. Bern, Switzerland, WCP-14, 14-18 Sept. 1981, World Meteorological Organisation, pp. 169-76.

49. Rampino, M.R. & Stothers, R.B., Terrestrial mass extinctions, cometry impacts and the sun's motion perpendicular to the galactic plane. Nature, 308 (1984) 709-12.

50. Young, P.e., Some observations on instrumental variable methods of timeseries analysis. Int. J. of Control, 23 (1976) 593-612.

51. Ljung, L. & Soderstrom, T., Theory and Practice of Recursive Estimation. MIT Press, Cambridge, MA, 1983.

52. Young, P.e., Recursive identification, estimation and control. In Handbook of Statistics 5: Time Series in the Time Domain, ed. EJ. Hannan, P.R. Krishnaiah & M.M. Rao. North Holland, Amsterdam, 1985, pp. 213-55.

53. Young, P.e., The instrumental variable method: a practical approach to identification and system parameter estimation. Tn Ident!fication and System Parameter Estimation 1985, vol. 1 and 2, ed. H.A. Barker and P.e. Young. Pergamon Press, Oxford, 1985, pp. 1-16.

54. Young, P.e., The validity and credibility of models for badly defined systems. In Uncertainty and Forecasting of Water Quality, ed. M.B. Beck & G. Van Straten. Springer-Verlag, Berlin, 1983, pp. 69-100.

55. Young, P.e., Time-series methods and recursive estimation in hydrological systems analysis. In River Flow Modelling and Forecasting, ed. D.A. Kraijenhoff & J.R. Moll. D. Reidel, Dordrecht, 1986, pp. 129-80.

56. Young, P.e. & Wallis, S.G., Recursive estimation: a unified approach to identification, estimation and forecasting for hydrological systems. J. of App. Maths. and Computation, 17 (1985) 299-334.

57. Wallis, S.G., Young, P.e. & Beven, KJ., Experimental investigation of the aggregated dead zone model for longitudinal solute transport in stream channels. Proc. Ins. Civ. Engrs.: Part 2,87 (1989) 1-22.

58. Young, P.e., A general theory of modelling for badly defined dynamic systems. In Modeling, Ident!fication and Control in Environmental Systems, ed. G.e. Vansteenkiste. North-Holland, Amsterdam, 1978, pp. 103-35.

59. Priestley, M.B., State-dependent models: A general approach to nonlinear time-series analysis. J. of Time Series Analysis, 1 (\980) 47-71.

60. Young, P.e. & Wallis, S.G., The Aggregated Dead Zone (ADZ) model for dispersion in rivers. BHRA Int. Conf. on Water Quality Modelling in the Inland Natural Environment, BHRA, Cranfield, Bedfordshire, 1986, pp.421-33.

61. Haggan, V., Heravi, S.M. & Priestly, M.B., A study of the application of state-dependent models in nonlinear time-series analysis. J. of Time Series Analysis, 5 (1984) 69-102.

62. Young, P.e. & Runkle, D.E., Recursive estimation and the modelling of nonstationary and nonlinear time-series. In Adaptive Systems in Control and Signa/Processing, Vol. 1. Institute of Measure and Control, London, 1989, pp.49-64.

Chapter 3

Regression and Correlation

A.C. DAVISON Department of Statistics. University of Oxford. 1 South Parks Road.

Oxford OX1 3TG. UK

1 BASIC IDEAS

1.1 Systematic and random variation Statistical methods are concerned with the analysis of data with an appreciable random component. Whatever the reason for an analysis, there is almost always pattern in the data, obscured to a greater or lesser extent by random variation. Both pattern and randomness can be summarized by statistical methods, but their relative importance and the eventual summary depend on the aim of the analysis. At one extreme are techniques intended solely to explore and describe the data, and at the other extreme is the analysis of data for which there is an accepted probability model based on biological or physical mechanisms. Regression methods stretch between these extremes, and are among the most widely used techniques in statistics.

There are various ways to classify data. Typically there are a number of individuals or units on each of which a number of variables are measured. Often in environmental data the units correspond to evenly-spaced points :in time, and if there is the possibility of dependence between measurements on successive units the techniques of time series analysis must be considered, especially if interest is focused primarily on a single variable. In this chapter we suppose that this dependence is at most weak, or that a major part of any apparent serial dependence is due to changes in another variable for which data are available.

The variables themselves may be continuous or discrete, and bounded or unbounded. Discrete variables may be measurements, or counts, or

79

80 A.C. DAVISON

proportions of counts, or they may represent a classification of other variables. Various classifications are possible. Examples are ordinal data, where categories I, 2, ... , k represent categories for which the order is important, or nominal categories, for which there is no natural ordering.

In many problems there is a variable of central interest, a response. The variables on which the response is thought to depend are explanatory variables which, though possibly random themselves, are regarded as fixed for the purpose of analysis, Some authors use the terms dependent and independent for response and explanatory variables. The collection of response and explanatory variables for a unit is sometimes called a case.

In regression problems the aim is to explain variation in the response, in terms of the other variables. Even though different variables may be chosen as responses at different stages of an analysis, it helps to focus discussion and simplifies interpretation if there is a single response at each stage. A central aspect in formulating a regression model is therefore to determine which variable is to be taken as the response. Very often, though not invariably, this is dictated by the purpose of the analysis. Sometimes several variables measuring essentially the same quantity may usefully be combined into a single derived variable to be used as the response.

Once a response has been identified, the question arises of how to model its dependence on explanatory variables. There are two aspects to this. The first is to specify a plausible form for the systematic variation. A linear relation is often chosen for its simplicity of interpretation, and this may adequately summarize the data. Consistency with any known limiting behaviour is important, however, in order to avoid such absurdities as negative predictions for positive quantities. Formulations which encompass known features of the relationship between the response and explanatory variables, such as asymptotes, are to be preferred.

A second aspect is the form chosen to describe departures from the systematic structure. Sometimes a method of fitting such as least squares is regarded as self-justifying, as when data subject to virtually no measurement error are summarized in terms of a fitted polynomial. More often scatter about the systematic structure is described in terms of a suitable probability distribution. The normal distribution is often used to model measurement error, but other distributions may be more suitable for modelling positive skewed responses, proportions, or counts. Examples are given below. In some cases it is necessary only to specify the mean and variance properties of the model, so that full specification of a probability distribution is not required.

The two most widely used methods of fitting are least squares and

REGRESSION AND CORRELATION 81

TABLE 1 Annual maximum sea levels (cm) in Venice, 1931-1981

103 78 121 116 115 147 119 114 89 102 99 91 97 106 105 136 126 132 104 117

151 116 107 112 97 95 119 124 118 145 122 114 118 107 110 194 138 144 138 123 122 120 114 96 125 124 120 132 166 134 138

maximum likelihood. Routines for least squares have been widely available for many years and the method provides generally satisfactory answers over a range of settings. Some loss of sensitivity can result from their use, however, and with the computational facilities now available, maximum likelihood estimates-sometimes previously avoided as being harder to calculate-are used for their good theoretical properties. The methods coincide in an important class of situations (Section 3.1).

In many situations interest is focused primarily on one variable, whose variation is to be explained. In problems of correlation, however, the relationships between different variables to be treated on an equal footing are of interest. Perhaps as a result, examples of successful correlation analyses are rarer than successful examples of regression, where the focus on a single variable gives a potentially more incisive analysis. Another reason may be the relative scarcity of flexible and analytically tractable distributions with which to model the myriad forms of multivariate dependence which can arise in practice. In some cases, however, variables must be treated on an equal footing and an attempt made to unravel their joint structure.

1.2 Examples Some examples are given below to elaborate on these general comments.

Example 1. The data in Table I consist of annual maximum sea levels in centimetres in Venice for the years 1931-1981.1 The plot of the data in Fig. I shows a marked increase in sea levels over the period in question, superimposed upon which there are apparently random fluctuations. There are different possible interpretations of the systematic trend, but clearly it would be foolish not to take it into account in predictions for the future. In this case the trend appears to be linear, and we might express the annual maximum sea level :r; in year t as

(1)

82

200 -

E 180 ()

CD ~ 160

(\l

~ 140~ :l E x (\l

120

E 100 +

iii :l C C « 80 +

+

+

+ ++ +

++ ++ +

+ +

A.C. DAVISON

+ +

+

+

+ +

+

+

+

+

+ + + + +

+

Year

+

+ + +

+++ +

+

++ +

+

+ + +

Fig. 1. Annual maximum sea levels measured at Venice, 1931-1981.

where po(cm) and PI (cm/year) are unknown parameters representing respectively the mean sea level in 1931 and the rate of increase per year. The random component lit is a random variable which represents the scatter in the data about the regression line. In many cases it would be assumed that the lit are uncorrelated with mean zero and constant variance, but a stronger assumption would that the lit were independently normally distributed. A further elaboration of the systematic part of eqn (1) would be to suppose that the trend was polynomial rather than purely linear, or that it contained cyclical behaviour, which could be represented by adding trigonometric functions of t to the right-hand side ofeqn (1).

Data on river flows or sea levels are continuous: measurements could in principle be made and recorded to any desired accuracy. In other cases data take a small number of discrete values.

Example 2. Table 2 gives data on levels of ozone at two sites in east Texas.2 For each of 48 consecutive months the table gives the number of days on which ambient ozone levels exceeded 0·08 parts per million (ppm), the number of days per month on which data were recorded, and the average daily maximum temperature for the month. Figure 2 displays a


TABLE 2 Ozone data for two locations in Texas, 1981-1984

1981 1982 1983 1984

A B C A B C A B C A B C

Beaumont January 0 30 14·61 0 31 15·50 0 13 14·56 5 31 12·72 February I 25 16·89 I 28 14·44 0 0 16·17 2 28 17·67 March 0 24 20·22 I 27 22·33 0 5 19·83 3 31 19·94 April 0 24 25·50 4 29 23·89 0 18 20·39 5 24 24·22 May 0 31 26·78 4 25 27-39 9 28 26·89 lO 31 27·11 June 0 30 30·78 8 22 31·50 3 21 29·28 4 24 29·00 July 3 30 32·11 0 3 32·39 0 0 31-67 5 25 30·33 August 8 25 32·17 6 20 32-33 0 0 30·94 6 29 31·28 September II 26 29·56 3 24 30·06 4 9 27·50 2 29 28·56 October 2 31 25·28 5 25 25·06 6 26 25-83 2 30 26·61 November 4 29 21·17 I 26 20·50 0 27 21·44 0 30 19·22 December 0 31 15·72 1 24 19·06 1 29 l2·11 0 30 21·67

North Port Arthur January 0 0 0 0 23 16·50 2 17 15·00 5 31 12·39 February 0 0 0 2 28 16·61 I 25 17-78 I 28 17-33 March 0 0 0 1 31 21·94 2 29 22·00 3 27 19·89 April 0 0 0 2 28 24·17 2 20 24·00 3 22 24·56 May 0 0 0 5 31 28·94 I 14 29·94 4 31 27·89 June 4 18 31·06 6 28 31·94 0 0 30·00 0 3 30·33 July 6 19 31·94 0 to 32·28 1 5 31·67 4 26 31·56 August 3 13 32·11 4 II 36·28 8 19 35·67 7 30 32·61 September 9 23 29·44 3 15 34·67 6 20 30·28 3 30 29·06 October 2 18 25·28 4 26 27·61 6 27 26·94 0 17 27·28 November 3 24 21·83 0 30 20·50 0 28 21·61 I 28 20·89 December 2 30 16·44 0 23 18·28 0 29 12·17 0 31 21·67

A Number of days per month with exceedances over 0·08 ppm ozone. B Number of days of data per month. C Daily maximum temperature eC) for the month.

plot of the proportion of days on which 0·08 ppm was exceeded against temperature. There is clearly a relation between the two, but the discre~e-ness of the data and the large number of zero proportions mean that a simple relation of the same form as eqn (1) is not appropriate.

Example 3. Table 3 gives annual peak discharges of four rivers in Texas for the years 1924-1982. Data were collected for the San Saba

84 A.C. DAVISON

E 0.5 e. e. co 0 c:i 0.4 Q; 15 .c; 0.3

.. ,. c + 0 E Q; 0.2

+ +

e. + + + + +

(f) ~ t >- +

'" + "0 0.1

+ +

15 ... + + + + + c + 0 + ++ .,

0.0 5 + .. ++ + ++*' * e. 0 0:

-0.1 10 15 20 25 30 35 40

Average daily maximum temperature (OC)

Fig. 2. Proportion of days per month on which ozone levels exceed 0·08 ppm for the years 1981-1984 at Beaumont and North Port Arthur, Texas, plotted

against maximum daily average temperature.

and Colorado Rivers at and near San Saba, data from Llano and near Castell were combined for the Llano River, and data from near Johnson City and Spicewood were combined for the Pedernales River (see Fig. 3). If prevention of flooding downstream at Austin is of interest, it will be important to explain joint variation in the peak discharges, which will have a combined effect at Austin if they arise from the same rainfall episode.

There are many possible joint distributions which might be fitted to the data. About the simplest is the multivariate normal distribution, whose probability density function is

f(y; Il, U) = (2n)-p/2IUI- 1/2 exp {-t(y - Il)TU- 1(y - Il)} (2)

where y is the p x 1 vector of observations, Il is the p x I vector of their means, and the p x p non-negative definite matrix U is the covariance matrix of Y. Thus

and so on. In this example the vector observation for each year consists of the peak discharges for the p = 4 rivers. There are n = 59 independent copies of this if we assume that discharges are independent in different

TA

BL

E 3

A

nn

ua

l pe

ak d

isch

arge

s (c

use

cs x

103

) fo

r fo

ur

Tex

as r

iver

s. 1

92

4-1

98

2

Llan

o 59

·50

21·4

0 14

·40

33·9

0 27

·30

49·0

0 12

2·00

92

·50

22·8

0 19

·30

388·

00

130·

0 3·

72

110·

00

55·5

0 28

·20

26·7

0 23

·40

50·6

0 10

·10

8·50

18

·20

8·60

10

8·00

14

·60

7·77

13

·90

23·2

0 16

·50

3·46

72

-00

1·85

47

·20

83·7

0 35

·60

103·

00

57·6

0 1·

70

1-81

67

·20

2·48

15

·40

27·4

0 44

·40

4·52

15

4·00

28

·00

24·5

0 11

·40

154·

00

19·4

0 61

·50

67·5

0 13

9·00

25

·80

210·

00

32·9

0 11

6·00

5·

98

Ped

erna

les

~ ~

1·36

28

·30

16·4

0 6·

94

155·

00

36·6

0 13

·90

18·5

0 2·

02

11·4

0 ~

105·

00

85·3

0 10

·00

14·8

0 2·

39

42·9

0 21

·10

26·6

0 27

·00

104·

00

~ 25

·50

9·68

10

·20

8·38

8·

17

29·1

0 11

·80

441·

00

32·2

0 5·

34

0 13

-60

0·16

12

5·00

50

·20

47·0

0 14

2·00

15

·60

8·55

5·

27

10·6

0 z >

32

·30

7·55

58

·30

27·9

0 12

·70

28·9

0 9·

07

35·7

0 21

·40

44·4

0 Z

90

·10

16·8

0 98

·00

127·

00

64·2

0 2·

79

49·6

0 32

·30

62·6

0 " C")

San

Saba

0 ~

6·50

4·

66

8·64

8·

46

~

8·64

7·

46

8·64

44

·80

34·0

0 7·

35

m

r"'

27·2

0 64

·00

67·2

0 5·

11

203·

00

2·19

5·

57

27·2

0 25

·20

20·4

0 >

4·

67

4·78

14

·70

2·49

4·

66

6·29

2·

72

12·5

0 70

·40

1·50

g

7·15

41

·30

35·6

0 27

·50

32·0

0 3·

16

10·3

0 11

·50

10·2

0 0·

68

Z

20·2

0 20

·40

1·92

4·

67

17·4

0 4·

25

36·7

0 25

·60

5·99

1·

05

40·5

0 3·

20

10·3

0 10

·50

27·0

0 1·

81

40·7

0 3·

56

3·59

Col

orad

o 17

·40

30·3

0 29

·60

189·

00

27·2

0 35

·00

31·6

0 78

·90

39·8

0 26

·60

45·3

0 86

·00

179·

00

115·

00

224·

00

20·4

0 23

·40

42·6

0 25

·00

23·2

0 19

·20

32·3

0 16

·30

19·4

0 34

·10

32·8

0 8·

01

22·4

0 69

·00

20·7

0 24

·90

57·2

0 54

·10

66·2

0 44

·40

20·9

0 43

·00

23·4

0 15

·60

12·6

0 29

·90

42·4

0 16

·00

13-2

0 34

·80

15·0

0 44

·50

30·9

0 11

·80

7·08

0

0

v.

46·2

0 13

-30

10·7

0 18

·10

28·1

0 9·

11

36·0

0 4·

36

21·4

0

86 A.C. DAVISON

San Saba river

Castell

-~~ / Llano river

; Austin ~

Pedernales ri ver Johnson city

Fig. 3. Sketch of central Texas rivers showing position of Austin relative to Llano, Pedernales, San Saba and Colorado rivers.

years. The fitting of this distribution to the data is discussed in Section 2.

The examples above are observational studies. They are typical of environmental data where usually there is no prospect of controlling other factors thought to influence the variables of central interest. The situation is different in designed experiments as used in, for example, agricultural and some biological studies, where the experimenter has a good deal of control over the allocation of, say, fertilizer treatments to plots of land. Random allocation of plots to treatments helps reduce bias and may make causal statements possible. That is, the experimenter may at the end of the day be able to say that the average effect of this treatment is to cause that response. Sometimes randomization is possible in environmental studies, but more often it is not, and this limits the strength of inferences that can be made. Generally inferences regarding association may be drawn, so that it may be possible to say, for example, that a 1°C rise in temperature is associated with a rise in the rate at which some ozone level is exceeded. The further implication of a causal relation must come from considerations external to the data, such as a knowledge of underlying physical, biological or chemical mechanisms, or from additional data from suitably designed investigations.

REGRESSION AND CORRELA nON 87

2 CORRELATION

Two variables are said to be correlated if there is an association between them. The strength and direction of association between independent data pairs (XI> YI ), ••• , (Xn' Yn) may be informally assessed by a scatterplot of the values of the lj against the Aj. A numerical measure of linear association is the (product moment) sample correlation coefficient,

R = I:j=I(Aj - X)(lj - Y) { n - 2 n - 2} 1/2 I:j=I(Aj - X) I:j=l(lj - Y)

(3)

where X = n- I I: Aj and Y = n- I I: lj are the averages of the Aj and lj. It is possible to show that - I ~ R ~ I and that I R I is location and scale invariant, i.e. unaltered by changes of form X --+ a + bX, Y --+ c + dY, provided band d are non-zero. The implication is that R measures association only and does not depend on the units in which the data are measured. If R = ± 1, the pairs lie on a straight line; if R = 0 the variables are said to be uncorrelated.

If the pairs (Aj, lj) have a bivariate normal distribution (eqn (2)) with p = 2, and population covariance matrix

(WII WI2)

W21 W22

the population correlation coefficient is p = W 12 /{W II W 22 }1/2, and the distribution of R is tabulatedY If tables are unavailable a good approximation is that Z = t 10g{(1 + R)/(l - R)} is approximately normal with mean tanh -I (p) + tp/(n - I) and variance l/(n - 3). If p = 0 then WI2 = 0, and it foHows from the factorization of eqn (2) into a product of separate probability density functions for x and y that the distributions of X and Yare independent.

It is important to realise that R measures only linear association between X and Y. The (x, y) pairs ( - 2, 4), ( - I, I), (0, 0), (1, I) and (2, 4) are strongly and non-randomly related, with y = x2 , but their correlation coefficient is zero. A scatterplot of data pairs is therefore an essential preliminary to calculation and subsequent interpretation of their sample correlation coefficient. Apparently strong correlation may be due to a single extreme pair, for example, but the numerical value of R alone will not convey this information.

Example 3 (contd). Figure 4 shows a scatterplot matrix for the Texas rivers data and for the year for which the measurements are observed. This

88 A.C. DAVISON

1982

Year

Fig. 4. Scatterplot matrix for data on Texas rivers.

consists of all possible scatterplots of pairs of variables of the original data, together with plots of the data against time. Thus the cell second from the left in the bottom row consists of values for Llano on the y-axis plotted against values for Pedernales on the x-axis. The diagonal cells consist of data plotted against themselves and are not informative.

The data vary over several orders of magnitude and have very skewed distributions. Both these features are removed by a logarithmic transformation of the peak discharges. There seem to be high peak discharges of the rivers, especially the Colorado, in early years. Table 4 gives the matrix of correlations for the original data and for the natural logarithm of the peak discharges. Thus 0·29 is the sample correlation coefficient for the original Llano and Pedernales data. The correlation between a vector of observations and itself is I, and the diagonal is usually omitted from a correlation matrix, which is of course symmetric.

Figure 5 shows a scatterplot matrix for the data after log transformation. The correlations between the rivers are very similar to those for the un transformed data, but the skewness of the distributions has been removed and the signs and sizes of the sample correlation coefficients are clearly supported by the entire datasets. The appreciable correlation between the San Saba and Colorado rivers is seen clearly. The correlation


TABLE 4 Correlation matrices for the Texas rivers data: original data and data after

log transformation

Llano Pedernales

Original data Colorado San Saba Pedemales Llano

Data after log transformation Colorado San Saba Pedemales Llano

o Q \" .4.~ r? 0 ·o'\.· . ~~ .~;B 0 ~ 0

o _~" 0

o 0

0 0 ••

0 0° o.

o J. 0

,:.;~ 0 05 o • ~ , o •

• 0

0 0

o 0 f °iM 00".,'Iffd • 0~~0 o~~ o .. ... 0

• • 0 Go 0 0 . 6.089 ooo~. ~/~ Aldernales

0' 0

-1.808 5.981 0

~~. Llano

0.5306 09· 0

I 0·29

I 0·41

o· .. o °t· (i

• It o· 0 ~o

:~\, 0

0 o 0 0

00# 5.313

San 8aba

-0.3975

° ° ~l'= o~o

0,66 00

o 0 0

. • 0./k!>

of." 0 ~ J'4. •• 0 •

0

San Saba

I 0·00

-0·14

I 0·5

-0·\2

o~d' 1982

. t.~ Year

~~9"~ 1924

5.412 06

.! Colorado ,J(~60

1,472 .~ 0 0 ,: (~~';I

",. oe 00

0\ 0 ~ .".0 o 0 • 0 0

0 0

~oj!\ o o. ~~~ o -S::\f> 0 0 0 0 00 •

. . "~09" •• ~ .'fl~

Co ~ 00., .. ~ ./' ~o • o •• A. 0

Colorado

I 0·70

-0,14 -0·08

I 0·75

-0·08 -0·10

Fig. 5. Scatterplot matrix for Texas rivers data after log transformation.

90 A.C. DAVISON

between the Llano and Pedernales rivers would be strengthened by removal of the lowest observation, which is highlighted. The other plots show little correlation between the variables. There seems to be downward trend with time for the Colorado and a possible slight upward trend for the Pedernales.

In this example transformation to remove skewness makes a summary of the data in terms of a multivariate normal distribution sensible for some purposes. The largest correlation for the transformed data is O· 70 for the San Saba and Colorado rivers. The only other correlation significantly different from zero at the 95% level is 0·29 between the Llano and Pedernales rivers. Taken together with Fig. 5, this suggests that after transformation the data can be summarized in terms of two independent bivariate normal distributions, one for the Llano and Pedernales rivers and another for the San Saba and Colorado rivers.

Discussion of other parametric measures of correlation is deferred to Chapter 4. There are a number of non-parametric measures of association between two variables, generally based on comparisons of the ranking of the (X, Y) pairs. Let ~, for example, be the difference between the rank of ~ and that of lj. Spearman's S is defined as L ~2, and is obviously zero if there is a monotonic increasing relation between the measurements. This and other statistics based on ranks do not suffer from the disadvantage of measuring linear association only. Percentage points of S under the hypothesis of independence between rankings are available.3

Although the multivariate normal distribution is the most widely used, others may be appropriate in analysing data on multivariate survival times,5 discrete multivariate data,6 mixtures of discrete and continuous multivariate data/ and other situations. 8 We take as an example the study of multivariate extremes.

Example 4. The need to model sample minima and maxima arises in many environmental contexts, such as hydrology, meteorology and oceanography. Suppose that X and Y represent the annual minimum daily levels of two nearby rivers, and that data are available for a number of years. The assumption that the annual minima are independent may obviously be incorrect, as a drought might affect both rivers at once. It may therefore be necessary to model the joint behaviour of X and Y simultaneously, either to obtain more accurate predictions of how each varies individually, or in order to model the behaviour of some function of both X and Y.

REGRESSION AND CORRELATION

One class of joint distributions for such minima is specified by9.10

pr(X>x,Y>y) = exp{-(x+ y)+ oxy }, x+y

o ~ 0 ~ 1, (x, y > 0)

The population correlation coefficient between X and Y is then

(1 - !O)-3/20-1/2{sin- l(tOI /2 - tol/2(1 - !0)'/2(1 - to)}

91

(4)

The variables X and Yare independent if 0 = 0, when eqn (4) factorizes into the product e- X x e-Y• Tawn9 discusses this and related models in the context of modelling extreme sea levels.

3 LINEAR REGRESSION

3.1 Basics Consider again the data on sea levels in Venice pictured in Fig. 1. If we suppose that model (1) is appropriate, we need to estimate the parameters 130 and 131' We suppose that the errors et are uncorrelated with mean zero and variance (J2, also to be estimated. We write eqn (l) in matrix form as

1';931 0 el931

YI932

(~~) +

el932

Yl933 2 el933 (5)

YI981 50 el981

or

y = XfJ + I: (6)

in an obvious notation. More generally we take a model written in form (6) in which Y is an

n x 1 vector of responses, X is an n x p matrix of explanatory variables or covariates, fJ is a p x 1 vector of parameters to be estimated, with p < n, and Il is an n x I vector of unobservable random disturbances. The sum of squares which corresponds to any value fJ is

n

(Y - XfJ)T(y - XfJ) = L (Yj - xi fJ)2 (7) j=1

92 A.C. DAVISON

where superscript T denotes matrix transpose, lj is the jth element of Y, and xi is the jth row of X. The sum of squares (eqn (7)) measures the squared distance, and hence the discrepancy, between the summary of the data given by the model with parameters fl and the data themselves. The model value corresponding to a given set of explanatory values and parameters is xi fl, from which the observation lj differs by lj - xi fl; the overall discrepancy for the entire dataset is eqn (7).

As their name suggests, the least squares estimates of the parameters minimize the sum of squares (eqn (7)). Provided that the matrix Xhas rank p, so that the inverse (XT X) -I exists, the least squares estimate of fl is the p x I vector

(8)

The fitted value corresponding to lj is ~ = xi fJ. The n x I vector of residuals

e = Y - Y = Y - xfJ = {I - X(XT X)-I XT} Y = (I - H)Y (9)

say, can be thought of as an estimate of the vector of errors 8. The n x n matrix H is called the 'hat' matrix because it 'puts hats' on Y: f = HY.

The interpretation of e as the estimated error suggests that the residuals e be used to estimate the variance (i of the Ej' and it can be shown that provided the model is correct, the residual sum of squares

(10)

has expected value (n - p)(J2. Therefore

2 1 ~ Tj)2 S = -- ~ (lj - Xj p)

n - p j=1 (11)

is unbiased as an estimate of (J2. The divisor n - p can be thought of as compensating for the estimation of p parameters from n observations and is called the degrees of freedom of the model.

The estimators fJ and; have many desirable statistical properties. I I The estimators can be derived under the second-order assumptions made above, which concern only the means and covariance structure of the errors. These assumptions are that

var( e) = (J2, (k =f j)

A stronger set of assumptions for the linear model are the normal-theory

assumptions. These are that the Gj have independent normal distributions, with mean zero and variance (J2, i.e. Yj is normally distributed with mean xI/I and variance (J2. Thus the probability density function of Yj is

f(Yj; /I, (J2) = (2n(J2)-1/2 exp {- 2~2 (Yj - XJ/I)2 } (12)

An alternative derivation of P is then as follows. The Yj are independent, and so the joint density of the observations Y;, ... , Yn is the product of the densities (eqn (12)) for the individual Yj:

n { In} }] f( Yj; /I, (J2) = (2n(J2) - n/2 exp - 2(J2 j~1 (Yj - xJtN (13)

For a given set of data, i.e. having observed values of Yj and Xj' this can be regarded as a function, the likelihood, of the unknown parameters /J and (J2. We now seek the value of /I which makes the data most plausible in the sense of maximizing eqn (13). It is numerically equivalent but algebraically more convenient to maximize the log likelihood

2 n 2 l~ T2 L(/I, (J) = - 2 log (2n(J ) - 2(J2 j~ (Yj - Xj /I)

which is clearly equivalent to minimizing the sum of squares (eqn (7)) and so the maximum likelihood estimate of /I equals the least squares estimate p. The maximum likelihood estimate of (J2 is

n

&2 = n- I I (Yj - XJjJ)2 j=1

This is a biased estimate, and S2 is used in most applications. This derivation of p and (J2 as maximum likelihood estimates has the advantage of generalizing naturally to more complicated situations, as we shall see in Section 5.

Under the normal-theory assumptions, confidence intervals for the parameters /I are based on the fact that the joint distribution of the estimates p has a multivariate normal distribution with dimension p, mean /I and covariance matrix (J2(XT X)-I. Ifwe denote the rth diagonal element of (XT X) -I by Urr , Pr has the normal distribution with mean f3r and variance (J2 urr • If (J2 is known, a(l - 2a:) x 100% confidence interval for the true value of f3r is

where (z.) = a: and <1>(.) is the cumulative distribution function of the

94 A.C. DAVISON

standard normal distribution, namely

(z) = (2n)-1/2 [00 e- u2 /2du

Usually in applications a2 is unknown, and then it follows that since (n - p)s-ja2 has a chi-squared distribution of n - p degrees of freedom independent of jJ, a(1 - 2tx) x 100% confidence interval for f3r is

Pr ± sJi);;tn_p(tx)

where tn_p(tx) is the tx x 100% point of the Student t distribution on n - p degrees offreedom. Values of z. and tn_p(tx) are widely tabulatedY

Example 1 (contd). For the Venice data and model (1), there are p = 2 elements of the parameter vector fl. The matrices X and Yare given in eqn (5). The least squares estimates are Po = 105·4 and PI = 0'567, the residual sum of squares is 16988,ands2 = 16988/(51 - 2) = 346·7. The matrix i(XT X)-I is

( 26-41 - 0·7844 )

- 0·7844 0·03138

and the standard errors of Po and PI are respectively 5·159 and 0·1771. The value of (J2 is unknown, and t49 (0'025) ~ 2·01. A 95% confidence interval for PI is (0,211, 0'923), so the hypothesis of no trend, PI = 0, is decisively rejected at the 95% level. The straight fitted line shown in Fig. 6 seems to be an adequate summary of the trend, although there is considerable scatter about the line. The interpretation of the parameter values is that the estimated level in 1931 was 105·4 cm, and the increase in mean annual maximum sea level is about O· 57 cm per year.

One use to which the fitted line could be put is extrapolation. The estimated average sea level in 1990, for example, is Po + (1990 - 1931)P I = Po + 59PI = 138'85, and this has a standard error of

var(Po + 59PI )1/2

= {var(Po) + 2 x 59 x cov(Po, PI) + 592var(PI)}I/2

which is 6·56. Comparison with the natural variability in Fig. 6 shows that this is not the likely variability of the level to be observed in 1990, but the variability of the point on the fitted line to be observed that year. The actuallevel to be observed can be estimated as YI990 = Po + 59PI + e199O, where el990 is independent of previous years and has zero mean and estimated variance S2. Thus

var( Y199O ) = var(Po + 59PI) + var(eI990)


200 +

180 E .£ +

i 160

+

'" + + Q) + til 140

~ + + + +

+ ++ .~ 120 + + + ++ + +

~ + ++ + + c: +

~ 100 + + + ++ +

+ +

80 +

50, I I I I I I I I I I I I I ,1 I I I I ,

1930 1940 1950 1960 1970 1980

Year

Fig. 6. Linear and cubic polynomials fitted to Venice sea level data.

and this has an estimated value 389·78; the standard error of Y1990 is 19·74. Thus most of the prediction uncertainty for the future value is due to the intrinsic variability of the maximum sea levels. An approximate 95% predictive confidence interval is given by 138·85 ± 1·96 x 19·74 =

(100,2, 177'5). This is very wide. Apart from uncertainty due to the variability of estimated model par

ameters, and intrinsic variability which would remain in the model even if the parameters were known, there is phenomenological uncertainty which makes it dangerous to extrapolate the model outside the range of the data observed. The annual change in the annual maxima may change over the years 1981-1990 for reasons that cannot be known from the data. This makes a prediction based on the data alone possibly meaningless and certainly risky.

3.2 Decomposition of variability The sum of squares (eqn (5» has an important geometric interpretation. If we think of Yand X/l as being points in an n-dimensional space, then minimization of (Y - X/l) T (Y - X/l) amounts to choice of the value of /l which·· minimizes the squared distance between Y and X/l. The data vector Y lies in the full n-dimensional space, but the fitted point X/llies in

96 A.C. DAVISON

z

y y

o

x

Fig. 7. The geometry of least squares. The X-V plane is spanned by the columns of the covariate matrix X, and the least squares estimate minimizes the distance between the fitted value, 9, which lies in the X-V plane, and the data, Y. The x-axis is spanned by a column of ones, and the overall average, 9, minimizes the distance between that axis and the fitted value 9. The extra variation accounted for by the model beyond that in the x-direction is orthogonal to the

x-axis.

the p-dimensional subspace of the full space spanned by the columns of X. In Fig. 7 the x-y plane is spanned by the columns of X, and the x-axis is spanned by a vector of ones. Since the distance from Y to f is minimized, f must be orthogonal to Y - f. Likewise Y, which is an n x I vector all of whose elements are the overall average n-'l: 1'; and which therefore lies along the x-axis, is orthogonal to f - Y. The role of H is to project the n-dimensional space onto the p-dimensional subspace of it spanned by the columns of X, which is the x-y plane in the figure. Now Y = f + (Y - f), and by Pythagoras' theorem,

yTy = fTf + (Y _ f)T(y - f)

Thus the total sum of squares of the data splits into a component due to the model and one due to the variation unaccounted for by the model.

The variability due to the model can be further decomposed. If we write f as f + (f - f), it follows from the orthogonality of f and f ~ .. f


TABLE 5 Analysis of variance table for linear regression model

Source df

Regression (adjusted for mean) p - I Residual 11 - P

Total (adjusted for mean) 11 - I

Mean

Total n

that

Sum of squares

A - 2 SSReg = 1: (Yj - Y)

A 2 SSRes = 1: (Yj - Yj)

Mean square

SSReg/(P - I) SSRes/(1l - p)

yTy = yTy + (Y _ y)T(y _ Y) + (y __ y)T(y _ Y)

or equivalently

L lj2 = ny2 + L (~ - y)2 + L (lj - ~)2 (14)

The degrees of freedom (df) of a sum of squares is the number of parameters to which it corresponds, or equivalently the dimension of the subspace spanned by the corresponding columns of X. The degrees of freedom of the terms on the right-hand side of eqn (14) are respectively I, p - 1, and n - p.

These sums of squares can be laid out in an analysis of variance table as shown in Table 5. The sum of squares for the regression can be further broken down into reductions due to individual parameters or sets of them. An analysis of variance (ANOV A) table displays concisely the relative contributions to the overall variability accounted for by the inclusion of the model parameters and their corresponding covariates. The last two rows of an analysis of variance table are usually omitted, on the grounds that the reduction in sum of squares due to an overall mean is rarely of interest.

A good model is one in which the fitted value Y is close to the observed y, so that a high proportion of the overall variability is accounted for by the model and the residual sum of squares is small relative to the regression sum of squares. The ratio of the regression sum of squares to the adjusted total sum of squares can be used to express what proportion of the overall variability is accounted for by the fitted regression model. This is numerically equal to the square of the correlation coefficient between the fitted values and the data.

98 A.C. DAVISON

TABLE 6 Analysis of variance for fit of cubic regression model to Venice sea level data

Source df Sum of squares Mean square

Linear I 3552 3552 Quadratic and cubic (adjusted for linear) 2 9·2 4·6

Regression (adjusted for mean) 3 3561·2 I 187·1

Residual 47 16979 361·26

Total (adjusted for mean) 50 20540

When the observations have independent normal distributions with common variance (f2, the residual sum of squares SSRes has a chi-squared distribution with mean (n - p )(f2 and n - p degrees of freedom, and this implies that the residual mean square SSRes/(n - p) is an estimate of (f2. If there was in fact no regression on the columns of X after adjusting for the mean, i.e. if in the model

Yj = Po + P1Xjl + ... Pp_1Xj.P_1 + Bj

the parameters PI, ... , PP_I were all equal to zero, the regression sum of squares SSReg would have a chi-squared distribution on p - 1 degrees of freedom, with mean (p - 1 )(f2. If any of these parameters were non-zero the regression sum of squares would tend to be larger, since more of the overall variability would be accounted for by the model. A statistical implication of the orthogonality of Y - IT and Y - Y is that the corresponding sums of squares are independent, and so if the regression parameters are zero, the statistic

SSReg/(P - 1)

SSRes/(n - p)

has an F-distribution on p - 1 and n - p degrees of freedomY This can be used to determine if a reduction in sum of squares is too large to have occurred by chance and so is possibly due to non-zero regression parameters.

Example 1 (contd). The analysis of variance table when the cubic model

Po + PI(t - 1930) + P2(t - 1930)2 + P3(t - 1930)3

is fitted to the Venice data, is given in Table 6.


The estimate of (12 based on the regression is 1187, as compared to 361· 26 based on the residual sum of squares. The corresponding F-statistic is 1187/361'26 = 3'29, which is just not significant at the 2·5% level, and gives strong evidence that the average sea level changes with the passage of time. The decomposition of the regression sum of squares, however, shows that the regression effect is due almost wholly to the linear term; the F-statistic for the extra terms is 0·013 = 4'6/361'26, which is not significant at any reasonable level. Figure 6 shows the fitted values for the cubic model. These differ very little from those for the linear model, thus explaining the result of the analysis of variance. The proportion of total variation accounted for by the model is 3561·2/20 540 x 100% = 17'3%, corresponding to a correlation coefficient of ,J0·173 = 0-42 between observations and fitted values. This is not very large but the line seems to be an adequate summary of the trend in Fig. 6. Perhaps more striking than the trend is the large degree of scatter about the fitted line.

3.3 Model selection In larger problems than Example 1, there are typically many explanatory variables. The problem then arises of how to select a suitable model from the many that are possible. The simplest model is the minimal model, which summarizes the response purely in terms of its average, a single number. This is the most concise summary of the response possible, but it makes no allowance for any systematic variation which might be explained by explanatory variables. The least concise summary is to use the data themselves as the summary. This uses n numbers to summarize n numbers, so there is no reduction in complexity. The problem of model choice is to find a middle path where as much systematic variation as possible is explained by dependence of the response on the covariates, and the remaining variation is summarized in the random component of the model.

In many situations there is prior knowledge about which co variates are likely to prove important; indeed the sign and the approximate size of a regression coefficient may be known. Sometimes there is a natural order in which covariates should be fitted. With polynomial covariates, as in Example I, it would make no sense to fit a model with linear and cubic terms but no quadratic term unless there were strong prior grounds for the belief that the quadratic coefficient is zero, perhaps based on theoretical knowledge of the relationship. If a polynomial term is fitted, then all terms of lower order should normally also be fitted. A similar rule applies also to qualitative terms, for reasons illustrated in Section 5.5.

lOO A.C. DAVISON

If several covariates measure essentially the same thing, it may be sensible to use their average, or some other function of them, as a covariate, instead of trying to fit them separately. Regardless ofthis it is essential to plot the response variable against any likely covariates, in order to verify prior assumptions, indicate covariates unlikely to be useful, and to screen the data for unusual observations. It is valuable to check if any pairs of covariates are highly correlated, or collinear, as this can lead to trouble. For example, collinear covariates that are individually significant may both be insignificant if included in the same model. A model with collinear covariates may have very wide confidence intervals for parameter estimates, predictions, and so forth because the matrix XT X is close to singular. Weisbergl2 gives a good discussion of this. Collinearity often arises in models containing polynomial terms, in which case orthogonal polynomials should be used.

There are broadly three ways to proceed in regression model selection. Forward selection begins by fitting each covariate individually. The model with the intercept and the covariate leading to the largest reduction in residual sum of squares is then regarded as the base model. Each of the other covariates is then added separately to this model, and the residual sum of squares for each model with the intercept, the first covariate chosen, and the other covariates taken separately is recorded. Whichever of these models that has the smallest residual sum of squares is then taken as the base model and each remaining covariate is then added separately to this. The procedure is repeated until a suitable stopping rule applies. Often the stopping rule used is that the residual sum of squares is not reduced significantly by the addition of any covariate not in the model.

Backwards elimination starts with all available covariates fitted, and drops the least significant covariate. The model without this covariate is then fitted, and the least significant covariate dropped; the procedure is repeated until a stopping rule applies. A common stopping rule is that deletion of any remaining covarite leads to a significant increase in the residual sum of squares for the model.

In both forward selection and backward elimination the choice of significant covariates is usually made by reference to tables of the F or t distributions. The two methods may not lead to the same eventual model, and consequently a combination of them known as stepwise regression is available in a number of regression packages. In this procedure four options are considered at each stage: add a variable, delete a variable, swap a variable in the model for one not in the model, or stop. This more complicated algorithm is often used in practice, but like forward selection


and backwards elimination it will not necessarily produce a satisfactory or even a sensible model. Examples are known where automatic procedures find complicated models to fit data that are known to have no systematic structure whatever. The best guide to a satisfactory model is the knowledge the investigator has of his field of application, and automatic output from any regression package should be examined critically.

3.4 Model checking There are many ways that a regression model can be wrongly specified. Sometimes one or a few observations are misrecorded, and so appear unusual compared to the rest. A problem like this can be called an isolated discrepancy. A systematic discrepancy arises when, for example, an explanatory variable is included in the model but inclusion of its logarithm would be more appropriate, or when the random component of the model is incorrectly specified-by supposing correlated responses to be independent, for example. A range of statistical techniques has been developed to detect and make allowance for such difficulties, and we discuss some of these below. Many of these model-checking techniques are based on the use of residuals.

We saw in Section 3.1 thatthe vector e = Y - f of differences between the data and fitted values can be thought of as estimating the unobservable vector t: of errors. Since assumptions about the t:j concern the random component of the model, it is natural to use the ej to check those assumptions. One problem with direct use of the ej' however, is that although they have zero means, unlike the true errors they do not have constant variance. Standardized reiduals are therefore defined as

y-y R - J J

j - s(l - hjjy/2

where hjj is thejth diagonal element of the hat matrix H = X(XT X)-I XT. If the model is correct, the Rj should have zero means and approximately

unit variance, and should display no forms of non-randomness, the most usual of which are likely to be

(i) the presence of outliers, sometimes due to a misrecorded or mistyped data value which may show up as lying out of the pattern of the rest, and sometimes indicating a region of the space of covariates in which there are departures from the model. Single outliers are likely to be detected by any of the plots described below, whereas multiple outliers may lead to masking difficulties in which each outlier is concealed by the presence of others;

102 A.C. DAVISON

(ii) omission of a further explanatory variable, detected by plotting residuals against that variable;

(iii) incorrect form of dependence on an explanatory variable, for example a linear rather than nonlinear relation, detectable by plotting residuals against the variable;

(iv) correlation between residuals as in time series data, detected by scatterplots between lagged residuals; and

(v) incorrect assumptions regarding the distribution of the errors t:j'

detected by plots of residuals against fitted values Y to detect systematic variation of their means or variances with the mean response, or of ordered residuals against expected normal order statistics to detect non-normal errors.

Figure 8 shows some possible patterns and their causes. Parts (a)-(d) should show random scatter, although allowance should be made for apparently non-random scatter caused by variable density of points along the abscissa. Plots (e) and (f) are designed to check for outliers and non-normal errors. The idea is that if the Rj are roughly a random sample from the normal distribution, a plot of the ordered Rj against approximate normal order statistics ~ I {(j - 3/8)/(n + 1/2)} should be a straight line of unit gradient through the origin. Outliers manifest themselves as extreme points lying off the line, and skewness of the errors shows up through a nonlinear plot.

The value of Xj may give a case unduly high leverage or influence. The distinction is a somewhat subtle one. An influential observation is one whose deletion changes the model greatly, whereas deletion of an observation with high leverage changes the accuracy with which the model is determined. Figure 9 makes the distinction clearer. The covariates for a point with high leverage lie outwith the covariates for the other observations. The measure of leverage in a linear model is the jth diagonal element of the hat matrix, hjj' which has average value pin, so an observation with leverage much in excess of this is worth examining.

One measure of influence is the overall change in fitted values when an observation is deleted from the data. Let Y(j) denote the vector of fitted values when lJ_ is deleted from the data. Then one simple measure of the change from Y to Y(j) is Cook's distance13 defined as

Cj = p~2 (Y - YU}?(Y - YU»)

where Y(j) = xi P(j)' and subscript (j) denotes a quantity calculated

R R x x

x x x x

x x x x x x X X X

X X X X 0

X X X 0

X X X X X X X l( Y x x x x x

x x x x

x

x

(a) (b)

R x R x x x

x x x x x x x x x

x X x x x x Xx X x x x x x x

0 x><xxxxxxxx " 0

" x y or x xx x x x y )( x x x x x x X X X X X X X X X X X X X X X X X '(

X X X X X

(c) (d) ')(

x x x

x X X X

X X X

o xX XX x o x

Normal order x Normal order statistic )( statistic

x><E xl' x x xxxx

x x

(e) (f)

Fig. 8. Examples of residual plots: (a) nonlinear relation between response and x; (b) variance of response increasing with y; (c) null plot; (d) nonlinearity and increasing variance; (e) null normal order statistics plot with possible outlier; and (f) normal order statistics plot showing skewed error distribution.

without the jth observation. An alternative more easily calculated form in which to express Cj is

Rfhjj Cj = p(l - hjj )

It is hard to give guidance as to when an observation is unduly influential,

104 A.C. DAVISON

Y x y

x .

~ ~ . . x x

(a) (b)

y x

x (c)

Fig. 9. The relation between the leverage and the influence of a point. The light line shows the fitted regression with the point x included, and the heavy line shows the fitted regression with it excluded. In (a) the point has little leverage but some influence on the intercept. though not on the estimate of slope; in (b) the point has high leverage but little influence; and in (c) both the

leverage and the influence of the point are high.

but if one or two values of Cj are large relative to the rest it is worth refitting the model without such observations in order to see if there are major changes in the interpretation or strength of the regression.

Even though it is an outlier, an observation with high leverage may have a small standarized residual because the regression line passes close to it. This problem can be overcome by use of jackknifed residuals

, y _ xT P(jl ( n _ p _ I )1/2

Rj = S(jl {I _ ~jj} 1/2 = Rj n - p - Rf

which measure the discrepancy between the observation lj and the model obtained when lj is not included in the fitting. The Rj are more useful than the ordinary residuals Rj for detecting outliers.

Example 1 (contd). Figures 10-13 display diagnostic plots for the


6

4 +

-'-cr ~ 2 OJ

+ + + U en ~ 0 u 2 c: -2 + -'" -'" 0 .!ll, -4

-6 100 110 120 130 140

"-

Ii tted value Yj

Fig. 10. Plot of jackknifed residuals, Rj, against fitted values, Yi , for Venice data.

model of linear trend fitted to the sea level data. Figure 10 shows the jackknifed residuals R; plotted against the fitted values ~. There is one outstandingly large residual, for the year 1966, but the rest seem reasonable compared to a standard normal distribution. Figure II shows the values of Cook's distance Cj plotted against case number. The largest value is for observation 36, which corresponds to 1966, but cases 6 (1936) and 49 (1979) also seem to have large influence on the fitted model.

4

IT 3

0

en

~ 2 en

-'" 0 0 u

0 +

0

+

+

+

+ + +

+

+

+

+

+ +

10

+ +

+ + ++

+

20

+

+ +

++

30

case number,j

+ +

+

+ + + + + + +

40

+ +

+

+

+

50

Fig. 11. Plot of Cook distances, Ci , against case numbers, j, for Venice data.

106

0.10

0.08

0.06

.s:: 0.04

0.02

+ +

A.C. DAVISON

Case.j

+ +

Fig. 12. Plot of leverages. hli • against case numbers. j. for Venice data.

Reinspection of Fig. to shows that these observations have the largest positive residuals.

Figure 12 shows a strong systematic pattern in the values of the measures of leverage, hjj • On reflection, the reason for this is clear, namely that in a model Yj = fJT Xj = Po + P1Zj, the matrix (XT X)-I has the form

1 (1: Z2 1: Z.) n 1: (Zj- Z)2 1: ~ n J

where z = n- I 1: Zj' and so hkb which is the kth diagonal element of the

4

3

2 <ii :J "0 en ~ 0 "0 Gl -1 CD

"0 (; -2

-3

-4 -3

+ ++

.v

/ ./' +

+ +

-2 -1 0 2

normal order statistic

+

3 Fig. 13. Plot of ordered residuals against normal order statistics for

Venice data.


matrix X(XT X)-I X T and so equals xl(XT X)-I Xb can be written in the form

T T _I I (Zk - zi hkk = Xk (X X) Xk = - + ~ ( -)2

n ""j Zj - Z

Thus hkk consists of a constant plus a term representing the squared distance between Zk and the average value of the Zj' which gives rise to the quadratic shape of Fig. 12. These two terms may be interpreted as the leverage an observation at Zk has on the mean level of the regression line and the estimate of slope, respectively; an observation which is more extreme in terms of its value of Z has a larger leverage on the estimate of slope. In this case the plot of the leverages is uninformative, but in more complicated problems it can be valuable to examine the hjj .

Figure 13 shows ordered standardized residuals plotted against normal order statistics. If the errors Bj were independent and normally distributed with constant variance, and the systematic part of the model was correct, this plot would be expected to be a straight line of unit gradient through the origin. However, the plot shows a clear curvature in its upper tail, which implies that the errors arise from a distribution which is skewed to the right. Reinspection of Fig. 10 with hindsight tells the same story. We conclude that the assumption of normal errors is unsuitable for this set of data. This finding calls into question the previous suggestion that the observation for 1966 is an outlier. Though it is unlikely to have arisen from the standard normal distribution, the observation might arise from a positively skewed distribution.

The most important assumption made in regression modelling is usually that the errors Bj are independent. Independence is also the hardest assumption to check. The direct effect of correlated errors on the estimates P may be small, provided the systematic part of the model is correctly specified, but the standard errors of the Pr may be seriously affected. It may be clear on general grounds that the errors may be regarded as independent, but in cases such as Examples 1 and 2 above where the errors may not be independent, scatter plots of consecutive residuals may reveal serial correlation. A formal test of autocorrelation of residuals is based on the Durbin-Watson statistic, which is often available from regression packages.

There can be problems of interpretation if discrepancies arise due to a single observation or a small subset of the data. If there is access to the original data records, it may be possible to account for unusual obser-

108 A.C. DAVISON

vations in terms of faulty recording or transcription of the data. The obvious remedy is then to delete the offending observation(s). The situation where there is no external evidence about the validity of an observation is less straightforward. Then it is often sensible to report analyses with and without the observation(s), especially if conclusions are substantially altered. Evidence from related studies may be useful. Systematic failure of a model at the highest or lowest responses is particularly important, and may indicate severe limitations on its applicability that make extrapolation more than usually unwise.

Reliable conclusions cannot be drawn from statistical methods, however sophisticated, applied to data for which they are unsuitable. A discussion of model adequacy is therefore an essential component in a statistical analysis. Nowadays residuals and the other measure of fit described above are often calculated automatically in regression packages, so the checks described above are easily performed.

3.5 Transformations One requirement of a successful model is consistency with any known asymptotic behaviour of the phenomenon under investigation. This is especially important if the model is to be used for prediction or forecasting. Many quantities are necessarily non-negative, for example, so a linear model for them can lead to logically impossible negative predictions. One remedy for this is to study the behaviour of the data after a suitable transformation. Even where considerations such as these do not apply, it may be sensible to investigate the possibility of transformation to satisfy model assumptions more closely.

The interpretation and possible usefulness of a linear model are complicated by such factors as

(a) non-constant variance; (b) asymmetrically distributed errors; (c) non-linearity in one or more covariates; or (d) interactions between explanatory variables.

A transformation can sometimes overcome one or more of these difficulties. In many applications a suitable transformation will be obvious on general grounds, or after inspection of the data, or from previous experience of related sets of data. In less clear-cut situations a more formal approach may be helpful, and one possibility is described below.

One class of transformations for positive observations is the power


family. In one version the transformed value of an observation Y is l4

{l- 1

y(;) = A-

log y

(A- '" 0)

(A- = 0)

which changes continuously from power to log forms at A- = O. The idea now is to use the data themselves to determine a value of A- which makes them close to normality. That is, we aim to find the value of A- which makes the model

Y(A) = Xp + 8

most plausible in the sense of maximizing the likelihood. Here Y(A) = (yf'), y~A), ... , y~A) T is the vector of transformed observations.

The density of 1j is

f(Yj; p, (,z, A-) = (2no.2)-1/2 exp {- 2~2 (y]') - xJ P?} y}-i, (15)

where the Jacobian y} -.I is needed for eqn (15) to be a density for Yj rather than yY). The optimum value of A- is chosen by maximizing the log likelihood

n

L(P, (12, A-) = L 10gf(Yj; p, (12, A-) j=1

Let y = (Ilj=, yy/n denote the geometric mean of the data, and define zj'l to be equal to yJ!.) Il-I. Then the profile log likelihood for A- is

Lp(A-) = max/l,,,z L(P, (12, A-) = - ~ log SSRes(A-) + constant

where SSRes(A-) is the residual sum of squares when the vector Z(Al is regressed on the columns of X. We now plot Lp(A-) as a function of A-, and aim to choose a value of A- which is easily interpreted but close to the maximum of the profile log likelihood.

Example 1 (contd), Figure 14 shows the profile log likelihood for Afor the Venice data. It suggests that the inverse transformation A- = - I or the log transformation A- = 0 would be more sensible than the identity transformation A- = I. The interpretation of the parameters /31 and /32 would change entirely if either of these scales were used. If the logarithmic scale was used the mean sea level would increase by a multiplicative factor i' each year, for example, which seems harder to interpret physically than

110 A.C. DAVISON

-285 -...< ~

Cl. -290 -l

"0 0 0 -295 ~

Q)

-"

Ol -300 .2 ~

'E -305 0:

-310 -2 -1 0 2

}..

Fig. 14. Profile log likelihood for power transformation for Venice data.

a constant additive change. Analysis of transformed data is unsatisfactory in this example because of these difficulties of interpretation.

A traditional use of transformations is in the analysis of counts. Small counts are often modelled as Poisson random variables, with density function

y = 0, 1, ... (16)

the mean and variance of which are both J1.. In the case of non-constant variance we might aim to find a transformation h( Y) whose variance is constant. Taylor series expansion gives heY) ~ h(J1.) + (Y - J1.)h'(J1.), so E{h(Y)} ~ h(J1.) and var{h(Y)} ~ h'(J1.)2 var(Y) = J1.h'(J1.f If this is to be constant, we must have h( Y) ex yl/2. A more refined calculation shows that heY) = (Y + 1/4)1/2 has approximate variance 1/4. The procedure would now be to analyse the transformed counts as variables with known variance. Difficulties of interpretation like those experienced for the Venice data arise, however, because a linear model for the square root of the count suggests that the count itself depends quadratically on the explanatory variables through the linear part of the model. This would seem highly artificial in most circumstances, and a more satisfactory approach is given in Section 5.

REGRESSION AND CORRELA nON III

3.6 Weighted least squares One immediate generalization of the results of Section 3.1 is to a 'si tuation where different observations lj have different variances. We suppose that var(lj) = (12/Wj' where Wj is the weight ascribed to lj. The least squares estimate of fJ is then P = (XT WX) -I XT WY, where W is the matrix with jth diagonal element Wj and zeros elsewhere. The covariance matrix of P is (12(XT WX)-I, and the estimate of (12 is given by

2 I In TA2 S = -- W (Y - X fJ)

n _ P j=1 J J J

One case where such a model is appropriate is when lj is an average of mj

observations, so var(lj) = (12jmj • An example with other instructive features is as follows.

Example 5.15 In studies of pollutant dispersal from a point source, data on pollutant levels may be available at a number of spatial locations. Exposures at each individual location may be summarized by fitting a suitable distribution,16 so that a vector of parameter estimates for that distribution, Yj' say, summarizes the mj observations at the jth site. In almost all cases the estimates will have a joint asymptotic covariance matrix of form mj- I M(yJ Variation in the entries of M is often small compared to variation in the mj , which will be heavily dependent on the distance of the location from the source. Thus the uncertainty attached to the fitted distribution may differ substantially from one location to another.

Although the values of the Yj may be of interest in themselves, it will often be useful to summarize how the distributions depend on such factors as distance from the source, meteorological factors, and pollutant characteristics. Variation in the estimates Yj may then be of only indirect interest. The Wei bull distribution, for example, is often parametrized so that its probability density function is

y > 0; YI' Y2 > 0

but in applications the parameters of direct interest are usually the mean and variance

112 A.C. DAVISON

of the distribution of exposures, where

r(z) = fo'" zt-'e-Udu

is the gamma function. Since the mean and variance "I and "2 are generally more capable of physical interpretation than YI and Y2' it makes sense to model their variation directly. A form whereby "I and "2 depend on the covariates may be suggested by exploration of the data, but more satisfactory co variates are likely to be derived if a heuristic argument for their form based on physical considerations can be found. Once suitable combinations of covariates are found, the estimates mean and variance Kjl and Kj2 can be regressed on them with weights mj • A more refined approach would use the covariance matrices ~ and multivariate regression '7 for the pairs (Kjl , Kj2 ).

This example illustrates two general points. The first is that in combining estimates from different samples with differing samples sizes it is necessary to take the sample sizes into account. The second is that when empirical distributions are replaced by fitted probability models it is wise to present results in a parametrization of the model which is readily interpreted on subject-matter grounds, and this may not coincide with a mathematically convenient parametrization.

Section 5 deals with further generalizations of the linear model.

4 ANALYSIS OF VARIANCE

The idea of decomposition of the variability in a dataset arose in Section 3 in the context of linear regression. This section discusses the corresponding decomposition when the explanatory variables represent qualitative effects rather than quantitative covariates. The simplest example is the one-way layout. Suppose that m independent measurements have been taken at each of p different sites, so that n = mp observations are available in all. If the variances of the observation at the sites are equal, the jth observation at the rth site may be written

j = 1, ... , m, r = 1, ... , p (17)

where the Gjr are supposed to be independent errors with zero means and variances a2, and the parameter Pr represents the mean at the rth site. In


terms of matrices this model may be expressed as

o o

0 eml

0 /31 el2

/32 (18) +

0 em2 =

/3p

el p

Ymp 0 0 emp

which has the same linear structure, Y = xp + 8, as eqn (6). Here however the columns of the n x p matrix X are dummy variables indicating the presence or absence of a parameter in the linear expression for the corresponding response variable. The qualitative nature of the columns of the covariate matrix in eqn (18) contrasts with eqn (5), where the co variates represent the quantitative effect of time on the mean response.

The qualitative nature of the explanatory variables has no effect on the algebra of least squares estimation, however, and estimates of the /3r are obtained as before. Thus P = (XT X)-I X T Y, and in fact the estimate of /3r for the one-way layout is the obvious, namely the average observed at the rth site,)if = m- I :Ej Yjr' The results in Section 3 on confidence intervals, tests, model-checking, and so forth apply directly because the linear structure of the model is unaffected by the nature of the covariates.

The model which asserts that the mean observation is the same for each site contains a single parameter /3 representing the overall mean, and is a restriction of eqn (17) that insists that /31 = /32 = ... = /3p = /3. The matrix of covariates for this model consists of a single column of ones, and the estimate of /3 is the overall average ji = n- I :Ej .k Yjk' This model is said to be nested within the model represented by eqn (18), which reduces to it when the parameters of eqn (18) are constrained to be equal. The difference in degrees of freedom between the models is p - I, and the analysis of variance table is given in Table 7.

In this table SSReg represents variability due to differences between sites

Sour

ce

Bet

wee

n si

tes

(adj

uste

d fo

r m

ean)

W

ithi

n si

tes

Tot

al (

adju

sted

for

mea

n)

TA

BL

E 7

A

na

lysi

s o

f va

rian

ce t

able

fo

r a

on

e-w

ay l

ayo

ut

df

p-I

p(m

-1)

mp

-1

Sum

of s

quar

es

SS

Reg

=

Ljr

(Y.r

-

y .. )2

S

SR

es

=

Ljr

(Yjr

-

Yr)2

Ljr

(Yjr

-

yj

Mea

n sq

uare

SS

Reg

/(P

-

I)

SSRe

s/{p

(m -

I)}

~

> o Cl > ~


adjusted for the overall mean, and SSRes represents variability at sites adjusted for their different means. This corresponds to the decomposition

Yj' = Y. + (y., - yJ + (Yj, - y,)

in which the terms on the right-hand side correspond successively to the overall mean, the difference between the site mean and the overall mean, and the difference of the jth observation at the rth site from the mean there. If the differences between sites are non-zero, SSReg will be inflated relative to its probable value if there were no differences. Assessment of whether between-site differences are important is based on

SSReg/(P - 1)

SSRes/{p(m - I)}

which would have an F distribution with p - I and p(m - 1) degrees of freedom if there were no differences between sites and the errors were independent and normal.

Equation (17) represents one way to write down the model of different means at each site. Another possibility is

Jj, = IX + y, + Bjll j = I, ... , m, r = I, ... ,p (19)

in which the overall mean is represented by IX and the parameters YI, ... , YP represent the differences between the site means and the overall mean. This formulation of the problem contains p + I parameters, whereas eqn (17) contains p parameters. Plainly p + I parameters cannot be estimated from data at p sites, and at first sight it seems that eqns (17) and (19) are incompatible. This difficulty is resolved by noting that only certain linear combinations of parameters can be estimated from the data; such com-binations are called estimable. In eqn (17) the parameters {31, ... , {3p are estimable, but in eqn (19) only the quantities IX + YI, IX + Y2, ... , IX + Yp are estimable, so that estimates of the parameters IX, YI' ... , Yp cannot all be found without some constraint to ensure that the estimates are unique. Possible constraints that can be imposed are:

(a) Ii = 0, which gives the same estimates as eqn (17); (b))l1 = 0, so that Ii represents the mean at site I, and )I, represents the

difference between the mean at site r and at site I; and (c) 1:,)1, = 0, so that Ii is the average of the site means and )I, is the

difference between the mean at the rth site and the overall mean.

116 A.C. DAVISON

Equation (19) leads to an n x (p + I) matrix of covariates

0 0

0 0

0 0

X (20) 0 0

0 0

0 0

which has rank p because its first column equals the sum of the other columns. Thus the (p + I) x (p + l) matrix X T X is not invertible and P = (XT X) -I XT Y cannot be found. This difficulty can be overcome by the use of generalized inverse matrices, but it is more satisfactory in practice to force X to have rank p by dropping one of its columns. Constraints (a) and (b) above, correspond respectively to dropping the first and second columns of eqn (20).

It is important to appreciate that the parametrization in which a model is expressed has no effect on the fitted values obtained from a model, which are the same whatever parametrization is used. Thus the fitted value for an observation at the rth site in the example above is the average at that site, ji" for parametrization (17) or (19) and any of the constraints (a), (b) or (c). The parametrization used affects only the interpretation of the resulting estimates and not any aspect of model fit.

Example 6. As part of a much larger investigation into the atmospheric chemistry of vehicle emissions, data were collected on a number of substances in the atmosphere at the side of a London road. Table 8 gives hourly average levels of toluene (ppb) for the hours 8-9, 9-10, ... , 15-16, measured on each day over a 2 week period. Here we shall be interested in modelling temporal changes in toluene levels, without regard to other pollutants, although in practice the levels of different pollutants are closely related. Some of the observations are missing due to machine breakdown and other effects.


TABLE 8 Hourly average toluene levels (ppb) in London street

Start of hour

8 9 10 11 12 13 14 15

Week 1 Sunday 4·56 4·05 4·22 9·82 4·84 6·59 4·95 5·41 Monday 19·55 9·48 7·24 Tuesday 11·86 15·13 12·13 7·67 7·08 9·55 7·21 8·31 Wednesday 21·68 22·13 18·40 12·16 13·35 11·42 13-80 10·13 Thursday 20·01 28·15 19·59 24·39 25·18 20·44 23·06 26·96 Friday 45·42 57·78 42·45 74·78 75·16 61·90 46·25 52·38 Saturday 4·23 8·65 15·64 14·29 17·02 23·22 18·35 14·14

Week 2 Sunday 4·46 5·59 6·48 5·86 6·27 3·68 3·93 6·29 Monday 5·41 5·41 6·37 7·51 6·40 6·01 6·01 6·22 Tuesday, 10·80 11·27 10·34 11·62 11·41 11·32 12·43 15·60 Wednesday 5·90 29·20 21·81 32·61 18·19 18·95 19·05 19·41 Thursday 12·62 14·33 10·24 8·66 14·69 10·50 14·65 14·83 Friday 8·46 11·61 11·90 11·94 10·12 11·41 9·28 11·51 Saturday 15·88 14·37 12·81 12·17 12·97 13·09 12·72

One approach to detailed analysis of such data would be via time series analysis, but this is problematic here since the data consist of a number of short series with large gaps intervening. The aim instead is to summarize the sources of variability by investigating the relative importance of hourly variation and daily variation. Hourly variation might be due to variable traffic intensity, as well as diurnal changes in temperature and other meteorological variables, whereas day-to-day variation may be due to different traffic intensity on different days of the week as well as the weather.

Some possible models for the data in Table 8 are that Ywdo the toluene level observed at time t of day d on week w, has an expected value given by

Q(

Q( + f3d

Q( + f3d + Yt

Q( + f3wd + Yt

(21)

118 A.C. DAVISON

TABLE 9 Analysis of variance for toluene data

Source df Sum of squares Mean square

Day of week 6 8263 1373 Time of day 7 196·7 28·1 Days 7 9254 1322

Residual 85 2116·8 24·9

Total (adjusted for mean) 105 19830

The parameter a corresponds to an overall mean, Pd to the difference between the mean on day d and the overall mean, Yt to the difference between the mean at time t and the overall mean, and Pwd to the difference between the mean on day d of week wand the overall mean. The first model in (21) is that the data all have the same mean. The second that there is a day-of-the-week effect but no variation between hours on the same day; note that this implies that variation is due to different traffic patterns on different days of the week, but that all weeks have the same pattern, and that there is no variation between hours on the same day. The third model is that there is a day-of-the-week effect, and also variation between times of the day. The fourth model is that there is variation between times, but that the between-day variation is due not to causes which remain unvarying from week to week, but to other causes, possibly overall meteorological changes. This would not imply that every Monday, Tuesday, and so forth had the same average toluene levels.

Table 9 shows the analysis of variance when the models in (21) are successively fitted to the data. The seven extra degrees of freedom for days correspond to the extra day parameters which are fitted in the final model in (21) but not in the previous one. The table shows large differences due to days, but no significant effect for times of the day once differences between days are taken into account. The between-day variation is not due solely to a day-of-the-week effect, as the final model in (21) gives a big reduction in overall sum of squares. The between-day mean squares in the table are very large compared to the residual mean square, which gives an estimate of (J2 of 24·9 on 85 degrees of freedom.

Figure 15 shows the average levels of toluene for the different days. The two lowest values are for Sundays, which supports the contention that the levels are related to traffic intensity, but the variability between days is very substantial, and no definite weekly pattern emerges. Considerably more


60

:0 50 0. E-OJ 40 t:J)

~ 30 OJ > en ;>, 20 en 0

10

0'--.l.--...1.---'---'---'----L.-..i.--I_'--.l...-.L.-...I.--'--....J o 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Day

Fig. 15. Daily average toluene level (ppb), Sundays (days 1 and 8) have the lowest values,

data would be required to discern a day-of-the-week effect with any confidence. The estimates of the hour-of-the-day effects are all ± 2 ppb or so, and'are very small compared to the variation between days.

Figure 16 shows a plot of standardized residuals against fitted values for the final model in (21). There is an increase of variability with fitted value, and this implies that the assumption of constant variance is inappropriate. This example is returned to in Section 5.

5 GENERALIZED LINEAR MODELS

5.1 The components of a generalized linear model Linear regression is the most popular tool in applied statistics. However, there are difficulties in using it with data in the form of counts or proportions, not because it cannot be applied to the data, perhaps suitably transformed, but because of difficulties in the interpretation of parameter estimates. An important extension of linear models is the class of generalized linear models, in which this problem may be overcome.6,18

The normal linear model with normal-theory assumptions outlined in Section 3.1 can be thought of as having three components:

(l) the observation Y has normal density with mean f1 and variance 0'2;

(2) a linear predictor 1'/ = xT fJ through which covariates enter the model; and

(3) a function linking f1 and 1'/, which in this case is f1 = 1'/.

In a generalized linear model, (I) and (3) are replaced by:

120 A.C. DAVISON

4 + +

3

'- + a: 2 + ++

(ij + ::I U + + + Ui ++ t.+ ++ ~ + +,"+ + u 0 ~+:tr + ++ Ql N +~+~+ "!t +

+ u (;j -1 + + ++

~ + + + u ++ + c + '" +

Ci5 -2 + ++

-3 + +

-4 0 10 20 30 40 50 60 70

• A Fitted value Y j

Fig. 16. Residual plot for fit of linear model with normal errors to toluene data. The tendency for variability to increase with fitted value suggests that the

assumption of constant variance is inappropriate.

(1') the observation Y has a density function of form

{ yO - b(O) } f(y; 0, c/J) = exp a(c/J) + c(y, c/J) (22)

from which it can be shown that the mean and variance of Yare p. = b'(O) and a(c/J)b"(O); and

(3') the mean p. of Yis connected to the linear predictor '1 by '1 = g(p.), where the link function g(.) is a differentiable monotonic increasing function.

The Poisson, binomial, gamma and normal distributions are among those which can be expressed in the form (22). These allow for the modelling of data in the form of counts, proportions of counts, positive continuous measurements, or unbounded continuous measurements respectively.

We write a(c/J) = c/J/w, where c/J is the dispersion parameter and w is a weight attached to the observation. The quantity a(c/J) is related to the second parameter which appears in the binomial, normal, and gamma


distributions. If lj is the average of mj independent normal variables each with variance rr, for example, we write a(4)) = 4>lw for the jth observation, with 4> = (J2 and w = mj' to take account of the fact that lj has weight mj'

The parameter 0 in eqn (22) is related implicitly to the linear predictor '1 = xT fl. Since p = b' (0) and the linear predictor is '1 = g(p) in terms of the mean, p, we see that '1 = g{ b' (O)}. If we invert this relationship we can write 0 as a function of '1. Hence eqn (22) can be expressed as a function of fl and x, which are of primary interest, rather than of 0, which enters mainly for mathematical convenience. Thus the logarithm of eqn (22) can be written as

1('1,4» = yO('1) ~~{0('1)} + c(y, 4» (23)

in terms of '1 = xT fl, and 0('1) denotes that 0 is regarded here as a function of '1 and through '1 of fl. This expression arises below in connection with estimation of the parameters fl.

For each density which can be written in form of eqn (22), there is one link function for which the model is particularly simple. This, the canonical link junction, for which 0('1) = '1, is of some theoretical importance.

The value of the link function is that it can remove the need for the data to be transformed in order for a linear model to apply. Consider data where the response consists of counts, for example. Such data usually have variance proportional to their mean, which suggests that suitable models may be based on the Poisson distribution (eqn (16)). Direct use of the linear model (eqn (6)) would, however, usually be inappropriate for two reasons. Firstly, if the counts varied substantially in size the assumption of constant variance would not apply. This could of course be overcome by use of a variance-stabilizing transformation, such as the square root transformation derived in Section 3.5, but the difficulties of interpretation mentioned there would arise. Second, a linear model fitted to the counts themselves would lead to the unsatisfactory possibility of negative fitted means. The link function can remove these difficulties by use of the Poisson distribution with mean J.I. = e~, which is positive whatever the value of '1. This model corresponds to the logarithmic link function, for which '1 = log J.I.. This is the canonical link function for the Poisson distribution. When this link function is used, the effect of increasing '1 by one unit is to increase the mean value of the response by a factor e. Such a model is known as a log-linear model. Since the Poisson probability

122 A.C. DAVISON

density function (16) can be written as

j(y) = exp (y log Jl. - Jl. - log y!)

we see by comparison with eqn (22) that 9 = log Jl., b(9) = eO, a(c/J) = 1, and c( y, c/J) = -log y!. The mean and variance of Y both equal Jl. = eO.

It has already been pointed out that distributions whose density function is ofform eqn (22) have mean Jl. = b'(9) and variance a(c/J)bH(9). It follows that since 9 can be expressed as b,-I(Jl.), the variance can be expressed in terms of the mean Jl. as a(c/J)bH{b,-I(Jl.)} = a(c/J)V(Jl.), say, where V{J1.) is the variance junction, an important characteristic of the distribution. The mean and variance of the Poisson distribution are equal, so that this distribution has V(Jl.) = Jl.. Similarly V(Jl.) = 1 for the normal distribution, for which there is no relation between the mean and variance. However data arise with a variety of mean-variance relationships, and it is useful that several of these are compassed by models of form eqn (22).

It is frequently observed that the ratio of the variance to the square of the mean is roughly constant across related samples of positive continuous data. A distribution with this property is the gamma distribution. Its probability density function is

1 vV (y)V-1 f( Y; /1, v) = r(v) Ii p, exp ( - vY/ /1), (y > 0; /1, v > 0)

The two parameters Jl. and v are respectively the mean and shape parameter of the distribution, which is flexible enough to take a variety of shapes. For v I the distribution is peaked, approaching the characteristic bell-shaped curve of the normal distribution for large v. The gamma distribution has variance /12/v, so its variance function is quadratic, i.e. V(Jl.) = Jl.2• Comparison with eqn (22) shows that a(c/J) = I/v, 9 = -1/Jl., b(9) = -log (- 9), and c(y; c/J) = v log (vy) - log (y) - log r(v). Various link functions are possible, of which the logarithmic, for which g(Jl.) = log Jl. = '1, or the canonical link, the inverse, with g{J1.) = I/Jl. = '1, are most common in practice.

Example 6 (contd). We saw in Section 4 that the variance of the toluene data apparently depends on the mean. Figure 17 shows a plot of log variance against log mean for each of the 14 days. The slope is close to two, which suggests that the variance is proportional to the square of the mean and that the gamma distribution may be suitable. The conclusion that between-day variation is important but that between-time

6

5

4

Ql ()

c 3 as to > OJ 2 0 ...J

0

-1 1

REGRESSION AND CORRELATION

+

+

+

2

+

+ +

+

+ + +

+

+

+

+

3 4

Log mean

123

5

Fig. 17. Plot of log variance against log average for each day for toluene data. The linear form of the plot shows a relation between the average and variance. The slope is about 2, indicating that the coefficient of variation is roughly

constant.

variation is not important is again reached when the models with gamma error distribution, log link function and linear predictors in eqn (21) are fitted to the data. Residual plots show that this model gives a better fit to the data than a model with normal errors.

Apart from the normal, Poisson and gamma distributions, the most common generalized linear models used in practice are based on the binomial distribution, an application of which is described in Section 5.5.

5.2 Estimation The parameters fJ of a generalized linear model are usually estimated by the method of maximum likelihood. The log likelihood is

n

L(fJ) = L log f { Yj; ()j (fJ), <p} (24) j=l

where f is defined in eqn (22). If we assume that <p is known, a maximum of the likelihood is determined by the likelihood equation

(25)

which must be solved iteratively. For this problem and many others,

124 A.C. DAVISON

Newton's method can be rewritten as a weighted least squares algorithml9 in which the weights are updated at each iteration. The derivation, given below, involves more matrix algebra than other sections of this chapter and may be omitted at a first reading.

The fitting algorithm is derived as follows. First, note that eqn (26) can be written as

a"T aL = 0 ap a"

(26)

evaluated at the overall maximum likelihood estimates p. Thejth element of the n x 1 vector of linear predictors, ", is xJ p, SO" = Xp, and therefore the p x n matrix a"Tjap equals X. Also the n x 1 vector aLja" has jth element

a 10gf{Yj; OJ(1'/j), ¢} aOj Yj - b'(Oj) a1'/j = a1'/j a( ¢ )

(27)

Newton's method applied to eqn (25) involves a first-order Taylor series expansion of aLjap about an initial value of p. Thus

aL aL 02 L(P) o = 8fJ (p) == 8fJ (fJ) + 8P8fJT (p - P) (28)

where 02 L(p)jopapT is the p x p matrix whose (r, s) element is

a2L(p) _ ~ a2 10gf{Yj; (}j(11j), ¢} ap ap - L XjrXjs 0 .. 2

r s )= I ./)

and xjr is the rth element of the p x I covariate vector Xj' and so forth. In the derivation of Newton's method, the right-hand side of eqn (28) is now rearranged by writing

(29)

whose right-hand side depends only on quantities evaluated at p. However, there can be numerical difficulties associated with the use in the algorithm of the p x p matrix of second derivatives

a2L - afJapT

and it is replaced by its expected value, which can be written in the form XT WX, where W is the n x n matrix with zeros off the diagonal andjth


diagonal element

WJ' = E[- jPlogf{Yj; 8j (t/j), cp}]

at/I For distributions whose density function can be written in th- form eqn (22), it turns out that Wj equals (aJlj/at/j)2/{a(cp)V(Jlj)} in term!; of the mean and variance function Jlj and V(Jl) of the jth observation.

From eqns (26), (27) and (29) we see that

where

p == p + (XT WX)-I XT aL a"

(XT WX)-I (XT Wxp + xT ~~ ) = (XT WX)-I xT Wz

z Xp + w-I aL a"

= ,,+ W-1(y - II)/a(cp)

(30)

is an n x I vector known as the adjusted dependent variable. We see that P is obtained as the result of applying the weighted least squares algorithm of Section 3.6 with matrix of co variates X, weight matrix W, and response variable z. The solution to eqn (25) is not usually obtained in a single step, and the value of P is obtained by repeated application of eqn (30). This iterative weighted least squares algorithm can be set out as follows:

(la) First time through, calculate initial values for the mean vector II and the linear predictor vector" based on the observed y. Go to (2).

(1 b) If not the first time through, calculate II and" from the current p. (2) Calculate the weight matrix Wand the adjusted dependent variable

z from II and II· (3) Regress z on the columns of X using weights W, to obtain a current

vector of parameter estimates p. (4) Decide whether to stop based on the change in L(P) or in P from

the previous iteration; if decide to continue, go to (lb) and repeat until the change in L(P) or in P between two successive iterations is sufficiently small.

This algorithm gives a flexible and rapid method for maximum likelihood estimation in a wide variety of regression problems. '9

126 A.C. DAVISON

Estimation of the dispersion parameter cP, where necessary, is discussed in Ref. 6. The dispersion parameter is known for the Poisson and binomial distributions, and for normal data cP = (12 is estimated, as usual, using the residual sum of squares. For the gamma distribution, estimation of cP is equivalent to estimation of the shape parameter v.

Confidence intervals for the components of P are based on the result that in large samples Pr is approximately normally distributed with mean Pr and variance V r" where Vrr is the rth diagonal element of the P x P matrix (Xl WX) -I, evaluated at p. Similarly the approximate covariance of Pr and Ps is the (r, s) element of (Xl WX)-I evaluated at p. These results generalize those in Section 3.1.

5.3 The deviance and model fit In a linear model with normal errors the residual sum of squares is used to measure the effect of adding covariates to the model. For a generalized linear model the corresponding quantity for a model with parameter estimates p is the deviance, which is defined as

n

D(P) = 2 L g - Ij(P)} (31 ) j=1

Here ~(fJ) = 10gf(Yj; fJ), from eqn (24), ¢ is the dispersion parameter, and ~ is the biggest possible log likelihood attainable by the jth observation. The deviance is non-negative, and for a normal linear model it equals the residual sum of squares for the model, SSRes' The deviance is used to compare models, and to judge the overall adequacy of a model.

As in Section 4, model Mo is said to be nested within model MI if MI can be reduced to Mo by restrictions on the values of the parameters. Consider for example a model with Poisson error distribution, log link function, and linear predictor '10 = Po + PI ZI' This is nested within the model with linear predictor'll = Po + PIZI + P2Z2, which reduces to '10 if P2 = O. However'll cannot be reduced to '1b = Po + P3Z3 by restrictions on Po, PI and P2' The corresponding models are not nested, although '1b has fewer parameters than'll'

If there are Po unknown parameters in Mo and PI in MI and the models are nested, the degrees of freedom n - Po of Mo exceeds the degrees of freedom n - PI of MI' General statistical theory then indicates that for binomial and Poisson data the difference in deviances D(Mo) - D(MI), which is necessarily non-negative, has a chi-squared distribution on PI - Po degrees of freedom if model Mo is in fact adequate for the data. A difference D(Mo) - D(MI) which is large relative to that distribution


may indicate that MI fits the data substantially better than Mo. For normal models, inference proceeds based on F-statistics, as outlined in Section 3.

The forward selection, backwards elimination, and stepwise regression algorithms described in Section 3.4 can be used for selection of a suitable generalized linear model, though the caveats there continue to apply. The role of the residual sum of squares is taken by the deviance, and reductions in deviance are judged to be significant or not, relative to the appropriate chi-squared distribution.

Under some circumstances the deviance has a X~-PI distribution if model MI is in fact adequate for the data. For Poisson data the deviance has an approximate chi-squared distribution if a substantial number of the individual counts are fairly large. For binomial data the distribution is approximately chi-squared if the denominators mj are fairly large and the binomial numerators are not all very close to zero or to the denominator. For normal models with known (J2 the deviance has an exact chi-squared distribution when the model is adequate, and the distribution is approximately chi-squared for gamma data when v is known.

5.4 Residuals and diagnostics The residuals and measures of leverage and influence outlined in Section 3.4 can be extended to generalized linear models. Thorough surveys are given in Refs 6 and 20.

The most useful general definition of a residual for a generalized linear model is based on the idea of comparing the deviances for the same model with and without an observation. This is analogous to the jackknifed residual described in Section 3.4. The change in deviance when the jth observation is deleted from the model is approximately the square of the jackknifed deviance residual

Rj = sign (Yj - ft)(d} + hjj '::')1/2 (32)

In eqn (32) ~2 is the contribution to the deviance from the jth observation, i.e. ~2 = 2g - Ij(P)}, and hjj is given by thejth diagonal element of H = WI/2 X(XT WX)-I X T W 1/2, where W is the diagonal matrix of

128 A.C. DAVISON

weights Wj evaluated at the current model. The quantity

a 10g!{Yj; (Jj('1), <jJ}/0'1j rpj = {wj(l _ hij)}1/2

Yj - {J.j {a(<jJ)V(jJ.)(1 - hij)}1/2

is known as a standardized Pearson residual.

(33)

The Rj can be calibrated by reference to a normal distribution with unit variance and mean zero. Observations whose residuals have values that are unusual relative to the standard normal distribution merit the close scrutiny that such observations would get in a linear model. For most purposes Rj may be used to construct the plots described in Section 3.4.

A useful measure of approximate leverage in a generalized linear model is hij as defined above, and the measure of influence is the approximate Cook statistic

h-. (j= lJ tf,

p(l - hij) j

where p is the dimension of the parameter vector /3. Both hjj and Cj may be used as described in Section 3.4.

The definitions of Rj , hij and Cj given here reduce to those given in Section 3.4 for normal linear models.

5.5 An application The data in Table 2 are in the form of counts rj of the numbers of days out of mj on which ozone levels exceed 0·08 ppm. One model for this is that the rj have a binomial distribution with probability 7tj and denominator mj:

rj = 0, I, ... , mj, 0 < 7tj < I

(34)

The mean of this distribution is mj7tj and its variance is mj7tj(1 - 7t). The idea underlying the use of this distribution is that the probability that the ozone level exceeds 0·08 ppm is 7tj on each day of the jth month. Exceedances are assumed to occur independently on different days, and the number of days with exceedances in a month for which mj days of data are recorded then has the binomial distribution eqn (34). The probabilities 7tj are likely to depend heavily on time of year since there are marked seasonal vari-


ations in ozone levels, and a possible form for the dependence is suggested by the following argument.

If exceedances over the threshold during month j occur at random as a Poisson process with a positive constant rate Aj exceedances per day, the number of exceedances Zj occurring on any day that month has the Poisson distribution (16) with mean Aj • Thus the probability of one or more exceedances that day is

1tj = pr (Zj ~ I) = 1 - exp (-Aj )

If Aj = exp (xJ fl), corresponding to a log-linear model for the rate of the Poisson process, we find that the expected number of days per month on which there are exceedances is

p.j = mj 1tj = mj[1 - exp {-exp (xJfl}}]

which corresponds to the complementary log-log link function, " = xJ fl = log { -log (l - p.j/mj)}. This model automatically makes allowance for the different numbers of days of recorded data in each month, and would enable the data analyst to impute numbers of days with exceedances even for months for which very little or even no data are recorded.2

Other link functions that are useful for the binomial distribution are the probit, for which " = ~-I {Ji./m}, and the logistic, for which " = log {p./(m - p.}}. The canonical link function is the logistic, which is widely used in applications but does not have the clear-cut interpretation that can be given to the complementary log-log model in this context.

The ozone levels in this example are thought to depend on several effects. There is an overall effect due to differences between sites, which vary in their proximity to sources of pollutants. There is possibly a similar effect due to differences between years, which can be explained on the basis of different weather patterns from one year to another. There may be interaction between sites and years, corresponding to the overall ozone levels showing different patterns at the two sites over the four years involved. There is likely to be a strong effect of temperature; here used as a surrogate for insolation. Furthermore an effect for the 'ozone season', the months May-September, may also be required. Of these effeCts only that due to temperature is quantitative; the remainder are qualitative factors of the type discussed in Section 4.

Table 10 shows the contributions of these different terms towards explaining the variation observed in the data. The table is analogous to an analysis of variance table, but with the distinction that the deviance explained by a term would be approximately distributed as chi-squared on

130 A.C. DAVISON

TABLE 10 Texas ozone data: deviance explained by model terms when entered in the

order given

Model term Degrees of freedom Deviance

Site I 0·17 Year 3 1·00 Site-year interaction 3 13-34 Daily maximum temperature 1 103·16 Year-ozone season interaction 4 18·83

Residual 74 143·55

the corresponding degrees of freedom if that term were not needed in the model, and otherwise would tend to be too large. If the data showed no effect of daily maximum temperature, for example, the deviance explained in the table would be randomly distributed as chi-squared on one degree of freedom. In fact lO3·16 is very large relative to that distribution: overwhelming evidence of a strong effect of daily maximum temperature.

The deviance explained by each set of terms except for differences between sites and years is statistically significant at less than the 5% level. The site and year effects are retained because of the interaction of site and year: it makes little sense to allow overall rates of exceedance to vary with year and site but to require that overall year effects and overall site effects must necessarily be zero.

The coefficient of the monthly daily maximum temperature is approximately 0·07, which indicates that the rate at which exceedances occur increases by a factor eO·07! for each rise in temperature of t°c. Since the temperature variation over the course of a year is about 15°C, the rate of exceedances varies by a factor 2· 7 or so due to annual fluctuations in air temperature, regarded in this context as a surrogate for insolation. Other parameter estimates can be interpreted in a similar fashion.

The size of the residual deviance compared to its degrees of freedom suggests that the model is not adequate for the data. Standard theory would suggest that if the model was adequate, the residual deviance would have a chi-squared distribution on 74 degrees offreedom, but the observed residual deviance of 143·55 is very large compared to that distribution. However, in this example there is doubt about the adequacy of a chisquared approximation to the distribution of the residual deviance, because many of the observed counts rj are rather small. Furthermore some lack of fit is to be expected, because of the gross simplifications made


0.5

.I::

C 0

0.4 E ~

Q) Q.

'" Q) 0.3 ()

c

'" "C Q) Q)

0.2 () x Q)

'0 Q) 0.1 (ii c:

0.0 0 12 24 36 48

Month

Fig. 18. Fitted exceedance rate and observed frequency of exceedance for Beaumont, 1981-1984. The fitted rate, lj (solid curve), smooths out fluctu

ations in the frequencies, rJmj (dots).

in deriving the model. There is evidently variation in the rates Ai within months, as well as a tendency for days with exceedances to occur together due to short-term persistence of the weather conditions which generate high ozone levels. These effects will generate data that are over-dispersed relative to the model, and hence have a larger deviance than would be expected.

There is also the possibility that individual observations are poorly fitted by the model. Figures 18 and 19 show plots of the estimated rates based on the model, Xi' and the observed proportions of days with exceedances, rJmi , for the two sites. There are clearly some months where there are more exceedances than would be expected under the model, and examination of residuals confirms this. In particular there are substantially more days with exceedances at Beaumont in August and September 1981, in September 1983, and in January and May 1984; and at North Port Arthur in January 1984. Some of these outliers can be traced to specific pollution incidents. Plots of residuals and measures of influence show the characteristics, and the deviance drops considerably when they are deleted from the data.

132 A.C. DAVISON

0.5

.c C 0 0.4 E Q; Q.

(J)

0.3 Q)

tl C <Il "0. Q) Q) tl 0.2 x Q)

'0 Q)

0.1 0; a:

0.0 0 12 24 36 48

Month

Fig. 19. Fitted exceedance rate and observed frequency of exceedance for North Port Arthur, 1981-1984. The fitted rate, Ij (solid curve) smooths out

fluctuations in the frequencies, rJmj (dots).

5.6 More general models The models described above can be broadened in several ways. One generalization is to some types of correlated data, which in general leads to a weight matrix W which is not diagonal.

A second possibility stems from the observation that the iterative weighted least squares algorithm derived in Section 5.2 extends readily to distributions which are not of form eqn (22) and is best illustrated by example.

Example 1 (contd). We saw in Section 3 that the assumption of normal errors is not suitable for the Venice sea level data, whose errors are skewed to the right. The distribution of the maximum of a large number of independent and identically distributed observations may be modelled by the generalized extreme-value distribution

[ { y rt}l/kJ H(y; rt, t/I, k) = exp - I - k -t/l- (35)

over the range of y that k( y - rt) < t/I, where t/I > 0 and k, Jl are


arbitrary. The case k = 0 is interpreted as the limit k --. 0, i.e.

H(y; '1, 1/1, 0) = exp [ - exp {_ y ; '1}]

the Gumbel distribution. This distribution arises as the limiting stable distributions of extreme value theory, and is widely used in engineering, hydrological and meteorological contexts when it is required to model the behaviour of the extremes of a sample. The relatively simpler Gumbel distribution is an obvious initial point from which to attempt to model the Venice data, and we assume as before that a linear model

holds for the data, but that the 'errors' Ilt have the Gumbel distribution. The log likelihood for this model is

1981

L(P) = L (-log 1/1 - (YI - tit)/I/I - e -(Yt-qt)/tII)

1=1931

where'1t = Po + PI(t - 1931) is the linear predictor for year t. Apart from the presence of 1/1, the structure of this model is analogous to that of eqn (24) and the derivation of the fitting algorithm given in Section 5.2 can be carried through in a very similar manner. The value of 1/1, which plays a role like that of the unknown variance in the linear model with normal errors, is estimated by a single extra step in the algorithm. 19,21

When this model is fitted to the data in Table 1, the estimates of Po and PI are 97·15 and 0'574, compared to the values 105·4 and 0·567 obtained previously. The estimates of slope are very similar, and the estimates of the intercept differ partly because the mean of the Gumbel distribution is not zero. However, examination of residuals shows that the Gumbel distribution, unlike the normal, gives an adequate fit to the data. There is no reason to suppose on the basis of the data that the more complicated form eqn (35) is needed. Figure 20 shows the normal order statistics plot of residuals defined by extending eqn (32). The use of the Gumbel distribution has removed the skewness that was evident in Fig. 13.

One message from this example is that regression parameter estimates are usually fairly insensitive to the choice of error distribution. If interest was focused on the prediction of extreme sea levels from these data, however, the symmetry of the normal distribution might lead to substantial underprediction of extreme levels in any particular year. For purposes such as this it would be important to use an error distribution which

134 A.C. DAVISON

3

+

2 +

++

OJ + ::l +++ !? .r+ '" ~ "If'" "0 0 ~/ ~ Q)

"0 # 0 -1 ......

+ +++

+ -2 +

-3 +

-3 -2 -1 0 2 3 Normal order statistic

Fig. 20. Plot of ordered residuals against normal order statistics for fit of model with Gumbel errors to Venice data.

matches the asymmetry of the data, and this precludes the use of the normal distribution.

6 COMPUTING PACKAGES

Almost every statistical package has facilities for fitting the linear regression models described in Section 3, and most now provide some form of diagnostic aids for the modeller. At the time of writing, only two packages, GUM and GENSTAT, have facilities for direct fitting of generalized linear models. There is no space here for a comprehensive discussion of computing aspects of linear models. Some general issues to consider when choosing a package are:

(a) the flexibility of the package in terms of interactive use, plotting facilities, model fitting, and control over the level of output;

(b) the ease with which standard and non-standard models can be fitted; (c) the ease with which models can be checked; and (d) the quality of documentation and the level of help available, both in

terms of on-line help and in the form of local gurus.

As ever, there is likely to be a trade-off between immediate ease of use


and the power of a package. Whatever decision is made, there is no excuse for wasting valuable scientific time by writing computer programs to fit regression models instead of using an off-the-shelf package whenever possible.

7 BIBLIOGRAPHIC NOTES AND DISCUSSION

There is a vast literature on regression models. Weisbergl2 is a good introduction to applied regression analysis, and Seber I I is good on the more theoretical aspects. Another standard reference is Draper & Smith.22

The standard book on generalized linear models is McCullagh & Nelder;6 Aitkin et al. 18 give an account based on the use of the package GUM. Seber & Wild23 deal specifically with nonlinear models with normal errors.

Aspects of model-checking are covered by Atkinson,24 who pays special attention to graphical methods and transformations, and by Cook & Weisberg,25 and Carroll & Ruppert. 26

Two chapters in Hinkley et al. give short accounts of generalized linear models27 and residuals and diagnostics.20

Reference was made to overdispersion in Section 5.5. Overdispersion is widespread in applications, and often arises because a model cannot hope to capture all the variability which arises in data. A response will often depend on factors that were not recorded, as well as on those that were, and this will lead to data that are overdispersed, apparently at random, relative to the postulated model. One remedy is to broaden the applicability of generalized linear models by making model assumptions only involving the mean-variance relationship and hence the variance function. The relation of these to the generalized linear models described in Section 5 is the same as that of the second-order assumptions to the normal-theory assumptions described in Section 3.1. The topic is discussed fully in Ref. 6.

This chapter has concentrated on the analysis of single sets of data, rather than the automatic analysis of many similar sets. When there are many sets of data to be analysed in an automatic way and there is the possibility of gross errors in some observations, it may be useful to consider robust or resistant methods of analysis. Robust methods are insensitive to changes in the assumptions which underlie a statistical model, whereas resistant methods are insensitive to large changes in a few observations. If a suitable model is known to apply, there are obvious advantages to the use of such methods. The difficulty is that usually

136 A.C. DAVISON

models are uncertain, and it is then valuable to detect the ways in which the model departs from the data. Robust and resistant methods can make this harder precisely because of their insensitivity to changes in assumptions or in the data. Such methods are considered in more detail by Green,19 Li,28 Rousseeuw & Leroy,29 and Hampel et al.30

The full parametric assumption that" = xT fJ for all values of x and fJ can be relaxed in various ways. One possibility is non-parametric smoothing techniques, which aim to fit a smooth curve to the data. This avoids the full parametric assumptions made above, while giving tractable and reasonably fast methods of curve-fitting. Hastie & Tibshirani31 give a full discussion of the topic.

ACKNOWLEDG EM ENTS

This work was supported by a grant from the Nuffield Foundation. The author is grateful to L. Tierney for a copy of his statistical package XLISP-STAT.

REFERENCES

I. Smith, R.L., Extreme value theory based on the r largest annual events. J. Hydrology, 86 (1986) 27-43.

2. Davison, A.C. & Hemphill, M.W., On the statistical analysis of ambient ozone data when measurements are missing. Atmospheric Environment, 21 (1987) 629-39.

3. Lindley, D.V. & Scott, W.F., New Cambridge Elementary Statistical Tables. Cambridge University Press, Cambridge, 1984.

4. Pearson, E.S. & Hartley, H.O., Biometrika Tables for Statisticians, 3rd edn, vol. I and 2. Biometrika Trust, University College, London, 1976.

5. Crowder, MJ., A multivariate distribution with Weibull connections. J. Roy. Statist. Soc. B, 51 (1989) 93-107.

6. McCullagh, P. & Neider, J.A., Generalized Linear Models, 2nd edn. Chapman & Hall, London, 1989.

7. Edwards, D., Hierarchical interaction models (with Discussion). J. Roy. Statist. Soc. B, 52 (1990) 3-20, 51-72.

8. Jensen, D.R., Multivariate distributions. In Encyclopedia of Statistical Sciences, vol. 5, ed. S. Kotz, N.L. Johnson & C.B. Read. Wiley, New York, 1985, pp.43-55.

9. Tawn, J.A., Bivariate extreme value theory: Models and estimation. Biometrika, 75 (1988) 397-415.

10. Tawn, J.A., Modelling multivariate extreme value distributions. Biometrika, 77 (1990) 245-53.


II. Seber, G.A.F., Linear Regression Analysis. Wiley, New York, 1977. 12. Weisberg, S., Applied Linear Regression, 2nd edn. Wiley, New York, 1985. 13. Cook, R.D., Detection of influential observations in linear regression. Tech

nometrics, 19 (1977) 15-18. 14. Box, G.E.P. & Cox, D.R., An analysis of transformations (with Discussion).

J. Roy. Statist. Soc. B, 26 (1964) 211-52. 15. ApSimon, H.M. & Davison, A.C., A statistical model for deriving probability

distributions of contamination for accidental releases. Atmospheric Environment, 20 (1986) 1249-59.

16. Holland, D.M. & Fitz-Simons, T., Fitting statistical distributions to air quality data by the maximum likelihood method. Atmospheric Environment, 16 (1982) 1071-6.

17. Seber, G.A.F., Multivariate Observations, Wiley, New York, 1985. 18. Aitkin, M., Anderson, D., Francis, B. & Hinde, J., Statistical Modelling in

GLIM. Clarendon Press, Oxford, 1989. 19. Green, PJ., Iteratively reweighted least squares for maximum likelihood

estimation and some robust and resistant alternatives (with Discussion). J. Roy. Statist. Soc. B, 46 (1984) 149-92.

20. Davison, A.C. & Snell,EJ., Residuals and diagnostics. In Statistical Theory and Modelling: In Honour of Sir David Cox, ed. D.V. Hinkley, N. Reid & EJ. Snell. Chapman & Hall, London, 1990, pp. 83-106.

21. }",rgensen, B., The delta algorithm and GUM. Int. Statist. Rev., 52 (1984) 283-300.

22. Draper, N. & Smith, H., Applied Regression Analysis, 2nd edn. Wiley, New York, 1981.

23. Seber, G.A.F. & Wild, CJ., Nonlinear Regression. Wiley, New York, 1989. 24. Atkinson, A.C., Plots, Transformations, and Regression. Clarendon Press,

Oxford, 1985. 25. Cook, R.D. & Weisberg, S., Residuals and Influence in Regression. Chapman

& Hall, London, 1982. 26. Carroll, R.J. & Ruppert, D., Transformation and Weighting in Regression.

Chapman & Hall, London, 1988. 27. Firth, D., Generalized linear models. In Statistical Theory and Modelling: In

Honour of Sir David Cox, ed. D.V. Hinkley, N. Reid & E.J. Snell. Chapman & Hall, London, 1990, pp. 55-82.

28. Li, G., Robust regression. In Exploring Data Tables, Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. Wiley, New York, 1985, pp. 281-343.

29. Rousseeuw, PJ. & Leroy, A.M., Robust Regression and Outlier Detection. Wiley, New York, 1987.

30. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.I. & Stahel, W.A., Robust Statistics. Wiley, New York, 1986.

31. Hastie, T. & Tibshirani, R.I., Generalized Additive Models. Chapman & Hall, London, 1990.

Chapter 4

Factor and Correlation Analysis of Multivariate

Environmental Data PHILIP K. HOPKE

Department of Chemistry, Clarkson University, Potsdam, New York 13699, USA

1 INTRODUCTION

In studies of the environment, many variables are measured to characterize the system. However, not all of the variables are independent of one another. Thus, it is essential to have mathematical techniques that permit the study of the simultaneous variation of multiple variables. One such analysis is based on examining the relationships between pairs of variables. This correlation analysis, however, does not provide a clear view of the multiple interactions in the data. Thus, various forms of eigenvector analysis are used to convert the correlation data into multivariate information. Factor analysis is the name given to one of the variety of forms of eigenvector analysis. It was originally developed and used in psychology to provide mathematical models of psychological theories of human ability and behavior.1 However, eigenvector analysis has found wide application throughout the physical and life sciences. Unfortunately, a great deal of confusion exists in the literature in regard to the terminology of eigenvector analysis. Various changes in the way the method is applied have resulted in it being called factor analysis, principal components analysis, principal components factor analysis, empirical orthogonal function analysis, Karhunen-Loeve transform, etc., depending on the way the data are scaled before analysis or how the resulting vectors are treated after the eigenvector analysis is completed.

139

140 PHILIP K. HOPKE

All of the eigenvector methods have the same basic objective; the compression of data into fewer dimensions and the identification of the structure of interrelationships that exist between the variables measured or the samples being studied. In many chemical studies, the measured properties of the system can be considered to be the linear sum of the term representing the fundamental effects in that system times appropriate weighing factors. For example, the absorbance at a particular wavelength of a mixture of compounds for a fixed path length, z, is considered to be a sum of the absorbencies of the individual components

(1)

where ej is the molar extinction coefficient for the ith compound at wavelength A, and Cj is the corresponding concentration. Thus, if the absorbencies of a mixture of several absorbing species are measured at m various wavelengths, a series of equations can be obtained:

(2)

If we know what components are present and what the molar extinction coefficients are for each compound at each wavelength, the concentrations of each compound can be determined using a multiple linear regression fit to these data. However, in many cases neither the number of compounds nor their absorbance spectra may be known. For example, several compounds may elute from an HPLC column at about the same retention time so that a broad elution peak or several poorly resolved peaks containing these compounds may be observed. Thus, at any point in the elution curve, there would be a mixture of the same components but in differing proportions. If the absorbance spectrum of each of these different mixtures could be measured such as by using a diode array system, then the resulting data set would consist of a number of absorption spectra for a series of n different mixtures of the same compounds.

j = I,m = l,n

(3)

For such a data set, factor analysis can be employed to identify the number of components in the mixture, the absorption spectra of each component, and the concentration of each compound for each of the mixtures. Similar problems are found throughout analytical and environmental chemistry

FACTOR AND CORRELATION ANALYSIS 141

where there are mixtures of unknown numbers of components and the properties of each pure component are not known a priori.

Another use for such methods is in physical chemistry where the measured property can also be related to a linearly additive sum of independent causative processes. For example, the effects of solvents on the proton NMR shifts for non-polar solutes can be expressed in a form of

p

~a = L ~jfja (4) j=1

where ~a is. the chemical shift of solute i in solvent a, ~j refers to the jth solute factor of the ith solvent, and fja refers to the jth solvent factor of the ath solvent with the summation over all of the physical factors that might give rise to the measured chemical shifts. Similar examples have been found for a variety of chemical problems and are described in Ref. 2.

Finally, similar problems arise in the resolution of environmental mixtures to. their source contributions. For example, a sample of airborne particulate matter collected at a specific site is made up of particles of soil, motor vehicle exhaust, secondary sulfate particles, primary emissions from industrial point sources, etc. It may be of interest to determine how much of the total collected mass of particles comes from each source. It is then assumed that the measured ambient concentration of some species, Xi'

where i = 1, m measured elements, is a linear sum of contributions from p independent sources of particles. These species are normally elemental concentrations such as lead or silicon and are given in J.Lg of element per cubic meter of air. Each kth source emits particles that have a profile of elemental concentrations, aik> and the mass contribution per unit volume of the kth source is/k • When the compositions are measured for a number of samples, an equation of the following form is obtained.

P

Xij = L aidkj k=1

(5)

The use of factor analysis for this type of study is reviewed in Ref. 3. Thus, a factor analysis can help compress multivariate data to sufficiently

few dimensions to make visualization possible and assist in identifying the interrelated variables. Depending on the approach used, the results can be interpreted statistically or they can be directly related to the structural variables that describe the system being studied. Examples of both types of results will be presented in this chapter. However, in all cases, the investigator must then interpret the interrelationships determined through

142 PHILIP K. HOPKE

the analysis within the context of his problem to provide the more detailed understanding of the system being studied.

2 EIGENVECTOR ANALYSIS

2.1 Dispersion matrix The initial step in the analysis of the data requires the calculation of a function that can indicate the degree of interrelationships that exists within the data. Functions exist that can provide this measure between the two variables when calculated over all of the samples or between the samples when calculated over all of the variables. The most well-known of these functions is the product-moment correlation coefficient. To be more precise, this function should be referred to as the correlation about the mean. The 'correlation coefficient' between two variables, Xi and Xk, over all n samples is given by

n

L (xij - Xi)(Xkj - Xk) rik = j=1 (6)

( t (xij - xy)1/2(.± (xkj _ Xk)2)1/2

)=1 )=1

The original variables can be transformed by subtracting the mean value and dividing by the standard deviation,

Using the standardized variables, eqn (6) can be simplified to

I n

rik = - L ZijZkj n j=1

(7)

(8)

The standardized variables have several other benefits to their use. Each standardized variable has a mean value of zero and a standard deviation of 1. Thus, each variable carries 1 unit of system variance and the total variance for a set of measurements of m variables would be m.

There are several other measures of interrelationship that can also be utilized. These measures include covariance about the mean as defined by

n

Cik = L dijdki (9) j=1

where

d ij = xii - Xi (10)

are called the deviations, and Xi is the average value of the ith variable. The covariance about the origin is defined by

" c~ = L xijxkj j=1

and the correlation about the origin

" L XijXkj j=1

Ctl xij jt xiJ/2

(II)

(12)

The matrix of either the correlations or covariances, called the dispersion matrix, can be obtained from the original or transformed data matrices. The data matrix contains that data for the m variables measured over the n samples. The correlation about the mean is given by

Rm = ZZ' (13)

where Z' is the transpose of the standardized data matrix Z. The correlation about the origin

Ro = zo Zo' = (XV)(XV)' (14)

where

zij (IS)

( " )1/2 Ixij '=1

which is a normalized variable still referenced to the original variable origin and V is a diagonal matrix whose elements are defined by

( " )1/2 Vik = <>ikL Xij

J=I

(16)

This normalized variable also carries a variance of I, but the mean value is not zero. The covariance about the mean is given as

(17)

where D is the matrix of deviations from the mean whose elements are calculated using eqn (10) and D' is its transpose. The covariance about the origin is

(18)

144 PHILIP K. HOPKE

the simple product of the data matrix by its transpose. As written, these product matrices would be of dimension m by m and would represent the pairwise interrelationships between variables. If the order of the multiplication is reversed, the resulting n by n dispersion matrices contain the interrelationships between samples.

The relative merits of these functions to reflect the total information content contained in the data have been discussed in the literatureY Rozett & Petersen3 argue that since many types of physical and chemical variables have a real zero, the information regarding the location of the true origin is lost by using the correlation and covariance about the mean that include only differences from the variable means. The normalization made in calculating the correlations from the co variances causes each variable to have an identical weight in the subsequent analysis. In mass spectrometry where the variables consist of the ion intensities at the various m/e values observed for the fragments of a molecule, the normalization represents a loss of information because the variable metric is the same for all of the m/e values. In environmental studies where measured species concentrations range from the trace level (sub part per million) to major constituents at the per cent level, the use of covariance may weight the major constituents too heavily in the subsequent analyses. The choice of dispersion function depends heavily on the nature of the variables being measured.

Another use of the correlation coefficient is that it can be interpreted in a statistical sense to test the null hypothesis as to whether a linear relationship exists between the pair of variables being tested. It is important to note that the existence of a correlation coefficient that is different from zero does not prove that a cause and effect relationship exists. Also, it is important to note that the use of probabilities to determine if correlation coefficients are 'significant' is very questionable for environmental data. In the development of those probability relationships, explicit assumptions are made that the underlying distributions of the variables in the correlation analysis are normal. For most environmental variables, normal distributions are uncommon. Generally, the distributions are positively skewed and heavy tailed. Thus, great care should be taken in making probability arguments regarding the significance of pairwise correlation coefficients between variables measured in environmental samples.

Another problem with interpreting correlation coefficients is that environmental systems are often truly multivariate systems. Thus, there may be more than two variables that covary because of the underlying nature of the processes being studied. Although there can be very strong correla-


tions between two variables, the correlation may arise through a causal factor in the system that cannot be detected.

For each of the equations previously given in this section, the resulting dispersion matrix provides a measure of the interrelationship between the measured variables. Thus, in the use of a matrix of correlations between the pairs of variables, each variable is given equal weight in the subsequent eigenvector analysis. This form of factor analysis is commonly referred to as an R-mode analysis. Alternatively, the order of multiplication could be reversed to yield covariances or correlations between the samples obtained in the system. The eigenvector analysis of these matrices would be referred to as a Q-analysis. The differences between these two approaches will be discussed further after the approach to eigenvector analysis has been introduced.

2.2 Eigenvector calculation The primary goal of eigenvector analysis is to represent a data matrix as a product of two other matrices containing specific information regarding the sources of the variation observed in the data. It can be shown5 that any matrix can be expressed as a product of two matrices

(19)

where the subscripts denote the dimensions of the respective matrices. There will be an infinite number of different A and F matrices that satisfy this equation.

To divide the matrix into two cofactor matrices as in eqn (19), a question is raised about the minimum value p can have and still yield a solution. This value is the 'rank' of matrix X (Ref. 5, p. 335). The rank clearly cannot be greater than a matrix's smaller dimension and the rank of a product moment matrix cannot be greater than the smaller of the number of columns or the number of rows. The rank of the product moment matrix must be the same as that of the matrix from which it was formed.

Associated with the idea of matrix rank is the concept of linear independence of variables. We can look at the interrelationships between columns (or rows) of a matrix and determine if columns (or rows) are linearly independent of one another. To understand linear independence let us examine the relationship between the two vectors in Fig. I. We can find the vector t such that

s = r - t (20)

The vector t can then be generalized to be the resultant of the sum of rand

146 PHILIP K. HOPKE

Fig. 1. Illustration of the interrelationship between two vectors.

s with coefficients ar and as:

t = arf + ass (21)

1ft = 0, then f and s are said to be collinear or linearly dependent vectors. Thus, a vector y is linearly dependent on a set of vectors, VI_ V2, ••• , Vrn

if

(22)

and at least one of the coefficients, aj, is non-zero. If all of the aj values in eqn (22) are zero, then y is linearly independent of the set of vectors Vj. The number of linearly independent column vectors in a matrix defines the minimum number of dimensions needed to contain all of the vectors. The idea of the rank or true dimensionality of a data matrix is an important concept in receptor modeling as it defines the number of separately identifiable, independent sources contributing to the system under study. Thus finding the rank of a data matrix will be an important task. In addition, the ability to resolve sources of material with similar properties or the resolution of various receptor models needs to be carefully examined. To examine this question, several additional mathematical concepts need to be discussed.

A given data matrix can be reproduced by one of an infinite number of sets of independent column vectors or basis vectors that will describe the axes of the reduced dimensionality space. The rank of the matrix can be determined and a set of linearly independent basis vectors can be developed by the use of an eigenvalue analysis. In this discussion, only the analysis of real, symmetric matrices such as those obtained as the minor or major product of a data matrix will be discussed. Suppose there exists a real, symmetric matrix R that is to be analyzed for its rank, remembering


that the rank of a product moment matrix is the same as the data matrix from which it is formed.

An eigenvector of R is a vector U such that

Ru = UA (23)

where A is an unknown scalar. The problem then is to find a vector so that the vector Ru is proportional to u. This equation can be rewritten as

Ru - UA = 0 (24)

or

(R - AI)U = 0 (25)

implying that u is a vector that is orthogonal to all of the row vectors of (R - AI). This vector equation can be considered as a set of p equations where p is the order of R:

u, (1 - A) + u2r'2 + U3r13 + ... + upr,p = 0

u,r2' + u2(1 - A) + U3r23 + ... + upr2p = 0

u, rp' + U2rp2 + u3rp3 + ... + up(1 - A) = 0

Unless u is a null vector, eqn (25) can only hold if

R-A./=O

(26)

(27)

There is a solution to this set of equations only if the determinant of the left side of the equation is zero:

IR - A.l1 = 0 (28)

This equation yields a polynomial in A of degree p. It is then necessary to obtain the p roots of this equation, Ai' i = 1, p. For each Ai there is an associated vector Ui such that

Ru, - u, A, = 0 (29)

If these Ai values are placed as the elements of a diagonal matrix A, and the eigenvectors can be collected as columns of the matrix U, then we can express eqn (25) as

RU = UA

The matrix U is a square orthonormal so that

U'U = UU' = I

(30)

(31)

148 PHILIP K. HOPKE

Postmultiplying eqn (30) by V' yields

R = UU'A (32)

Thus, any symmetric matrix R may be represented in terms of its eigenvalues and eigenvectors:

(33)

so that R is weighted sum of matrices II; ui, of order p by p and of rank I. Each term is orthogonal to all other terms so that for i '# j

and

Premultiplying eqn (30) by V' yields

V'RV = A

(34)

(35)

(36)

so that V is a matrix that reduces R to a diagonal form. The eigenvalues have a number of useful properties:6

(l) Trace A = trace R; the sum of the eigenvalues equals the sum of the elements in the principal diagonal of the matrix.

(2) IIf= I A; = IRI; the product of the eigenvalues equals the determinant of the matrix. If one or more of the eigenvalues is zero, then the determinant is zero and the matrix R is called a singular matrix. A singular matrix cannot be inverted.

(3) The number of non-zero eigenvalues equals the rank of R.

Therefore, if for a matrix R of order p, there are m zero eigenvalues, the rank of R is (p - m); the (p - m) eigenvectors corresponding to those non-zero linearly independent vectors of R.

Another approach can be taken to examine the basic structure of a matrix. This method is called a singular value decomposition of an arbitrary rectangular matrix. A very detailed discussion of this process is given in Ref. 7. According to the singular value decomposition theorem, any matrix can be uniquely written as

R = UDV' (37)

where R is an n by m data matrix, U is an n by n orthogonal matrix, V is an m by m orthogonal matrix, and D is an n by m diagonal matrix. All of the diagonal elements are non-negative and exactly k of them are strictly


positive. These elements are called the singular values of R. The column values of the U matrix are the eigenvectors of XX'. The column values of the V matrix are the eigenvectors of X'X. Zhou et al. 8 show that the R- and Q-mode factor solutions are interrelated as follows

AQ FQ r! r----J

X U D V' I I L...J

(38)

FR AR

Although there has been discussion of the relative merits of R- and Q-mode analyses in the literature,3.9 the direction of multiplication is not the factor that alters the solutions obtained. Different solutions are obtained depending on the direction in which the scaling is performed. Thus, different vectors are derived depending on whether the data are scaled by row, by column, or both. Zhou et al.8 discuss this problem in more detail.

By making appropriate choices of A and F in eqn (38), the singular value decomposition is one method to partition any matrix. The singular value decomposition is also a key diagnostic tool in examining collinearity problems in regression analysis. 1O The application of the singular value decomposition to regression diagnostics is beyond the scope of this chapter.

In the discussion of the dispersion matrix, it becomes necessary to discuss some of the seman tical problems that arise in 'factor' analysis. If one consults the social science literature on factor analysis, a major distinction is made between factor analysis and principal components analysis. Because there are substantial problems in developing quantitative models analogous to those given in the introduction to this chapter in eqns (I) to (5), the social sciences want to obtain 'factors' that have significant values for two or more of the measured variables. Thus, they are interested in factors that are common to several variables. The model then being applied to the data is of the form:

p

zij = L aiJkj + di~j k=l

(39)

where the standardized variables, zij' are related to the product of the common factor loadings, aik, by the common factor scores, fkj' plus the unique loading and score. The system variance is therefore partitioned into the common factor variance, the specific variance unique to the

150 PIDLIP K. HOPKE

particular variable, and the measurement error.

True Variance I I

System Variance Common Variance + Specific Variance + Error I I

Unique Variance

(40)

In order to make this separation, an estimation is made of the partitioning of the variance between the common factors and the specific factors. A common approach to this estimation is to replace each 'I' on the diagonal of the correlation matrix with an estimate of the 'communality' defined by

(41)

The multiple correlation coefficients for each variable against all of the remaining variables are often used as initial estimates of the communalities. Alternatively, the eigenvector analysis is made and the communalities for the initial solution are then substituted into the diagonal elements of the correlation matrix to produce a communality matrix. This matrix is then analyzed and the process repeated until stable communality values are obtained.

The principal components analysis simply decomposes the correlation matrix and leads to the model outlined in eqn (39) without the d; U;j term. It can produce components that have a strong relationship with only one variable. The single variable component could also be considered to be the unique factor. Thus, both principal components analysis and classical factor analysis really lead to similar solutions although reaching these solutions by different routes. Since it is quite reasonable for many environmental systems to show factors that produce such single variable behavior, it is advisable to use a principal components analysis and extend the number offactors to those necessary to reproduce the original data within the error limits inherent in the data set.

Typically, this approach to making the eigenvector analysis compresses the information content of the data set into as few eigenvectors as possible. Thus, in considering the number of factors to be used to describe the system, it is necessary to carefully examine the problems of reconstructing both the variability within the data and reconstructing the actual data itself.


2.3 Number of retained factors Following the diagonalization of the correlation or covariance matrix, it is necessary to make the difficult choice of the number of factors, p, to use in the subsequent analysis. This problem occurs in any application of an eigenvector analysis of data containing noise. In the absence of error, the eigenvalues beyond the true number of sources become zero except for calculational error. The choice becomes more difficult depending on the error in the data. Several approaches have been suggested.4•il

A large relative decrease in the magnitude of the eigenvalues is one indicator of the correct number of factors. It can often be useful to plot the eigenvalues as a function of factor number and look for sharp breaks in the slope of the line. 12 If the eigenvalue is a measure of the information content of the corresponding eigenvector, then only sufficiently 'large' eigenvalues need to be retained in order to reproduce the variation initially present in the data. One of the most commonly used and abused criteria for selecting the number of factors to retain is retaining only those eigenvalues greater than 1. I3 The argument is made that the normalized variables each carry one unit of variance. Thus, if an eigenvalue is less than one, then it carries less information than one of the initial variables and is therefore not needed. However, Kaiser & Hunka l4 make a strong argument that although eigenvalue greater than one does set a lower limit on the number of factors to be retained, it does not set a simultaneous upper bound. Thus, there must be at least as many factors as there are eigenvalues greater than one, but there can be more than that number that are important to the understanding of the system's behavior.

Hopkels has suggested a useful empirical criterion for choosing the number of retained eigenvectors. In a number of cases of airborne particulate matter composition source identification problems, Hopke found that choosing the number of factors containing variance greater than one after an orthogonal rotation provided a stable solution. Since the eigenvector analysis artificially compresses variance into the first few factors, reapportioning the variance using the rotations described in the next section will result in more factors with total variance greater than one than there are eigenvalues greater than one. In many cases, this number of factors will stay the same even after rotating more factors.

For a different type of test, the original data are reproduced using only the first factor and compared point-by-point with the original data.

152 PHILIP K. HOPKE

Several measures of the quality of fit are calculated including chi-squared

n m ( -)2 t = I I Xij - Xi j=1 i=1 aij

(42)

where xij is the reconstructed data point using p factors and (1ij is the uncertainty in the value of xij' The Exner function 16 is a similar measure and is calculated by

EP = [t f (x~ - ~i)~JI/2 j=1 i=1 (x - Xi)

(43)

where XO is a grand ensemble average value. The empirical indicator function suggested by Malinowski '7 can be used for this purpose and is calculated as follows:

RSD = [t Aj ] j=p+1 n(m - p)

for n > m (44)

IND RSD

(m _ p)2 (45)

RSD [± Aj ] j=p+1 m(n - p)

for n < m (46)

IND = RSD (n _ p)2 (47)

where Aj are the eigenvalues from the diagonalization. This function has proven very successful with spectroscopy results.'7.'7a However, it has not proven to be as useful with other types of environmental data. IS. Finally, the root-mean-square error and the arithmetic average of the absolute values of the point-by-point errors are also calculated. The data are next reproduced with both the first and second factors and again compared point-by-point with the original data. The procedure is repeated, each time with one additional factor, until the data are reproduced with the desired precision. If p is the minimum number of factors needed to adequately reproduce the data, then the remaining n - p factors can be eliminated from the analysis. These tests do not provide unequivocal indicators of the number of factors that should be retained. Judgement becomes necessary in evaluating all of the test results and deciding upon a value of p. In this manner the dimension of the A and F matrices is reduced from n to p.

The compression of variance into the first factors will improve the ease with which the number of factors can be determined. However, their


/ Fig. 2. Illustration of the rotation of a coordinate system (x,. x2 ) to a new

system (y,. Y2) by an angle O.

nature has now been mixed by the calculational method. Thus, once the number offactors has been determined, it is often useful to rotate the axes in order to provide a more interpretable structure.

2.4 Rotation of factor axes The axis rotations can retain the orthogonality of the eigenvectors or they can be oblique. Depending on the initial data treatment, the axes rotations may be in the scaled and/or centered space or in the original variable scale space. The latter approach has proved quite useful in a number of chemical applications2 and in environmental systems. 19 To begin the discussion of factor rotation, it is useful to describe how one set of coordinate system axes can be transformed into a new set. Support that it is necessary to change from the coordinate system Xl' X2 to the system of Yl, Y2 by rotating through angle 0 as shown in Fig. 2. For this two-dimensional system, it is easy to see that

COSOXI + sinOx2

-sinOxl + COSOX2

In matrix form, this equation could be written as

V' = X'T

where T is the transformation matrix.

(48)

(49)

In order to obtain a more general case, it is useful to define the angles of rotation in the manner shown in Fig. 3. Now the angle Oij is the angle

154 PHILIP K. HOPKE

Fig. 3. Illustration of the rotation of a coordinate system (X" X2) to a new coordinate system ( y, , Y2) showing the definitions of the full set of angles used

to describe the rotation.

between the ith original reference axis and thejth new axis. Assuming that this is a rotation that maintains orthogonal axes (rigid rotation), then, for two dimensions,

Oil = 011 + 90°

0 21 = 011 + 90°

022 011

There are also trigonometric relationships that exist,

sin 011 = sin (021 + 90°) = cos 021

- sin 011 = - sin (012 + 90°) = sin (90° - ( 21 )

so that eqn (48) can be rewritten as

YI COSOIIXI + COS021 X2

Y2 = cos 012XI + cos 0nX2

(50)

= COS 812

(51)

(52)

This set of equations can then be easily expanded to n orthogonal axes


yielding

(53)

Yn COS1nX1 + cos (J2n X2 + ... + cos (Jnnxn

A transformation matrix T can then be defined such that its elements are

(54)

Then, for a collection of N row vectors in a matrix X with n columns,

Y = XT (55)

and Y has the coordinates for all N row vectors in terms of the n rotated axes. For the rotation to be rigid, T must be an orthogonal matrix. Note that the column vectors in Y can be thought of as new variables made up by linear combinations of the variables in X with the elements of T being the coefficients of those combinations. Also a row vector of X gives the properties of a sample in terms of the original variables while a row vector ofY gives the properties of a sample in terms of the transformed variables.

3 APPLICATIONS

3.1 Data screening To illustrate this approach to data, a set of geological data from the literature has been perturbed in various ways to illustrate the utility of a factor analysis.20 These data for 29 elements were obtained by neutron activation analysis of 15 samples taken from Borax Lake, California where a lava flow had once occurred.21 The samples obtained were formed by two different sources of lava that combined in varying amounts and, thus, formed a mixture of two distinct mineral phases. The data was acquired to identify the elemental profiles of each source mineral phase.

3.1.1 Decimal point errors Decimal point errors are common mistakes made in data handling. Therefore, to simulate a decimal point error, the geological data was changed so that one of the 15 thorium values was off by a factor of ten. The data was processed by classical factor analysis to produce factor loadings and factor scores for 2-10 causal factors. The factor loadings

156 PHILIP K. HOPKE

TABLE 1 Factor loadings of the geological data with a decimal point error

Element Factor-} Factor-2 Factor-3

Th 0·030 0·046 0·972 Sm 0·899 0·360 0·155 U 0·959 0·264 0·090 Na 0·778 0·362 0·342 Sc -0'971 -0,221 -0,063 Mn -0,970 -0,226 -0,075 Cs 0·962 0·247 0·093 La 0·904 0·286 0·181 Fe -0·970 -0,226 -0,064 Al -0,937 -0·155 0·034 Dy -0'002 0·828 0·208 Hf 0·708 0·288 0·481 Ba -0·262 -0·834 0·095 Rb 0·939 0·074 0·204 Ce 0·956 0·220 0·128 Lu 0·771 0·266 0·173 Nd 0·878 0·142 0·211 Yb 0·925 0·084 -0·119 Tb 0·923 0·024 0·032 Ta 0·837 0·303 0·334 Eu -0,972 -0,212 -0,068 K 0·908 0·263 -0,092 Sb 0·621 0·674 0·024 Zn -0,927 -0,247 -0·123 Cr -0,976 -0,191 -0·058 Ti -0,972 -0,211 -0,039 Co -0,971 -0,216 -0·073 Ca -0,963 -0·119 -0'121 V -0'948 -0,208 -0,091

were then examined starting with the two factor solution to determine the nature of the identified factors. For the two factor solution, one factor of many highly correlated variables was identified in addition to a second factor with high correlations by Dy, Ba and Sb. For the three factor solution, the above factors were again observed in addition to a factor containing most of the variance in thorium and little of any other variable (see Table 1). This factor that reflects a high variance for only a single variable shows the typical identifying characteristics of a possible data error. To further investigate the nature of this factor, the factor scores of the three factor solution are calculated (Table 2). As can be seen, there is


TABLE 2 Factor scores of geological data with a decimal point error

Sample Factor-} Factor-2 Factor-3

I -0'179 0·101 3.51 2 -0,240 2·63 -0,591 3 0·037 1·06 -0'027 4 0·846 -0,517 -0,462 5 0·753 -0'560 0·011 6 0·782 -0'520 -0,029 7 0·900 -0,333 -0·426 8 0·645 0·522 -0,387 9 1·23 -1·53 -0·162

10 0·804 0·662 -0'074 II -2·08 -1·02 -0·601 12 -1,58 -0,236 0·148 13 -1'12 -0'376 -0,500 14 -0,679 -0,486 -0·290 15 -0·109 0·613 -0'120

a very large contribution of factor-3 for sample-I, indicating that most of the variance was due to sample-I, the sample with the altered thorium value. Having made these observations, one then must go back to the raw data and decide if an explanation can be found for this abnormality and then take the appropriate corrective action.

3.1.2 Random variables If there was no physical explanation for the unique thorium contribution in the preceding example and the factor scores did not identify a specific sample, then the error may have been distributed throughout the data for the variable. An example of this would be the presence of a noisy variable in the set of data. To demonstrate how to identify a variable dominated by random noise, an additional variable, Xl, was added to the geological data set. The values for Xl were generated using a normally-distributed random number generator to produce values with a mean of 5·0 and standard deviation 0·5. Using the procedure described above, factor analysis identified the random variable as can be seen in Table 3. Again, the same basic characteristics were observed in the factor, i.e. a large loading for variable Xl and little contribution for the other variables. If one were to look at the factor scores for this case, the distribution of the values for factor-3 would not identify a specific sample. Thus, the problem with variable Xl is distributed over all the samples. If there is no physical

158 PHILIP K. HOPKE

TABLE 3 Factor loadings of the geological data with a random variable X1

Element Factor-l Factor-2 Factor-3

Th 0·966 0·245 -0,045 Sm 0·903 0·389 -0,002 U 0·955 0·286 -0·058 Na 0·832 0·340 0·153 Sc -0,968 -0·236 0·048 Mn -0·966 -0,243 0·060 Cs 0·958 0·270 -0·063 La 0·915 0·311 0·005 Fe -0'966 -0,242 0·056 AI -0,888 -0,223 0·296 Dy 0·048 0·757 0·341 Hf 0·728 0·362 -0,127 Ba -0,231 - 0·841 -0,50 Rb 0·963 0·078 -0'011 Ce 0·948 0·260 -0,122 Lu 0·710 0·408 -0,439 Nd 0·909 0·137 0·041 Yb 0·865 0·146 -0·343 Tb 0·944 -0,010 0·058 Ta 0·847 0·370 -0·059 Eu -0,968 -0,226 0·074 K 0·880 0·277 -0,097 Sb 0·596 0·690 -0·100 Zn -0,943 -0,241 -0,032 Cr -0,975 -0,200 0·042 Ti -0,968 -0,223 0·049 Co -0·969 -0·233 0·050 Ca -0,958 -0,154 0·116 V -0,941 -0,232 0·109 XI -0,006 0·290 0·856

reason for this variable being unique, then the investigator should explore his analytical method for possible errors.

3.1.3 Interfering variables In some cases, especially spectral analysis, two variables may interfere with each other. For example, in neutron activation analysis, Mn and Mg have overlapping peaks. If there is a systematic error in separating these two peaks, errors in analytical values will be obtained. To demonstrate this, two correlated random variables, Yl and Y2, were generated and added to the geological data set. The YI variables were produced as the


TABLE 4 Factor loadings of the geological data with two correlated random variables,

Y1 and Y2


Yl 0-133 -0-160 0-888 Y2 -0-116 -0-010 0-857 Th 0-976 0-202 -0-004 Sm 0-916 0-352 -0-003 U 0-968 0-242 -0-009 Na 0-813 0-382 -0-145 Sc -0-978 -0-193 0-027 Mn -0-978 -0-200 0-018 Cs 0-972 0-224 -0-011 La 0-920 0-288 0-067 Fe -0-977 -0-198 0-027 AI -0-933 -0-117 -0-143 Dy 0-031 0-858 -0-066 Hf 0-742 0-361 0-229 Ba -0-277 -0-783 0-044 Rb 0-956 0-066 -0-095 Ce 0-967 0-208 0-057 Lu 0-785 0-270 0-303 Nd 0-898 0-136 -0-048 Yb 0-917 0-020 0-050 Tb 0-924 -0-009 -0-136 Ta 0-862 0-340 0-189 Eu -0-980 -0-182 0-049 K 0-904 0-210 -0-004 Sb 0-646 0-616 -0-273 Zn -0-938 -0-234 -0-022 Cr -0-982 -0-163 -0-025 Ti -0-977 -0-180 0-013 Co -0-979 -0-191 0-017 Ca -0-972 -0-103 0-041 V -0-956 -0-188 -0-033

Xl variables described above_ The Y2 variables were generated by dividing the Yl variable by 3 and adding a normally distributed random error with a standard deviation of 0-15 to it. After factor analysis, the factor loadings identified the nature of the problem in the data as can be seen in Table 4_ The large correlation between variables YI and Y2 for factor-3 along with the absence of any other contributing variable was very obvious_

The question now arises, what if the two interfering variables are also correlated with other variables in the data? To investigate this, the COD-

160 PffiLiP K. HOPKE

TABLE 5 Factor loadings of the geological data with errors present in two previously

uncorrelated variables, Th and Eu


Th 0·965 0·234 0·091 Sm 0·899 0·383 0·093 U 0·956 0·275 0·087 Na 0·821 0·391 -0,109 Sc -0,969 -0'226 -0·067 Mn -0·967 -0,233 -0,077 Cs 0·958 0·259 0·093 La 0·904 0·309 0·129 Fe -0,968 0·232 -0,069 AI -0·897 -0,165 -0,290 Dy 0·032 0·827 -0·198 Hf 0·712 0·370 0·253 Ba -0·231 -0,820 -0·102 Rb 0·964 0·091 -0'021 Ce 0·947 0·242 0·160 Lu 0·717 0·324 0·498 Nd 0·899 0·156 0·024 Yb 0·882 0·081 0·264 Tb 0·944 0·009 -0·095 Ta 0·831 0·360 0·256 Eu 0·035 -0,178 0·935 K 0·887 0·248 0·099 Sb 0·626 0·669 -0,082 Zn -0,932 -0,253 -0'070 Cr -0,976 -0,194 -0·059 Ti -0,968 -0'2l3 -0,076 Co -0,969 -0,223 -0·075 Ca -0,968 -0'137 -0,056 V -0·942 -0,219 -0'1l8

centrations of two uncorrelated elements, thorium and europium, were altered to simulate an interference. Europium values were altered by adding 10% of the thorium value plus a normally distributed random error about thorium with a 5% standard deviation. The thorium values were altered similarly except that 10% of the europium value was subtracted rather than added. As before, the problem in the data set was identified (see Table 5) in factor-3 by the same characteristics previously described. Again, this factor establishes the possibility of a problem and it is up to the researcher to identify the nature of the difficulty.


TABLE 6 Factor loadings of the geological data with errors present in two previously

correlated variables, Th and Sm

Element Factor-1 Factor-2 Factor-3

Th 0·854 0·472 0·206 Sm 0·414 0·810 -0'208 U 0·860 0·428 0·268 Na 0·831 0·132 0·412 Sc -0,883 -0·405 -0'224 Mn -0·879 -0'412 -0,230 Cs 0·862 0·431 0·252 La 0·811 0·427 0·297 Fe -0,880 -0,408 -0,229 Al -0,732 -0,616 -0'134 Dy 0·106 -0,178 0·857 Hf 0·624 0·421 0·340 Ba -0,108 -0,340 -0'796 Rb 0·927 0·273 0·104 Ce 0·835 0·483 0·226 Lu 0·486 0·779 0·257 Nd 0·861 0·279 0·165 Yb 0·720 0·597 0·054 Tb 0·936 0·187 0·034 Ta 0·706 0·516 0·330 Eu -0·890 -0,391 -0,217 K 0·783 0·435 0·242 Sb 0·542 0·282 0·678 Zn -0,877 -0,339 -0'257 Cr -0·896 -0,389 -0·194 Ti -0·882 -0,408 -0,211 Co -0,882 -0,411 -0,220 Ca -0,875 -0'409 -0,132 V -0,846 -0,436 -0,210

Now, consider if the two interfering variables are correlated as well as interfering with each other. Again the data set was altered, as described above except that the samarium values were substituted for the europium values. The resulting factor loadings are given in Table 6. The problem appears in factor-2, but it is not obvious that it is a difficulty since there are high values for some other elements. For these data, the researcher might be able to identify the problem concerning samarium if he had sufficient insight into the true nature of the data. In this case it is known there are only two mineral phases. The existence offactor-3 indicates that

162 PIDLIP K. HOPKE

either an error has been made in assuming only two phases or there are errors in the data set. Thus, knowledge of the nature of the system under study would be needed to find this type of error. The two variables involved could also be an indication of a problem. If two variables that would not normally be thought to be interrelated appear together in a factor, it could indicate a correlated error.

In both of the previous two examples, there were two interfering variables present, either thorium and samarium or thorium and europium. Potential problems in the data were recognized after using factor analysis, i.e. samarium and europium. However, no associated problem was noticed with thorium in either case because of the relative concentrations of the two variables. Since thorium was much less sensitive to a change (because of its large magnitude relative to samarium or europium), the added error in thorium was interpreted as unnoticeable increases in the variance of the thorium values. For an actual common interference, consider the Mn-Mg problem in neutron activation analysis. If the spectral analysis program used was having a problem properly separating the Mn and Mg peaks, the factor analysis would usually identify the problem as being in the Mg values since the sensitivity of Mg to neutron activation analysis is so much less than that of Mn. However, if the actual levels of Mn were so low that the Mg peak was relatively the same size as the Mn peak, then the problem could show up in both variables.

3.1.4 Summary and conclusions The approach to classical factor analysis described in this section, i.e. doing the analysis for varying numbers of factors prior assumptions to the number of factors, prevents one from getting erroneous results by implicit computer code assumptions. Identification of a factor containing most of the variance of one variable with little variance of other variables, pinpoints a possible difficulty in the data, if the singularity has no obvious physical significance. Examination of the factor scores will determine whether the problem is isolated to a few samples or over all the samples. Having this information, one may then go back to this raw data and take the appropriate corrective action.

Classical factor analysis has the ability to identify several types of errors in data after it has been generated. It is then ideally suited for scanning large data sets. The ease of the identification technique makes it a beneficial tool to use before reduction and analysis of large data sets and should, in the long run, save time and effort.


3.2 Data interpretation To illustrate the use of factor analysis to assist in the interpretation of multivariate environmental data, it will be applied to a set of physical and chemical data resulting from the analysis of surficial sediment samples taken from a shallow, eutrophic lake in western New York State, Chautauqua Lake.

Chatauqua Lake is a water resource with a long history of management attempts of various types. The outfall of the lake, the Chadakoin River, is controlled by a dam at Jamestown, New York, so that the lake volume can be adjusted. Various chemical treatment techniques have been employed to control biological activity in this eutrophic lake; copper sulfate applied in the late 1930s and early 1940s to control algae, and sodium arsenite added to control rooted macrophytes in 1953 with intensive spraying from 1955 to 1963. Since 1963, various organic herbicides have been employed.

Chautauqua Lake is a 24-km long, narrow lake in southwestern New York State. It is similar in surface configuration to the Finger Lakes of New York. However, in contrast to the Finger Lakes, Chautauqua Lake is a warm, shallow lake. Beginning in 1971, this lake has been the subject of an intensive multidisciplinary study by the State University College at Fredonia (Lake Erie Environmental Studies, unpublished). In the early studies of the lake, Lis & Hopke22 found unusually high arsenic concentrations in the waters of Chautauqua Lake.

One possible source of this arsenic was suggested as the release from the bottom sediments of arsenic residues from the earlier chemical treatment of the lake. In an effort to investigate this possibility, the abundance of arsenic23 along with 14 other elements was determined by neutron activation analysis for grab samples of the bottom sediments which had been taken for sediment particle size analysis.24 The concentrations of 15 elements were determined quantitatively and are given in Ref. 25. Ruppert et al. 23 reported the particle size distribution as characterized by per cent sand, per cent silt, and per cent clay, as well as the per cent organic matter and water depth above the sample. In addition, parameters describing the particle size distribution as determined by Clute24 include measures of the average grain size, mean grain size, median grain size, and parameters describing the shape of the distribution, sorting (standard deviation), skewness, kurtosis and normalized kurtosis. These values are calculated using the methods described in Ref. 26.

Several additional variables were also calculated. These variables were the square of the median and mean grain sizes in millimeters and the

164 PHILIP K. HOPKE

reciprocal of the median and mean grain sizes in millimeters. The squared diameters provide a measure of the available surface area assuming that the particles are spherical. The reciprocals of the diameter should help indicate the surface area per unit volume of the particles. Seventy-nine samples were available for which there were complete data for all of the variables.

An interpretation can be made of the factor loading matrix (Table 7) in terms of both the physical and chemical variables. The variables describing particle diameter in millimeters and per cent sand, have high positive loadings for factor one. This factor could be thought of as the amount of coarse grain source material in the sediment. The amount of sand (coarsegrained sediment) is positively related only to this first factor. Of the elements determined, only sodium and hafnium have positive coefficients for this factor.

The second factor is related to the available specific surface area of the sedimental particles. The amount of surface area per unit volume would be given by

R = 4n,J ~nr3 3

3 r

6 d

where d is the particle diameter. Thus the surface to volume ratio is proportional to the inverse of the particle radius. It is the inverse diameter that has the highest dependence on the second factor. The clay fraction contains the smallest particles which are the ones with the largest surface to volume ratio and, thus, it is reasonable that the per cent clay is also strongly dependent on the second factor. The elements that have positive loadings are adsorbing onto the surface of the particles. This factor has a significant value for As, Br, K and La.

The third factor represents the glacial till source material. The source for sediments found in the lake consists of Upper Devonian siltstones and shales overlain by Pleistocene glacial drift.27,28 In order to compare the average elemental abundances of the lake sediments with the source material, two samples of surface soil were obtained from forested areas that had been left dormant for more than 50 years. Two samples were obtained of shale typical of the area. These samples show quite similar elemental concentrations to those samples taken from the center area of the lake where there has been little active sedimentation. It is suggested that factor-3 accounts for the presence of this glacial till in the sedimental material and explains the high correlation coefficients (> 0·8) reported by Hopke et al.25 between Sb, Cs, Sc and Ta.


The only variable having a high loading for factor-4 is the per cent silt. Since the silt represents a mixture of size fractions that can be carried by streams, this factor may represent active sedimentation at or near stream deltas. The silty material is deposited on the delta and then winnowed away by the action of lake currents and waves.

The final factor has a high loading for sorting. The sorting parameter is a measure of the standard deviation of the particle size distribution. A large standard deviation implies that there is a wide mixture of grain sizes in the sample. Wave action sorts particles by carrying away fine-grained particles and leaving the coarse-grained. Therefore, the fifth factor may represent the wave and current action that transport the sedimental material within the lake.

3.3 Receptor modeling The first receptor modeling applications of classical factor analysis were by Prinz & Stratmann29 and Blifford & Meeker. 30 Prinz & Stratmann examined both the aromatic hydrocarbon content of the air in 12 West German cities and data from Colucci & Begeman31 on the air quality of Detroit. In both cases they found three factor solutions and used an orthogonal varimax rotation to give more readily interpretable results. Blifford & Meeker30 used a principal component analysis with both varimax and a non-orthogonal rotation to examine particle composition data collected by the National Air Sampling Network (NASN) during 1957-1961 in 30 US cities. They were generally not able to extract much interpretable information from their data. Since there are a very wide variety of particle sources among these 30 cities and only 13 elements were measured, it is not surprising that they were not able to provide much specificity to their factors.

The factor analysis approach was then reintroduced by Hopke et al. 25

and Gaarenstroom et al. 32 for their analysis of particle composition data from Boston, MA and Tucson, AZ, respectively. In the Boston data for 90 samples at a variety of sites, six common factors were identified that were interpreted as soil, sea salt, oil-fired power plants, motor vehicles, refuse incineration and an unknown manganese-selenium source. The six factors accounted for about 78% of the system variance. There was also a high unique factor for bromine that was interpreted to be fresh automobile exhaust. Since lead was not determined, these motor vehicle related factor loading assignments remain uncertain. Large unique factors for antimony and selenium were found. These factors represent emission of volatile species whose concentrations do not covary with other elements

;;

0-

TA

BL

E 7

V

arim

ax-r

otat

ed m

axim

um l

ikel

ihoo

d fa

ctor

mat

rix

for

Ch

auta

uq

ua

Lak

e se

dim

ent

Var

iabl

e F

acto

r h2

1 2

3 4

5

Ars

enic

-0

·34

46

0·

3794

0·

2824

0·

0426

0·

4099

0·

5123

B

rom

ine

-0·2

55

2

0·38

83

0·37

67

-0·0

03

6

0·38

10

0·50

30

Ces

ium

-0

·21

15

0·

2661

0·

9106

0·

0575

0·

2278

0·

9999

E

urop

ium

-0

·43

00

0·

2567

0·

1856

0·

0869

0·

2087

0·

3363

Ir

on

-0·2

27

6

0·17

47

0·44

52

0·04

95

0·34

00

0·39

86

'"d

Gal

lium

-0

·10

34

0·

2065

0·

2147

0·

0626

0·

0994

0·

1132

== t:

H

afni

um

0·26

03

-0·2

16

4

0·33

18

-0·2

40

4

-0·

1461

0·

3038

'"d

Pota

ssiu

m

-0·0

70

6

0·40

68

0·40

62

0·12

28

0-40

17

0·51

19

~

Lan

than

um

-0·4

57

9

0·37

77

0·37

97

0·18

05

0·45

53

0·73

64

:I:

Man

gane

se

-0·2

09

2

0·17

33

0·12

89

0·16

43

0·28

09

0·19

63

0 '"d

Sodi

um

0·29

71

-0·3

43

6

-0·3

54

0

-0·

1368

-0

·54

10

0·

6430

~

tTl

Ant

imon

y -

0·12

17

0·16

77

0·93

21

0·01

85

0·05

78

0·91

54

Scan

dium

-0

·10

94

0·

2535

0·

8657

-0

·01

88

0·

1611

0·

8625

T

anta

lum

-0

·10

97

0·

0536

0·

7240

0·

0491

0·

1009

0·

5517

T

erbi

um

-0·1

29

0

0·10

73

0·44

83

0·01

92

0·25

87

0·29

64

% S

and

0·76

07

-0·2

50

7

-0·2

24

8

-0·3

70

5

-0·4

07

0

0·99

50

% S

ilt

-0·8

33

1

-0·1

03

6

0·08

72

0·50

29

0·17

82

0·99

71

% C

lay

-0·3

72

9

0·70

66

0·27

04

0·08

40

0·52

36

0·99

55

% O

rgan

ic m

atte

r -0

·26

70

0·

1018

0·

3310

-0

·02

49

0·

4020

0·

3534

D

epth

(m

) -0

·22

90

0·

3135

0·

3716

-0

·02

03

0·

2623

0·

3580

M

edia

n gr

ain

size

0·

9282

-0

·22

65

-0

·17

32

-0

·11

48

-0

·19

25

0·

9931

(m

m)

Mea

n gr

ain

size

(m

m)

'(Med

ian

grai

n si

zei

(Mea

n gr

ain

size)

2 M

edia

n gr

ain

size

(q,

) M

ean

grai

n si

ze (

q,)

Sort

ing

Skew

ness

K

urto

sis

Nor

mal

ized

kur

tosi

s (M

edia

n gr

ain

size

)-I

(Mea

n gr

ain

size

)-I

0·92

93

0·96

39

0·93

07

-0·6

53

2

-0·6

92

5

-0,1

68

0

-0,1

07

5

0·49

75

0·43

22

-0·2

18

4

-0·3

22

6

-0,1

44

6

-0,1

08

7

-0·0

35

7

0·57

88

0·47

54

-0·0

32

8

-0·7

11

0

-0'3

01

I -0

·42

30

0·

9163

0·

8463

-0·1

66

1

-0'1

I88

-0

'1I0

5

0·27

11

0·26

31

0·08

74

-0,1

26

8

-0,2

12

3

-0,2

54

9

0·15

67

0·24

90

-0·0

12

2

0·06

19

0·19

00

0·19

09

0·18

42

-0·0

00

8

0·08

47

-0·1

73

6

-0·

1618

0·

0715

0·

1103

-0·2

82

8

-0,0

69

0

-0·1

61

5

0·34

42

0·43

06

0·66

47

0·10

05

-0,3

75

4

-0·4

84

4

0·12

41

0·31

14

0·99

22

0·96

36

0·94

19

0·99

01

0·99

41

0·47

88

0·55

04

0·55

43

0·69

15

0·93

24

0·99

14

~ ~ ~ ~ ~ ~ ~ ~

168 pmLIP K. HOPKE

emitted by the same source. Subsequent studies by Thurston et al. 33 where other elements including sulfur and lead were measured showed a similar result. They found that the selenium was strongly correlated with sulfur for the warm season (6 May to 5 November). This result is in agreement with the Whiteface Mountain results34 and suggests that selenium is an indicator of long range transport of coal-fired power plant effluents to the northeastern US. They found lead to be strongly correlated with bromine and readily interpreted as motor vehicle emissions.

In the study of Tucson, AZ,32 whole filter data were analyzed separately at each site. Factors were identified as soil, automotive, several secondary aerosols such as (NH4)2S04 and several unknown factors. Also discovered was a factor that represented the variation of elemental composition in their aliquots of their neutron activation standard containing Na, Ca, K, Fe, Zn and Mg. This finding illustrates one of the important uses offactor analysis; screening the data for noisy variables or analytical artifacts.

One of the valuable uses of this type of analysis is in screening large data sets to identify errors.20 With the use of atomic and nuclear methods to analyze environmental samples for a multitude of elements, very large data sets have been generated. Because of the ease in obtaining these results with computerized systems, the elemental data acquired are not always as thoroughly checked as they should be, leading to some, if not many, bad data points. It is advantageous to have an efficient and effective method to identify problems with a data set before it is used for further studies. Principal component factor analysis can provide useful insight into several possible problems that may exist in a data set including incorrect single values and some types of systematic errors.

Gatz3S used a principal components analysis of aerosol composition and meteorological data for St Louis, MO, taken as part of project METROMEX.36•37 Nearly 400 filters collected at 12 sites were analyzed for up to 20 elements by ion-induced X-ray fluorescence. Gatz used additional parameters in his analysis including day of the week, mean wind speed, per cent of time with the wind from NE, SE, SW or NW quadrants or variable, ventilation rate, rain amount and duration. At several sites the inclusion of wind data permitted the extraction of additional factors that allowed identification of motor vehicle emissions in the presence of specific point sources of lead such as a secondary copper smelter. An important advantage of this form of factor analysis is the ability to include parameters such as wind speed and direction or particle size in the analysis.

In the early applications of factor analysis to particulate compositional data, it was generally easy to identify a fine particle mode lead/bromine


factor that could be assigned as motor vehicle emissions. In many cases, a calcium factor sometimes associated with lead could be found in the coarse mode analysis and could be assigned as road dust. However, there is a problem of diminishing lead concentrations in gasoline. As the lead and related bromine concentrations diminish, the clearly distinguishable covariance of these two elements is disappearing. In a study of particle sources in southeast Chicago based on samples from 1985 and 1986, much lower lead levels were observed and the lead/bromine correlation was quite weak.38 Thus, the identification of highway emissions through factor analysis based on lead or lead and bromine is becoming more and more difficult and other analyte species are going to be needed in the future.

A problem that exists with these forms of factor analysis is that they do not permit quantitative source apportionment of particle mass or of specific elemental concentrations. In an effort to find an alternative method that would provide information on source contributions when only the ambient particulate analytical results are available,18.38-40 target transformation factor analysis (TTF A) has been developed, in which uncentered but standardized data are analyzed. In this analysis, resolution similar to that obtained from a Chemical mass balance (CMB) analysis can be obtained. However, a CMB analysis can be made on a single sample if the source data is known while TTF A requires a series of samples with varying impacts by the same sources, but does not require a priori knowledge of the source characteristics. The objectives ofTTFA are (1) to determine the number of independent sources that contribute to the system, (2) to identify the elemental source profiles, and (3) to calculate the contribution of each source to each sample.

One of the first applications of TTF A was to the source identification of urban street dust. II A sample of street dust was physically fractionated by particle size, density and magnetic susceptibility to produce 30 subsampIes. Each subsample was analyzed by instrumental neutron activation analysis and atomic absorption spectroscopy to yield analytical results for 35 elements. The number of sources is determined by performing an eigenvalue analysis on the matrix of correlations between the samples. A target transformation determines the degree of overlap between an input source profile and one of the calculated factor axes. The input source profiles, called test vectors, are developed from existing knowledge of the emission profiles of various sources or by an iterative technique from simple test vectors.41 The identified source profiles are then used in a simple

170 PHILIP K. HOPKE

weighted least-squares determination of the mass contributions of the sources.42

In the analysis of the street dust, six sources were identified including soil, cement, tire wear, direct automobile exhaust, salt and iron particles. The lead concentration of the motor vehicle source was found to be 15% with a lead to bromine ratio of O· 39. This ratio is in good agreement with the values obtained by Dzubay et al. 43 for Los Angeles freeways and in the range presented by Harrison & Sturges44 in their extensive review of the literature. A comparison of the actual mass fractions with those calculated from the TTF A results shows that the TTF A provided a good reproduction of the mass distribution and source apportionments of the street dust that suggest that a substantial fraction of the urban roadway dust is anthropogenic in origin.

One of the principal advantages of TTF A is that it can identify the source composition profiles as they exist at the receptor site. There can be changes in the composition of the particles in transit from the source to the receptor and approaches that provide these modified source profiles should improve the receptor model results. Chang et al.45 have applied TTF A to an extensive set of data from St Louis, MO, to develop source composition profiles based on a subset selection process developed by Rheingrover & Gordon.46 They select samples from a data base such as the one obtained in the Regional Air Pollution Study (RAPS) of St Louis, MO, that were heavily influenced by major sources of each element. These samples were identified according to the following criteria:

(l) Concentration of the element in question X > X + Zcr where X is the average concentration of that particular element for each station and size fraction (coarse or fine particle size fraction), Zcr is typically set at about three for most elements, and is the standard deviation of the concentration of that element.

(2) The standard deviation of the 6 or 12 hourly average wind directions for most samples, or minute averages for 2-h samples, taken during intensive periods is less than 20°.

Samples that are strongly affected by emissions from a source were identified through observation of clustering of mean wind directions for the sampling periods selected with angles pointing toward the source.

The RAPS data of about 35000 individual ambient aerosol samples were collected at 10 selected sampling sites in the vicinity of St Louis, MO, and were screened according to the criteria stated above. With wind trajectory analysis, specific emission sources could be identified even in


cases where the sources were located very close together.46 A compilation of the selected impacted samples was made so that target transformation factor analysis could be employed to obtain elemental profiles for these sources at the various receptor sites.45

Thus, TTF A may be very useful in determining the concentration of lead in motor vehicle emission as the mix of leaded fuel continues to change. Multivariate methods can thus provide considerable information regarding the sources of particles including highway emissions from only the ambient data matrix. The TTF A method represents a useful approach when source information for the area is lacking or suspect and if there is uncertainty as to the identification of all of the sources contributing to the measured concentrations at the receptor site. TTF A has been performed using F ANT ASIA. 18a,40

Further efforts have recently been made by Henry & Kim47 on extending eigenvector analysis methods. They have been examining ways to incorporate the explicit physical constraints that are inherent in the mixture resolution problem into the analysis. Through the use of linear programming methods, they are better able to define the feasible region in which the solution must lie. There exists a limited region in the solution space because the elements of the source profiles must all be greater than or equal to zero (non-negative source profiles) and the mass contributions of the identified sources must also be greater than or equal to zero. Although there has only been limited applications of this expanded method, it offers an important additional tool to apply to those systems where a priori source profile data are not available. These methods provide a useful parallel analysis with eMB to help insure that the profiles used are reasonable representations of the sources contributing to a given set of samples.

3.4 Illustrative example In order to demonstrate the use of target transformation factor analysis for the resolution of sources of urban aerosols, TTF A will be applied to a compositional data set obtained from aerosol samples collected during the RAPS program in St Louis, Missouri.

3.4. 1 Description of data In the RAPS program, automated dichotomous samplers were operated over a 2-year period at 10 sites in the St Louis metropolitan area. Ambient aerosol samples were collected in fine « 2-4 jlm) and coarse (2·4- to 20-jlm) fractions and were analyzed at the Lawrence Berkeley Laboratory

J72 PHILIP K. HOPKE

for total mass by beta-gauge measurements and for 27 elements by X-ray fluorescence. The data for the samples collected during July and August 1976 from station 112 were selected for the TTFA process.

Station 112 was located near Francis Field, the football stadium on the campus of Washington University, west of downtown St Louis. During the 62 days of July and August, filters were changed at 12-h intervals, producing a total of 124 samples in each the fine and coarse fractions. Data were missing for 24 pairs of samples leaving a total of 100 pairs of coarse and fine fraction samples. Of the 27 elements determined for each sample, a majority of the determinations of 10 elements had values below the detection limits. Since a complete and accurate data set is required to perform a factor analysis, these 10 elements were eliminated from the analysis. For example, arsenic was excluded because almost all of the values were below the detection limits. Arsenic determinations by X-ray fluorescence are often unreliable because of an interference between the arsenic K X-ray and the lead L X-ray. A neutron activation analysis of these samples would produce better arsenic determinations. Reliable data for arsenic may be important to the differentiation of coal flyash and crustal material; two materials with very similar source profiles. The low percentage of measured elements can lead to distortions in the scaling factors produced by the mUltiple regression analysis. The remaining mass consists primarily of hydrogen, oxygen, nitrogen and carbon. Although no measurements of carbon are included in the RAPS data, that portion of the sample mass must still be accounted for by the resolved sources. In order to produce the best possible source resolutions, it is vital to have accurate measurements of the mass of total suspended particles (TSP) as well as determinations for as many elements as possible.

The fine and the coarse samples were analyzed separately and only the fine fraction results will be reported here. In this target transformation analysis, a set of potential source profiles was assembled from the literature to use as initial test vectors. In addition, the set of unique vectors was also tested.

3.4.2 Results The eigenvector analysis provided the results presented in Table 8. Examination of the eigenvectors suggests the presence of four major sources, possibly two weak sources, and noise. To begin the analysis, a four-vector solution was obtained. The iteratively refined source profiles are given in Table 9. The first three vectors can be easily identified as motor vehicles


TABLE 8 Results of eigenvector analysis of July and August 1976 fine fraction data at

site 112 in St Louis, Missouri

Factor

1 2 3 4 5 6 7 8 9

Eigenvalue

90·0 5·0 1·7 1·3 0·16 0·09 0·03 0·02 0·02

Chi square

210 156 65 63 55 26 24 24 15

Exner

0·324 0·214 0·141 0·064 0·047 0·034 0·027 0·021 0·016

% Error

204 164 129 93 72 68 67 58 49

(Pb, Br), regional sulfate, and soiljflyash (Si, AI) based on their apparent elemental composition.

However, the fourth vector showed high K, Zn, Ba and Sr and was not initially obvious as to its origin. The resulting mass loadings were then calculated and the only significant values were for the sampling periods of

TABLE 9 Refined source profiles for the four source solution at RAPS Site 112, July-

August 1976

Element Motor Flyashfsulfate Soil Fireworks vehicle

Al 3·0 0·9 62·0 60·0 Si 0·0 2·8 140·0 0·0 S 0·0 232·0 14·0 26·0 Cl 5·2 1·6 0·31 19·0 K 0·0 0·06 43·0 580·0 Ca 12·0 0·006 17·0 0·27 Ti 2·8 1·8 2·3 0·0 Mn 1·5 0·1 0·8 3·6 Fe 5·8 3-8 38·0 9·0 Ni 0·2 0·06 0·05 0·3 Cu 1·9 0·2 0·03 4·6 Zn 9·8 1·4 0·0 24·0 Se 0·1 0·1 0·0 0·01 Br 26·0 0·0 2·7 2·0 Sr 0·0 0·0 0·9 12·0 Ba 1·45 0·3 0·8 15·0 Pb 105·0 8·0 3-8 0·0

174 PffiLlP K. HOPKE

TABLE 10 Comparison of data with and without samples from 4 and 5 July (ng/m3)

RAPS Station 112, July and August 1976, fine fraction

Element With 4 and 5 Without 4 and 5 July mean July mean

Al 220 ± 30 200 ± 30 Si 440 ± 60 450 ± 60 S 4370 ± 310 4360 ± 320 CI 90 ± 10 80 ± 9 K 320 ± 130 150 ± 9 Ca 11O±1O 110 ± 10 Ti 63 ± 13 64 ± 13 Mn 17 ± 3 17 ± 3 Fe 220 ± 20 220 ± 20 Ni 2·3 ± 0·2 2·3 ± 0·2 Cu 16 ± 3 15 ± 3 Zn 78 ± 8 75 ± 8 Se 2·7 ± 0·2 2·7 ± 0·2 Br 140 ± 9 130 ± 8 Sr 5 ± 4 1·1 ± 0·1 Ba 19 ± 5 15 ± 4 Pb 730 ± 50 720 ± 50

noon to midnight on 4 July and midnight to noon on 5 July. This was 4 July 1976 and there was a bicentennial fireworks display at this location. Thus, these two highly influenced samples change the whole analysis.

To illustrate this further, Table 10 gives the average values of the elemental composition of the fine fraction samples for the samples with and without the 4 and 5 July samples included. It can be seen that these two samples from 4 and 5 July from the IOO-sample set have changed the average value of K by a factor of 2 and the average Sr by a factor of 5. Thus, TTF A can find strong, unusual events in a large complex data set. After dropping the samples from 4 and 5 July, the analysis was repeated and the results are presented in Table 11. Now there are 3 strong factors, 2 weaker ones, and a more continuum. Thus, a 5 factor solution was sought. These results are presented in Table 12.

The target transformation analysis for the fine fraction without 4 and 5 July data indicated the presence of a motor vehicle source, a sulfate source, a soil or flyash source, a paint-pigment source and a refuse source. The presence of the sulfate, paint-pigment and refuse factors was determined by the uniqueness test for the elements sulfur, titanium, and zinc


TABLE 11 Results of eigenvector analysis of July and August 1976 fine fraction data at

Site 112 in St Louis, Missouri, excluding 4 and 5 July data

Factor Eigenvalue Chi square Exner Average % error

I 87-0 210 0-304 197 2 4-9 152 0-304 197 3 2-0 57 0-070 123 4 0-2 42 0-050 98 5 0-1 26 0-037 73 6 0-1 25 0-029 69 7 0-02 26 0-023 69 8 0-02 17 0-019 67 9 0-01 16 0-015 53

TABLE 12 Refined source profiles (mg/g), RAPS Station 112, July and August 1976, fine

fraction without 4 and 5 July data

Element Motor Sulfate Soillflyash Paint Refuse vehicle

AI 5-0 I -I 53-0 0-0 0-0 Si 0-0 1-9 130-0 0-0 7-0 S 0-2 240-0 19-0 6-0 0-0 CI 2-4 I-I 0-0 4-6 22-0 K 1-4 1-6 15-0 5-7 48-0 Ca ll-O 0-0 16-0 34-0 1-2 Ti 0-0 0-7 2-5 110-0 0-0 Mn 0-0 0-0 0-7 4-8 8-6 Fe 0-0 I-I 36-0 90-0 36-0 Ni 0-08 0-04 0-042 O-Oll 0-7 Cu 0-6 0-01 0-0 0-0 8-7 Zn 0-8 0-0 0-0 3-7 65-0 Se 0-1 0-1 0-001 0-2 0-2 Br 30-0 0-3 2-5 0-0 0-05 Sr 0-09 0-01 0-15 0-1 0-001 Ba 0-7 0-035 0-07 28-0 0-5 Pb 107-0 6-5 5-0 0-0 46-0

176 PHILIP K. HOPKE

respectively. In the paint-pigment factor, titanium was found to be associated with the elements sulfur, calcium, iron and barium. This plant used iron titanate as its input material and the profile obtained in this analysis appears to be realistic. The zinc factor, associated with the elements chlorine, potassium, iron and lead, is attributed to refuse-incinerator emissions. This factor might also represent particles from zinc and/or lead smelters since a high chlorine concentration is usually associated with particles from refuse incinerators.48,49

The results of this analysis provide quite reasonable fits to the elemental concentration and to the fine mass concentrations for this system. Thus, the TTF A provided a resolution of source types and concentrations that appear plausible although specific sources are not identified and quantitatively apportioned. From other studies with other data sets, it appears TTF A is typically able to identify 5 to 7 source types as long as they are reasonably distinct from one another.

4 SUMMARY

The purpose of this chapter has been to introduce a number of ways in which eigenvector analysis can be used to reduce multivariate, environmental data to manageable and interpretable proportions. These methods have been made much more accessible with the quite sophisticated statistical packages that are now available for microcomputers. It is now quite easy to perform analyses that previously required mainframe computer capabilities. The key to utilizing eigenvector analysis is the recognition that the structure that it finds in the data may arise from many different sources both reflecting real causal processes in the system being studied and by errors in the sampling, analysis, or data transcription. There is often reluctance on the part of many investigators to use such 'complex' data analysis tools. However, they have often been surprised at how much useful information can be gleaned from large data sets using eigenvector methods. The key ingredients are learning enough about the methods to know what assumptions are being made and applying some healthy skepticism with regards to the interpretation of the results to be certain that they make sense within the context of the problem under investigation. The data are generally trying to convey information. The critical problem is to properly interpret the message even if it is that these data are wrong. It is hoped that this chapter has provided some assistance in understanding eigenvector methods and their application to environmental data analysis.


ACKNOWLEDG EM ENTS

Many of the studies reported here have been performed by students or post-doctoral associates in the author's group and their substantial contributions to the results presented here must be acknowledged. The work could not have been conducted without the support of the US Department of Energy under contract DE AC02-80EVlO403, the US Environmental Protection Agency under Grant No. R808229 and Cooperative Agreement R806236 and the National Science Foundation under Grants ATM 85-20533, ATM 88-10767 and ATM 89-96203.

REFERENCES

1. Hannan, H.H., Modern Factor Analysis, 3rd edn. University of Chicago Press, Chicago, 1976.

2. Malinowski, E.R. & Howery, D.G., Factor Analysis in Chemistry. l Wiley & Sons, New York, 1980.

3. Rozett, R.W. & Petersen, E.M., Methods off actor analysis of mass spectra. Anal. Chem., 47 (1975) 1301-08.

4. Duewer, D.L., Kowalski, B.R. & Fasching, lL. Improving the reliability of factor analysis of chemical data by utilizing the measured analytical uncertainty. Anal. Chem., 48 (1976) 2002-10.

5. Horst, P., Matrix Algebrafor Social Scientists. Holt, Rinehart and Winston, New York, 1963.

6. Joreskog, K.G., Kovan, lE. & Reyment, R.A., Geological Factor Analysis. Elsevier Scientific Publishing Company, Amsterdam, 1976.

7. Lawson, e.L. & Hanson, R.J., Solving Least Squares Problems. PrenticeHall, Englewood Cliffs, NJ, 1974.

8. Zhou, D., Chang, T. & Davis, J.e., Dual extraction of R-mode and Q-mode factor solutions. Int. J. Math. Geol. Assoc., 15 (1983) 581-606.

9. Hwang, e.S., Severin, K.G. & Hopke, P.K., Comparison ofR- and Q-mode factor analysis for aerosol mass apportionment. Atmospheric Environ., 18 (1984) 345-52.

10. Belsley, D.A., Kuh, K. & Welsch, R.E., Regression Diagnostics, Identifying Influential Data and Sources of Collinearity. Wiley, New York, 1980.

11. Hopke, P.K., Lamb, R.E. & Natusch, D.F., Multielemental characterization of urban roadway dust. Environ. Sci. Technol., 14 (1980) 164-72.

12. Cattell, R.B., Handbook of Multivariate Experimental Psychology. Rand McNally, Chicago, 1966, pp. 174-243.

13. Guttman, L. Some necessary conditions for common factor analysis. Psychometrika, 19 (1954) 149-61.

14. Kaiser, H.F. & Hunker, S., Some empirical results from Guttmais stronger laverband for the number of common factors. Education and Psych. Measurements, 33 (1973) 99-102.

15. Hopke, P.K. Comments on 'Trace element concentrations in summer

178 PHILIP K. HOPKE

aerosols at rural sites in New York State and their possible sources' by P. Parekh and L. Husain and 'Seasonal variations in the composition of ambient sulfur-containing aerosols' by R. Tanner and B. Leaderer. Atmospheric Environ., 16 (1982) 1279-80.

16. Exner, 0., Additive physical properties. Collection of Czech. Chem. Commun.,31 (1966) 3222-53.

17. Malinowski, E. R., Determination of the number of factors and the experimental error in a data matrix. Anal. Chem., 49 (1977) 612-17.

17a. Malinowski, E.R. & McCue, M., Qualitative and quantitative determination of suspected components in mixtures by target transformation factor analysis of their mass spectra. Anal. Chem., 49 (1977) 284-7.

18. Hopke, P.K., Target transformation factor analysis as an aerosol mass apportionment method: A review and sensitivity analysis. Atmospheric Environ., 22 (1988a) 1777-92.

18a. Hopke, P.K., FANTASIA, A Program for Target Transformation Factor Analysis, for program availability, contact P.K. Hopke (l988b).

19. Hopke, P.K., Receptor Modeling in Environmental Chemistry. J. Wiley & Sons, New York, 1985.

20. Roscoe, B.A., Hopke, P.K., Dattner, S.L. & Jenks, J.M., The use of principal components factor analysis to interpret particulate compositional data sets. J. Air Pollut. Control Assoc., 32 (1982) 637-42.

21. Bowman, H.R., Asaro, F. & Perlman, I., On the uniformity of composition in obsidians and evidence for magnetic mixing. J. Geology, 81 (1973) 312-27.

22. Lis, S.A. & Hopke, P.K., Anomalous arsenic concentrations in Chautauqua Lake. Env. Letters,S (1973) 45-55.

23. Ruppert, D.F., Hopke, P.K., Clute, P.R., Metzger, WJ. & Crowley, OJ., Arsenic concentrations and distribution of Chautauqua Lake sediments. J. Radioanal. Chem., 23 (1974) 159-69.

24. Clute, P.R., Chautauqua Lake sediments. MS Thesis, State University College at Fredonia, NY, 1973.

25. Hopke, P.K., Ruppert, D.F., Clute, P.R., Metzger, W.J. & Crowley, OJ., Geochemical profile of Chautauqua Lake sediments. J. Radioanal. Chem., 29 (1976) 39-59.

25a. Hopke, P.K., Gladney, E.S., Gordon, G.E., Zoller, W.H. & Jones, A.G., The use of multivariate analysis to identify sources of selected elements in the Boston urban aerosol. Atmospheric Environ., 10 (1976) 1015-25.

26. Folk, R.L., A review of grain-size parameters. Sedimentology, 6 (1964) 73-93. 27. Tesmer, I.H., Geology of Chautauqua County, New York Part I: Strati

graphy and Paleontology (Upper Devonian). New York State Museum and Science Service Bull. no. 391: Albany, University of the State of New York, State Education Department, 1963, p. 65.

28. Muller, E.H., Geology of Chautauqua County, New York, Part II: Pleistocene geology. New York State Museum and Science Service Bull. no. 392: Albany, the University of the State of New York, The State Education Department, 1963, p. 60.

29. Prinz, B. & Stratmann, H., The possible use off actor analysis in investigating air quality. Staub-Reinhalt Luft, 28 (1968) 33-9.


30. Blifford, I.H. & Meeker, G.O., A factor analysis model of large scale pollution. Atmospheric Environ., 1 (1967) 147-57.

31. Colucci, J.M. & Begeman, C.R., The automotive contribution to air-borne polynuclear aromatic hydrocarbons in Detroit. 1. Air Pollut. Control Assoc., 15 (1965) 113-22.

32. Gaarenstrom, P.D., Perone, S.P. & Moyers, J.P., Application of pattern recognition and factor analysis for characterization of atmospheric particulate composition in southwest desert atmosphere. Environ. Sci. Technol., 11 (1977) 795-800.

33. Thurston, G.D. & Spengler, J.D. A quantitative assessment of source contributions to inhalable particulate matter pollution in metropolitan Boston. Atmospheric Environ., 19 (1985) 9-26.

34. Parekh, P.P. & Husain, L., Trace element concentrations in summer aerosols at rural sites in New York State and their possible sources. Atmospheric Environ., 15 (1981) 1717-25.

35. Gatz, D.F., Identification of aerosol sources in the St Louis area using factor analysis. 1. Appl. Met., 17 (1978) 600-08.

36. Changnon, S.A., Huff, R.A., Schickendenz, p.r. & Vogel, J.L., Summary of METROMEX, Vol. I: Weather Anomalies and Impacts, Illinois State Water Survey Bulletin 62, Urbana, IL, 1977.

37. Ackerman, B., Chagnon, S.A., Dzurisin, G., Gatz, D.L., Grosh, R.C., Hilberg, S.D., Huff, F.A., Mansell, J.W., Ochs, H.T., Peden, M.E., Schickedanz, P.T., Semonin, R.G. & Vogel, J.L. Summary of METRO ME X, Vol. 2: Causes of Precipitation Anomalies. Illinois State Water Survey Bulletin 63, Urbana, IL, 1978.

38. Hopke, P.K., Wlaschin, W., Landsberger, S., Sweet, C. & Vermette, SJ., The source apportionment of PM lOin South Chicago. In P M-10: Implementation of Standards, ed. C.V. Mathai & D.H. Stonefield. Air Pollution Control Association, Pittsburgh, PA, 1988, pp. 484-94.

39. Alpert, D.l. & Hopke, P.K. A quantitative determination of sources in the Boston urban aerosol. Atmospheric Environ., 14 (1980) 1137-46.

39a. Alpert, D.l. & Hopke, P.K., A determination of the sources of airborne particles collected during the regional air pollution study. Atmospheric Environ., 15 (1981) 675-87.

40. Hopke, P.K., Alpert, DJ. & Roscoe, B.A., FANTASIA-A program for target transformation factor analysis to apportion sources in environmental samples. Computers & Chemistry, 7 (1983) 149-55.

41. Roscoe, B.A. & Hopke, P.K., Comparison of weighted and unweighted target transformation rotations in factor analysis. Computers & Chem., 5 (1981) 5-7.

42. Severin, K.G., Roscoe, B.A. & Hopke, P.K., The use of factor analysis in source determination of particulate emissions. Particulate Science and Technology, 1 (1983) 183-92.

43. Dzubay, T.G., Stevens, R.K. & Richards, L.W., Composition of aerosols over Los Angeles freeways. Atmospheric Environ., 13 (1979) 653-9.

44. Harrison, R.M. & Sturges, W.T. The measurement and interpretation of Br/Pb ratios in airborne particles. Atmospheric Environ., 17 (1983) 311-28.

45. Chang, S.N., Hopke, P.K., Gordon, G.E. & Rheingrover, S.W., Target

180 PHILIP K. HOPKE

transformation factor analysis of airborne particulate samples selected by wind-trajectory analysis. Aerosol Sci. Technol., 8 (1988) 63-80.

46. Rheingrover, S.G. & Gordon, G.E., Wind-trajectory method for determining compositions of particles from major air pollution sources. Aerosol Sci. Technol., 8 (1988) 29-61.

47. Henry, R.c. & Kim, B.M., Extension of self-modeling curve resolution to mixtures of more than three components. Part 1. Finding the basic feasible region. Chemometrics and Intelligent Laboratory Systems, 8 (1990) 205-16.

48. Greenberg, R.R., Zoller, W.H. & Gordon, G.E., Composition and size distribution of particles released in refuse incineration. Environ. Sci. Technol., 12 (1978) 566-73.

49. Greenberg, R.R., Gordon, G.E., Zoller, W.H., Jacko, R.B., Neuendorf, D.W. & Yost, KJ., Composition of particles emitted from the Nicosia municipal incinerator. Environ. Sci. Technol., 12 (1978) 1329-32.

50. Heidam, N. & Kronborg, D., A comparison of R- and Q-modes in target transformation factor analysis for resolving environmental data. Atmospheric Environment, 19 (1985) 1549-53.

51. Mosteller, F., The jackknife. Rev. Inst. Stat. Inst., 39 (1971) 363-8.

Chapter 5

Errors and Detection Limits

M.J. ADAMS School of Applied Sciences. Wolverhampton Polytechnic. Wulfruna

Street. Wolverhampton, WVI ISB, UK

1 TYPES OF ERROR

It is important to appreciate that in all practical sciences any measurement we make will be subject to some degree of error, no matter how much care is taken. As this measurement error can influence the subsequent conclusions drawn from the analysis of the data, it is essential to identify, reduce and minimise the effects of the errors wherever possible. This procedure requires that we have a thorough knowledge of the measurement process and implies a knowledge of the cause and source of errors in our measurements. For their treatment and analysis, experimental errors are assigned to one of three types: gross, random or systematic errors.

1.1 Gross errors Gross errors are, hopefully, rare in a well organised laboratory. Their occurrence does not fit into any common pattern and the analytical data following a gross error should be rejected as providing little or no analytical information. For example, if a powder sample is spilled and contaminated prior to examination or a solution is incorrectly diluted, then any subsequent analysis is likely to provide erroneous results and it would be wrong to infer any conclusions or base any meaningful decision on the resultant analytical data. It is worth noting here that gross errors can easily arise outside of the laboratory. If a soil or plant sample is taken from the wrong site then no matter how well performed the laboratory analysis, the meaning of the results can give rise to a gross error.

The presence of previously unidentified gross errors in any data set can

181

182 M.J. ADAMS

often be inferred by the presence of outliers which are most easily identified by graphing or producing some pictorial representation of the data. It is important to realise that gross errors can occur in the best planned laboratories and a watch should always be kept for their presence.

1.2 Random errors The most common and easily analysed errors are random; their occurrence is irregular and individual errors are not predictable. Random errors are present in all measurement operations and they provide the observed variability in any repeated determination. Random errors in an analysis can be, and always should be, subject to quantitative statistical analysis to provide a measure of their magnitude. This chapter is largely concerned with the analysis and testing of random errors in analytical measurements.

1.3 Systematic errors By definition, systematic errors are not random; the presence of a fixed systematic error affects equally each determination in a series of measurements. The source of a systematic error may be human, for example always reading from the top of a burette meniscus during a titration, or mechanical or instrumental. A faulty balance may result in all analyses in that laboratory producing results too high, a positive bias due to the fixed systematic error. Whilst the presence of systematic errors does not give rise to increased variability within a set or batch of data, it can increase variability between batches. Thus, if a student repeatedly weighs the same sample on several balances then the results using a particular balance will indicate the random error for that student and the differences between balances may indicate systematic errors in the balances. Systematic errors are frequently encountered in modern instrumental methods of analysis and their effects are minimised by the analysis of standards and subsequent correction of the results by calibration. These procedures are discussed in more detail below. The nature and effects of these types of errors on a series of analytical results may be best appreciated by an illustrative example.

Example. In the determination of lead in a standard sample of dried stream sediment, four independent analysts provided the series of results shown in Table 1, where each result is expressed in mg ofPb per kg sample.

The variability in these data can be seen clearly in the blob chart of Fig. 1. The average or mean of the 10 determinations from each analyst is shown along with the known true result as provided with the standard

ERRORS AND DETECTION LIMITS 183

TABLE 1 The concentration of lead in stream sediment samples, mg Pbkg-', as deter

mined by four analysts

Results (mgkg-I)

Sum (mgPb) Mean (mgPb) Variance (mg2Pb) s (mgPb) CV(%)

A

49·2 49·4 51·2 50·5 51·5 48·7 49·6 50·2 49·1 51·6

501·0 50·1

1·122 1·059 2·114

Known value = 50'00mgkg- 1 Pb.

B

48·6 47-4 51·4 49·7 51·8 52-8 47-8 47-4 49·2 57·6

497·7 49·77 4·013 2·003 4·025

Analyst

C D

55·2 57·1 54·8 58·6 54·6 56·4 55·8 57-8 56·1 55·3 54·8 53·9 55·3 57·9 55·9 55·4 56·1 53-5 55·2 58·2

553-8 564·1 55·38 56·41 0·315 3·294 0·561 1·815 1·0\4 3·218

sample. From this data we are able to assign quantitative, statistical, values to the variability in each of the analyst's set of results and to the difference between their mean values and the known value. Evaluation of these measures leads us naturally to the important analytical concepts of accuracy and precision. Accuracy is usually considered as a measure of the correctness of a result. Visual inspection of the data in Table 1 and Fig. 1 can provide a qualitative conclusion that the mean results from analysts A and B are similarly close to the expected true value and they both can be described as providing accurate data. Analyst B, however, exhibits much greater variability or scatter in the results than analyst A, and B therefore is considered less precise. In the case of analysts C and D, both show considerable deviation in their average result from the known true value, both are inaccurate, but C is more precise than D. The terms accuracy and precision are not synonymous in analytical and measurement science. The accuracy of a measurement is its closeness to some known, true value and is dependent on the presence and extent of any systematic errors. On the other hand, precision indicates the variability in

184 M.J. ADAMS

x x x xxxxxxx

x x XXXXX I xx X

--_._----"-----

x x x x x x

xxxx

A

B

c

x xx ~ xx ~x D

Fig. 1. The variability in the lead concentration data, from Table 1, as determined by four analysts A, B, C and D. The correct value is illustrated by the dotted line. Analyst A is accurate and precise, analyst B is accurate but less precise, analyst C is inaccurate but of high precision and the data from analyst D has low accuracy and poor precision.

a data set and is a measure of the random errors. It is important that this distinction is appreciated. No matter how many times analysts C and D repeat their measurements, because of some inherent bias in their procedure they cannot improve their accuracy. The precision might be improved by exercising greater care to reduce random errors.

Regarding precision and the occurrence of random errors, two other terms are frequently encountered in analytical science, repeatability and reproducibility. If analyst B had performed the measurements sequentially on a single occasion then a measure of precision would reflect the repeatability of the analysis, the within-batch precision. If the tests were run over, say, 2 days, then an analysis of data from each occasion would provide the between-batch precision or reproducibility.

2 DISTRIBUTION OF ERRORS

Whilst the qualitative merits of each of our four analysts can be inferred immediately from the data, to provide a quantitative assessment a statistical analysis is necessary and should always be performed on data from any quantitative analytical procedure. This quantitative treatment requires that some assumptions are made concerning the measurement process. Any measure of a variable, mass, size, concentration, etc., is expected to


approximate the true value but it is not likely to be exactly equal to it. Similarly, repeated measurements of the same variable will provide further discrepancies between the observed results and the true value, as well as differences between each practical measure, due to the presence of random errors. As more repeated measurements are made, a pattern to the scatter of the data will emerge; some values will be too high and some too low compared with the known correct result, and, in the absence of any systematic error, they will be distributed evenly about the true value. If an infinite number of such measurements could be made then the true distribution of the data about the known value would be known. Of course, in practice this exercise cannot be accomplished but the presence of some such parent distribution can be hypothesised. In analytical science this distribution is assumed to approximate the well-known normal form and our data are assumed to represent a sample from this idealised parent population. Whilst this assumption may appear to be taking a big step in describing our data, many scientists over many years of study have demonstrated that in a variety of situations repeat measurements do approximate well to the normal distribution and the normal error distribution is accepted as being the most important for use in statistical studies of data.

2.1 The normal distribution The shape of the normal distribution curve is illustrated in Fig. 2. Expressed as a probability function, the curve is defined by

p = _1_ exp [- ~ (x -/YJ (I) ay'ln 2 a

where (x - Jl.) represents the measured deviation of any value x from the population mean, Jl., and a is the standard deviation which defines the spread of the curve. The formula given in eqn (1) need not be remembered; its importance is in defining the shape of the normal distribution curve. As drawn in Fig. 2, the total area under the curve is one for all values of Jl. and a by standardising the data. This standard normal transform is achieved by means of subtracting the mean value from each data value and dividing by the standard deviation,

Z = (x - Jl.)

(2)

This operation produces a distribution with a mean of zero and unit standard deviation. As the function described by eqn (1) expresses the

186 M.J. ADAMS

F(z) (" 0.5 .,_\ , \

\ , \

i \ , I \ I I

F(z) " o.a1' : :\ I I i '\ i ! :

, , '\ i t i i I

/' : F(z) " 0.97~ I ;,

--_../ ~r ---- --- -----'---- - -~---------~-----------

-3 -2 -1 o +1 +2 +3

Standard Deviation

Fig. 2. The normal distribution curve, standardised to a mean value of 0 and a standard deviation of 1. The cumulative probability function values, F(z), are shown for standardised valued z of 0, 1, and 2 standard deviations from the mean and are available from statistical tables.

idealised case of the assumed parent population, in practice the sample mean, x, and standard deviation, s, are substituted and employed to characterise our analytical sample, the subset of measured data. These sample measures and the related quantity, the variance, are given by,

mean,

Ix i

n (3)

variance,

I(x - i)2 V = n - I

(4)

and standard deviation,

s = JV (5)

where n is the number of sample data values, i.e. the number of repeated measures.

The application of these formulae to the data from the four analysts is shown in Table I and confirms the qualitative conclusions from the visual inspection of the data.

A point to note is that the divisor in the calculation of the variance, and,


hence, standard deviation, is (n - I) and not n. This arises because v and s are estimates of the parent distribution and the use of n would serve to underestimate the deviation in the data set.

The standard deviation and variance are directly related (see eqns (4) and (5» and in practice the standard deviation of a set of measures is usually reported as this is in the same units as the original results. Another frequently reported measure is the relative standard deviation or coefficient of variation, CV, which is defined by

s CV = - x 100 x (6)

CV is thus a percentage spread of the data relative to the mean value. Armed with the assumption that the analytical data is taken from a

parent normal distribution as described by eqn (1), then the properties of the normal function can be used to infer further information from the data. As shown in Fig. 2, the normal curve is symmetrical about the mean and its shape depends on the standard deviation. The larger is a, or s, the greater the spread of the distribution. For all practical purposes there is little interest in the height of the normal curve; we are more concerned with the area under sections of the curve and the cumulative distribution function. Whatever the actual values of the mean and standard deviation describing the curve, approximately 66% of the data will lie within one standard deviation of the mean, 95% will be within two standard deviations and less than I observation in 300 will lie more than three standard deviations from the mean. These values are obtained from the cumulative distribution function which is derived by integrating the standard normal curve, with a mean of zero and unit standard deviation. This integral is evaluated numerically and the results are usually presented in tabular form. This table is available from many statistical texts, but for our immediate purposes a few useful values are presented below:

standardised variable, Z = cumulative probability function, F =

and are shown in Fig. 2.

012 3 0·5 0·84 0·977 0·9987

Returning to the results submitted by analyst A (Table I), we can now calculate the proportion of the results produced by A which would be expected to be above some selected value, say 52 mg Pb. To use the cumulative standard normal tables, the analyst's data must first be standardised. From eqn (2) this is achieved by z = (x - x)/s which, as required, moves the mean of the data to zero and provides unit standard

188

deviation. For our example,

Z =

M.J. ADAMS

52·0 - 50·1 = 1.8 1·059

the value of z is 1·8 standard deviations above the mean and, from the data given above, about 2·5% of determinations can be expected to exceed this value.

2.2 The central limit theory Providing there is no systematic error or bias in our measurements then the mean result of a set of values provides us with an estimate of the true value. It is unlikely, however, that the determined sample mean will be exactly the same as the true value and the mean should be recorded and presented along with some measure of confidence limits, some indication of our belief that the mean is close to the correct value. The range of values denoted by such limits will depend on the precision of the determinations and the number of the measurements. It is intuitive that the more measurements that are made, the more confidence we will have in the recorded mean value. In our example above, analyst A may decide to perform 30 tests to provide a more reliable estimate and indication of the true value. It is unlikely in practice that 30 repeated measurements would be performed in a single batch experiment, a more common arrangement might be to conduct the tests on 6 batches of 5 samples each. The mean values determined from each of the 6 sets of data can be considered as being derived from some parent population distribution and it is a well established and important property of this distribution of means that it tends to the normal distribution as the number of determinations increases, even if the original data population is not normally distributed. This is the centra/limit theorem and is important as most statistical tests are undertaken using the means of repeated experiments and assume normality. The mean of the sample means distribution will be the same as the mean value of the original data, of course, but the standard deviation of the means is smaller and is given by a/J1j, which is often termed the standard error of the sample mean.

The central limit theory can be applied and used to define confidence limits to accompany a declared analytical result. Given that the parent distribution is normal, then from the cumulative normal distribution function, 95% of the standard normal distribution lies between ± 1·96 of the mean and, therefore, the 95% confidence limits can be defined as x ± 1·96a/J1j. If the mean concentration of lead in 30 samples is deter-


mined as being 50·1 mg kg-I and the standard deviation is 10 mg kg-I, then the 95% confidence interval for the analysis is given by 50·1 ± 1,96(1/ jn = 50·1 ± 0·682. Because (1 is unknown, being derived from the idealised, infinite parent distribution, the estimate of standard deviation, s, must be used and then the 95% confidence interval is given by x ± (tos/ jn), where the value of the factor t tends to 1·96 as n increases. The smaller the sample size, the larger is t. Above n = 25 an approximate value of t = 2·0 is often used. For 99% confidence limits, a t value of 3·0 can be used. Other values of t for any confidence interval can be obtained from standard statistical tables; they are derived from the t-distribution, a symmetric distribution with zero mean, the shape of which is dependent on the degrees offreedom of the sample data. In the above example, where s is calculated from a sample size of n measures, there are (n - 1) degrees of freedom.

2.3 Propagation of errors Most analytical procedures are not simple single measures but are comprised of a number of discrete steps, e.g. dissolution and dilution. If any analytical process involves measuring and combining the results from several actions, then the cumulative effects of the individual errors associated with each stage in the procedure must be considered, quantified and combined to give the total experimental error.

2.3.1 Summed combinations If the final determined analytical result, y, is the sum of several independent variables, Xi, such that

(7)

where ai represents some constant coefficients, then it is a property of the variance that

(8)

The fact that the linear combination of errors can be determined by summing the individual variances associated with each stage of the analysis provides for a relatively simple calculation of the total experimental error.

2.3.2 Multiplicative combination of errors Combining measures of precision in a non-linear multiplication or division calculation is much more complex than in the simple linear case

190 M.J. ADAMS

shown above. An approximate but more useful and simple procedure makes use of the relative standard deviation, the coefficient of variation. Given that the final result, y, is obtained from two measurements, XI and X2, by Y = XI/X2 then rather than combining the variance values, which involves complex formulae, the CV values can be used.

(9)

(10)

and the standard deviation, SY' associated with the final result can be obtained by rearranging eqn (6) such that,

s = CV x x

100

3 SIGNIFICANCE TESTS

(11)

As stated above, the characteristics of the normal distribution are well known from theoretical considerations and countless experimental tests. If the variance of a parent normal population is known then the probability of any sampled data element occurring within the population can be calculated from the cumulative probability distribution function. Similarly, any sample datum can be considered as not belonging to the parent population if its measured value is beyond some selected and specified distance from the population mean. Such assumptions and resultant calculations provide experimentalists with the powerful, yet simple to use, tools referred to as significance tests. A simple example will illustrate the value and basic stages involved in such an analysis.

Example. If an identified river catchment pool is extensively (infinitely) sampled and analysed for, say, aluminium over a long period of time then an accurate and established distribution curve for the aluminium content of the water will be available. If at some subsequent stage an unspecified water sample is analysed for aluminium, a test of significance can indicate whether the unknown sample was derived from a source different from that providing our parent population of standard samples.

To answer such a question, the problem is usually expressed as, 'Do the two samples, the unknown and the standard set, have the same mean values for their aluminium concentration?' In statistics this statement is


referred to as the null hypothesis and it is usually written in the form

Ho: i = J1.,

where x is the mean of replicate analyses of the unknown sample and /l, is the mean of the standard, known source population. The alternative hypothesis is that the mean values of the two sets of data are significantly different, i.e.

HI: x =F /ls

If the mean value of the unknown sample set is identified as belonging to that area of the population curve corresponding to a region of significantly low probability then a safe conclusion is that the unknown sample did not come from our identified source and the null hypothesis is rejected. If, in our example, the mean and standard deviation of the parent population of aluminium concentration values are known to be 5· 2 mg kg-I and 0·53mgkg- ' respectively, and the mean concentration of five replicate analyses of the unknown sample is 7·8 mg kg-I, then the test statistic is given by

i - /ls

a/..fo Z = (12)

The level of significance below which we are willing to accept the wrong conclusion must be chosen and 5% (0·05 probability) is common. This implies that we are willing to risk rejecting the null hypothesis when it is in fact correct 1 time in 20. For our example,

Z = (7·8 - 5·2)/(0·53/.j5) = 2·19

We are not interested in whether the mean of the unknown sample is significantly less than or more than that of the parent population; both will lead to rejection of the null hypothesis. From the standardised normal distribution curve, therefore, we must determine the critical region containing 5% of the area of the curve, i.e. the extreme 2·5% either side of the mean. The critical region is illustrated in Fig. 3, and from tables of cumulative probabilities for the standardised normal region the critical test value is J1.s ± 1·96. The calculated test value, Z = 2·19, is greater than 1·96 and, therefore, the null hypothesis is rejected and we assume the sample to come from a different source than our parent, standard population.

Note that we could not prove that the samples were from the same source but if the value of the test statistic was less than the critical value

192 M.J. ADAMS

, \5%

, 2.5%

-1.96 -1.64 o +1.64 +1.96

Fig. 3. The standardised normal distribution curve and the critical regions containing the extreme 2·5% and 5% of the curve area. Cumulative probabilities, from statistical tables, are also shown.

of 1'96, then statistics would indicate that there was no reason to assume any difference.

In this example a so-called two-tailed test was performed: we are not interested in whether the new sample was significantly more or less concentrated than the standards, just different. To indicate that the unknown sample was significantly more concentrated, then a one-tailed test would have been appropriate to reject the null hypothesis and, at the 5% significance level, the critical test value would have been 1·64 (see Fig. 3).

3.1 t-Test In the above example it was assumed that all possible samples of the known water source had been analysed-a clearly impossible situation. If the true standard deviation of the parent population distribution is not known, as is usually the case, then the test statistic calculation proceeds in a similar fashion but depends on the number of samples analysed. If a large number of samples are analysed (n > 25) then the sample deviation, s, is considered a good estimate of (1 and then the test statistic,now denoted as t, is given by

x - Jl.o t = ---

s..{ri (13)

When n is less than 25, s may be a poor estimate of (1 and the test statistic is compared not with the normal curve but with the t-distribution curve which is more broad and spread out. The t-distribution is symmetric and similar to the normal distribution but its wider spread is dependent on the


TABLE 2 The concentration of copper determined from ten soil samples by AAS

Sample no.

2 3 4 5 6 7 8 9 10

Coppercontent(mgkg- ' ) 72 71 78 98 98 116 76 104 84 96

Sum = 893mgkg- l ; mean = 89·3mgkg- l ; ;. = 232·46(mgkg- I )2; s =

15·246mgkg- ' ; t = 1·93.

number of samples examined. For an infinite sample size (n = 00) the t-distribution and the normal curve are identical. As with the normal distribution, critical values of the t-distribution are tabulated and available in standard statistical texts. A value is selected by the user according to the level of significance required and the number of degrees of freedom in the experiment. In simple statistical tests, such as discussed here, the number of degrees of freedom is generally one less than the number of samples or observations (n - 1).

Example. Copper is routinely determined in soils, following extraction with EDTA, by atomic absorption spectrometry. At levels greater than 80mgkg- ' soil, copper toxicity in crops may occur. In Table 2 the results of 10 soil samples from a site analysed for copper are presented. Is this site likely to suffer from copper toxicity?

In statistical terms we are testing the null hypothesis,

Ho: x ~ J.lo

against the alternative hypothesis

HI: X > J.lo

The null hypothesis states that our 10 samples are from some parent population with a mean equal to or less than 80 mg kg-I. The alternative hypothesis is that the parent population has a mean copper concentration greater than 80mgkg-'. The test statistic, t, is given by eqn (13) in which s approximates the standard deviation of the parent population, x is the calculated mean of the analysed samples and J.lo is assumed to be the mean of our parent population, 80 mg kg-I. From Table 2 and eqn 13, t = (89·3 - 80)/(15'25/JIQ) = 1·93. From statistical tables, the value of t must exceed 1·83 for a significance level of 5% and nine degrees of freedom (n - I). This is indeed the case: our mean result lies in the critical region beyond the 5% level and the null hypothesis is rejected. Thus the conclu-

194 M.J. ADAMS

sion is that the soil samples do arise from a site, the copper content of which is greater than the level of 80 mg kg-I and copper toxicity may be expected.

3.1.1 Comparing means A common application of the I-test in analytical science is comparing the means of two sets of sample data. We may wish to compare two samples for similarity or to compare that two analytical methods or analysts provide similar results. In applying the I-test it is assumed that both sets of observations being compared are normally distributed, that the parent populations have the same variance and that the measurements are independent. Our null hypothesis in such cases can then be expressed as

Ho: Jl.1 = Jl.2

against the alternative

HI: Jl.1 f= Jl.2

In comparing two sets of data it is evident that the greater the difference between their mean values, XI and Xh the less likely that the two samples or results are the same, i.e. from the same parent population. The test statistic is therefore obtained by dividing the difference (XI - X2) by the standard error. Since the variance associated with the mean X, is given by

(12

variance (XI) = -nl

and the variance associated the mean X2 is given by

(14)

(15)

and the combined variance of the sum or difference of two independent variables is equal to the sum of the variances of the two samples (see Section 2.3.1), then the variance associated with the difference of the two means is given by

(16)

The standard error can be obtained from the square root of this variance, i.e.


The test statistic is therefore given by,

Z = (1' .j(l/n l + l/n2)

(17)

If the variance of the underlying population distribution is not known then (1' is replaced by the sample standard deviation, s, and the t-test statistic is used,

(18)

where s is derived from the standard deviations of each sample set by,

; = !: (Xi - X)2 n-I

and in the special case of nl = n2, then

; = ~+~ 2

(19)

(20)

If the null hypothesis is true, i.e. the two means are equal, then the t-statistic can be compared with the t-distribution with (nl + n2 - 2) degrees of freedom at some selected level of significance.

Example. Two samples of an expanded polyurethane foam are suspected of coming from the same source and to have been prepared using chlorofluorocarbons (CFCs). Using GC-mass spectrometry, the CFC has been identified and the quantitative results for 10 determinations on each sample are shown in Table 3. Is there a significant difference between the two sample means?

From eqn (20), the combined estimate of the variance is

; (~ + sD/2 = 44·66

s = 6·68

and the test statistic is, from eqn (18),

t = (XI - x2) = 1.27 s.j(l/IO + 1/10)

A two-tailed test is appropriate as we are not concerned whether any one sample contains more or less CFC, and from statistical tables for a t-distribution with 18 degrees offreedom (n l + n2 - 2), to-025.18 = 2·101.

196 M.J. ADAMS

TABLE 3 The analysis of two expanded polyurethane foams. A and e, for CFC content

expressed as mg CFC m-3 foam

CFC content (mgm- 3)

Sum (mgm-3)

Mean (mgm-3)

;- (mgm-3)2

s (mgm- 3)

Sample A

66 74 70 82 68 70 80 76 64 74

724 72-4 34·49 5·87

Sample B

78 70 72 84 69 78 92 75 68 76

762 76·2 54·84 HI

Our result (t = 1·27) is less than this critical value, therefore we have no evidence that the two samples came from different sources and the null hypothesis is accepted.

3. 1.2 Paired experiments If two analytical methods are being compared they may be tested on a wide range of sample types and analyte concentrations. Use of eqn (18) in such a case would be inappropriate because the differences in the analytical results between the sample types might be greater than the observed differences between the two methods. In paired experiments of this kind, therefore, the difference between the results from similar analyses are tested.

Example. Two students are required to determine iron by colorimetric analysis in a range of water samples. The results are given in Table 4. Is there a significant difference between the two students' results?

The null hypothesis is that the differences between the pairs of results come from a parent population with zero mean. Thus, from eqn (13), the test statistic to be calculated is

(21)


TABLE 4 The comparison of two students' results for the determination of iron in water

samples

Sample

A

1 2·74 2 3·52 3 0·82 4 3-47 5 10·82 6 16·92

Student

B

2·12 3·72 0·62 l07 9·21

15·60

Difference

0·62 -0,20

0·20 0·40 1-61 1·32

Sum = 3'95mgkg- l ; mean = 0'658mgkg- l ; ;. = 0'472(mgkg- I ?; s = 0'687mgkg- l ; t = 2·35.

and from Table 4, I = dl(Sd/.j6) = 2·35. Using a two-tailed test at the 10% significance level, this value exceeds the tabulated value of 10.05•5 = 2·01. Therefore, we have evidence that the results from the two students are significantly different at the 10% level.

3.2 F-Test In applying the I-test an assumption is made that the two sets of sample data have the same variance. Whether this assumption is valid or not can be evaluated using a further statistical measure, the F-test. This test, to determine the equality of variance of samples, is based on the so-called F-distribution. This is a distribution calculated from the ratios of all possible sample variances from a normal population. As the sample variance is poorly defined if only a small number of observations or trials are made then the F-distribution, like the I-distribution discussed above, is dependent on the sample size, i.e. the number of degrees of freedom. As with I-values, F-test values are tabulated in standard texts and in this case the critical value selected is dependent on two values of degrees of freedom, one associated with each sample variance to be compared, and the level of significance to be used.

In practice, we assume our two samples are drawn from the same parent population; the variance of each sample is calculated and the F-ratio of variances evaluated. The null hypothesis is

198 M.J. ADAMS

against

and we can determine the probability that, by chance, the observed F-ratio came from two samples taken from a single population distribution.

Example. Returning to the CFC data in Table 3, an F-test can indicate if the variation in CFC concentration is similar for the two sets of samples. If we are willing to use a 5% level of significance then we accept a I in 20 chance of concluding that the concentrations are different when they are the same.

The F-ratio is used to compare variances and is calculated from

F = ~ s~

(22)

Substituting the data from Table 3, F = 54·84/34·49 = 1·59. From statistical tables, the critical value for F with 9 degrees of freedom associated with each data set is F9.9.0-05 = 3·18. Our value is less than this so the null hypothesis is accepted and we can conclude the variances of the two sample sets is the same and the application of the t-test is valid.

3.3 Analysis of variance Many experiments involve comparing analytical results from more than two samples and the repeated application of the t-test may not be appropriate. In these cases the problem is examined by the statistical techniques called the analysis of variance.

Suppose the phosphate content of a soil is to be determined to assess its fertility for crop growth. If, say, six samples of soil are taken as being representative of the plot then we need to determine if the phosphate concentration is similar in each. Phosphate is determined colorimetrically using acidified ammonium molybdate and ascorbic acid.

It is important in such analyses that the order of analysis in the laboratory be randomised to reduce any systematic error. This can be achieved by assigning to each analytical sample a sequential number taken from a random number table. Randomising the samples in this way mixes up the various sources of experimental error over all the replicates and the errors are said to be confounded. The equivalency of the six soil samples for phosphate is determined using a one-way analysis of variance. The null hypothesis is


TABLE 5 The concentration of phosphate, determined colorimetrically in six soil samples, using five replications on each sample. The results are expressed as

mg kg- 1, phosphate in dry soil

Replicate Soil 1 Soil 2 Soil 3 Soil 4 Soil 5 Soil 6 no.

I 38·2 36·7 24·5 40·3 38·9 44-4 2 36·7 28·3 28·3 44·5 49·3 32·2 3 42·3 40·2 16·7 34·6 34·6 32·5 4 32·5 34·6 22-4 36·4 40·2 38·0 5 34·3 38·3 18·5 30·9 36·4 38·1

and the alternative,

HI: at least one mean is different

The data for this exercise is presented in Table 5. For the one-way analysis it is necessary to split the total variance into

two sources, the experimental variance observed within each set of replicates and the variance between the soil samples. A common method of performing an analysis of variance it to complete an ANOVA table. This is shown in Table 6.

The total variance for all the analyses is given by SSr,

m n (f ± Xii)2 SST = L L ~ _=1 i=1

j=1 i=1 m x n (23)

where xi.i represents the ith replicate of the jth sample. The variance

TABLE 6 An ANOVA table"

Source of Sum of Degrees of Mean squares F-test variation squares freedom (variance)

Between groups SSB m - I s~ s~/s~ Within groups SSw men - I) s~

Total variation SSr m'n - 1

aWhere n is the number of replicates and m is the number of different samples (soils).

200 M.J. ADAMS

between the samples is given by SSB,

[ ( n )2 ] (m n )2

m L Xij L L Xij

SSB = L i=1 _ j=1 i=1

j=1 n m x n

and the within-group variance can be found by difference,

SSw = SST - SSB

(24)

(25)

The second term in eqns (23) and (24) occurs in many ANOV A calculations and is often referred to as the correction factor, CF

( m n )2 L L Xij

CF = j=1 i=1

m x 11 (26)

From the data in Table 5, CF is calculated by summing all the values, squaring the result and dividing by the number of analyses performed,

CF = 36248

Using eqn (23) we can determine the total variance, which is obtained from summing every squared value and subtracting CF,

m n

SST = L L xT,j - 36248 = 1558 j=1 i=1

and similarly for the between-group variance from totalling the squares of the sums of each replicate and dividing by the number of replicates,

[( n )2]

m L Xi,j

SSB = j~ i=ln - 36248 = 1009

and finally, the within-group sum of squares by difference,

SSw = SSr - SSB = 1558 - 1009 = 549

and we can complete the ANOV A table (Table 7). At the I % level of significance the value of the F-ratio, from tables, for

5 and 24 degrees of freedom is, FO.01 ,5,24 = 3·90. The calculated value of 8·81 for the soils analysis exceeds this and we can confidently conclude that there is a significant difference in the means of the replicate analysis; the soil samples are not similar in their phosphate content.

The analysis of variance is an important topic in the statistical examina-

Source of variation

Between groups Within groups

Total variation

ERRORS AND DETECTION LIMITS

TABLE 7 The completed ANOVA table

Sum of Degrees of Mean squares squares freedom (variance)

1009 5 201·8 549 24 22·9

1558 29

201

F-test

8·81

tion of experimental data and there are many specialised texts on the subject.

4 ANALYTICAL MEASURES

In recent years instrumental methods of analysis have largely superseded the traditional 'wet chemical' techniques and even the most modest analytical laboratory will contain some example of modern instrumentation, e.g. spectrophotometers, HPLC, electrochemical cells, etc. An important characteristic and figure of merit for any instrumental measuring system is its detection limit for an analyte. Using the basic statistical techniques developed in previous sections we can now consider the derivation of the detection limit for an analytical method. As the limit of detection for any measurement is dependent on the inherent noise or random error associated with the measure, the fonn of noise and the concept of the signal-tonoise ratio will be considered first.

4.1 Noise and signal-to-noise ratio The electrical signals produced by instruments used in scientific measurements are carriers of encoded information about some chemical or physical quantity. Such signals consist of the desirable component related to the quantity of interest and an undesirable component, which is termed noise and which can interfere with the accurate measurement of the required signal. There are numerous sources of noise in all instruments and the interested reader is recommended to seek further details from one of the excellent texts on electronic measurements and instrumentation. Briefly, whatever its source, the noise produced by an instrument will be a combination of three distinct types: white noise, flicker noise and interference noise. White noise is of nearly equal amplitude at all frequencies and it can be considered as a mixture of signals of all frequencies with random

202 M.J. ADAMS

amplitudes and phases. It is this random, white noise that will concern us in these discussions. Flicker noise, or Ilf noise, is characterised by a power spectrum which is pronounced at low frequencies and is minimised by a.c. detection and signal processing. Many instruments will also display interference noise due to pick-up usually from the 50-Hz or 60-Hz power lines. Most instruments will operate detector systems well away from these frequencies to minimise such interference. One of the aims of instrument manufacturers is to produce instruments that can extract an analytical signal as effectively as possible. However, because noise is a fundamental characteristic of all instruments, complete freedom from noise can never be realised in practice. A figure of merit to describe the quality of a measurement is the signal-to-noise ratio, SIN, which is defined as,

SIN = average signal ~agnitude rms nOise

(27)

The rms (root mean square) noise is defined as the square root of the average squared deviation of the signal, s, from its mean value, s, i.e.

. Jr,(S - S)2 rms nOIse = n (28)

or, if the number of measurements is small,

. Jr,(S - S)2 rms nOise = n _ 1 (29)

Comparison with eqns (4) and (5) illustrates the equivalency of the rms value and the standard deviation of the signal, a,. The signal-to-noise ratio, therefore, can be defined as S/a,.

The SIN can be measured easily in one of two ways. One method is to repeatedly measure the analytical signal, determine the mean value and calculate the rms noise using eqn (29). A second method of estimating SIN is to record the analytical signal on a strip-chart recorder. Assuming the noise is random, white noise, then it is 99% likely that the deviations in the signal lie within ± 2·5a, of the mean value. The rms value can thus be determined by measuring the peak-to-peak deviation of the signal from the mean and dividing by 5. With both methods it is important that the signal be monitored for sufficient time to obtain a reliable estimate of the standard deviation. Note also that it has been assumed that at low analyte concentrations the noise associated with the analyte signal is the same as that when no analyte signal is present, the blank noise aB' i.e. at low concentrations the measurement error is independent of the concentra-


s

. '-;,- ...•. A-f. - -, ./'........ '1\ j~" ;, .~I\ : •• Ji \ 1\ ! I \ I : __

1 •.• \ '. \ jl \ I \j I \' I :.' (J i\i.! \ \j v / I ~ S ! j ~ i -I------- ... ---.~----I I

Fig. 4. An amplified trace of an analytical signal recorded at low response (the signal amplitude is close to the background level). The mean signal response is denoted by S, the standard deviation of the signal by as and the peak-to-peak noise is given by 5as .

tion. This assumption is continued throughout this section. It is further assumed that the mean signal from the blank measurement (no analyte) will be zero, i.e. JiB = O.

The ideas associated with SIN are illustrated in Fig. 4.

4.2 Detection limits The detection limit for any analytical method is a measure frequently quoted by instrument manufacturers and in the scientific literature. Unfortunately, it is not always clear as to the particular method by which this figure has been determined. A knowledge of detection limits and their calculation is important in evaluating and comparing instrument performance as it relates directly to the sensitivity of an analytical method and the noise associated with the instrument.

The concept of a detection limit for an analysis implies, in the first instance, that we can make a qualitative decision about the presence or absence of the analyte in a sample. In arriving at such a decision there are two errors of decision that could arise. The first, a so-called Type I error, is that we may conclude that the analyte is present in the sample when it is known not to be, and the second, a Type II error, is made if we decide the analyte to be absent when it is in fact present. To be of practical value, our definition of the detection limit must minimise both types of decision error.

Assuming the noise associated with our analysis is random, then the distribution of the electrical signal will approximate the normal form, as

204

a.

b.

c.

d.

\ \

M.J. ADAMS

\ 5% :'----o 1.65 ...

1\ /\ / \ / \oetection Limit

~/ -y,-- "-.~ o 3.29 Determination Limit

i\ // \,

----/ I '----o 5

Standard deviation of blank

10

Fig. 5. (a) The normal distribution with the 5% critical region marked; (b) two normally distributed signals overlapping, with the mean of one located at the 5% point of the second; the so-called decision limit; (c) two normally distributed signals overlapping at their 5% points with their means separated by 3'290', the so-called detection limit; (d) two normally distributed signals with equal variance and their means separated by 100'; the so-called determination limit.

illustrated in Fig. 1. From tables of the cumulative normal distribution function and using a one-tailed test as discussed above, 95% of this random signal will occur below the critical value of J.i.B ± 1·650' B' This case is illustrated in Fig. 5(a). If we are willing to accept a 5% chance of committing a Type I error, a reasonable value, then any average signal detected as being greater than J.i.B + l'650'B can be assumed to indicate the presence of the analyte. This measure has been referred to in the literature as the decision limit and is defined as the concentration producing the signal at which we may decide whether or not the result of the analysis indicates detection, i.e.

Decision Limit = ZO.95 0'B = l'650'B (30)

given that J.i.B = 0 for a well-established noise measurement when O'B is known.

If the noise or error estimate is calculated from relatively few measures


then the I-distribution should be used and the definition is now given by

Decision Limit = IO.9SSB (31)

where SB is an estimate of O"B and as before (see Section 2.2); the value of I depends on the degrees of freedom, i.e. the number of measurements and 10'9S approaches ZO.9S as the number of measurements increases.

This definition of the decision limit addresses our concern with Type I errors but says nothing about the effects of Type II errors. If an analyte producing a signal equivalent to the decision limit is repeatedly analysed, then the distribution of results will appear as illustrated in Fig. 5(b). Whilst the mean value of this analyte signal, Jl" is equivalent to the decision limit, 50% of the results are below this critical value and no analyte will be reported present in 50% of the measurements. This decision limit therefore must be modified to take account of these Type II errors so as to obtain a more practical definition of the limit of detection.

If we are willing to accept a 5% chance of committing a Type II error, the same probability as for a Type I error, then the relationship between the blank measurements and the sample reading is as indicated in Fig. 5(c). In such a case

Detection Limit = 2Z0.9S O"B

or if SB is an estimate of O"B,

Detection Limit = 2to'9SSB

(32)

(33)

Under these conditions we have a 5% chance of reporting the presence of the analyte in a blank solution and a 5% chance of missing the analyte in a true sample. Before we accept this defintion of detection limit, it is worth considering the precision of measurements made at this level. Repeated measurements on an analytical sample at the detection limit will lead to the analyte being reported as below the detection limit 50% of the time. In addition, from eqn (6) the relative standard deviation, or CV, of the sample measurement is defined as

CV = 1000",/Jl, = 100/3·29 = 30·3%

compared with 60% at the decision limit. Thus while quantitative measurements can be made at these low concentrations they do not constitute the accepted degree of precision for quantitative analysis in which the relative error should be below 20% or, better still, 10%.

If a minimum relative standard deviation of, say, 10%, is required from the analysis then a further critical value must be defined. This is sometimes

206 M.J. ADAMS

referred to as the determination limit and for a 10% CV is defined as

Determination Limit = lOus (34)

This case is illustrated in Fig. 5(d). In summary, we now have three critical values to indicate the lower

limits of analysis. It is recommended that analyses with signals less than the decision limit should be reported as 'not detected' and for all results above this value, even if below the detection limit, the result with appropriate confidence limits should be recorded.

5 CALIBRATION

Unlike the traditional volumetric and gravimetric methods of analysis which are based on stoichiometric reactions and which can provide absolute measures of analyte in a sample, modern instrumental techniques usually provide results which are relative to some known standard sample. Thus the instrumental method must be calibrated prior to an analysis being undertaken. This procedure typically involves the preparation of a suitable range of accurately known standard samples, against which the instrumental response is monitored, followed by the recording of the results for the unknown samples and the concentration of the analyte subsequently determined by interpolation against the standard results.

5.1 Linear calibration In most analytical procedures a linear relationship between the instrument response and the analyte concentration is sought and a straight-line calibration graph fitted. For example, colorimetric analysis is extensively employed for the determination of a wide range of cations and anions in a variety of sample types and, from the Beer-Lambert Law, there is a linear relationship between absorbance by the sample and the analyte concentration. In Table 8 the results are presented of the absorbance of a set of standard aqueous complexed copper solutions, recorded at a wavelength of 600 nm. These data are plotted in Fig. 6.

Visual inspection of the data suggests that a straight line can be fitted to the data and the construction of the calibration graph, or working curve, is undertaken by fitting the best straight line of the form

Absorbance, Ai = a + bXi (35)

where a and b are constants denoting the A intercept and slope of the fitted


TABLE 8 Analytical data of concentration and absorbance detained fromAAS for the

determination of copper in aqueous media using standard solutions

Copper concentration

(mgkg- l )

1·0 3·0 5·0 7·0

10·0

Mean i = 5·2 Sum

Absorbance d, = (600nm) (x - x)

0·122 -4·2 0·198 -2,2 0·281 -0,2 0·374 1·8 0·470 4·8

A = 0·289

Absorbance 0.5

0.4

0.3

0.2

0.1 J v

i I I IT

2 4 6

df d2 = (A - A)

17·64 -0'167 4·84 -0,091 0·04 -0'008 3·24 0·085

23·04 0·181

48·8

/

with blank

I

8 10

Copper Cone. (mg/kg)

d,d2

0·7014 0·2002 0·0016 0·1530 0·8688

1·925

Fig. 6. The calibration curves produced by plotting the data from Table 6 using the least-squares regression technique. The upper line uses the original data and the lower line uses the same data following a correction for a blank analysis.

208 M.J. ADAMS

line respectively and Xi represents the concentration of the standard solutions. In our treatment of this data set we will assume that the dependent variable, the absorbance readings, are subject to scatter but that the independent variable, the concentrations of the standard solutions, is not.

A common method of estimating the best straight line through data is by the method of least squares. Using the fitted line, any concentration Xi

of copper will be expected to produce an absorbance value of (a + bXi) and this will deviate from the true measured absorbance, Ai, for this solution by some error, ei ,

ei = Ai - (a + bxJ (36)

The least squares estimates of a and b are calculated so as to minimise the sum of the squares of these errors. If this sum of squares is denoted by S, then

n n

S L [Ai - (a + bxJF = L (Ai - a - bxY (37) i=! i=! i=!

S is a function of the two unknown parameters a and b and its minimum value can be evaluated by simple differentiation of eqn (37) with respect to a and b, setting both partial derivatives to zero (the condition for a minimum) and solving the two simultaneous equations.

bS/ba = ~ 2( - 1)(Ai - a - bxJ 0

bS/bb = ~2( -xJ(Ai - a - bXi) = 0

which upon rearrangement give,

na + b~Xi a~xi + b~xf

and solving for a and b

b ~(Xi - i)(Ai - A)

~(Xi - i)2

a = A - bi

(38)

(39)

The calculated values are presented with Table 8 and for these data the least squares estimates for a and bare,

b = 1·925/48·8 = 0·0395

a = 0·289 - 5·2b = 0·0839

and the best straight line is given by

A = 0·0839 + 0·0395x (40)


which is illustrated in Fig. 6. From rearranging eqn (40), or the working curve, a value of x from an unknown sample can be obtained from its measured absorbance value.

In practice, the absorbance values recorded, as presented in Table 8, will be the mean results of several readings. In such cases, the fitted line connecting these mean values is referred to as the regression line of A on x. Assuming that the errors associated with measuring absorbance are normally distributed and that the standard deviation of the errors is independent of concentration of analyte, then we can calculate confidence intervals for our estimates of the intercept and slope of the fitted line.

The residual sum of squares is defined by eqn (37) as the difference between the measured and predicted values from the fitted line and this allows us to define a residual standard deviation, SR, as

_ ~ _ JI:.(A - a - bXJ2 SR - ---

n-2 n-2 (41)

and, using a two-tailed test with the t-distribution, the 95% confidence limits for the slope b are defined by

and for the intercept,

and for the fitted line at some value Xo,

1 (xo - X)2 - + ~( -)2 n ~Xi - x

For our example, therefore, the characteristics of the fitted line and their 95% confidence limits are (to-025,3 = 3·182),

slope, b = 0·0395 ± 6·2 x 10-6

intercept, a = 0·0839 ± 2·6 x 10-4

If an unknown solution is subsequently analysed and gives an average absorbance reading of 0·182 then rearranging eqn (40),

x = (A - 0·0839)/0·0395 = 2·48 mg kg-I

210 M.J. ADAMS

TABLE 9 Analysis for potassium using the method of standard additions (all volumes in

cm3 )

Solution no.

2 3

Sample volume 20 20 20 Water volume 5 4 3 Standard K volume 0 I 2 Emission response (mv) 36 35 57

and the 95% confidence limits for the fitted value are

2·48 ± 3·182sR(I/5) + «2·48 - 5·2)/48·8)

x 10-4 mg kg- '

5.2 Using a blank solution

4

20 2 3

69

2-48 ± 1·80

5

20 I 4

80

It is usual in undertaking analyses of the type discussed above to include with the standard solutions a blank solution, i.e. a sample similar to the standards but known to contain no analyte. The measured instrument response from this blank is subsequently subtracted from each of the measured standard response values. When this procedure is undertaken it may be more pertinent to use as a mathematical model to fit the best straight line an equation of the form,

A = b'x (42)

and assume that the inteicept with the axis of the dependent variable is zero, i.e. the line passes through the origin. Proceeding as above, the error in the line can be described by

and the sum of squared deviations,

S = (A; - b'xY

and following differentiation with respect to b' and setting the derivative equal to zero to find the minimum error,

b' = !:A;x; !:xf

(43)

If in our example data from Table 8, the blank value is measured as

ERRORS AND DETECTION LIMITS

100 Emission Intensity (mV)

80

60 //

40 -,-/

/! / :

~/

;//

////20

/T . '---T----' T----,----,-----i--i

-.062 o .02 .04 .06 .08 .10

mg K added

211

Fig. 7. The determination of potassium in an acetic acid soil extract solution, using the method of standard additions, by flame emmission photometry.

0·081 absorbance units and this is subtracted from each response, then the value of b' is calculated as

b' = 7·333/184 = 0·0399

and the new line is shown in Fig. 6. For our unknown solution, of corrected absorbance (0·182 - 0·081 = 0'102) then

x = A/b' = 0'102/0·0399 = 2'56mgkg- 1

The results are similar and the use of the blank solution simplifies the calculations and removes the effect of bias or systematic error due to the background level from the sample and the instrument.

5.3 Standard additions The construction and use of a calibration graph implies that the standard solutions employed and the subsequently analysed solutions are of similar composition, with the exception of the concentration of the analyte. If this is not the case then it would be unwise to rely on the calibration graph to provide accurate results. It would obviously be incorrect to use a range of aqueous standards to calibrate an atomic absorption spectrometer for the analysis of petroleum products. The problem of badly matched standards

212 M.J. ADAMS

and samples can give rise to a severe systematic error in an analysis. One technique to overcome the problem is to use the method of standard additions. By this method the sample to be analysed is split into several sub-samples and to each is added small volumes of a known standard. A simple example will serve to illustrate the technique and subsequent calculations.

Example. Flame photometry is a common method of analysis for the determination of the alkali metals in solution. The technique employs inexpensive instrumentation and a typical analysis is the measurement of extractable potassium in soils. A sample of the soil (5 g) is extracted with 200 ml of O· 5 M acetic acid and the resultant solution, containing many elements other than potassium, can be analysed directly. From previous studies it is known that the potassium concentration is likely to be about 100 mg kg-I dry soil. Five 20-ml aliquots of the sample solution are taken and each is made up to 25 ml using distilled water and a standard 20 mg kg-I standard potassium solution in the proportions shown in Table 9, and the emission intensity is measured using the flame photometer.

The results are plotted in Fig. 7. The intercept of the fitted regression line on the concentration axis indicates the concentration of potassium in the original sample to be 0·062 mg. This is in 20 ml of solution, therefore 0·62 mg in the 200-ml extractant from 5 g of soil which indicates a soil concentration of 124 mg kg-I. Confidence intervals can be determined as for the general calibration graph discussed above.

The statistical treatment and evaluation of analytical data is of paramount importance in interpreting the results. This chapter has attempted to highlight some of the more important and more pertinent techniques. The interested reader is advised and encouraged to proceed with further reading and study on this fascinating subject.

BIBLIOGRAPHY

Bevington, P.R., Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York, 1969.

Caulcutt, R. & Boddy, K., Statistics for Analytical Chemistry. Chapman and Hall, London, 1983.

Chatfield, c., Statistics for Technology. Chapman and Hall, London, 1975. Davis, J .c., Statistics and Data Analysis in Geology. J. Wiley and Sons, New York,

1973. Malmstadt, H.V., Enke, C.G., Crouch, S.R. & Horlick, G., Electronic Measure

ments for Scientists. W.A. Benjamin, California, 1974.

Chapter 6

Visual Representation of Data Including Graphical

Exploratory Data Analysis

JOHN M. THOMPSON* Department of Biomedical Engineering and Medical Physics, University of Keele Hospital Centre, Thornburrow Drive, Hartshill, Stoke-on-Trent,

Staffordshire ST4 7QB, UK

1 INTRODUCTION

1.1 Uses and misuses of visual representation Reliance on simple number summaries, such as correlation coefficients, without plotting the data used to derive the coefficients, can lead one to misinterpret their real meaning. I- 3 An excellent example of this is shown in Fig. 1, which shows a series of plots of data sets, all of which have correlation coefficients, some of which apparently indicate reasonably strong correlation. However, the plots reveal clearly the dangers of relying only on number summaries. They also demonstrate the value of graphical displays in understanding data behaviour and identifying influential observations.

Figure 2 shows two plots on different scales of the same observations used in studies of subjective elements in exploratory data analysis.4 It was found that the rescaling of plots had most effect on subjective assessment

*Present address: Department of Medical Physics and Biomedical Engineering, Queen Elizabeth Hospital, Birmingham B15 2TH, UK.

213

214

10

(a)

o

10

(c)

o

J.M. THOMPSON

10

(b)

, r=0·855

10 o

10

r=0·980 (d)

10 o

r=0·666

r=0·672 , , '

, ....

10

10

Fig. 1. Scatter plots of bivariate data, demonstrating that a favourable correlation coefficient is not sufficient evidence that variables are correlated.

10r---------------------~

5

81-

4

61-

4 ... 2

1 ...

" , -.... "

••••• "0

,:'~/);.:~ :' .... .. . • • -0 '. "0° .... : ,. •

• 0 .. :.\ ••

. . ...... ~.:::-. •• '~"o : ... ; •

• 0. :. ~ '. '. '

"

o~--~--~----~--~--~. O~ __ ~ __ ~ __ ~--~--~~ o 2 4 6 8 10 0 2 3 4 5

Fig, 2, Effect of scale of plot on perception of plot; both plots have correlation coefficients of 0·8. (Reprinted with permission from W.S. Cleveland, P. Diaconis & R. McGill Science 216 1138-1141, copyright 1982 by the American

Association for the Advancement of Science.)

VISUAL REPRESENTATION OF DATA 215

of data sets with correlation coefficients between O· 3 and 0·8. The perceived degree of association was found to shift by 10-15% as a result of rescaling.

The purpose of visual representation of data is to provide the scientist/ technologist as a data analyst with insights into data behaviour not readily obtained by nonvisual methods. However, one must be vigilant about the psychological problems of experimenter bias resulting from anchoring of one's interpretation onto preconceived notions as to the outcome of an investigation.5•6 The importance of designing the visual presentation of data so as to avoid bias in perception of either the presenter or the receiver of such displays has been emphasized by several groups.7-1O These perceptual problems are active areas of collaborative research between statisticians and psychologists. Awareness of the importance of subjective, perceptual elements in the design of effective visual representations and a willingness to take account of these in data analysis, and in the communication of the results of such analysis, should now be considered as an essential part of statistical good practice.

1.2 Exploratory data analysis In the early development of statistical analysis, data exploration was necessarily an important feature but, as sophisticated parametric statistical tools were developed for confirmatory analysis and came into general use, the exploratory approach was eclipsed and rather neglected. Pioneering work in recent decades by Tukeyl1.l2 and others (see Bibliography) has resulted in new and powerful tools for visual representation of data and exploratory/graphical data analysis. Many of these tools have an elegant simplicity which makes them readily approachable by any person of reasonable arithmetic competence.

The four main themes of exploratory data analysis have been described by Hoaglin et al. 13 as resistance, residuals, re-expression and revelation. These themes are of fundamental importance to the visual representation of data. Resistance is the provision of insensitivity to localized misbehaviour of data. Data from real world situations rarely match the idealized models of parametric statistics, so resistance to quirky ness is a valuable quality in any tool for data analysis. Residuals are what remain after we have performed some data analysis and have tried to fit the data to a model. In exploratory data analysis, careful analysis of the residuals is an essential part of the strategy. The use of resistant data analysis helps to ensure that the residuals contain both the chance variations and the more unusual departures from the main pattern. Unusual observations may

216 1.M. THOMPSON

distort nonresistant analyses, so that the true extent of their influence may be grossly distorted, and therefore not apparent when examining the residuals. Some brief glimpses at various aspects of exploratory data analysis (EDA) are outlined below and aspects relevant to the theme of this chapter are discussed in more detail in the appropriate sections. The reader is encouraged to delve further into this subject by studying texts listed in the Bibliography and References.

1.2.1 Tabular displays Tabular displays can either be simple, showing various key number summaries, or complex but designed in such a way as to highlight various patterns. Several tabular display tools have evolved from EDA and are described in Sections 4.3, 4.4, 5.2.1, 6.5.1 and 6.5.2.

1.2.2 Graphical displays Many different kinds of graphical display have been developed enabling us to look at different features of the data. The most important point to bear in mind is that one should use several tools to examine various aspects of the data, so as to avoid missing key features. Examples will be given in later sections which will illustrate this particular issue.

1.2.3 Diagnosing types of data behaviour Many statistical software packages are now available for use on personal computers. A black box approach to their use in data analysis has its drawbacks. A salutory example of this was given by Ames & Szonyi, 14 who cited a study in which they wished to evaluate the influence of an additive in improving the quality of a manufactured product. When applying the standard parametric tests, the additive apparently did not improve product quality, which did not make much sense physically. Data exploration was then undertaken to evaluate the underlying distributions of the data from product with and without additive. Both were found to be non-Gaussian and subsequent application of more appropriate nonparametric tests demonstrated clear differences.

1.2.4 Identifying outliers for further investigation An important role for the visual representation of data is in the identification of unusual (outlier) data for further investigation, not for rejection at this stage. Fisherls advised that 'a point is never to be excluded on statistical grounds alone'. One should then establish whether the outlier arises from transcription or transmission errors, faulty measurement,


calibration, etc. If the outlier still appears genuine, then it may well become the starting point for new experiments or surveys.

2 TYPES OF DATA

Environmental data can be classified in a variety of ways which have an influence on the choice of visual representations that may be useful, either in analysis, or for communication of ideas or information to others.

Continuous data arise from observations that are made on a measurement scale which from an experimental viewpoint is infinitely divisible. Measurements involving counting of animals, plants, radioactive particles or photons give us discrete data. Proportional data are in the form of ratios, such as percentages.

Spatial and time-dependent (temporal) data present special display problems, but are obviously of major concern to environmental scientists and technologists, and various techniques will be discussed in appropriate sections.

Data may be in the form of a single variable or many variables and the discussion on visual representation in this chapter starts with single variables, proceeding then to two variable displays, following on with multivariate displays and maps. The chapter ends with a brieflook at some of the software available to assist in visual representation.

3 TYPES OF DISPLAY

Although not conventionally considered as visual representation, many modern tabular display techniques provide us with useful tools for showing patterns/relationships amongst data in a quite striking visual way.

A wide range of ways of displaying data using shade. pattern, texture and colour on maps have been developed, which can be used singly, or in combination, to illustrate the geographical distribution of environmental variables and their interactions.

Graphs are the classical way of displaying data and modern computer graphics have extended the range of possibilities, especially in the projection of three or more dimensional data onto a two dimensional display. However, as we will see later in this chapter, even very simple graphical techniques can provide the user with powerful ways of understanding and

218

III iii :::l

"C

'> "C .!: '0 c .2 .... u ",

0·22 . ....

, I I

I r-i I

I

Lt J i I irl

0'"

71

J.M. THOMPSON

I I

I I

I ~

~

J I J I

305 Xylene concentration (mg m-3 )

Fig. 3. Histogram of air pollution data showing the distribution of timeweighted average occupational exposures of histopathology laboratory technicians to xylene vapour (mg m-3 ) (data of the author and R. Sitham-

paranadarajah; generated using Stata).

analysing data. The development of dynamic displays and animation in computer graphics has enhanced the presentation of time dependent phenomena.

Hybrid displays combining different display tools, e.g. the use of graphs and statistical symbols on maps, are an interesting development. We are all familiar with the use of weather symbols on maps, but the use of symbolic tools in the visual representation of multivariate data has developed considerably and the wide range of such tools is discussed in Section 6.3.

4 STUDYING SINGLE VARIABLES AND THEIR DISTRIBUTIONS

4.1 Histograms and frequency plots The histogram is an often used tool for displaying distributional information and a typical histogram of a time-weighted average air pollutant concentration (occupational exposures of histopathology technicians to xylene vapour) in a specific laboratory is shown in Fig. 3. Each bar of the histogram shows the frequency of occurrence of a given concentration range of xylene vapour exposures. Another alternative is to plot the

I 71

71

VISUAL REPRESENTATION OF DATA

III III III 11111111 U I II I II I II III I I xylene

(a)

xylene (b)

I 305

305

219

Fig. 4. (a) Cluttered one-dimensional scatter plot using vertical bars to represent individual observations of xylene vapour exposure of histopathology technicians. It is not obvious from this plot that several observations overlap (generated using Stata). (b) Alternative means of reducing clutter using jittering of the same data used for Figs. 3 and 4(a) but this method of plotting reveals that there are several observations with the same value (generated

using Stata).

histogram of the cumulative frequency. Here the histogram may be simplified by merely plotting the tops of the bars. This enables us to compare two histograms on the same plot (see also Section 5.2.2). Thus the cumulative frequency distribution of the observations can be compared with a model distribution of the same median and spread, in order to judge their similarities. The largest gap between the cumulative plots is the test statistic for the Kolmogorov-Smirnov one sample test. 16

If the frequencies are plotted as points which are then joined by lines, we have a frequency plot and, as with the histogram, cumulative frequency plots are also useful.

4.2 Scatter plots Despite their simplicity, scatter plots are very useful for highlighting patterns in data and even for comparisons or revealing the presence of outliers.

Plotting a single variable can either be done horizontally or vertically in one dimension only, as in Fig. 4(a), in which each observation is represented as a vertical bar. It is not obvious from this particular plot that observations overlap. In order to show this, one may either stack points, or use the technique of jittering, as in Fig. 4(b). The jittering is achieved by plotting on the vertical axis, Uj (i = 1 to n), versus the variable to be jitter plotted, Xj, where Uj is the series of integers I to n, in random order. I?

The range of the vertical axis is kept small relative to the horizontal axis.

220 1.M. THOMPS0N

4.3 Stem and leaf displays A very useful alternative to the histogram is the stem and leaf display. 18

Not only can this display convey more information than the conventional histogram, showing how each bar is built up, but it can also be used to determine the median, fourths or hinges (similar to quartiles; see Section 4.4 below), and other summary statistics in a quick and simple way. Sometimes the display may be too widely spread out and we need to find a more compact form. On other occasions the display is too compact and needs to be spread out by splitting the stems. Many software packages do this automatically. For those without access to such packages, or wishing to experiment with suitable layouts for the display, various rules have been proposed based upon different functions of the number of observations in the data set. 19 These produce different numbers of lines for any given number of observations. The I + log2n rule makes no allowances for outliers. skewed data and multiple clumps separated by gaps, and is regarded as poor for information transfer. Emerson & Hoaglinl9 recommend using the 2nl/2 rule below 100 observations and the 1010gIO n rule above 100 observations. Extreme values may cause the display to be distorted but this can be avoided by placing these in two categories, 'hi' and '10', and then sorting the data for the stem and leaf display without the extremes.

4.4 Box and whisker plots Various kinds of box and whisker plots20 provide useful displays of the key number summaries: median, fourths or hinges twhich are closely related to the quartiles) and extremes. They enable us to show key features of a set of data, to gain an overall impression of the shape of the data distribution and to identify potential outliers.

4.4.1 Simple box and whisker plots The simple version of this plot is illustrated in Fig. 5. The upper and lower boundaries of the box are, respectively, the upper and lower fourths (also known as hinges), which are derived as follows: 21

depth of fourth = [(depth of median) + 1]/2

The depth of a data value is defined as the smaller of its upward and downward ranks. Upward ranks are derived from ordering data from the smallest value upwards; downward ranks start from the largest observation as rank I. The numerical values of the observations at the fourths determine the upper and lower box boundaries. In-between these is a line


Fig. 5. Box and whisker plot.

• • • • • •

Fig. 6. Notched box and whisker plot.

representing the position of the median. Beyond these boundaries stretch the whiskers. The whiskers extend out as far as the most remote points within outlier cutoffs defined as follows: 22

upper cutoff

lower cutoff

fourth spread

upper fourth + 1·5 (fourth spread)

lower fourth - 1·5 (fourth spread)

upper fourth - lower fourth

Beyond the outlier cutoffs, data may be considered as outliers and each such outlying observation is plotted as a distinct and separate point. The position of the median line, relative to the fourths, indicates the presence or absence of skewness in the data and the direction of any skewness. The lengths of the whiskers give us an indication of whether a set of data is heavy or light tailed. If the data is distributed in a Gaussian fashion, then only 0·7% of the data lie outside the outlier cutoffs.23

4.4.2 Extended versions The outlier cutoffs are also termed inner fences by Tukey24 and observations beyond the inner fences are then termed outside values. Tukey also

222 1.M. THOMPSON

defines outer fences as:

upper outer fence = upper fourth + 3 (fourth spread)

lower outer fence = lower fourth - 3 (fourth spread)

Observations beyond the outer fences are then termedfar out values. Thus one now has a way of identifying and classifying outliers according to their extremeness from the middle zone of the distribution.

Recently, Frigge et al.25 have highlighted the problem that a number of statistical software packages produce box plots according to different definitions of quartiles and fences. They have offered recommendations for a single standard form of the box plot. Of the eight different definitions for the quartile that they list. they suggest that, for the time being, definition 6 (currently in Minitab and Systat) should be used as the standard. They do suggest, however, that definition 7 may eventually become the standard. These two definitions (of the lower quartile Q" in terms of the ordered observations XI ~ X2 ~ X3 ~ ... ~ xn) are listed below:

- definition 6: standard fourths or hinges

Q, = (l - g)Xj + gxj+, where [(n + 3)/2]/2 = j + g and g = 0 or g = 1/2. n is the number of observations, j is an integer;

- definition 7: ideal or machine fourths

Q, = (I - g)Xj + gXj+1

where n/4 + 1/2 = j + g. The multiplier of the fourth spread used in calculating the fences. as

described above and in Section 4.4.1, also varies. Frigge et al.25 suggest that the use of a multiplier of 1·0 now seems too small, on the basis of accumulated experience, and 1·5 would seem to them to be a more satisfactory value to use for estimating the inner fences for exploratory purposes. The use of 3·0 as the multiplier for the outer fences (already discussed earlier in this section) is regarded as a useful option.

4.4.3 Notched box and whisker plots The box plot may be modified to convey information on the confidence in the estimated median by placing a notch in the box at the position of the median.26 The size of the notch indicates the confidence interval around the median. A commonly used confidence interval is 95%. The upper and lower confidence bounds are calculated by a distribution-free procedure


TABLE 1 Relationship between letter values, their tags, tail areas and Gaussian letter

spreads

Tag Tail area

M 1/2 F 1/4 E 1/8 D 1/16 C 1/32 B 1/64 A 1/128 Z 1/256 Y 1/512 X 1/1024 W 1/2048

Upper Gaussian letter value

0·6745 0·1503 1· 534 1 1·8627 2·1539 H176 2·6601 2-8856 3-0973 3·2972

Letter spread

1·349 2·301 3-068 3·725 4·308 4·835 5·320 5·771 6·195 6·594

based on the Sign Test26 in Minitab, whereas Velleman & Hoaglin22 describe an alternative, in which the notches are placed at:

median ± 1'58*(fourth spread)/(n)I/2

Figure 6 illustrates the notched box plot. The applications of this display method in multivariate analysis are discussed in Section 6.1.2.

4.5 Letter value displays So far we have dealt with two of the number summaries known as letter values: the median and the fourths or hinges. In Section 4.4.1 the calculation of the depth of the fourth from the depth of the median was shown. This may be generalized to the following: 27

depth of letter value = [(previous depth) + 1]/2

so that the next letter values in this sequence are the eighths, the sixteenths, the thirty-seconds, etc. For convenience these number summaries are given letters as tags or labels; hence they have become known as letter values. Table I shows the relationship between letter values, their associated tags and the fraction of the data distribution remaining outside boundaries defined by the letter values (the tail areas). Tables of letter values can range from simple 5 or 7 number summaries as in Fig. 7, to much more comprehensive letter value displays, as in Fig. 8. The latter shows other information obtainable from the letter values, including mid summaries (abbreviated to mid), spreads and pseudosigmas. These are discussed in

224

(a) 5-number summary # 53 Tag M F

Depth 27 14 I

(b) 7-number summary # 53 Tag M F E

Depth 27 14 7·5 I

1.M. THOMPSON

Lower

125·0 71-0

Lower

125·0 116·5 71-0

Mid 153'0

Mid 153·0

Upper

201·0 305·0

Upper

201·0 224·0 305·0

Fig. 7. (a) Five and (b) seven number summaries: simple letter value displays of the xylene vapour exposure data.

218 174 125 170 145 180 124 135 115 148 264 305 107 144 202 239 106 171 201 137 216 224 153 102 173 125 119 154 186 118 204 141 226 105 150 194 129 118 233 224 227 170 128 173 71 185 144 209 155 124 108 124 144

Depth Lower Upper Mid Spread N 53 M 27·0 153·000 153·000 153·000 H 14·0 125·000 201·000 163·000 76·000 E 7·5 116·500 224·000 170·250 107·500 D 4·0 106·000 233·000 169'500 127·000 C 2·5 103·500 251·500 177-500 148·000 B 1·5 86·500 284·500 185·500 198·000

1 71·000 305·000 188·000 234·000

Fig. 8. Comprehensive letter value display of xylene vapour exposure data generated using Minitab, showing the depth of each letter value in from each end of the ranked data, the lower and upper letter values, the spread between them and the mean of the lower and upper letter values known as the 'mid'.

Tabulated above the display is the original data set.

detail in Refs. 24, 27 and 28, and will be briefly discussed here. The mid summary is the average of a pair of letter values, so we can have the midfourth, the mideighth, etc., all the way through to the midextreme or midrange. The median is already a midsummary. The spread is the difference between the upper and lower letter values. The pseudosigma is calculated by dividing the letter spread by the Gaussian letter spread.

Letter values and the derived parameters can be used to provide us with a powerful range of graphical methods for exploring the shapes of data distributions, especially for the analysis of skewness and elongation.28

These techniques will be discussed in more detail in Section 4.7.

4.6 Graphical assessment of assumptions about data distributions It is often a prerequisite to the use of a particular parametric statistical test, that the assumptions about the distribution that the data is thought to follow are tested. There are many useful graphical tools with which to test such assumptions, whether one is dealing with discrete or continuous data.

4.6.1 Looking at discrete data distributions Since much environmental/ecological data is of the discrete type, it is useful to have available some suitable graphical means of calculating distributional parameters and deciding on which discrete distribution best describes the data. Amongst the most widely known discrete data distributions that are identifiable by graphical methods are the following:29

Poisson:

k = 0, 1,2, ... ;

Binomial:

k = 0, 1, 2, ... , N; o<p<

Negative binomial:

Pk = (k + m - 1) pm (I _ p)k; m - I

k = 0, I, 2, ... ; ° 

L > °

226 1.M. THOMPSON

....... "--------.,,------

......

o

Logsenes (b = -.I)

Ne~a[ive

binomial (il > 0)

Poisson (f> = 0)

"- Bmomlal (b < 0)

Fig. 9. Ord's procedure for graphical analysis of discrete data. (Reproduced with permission from S.C.H. du Toit, A.GW. Stern and R.H. Stumpf 'Graphical Exploratory Data Analysis' p. 38 Fig. 3.1 Copyright 1986 Springer-Verlag, New

York Inc.)

Logarithmic series:

Pk = - cf>kj[k -In (1 - cf> )]; k = 1,2, ... ;

where Pk is the relative number of observations in group k in the sample, nk is the number of observations in group k in the sample, L is the mean of the Poisson distribution, P is the proportion of the population with the particular attribute, cf> is the logarithmic series parameter.

4.6.1.1 Ord's procedure for large samples. Ord's procedure for large samples30 involves calculating the following parameter:

Uk = kpkjpk_1

for all observed x values and then plotting Uk versus k for all nk_1 > 5. If this plot appears reasonably linear, one of the models listed in Section

4.6.1 can be chosen by comparison with those in Fig. 9, but this procedure is regarded as suitable only for use with large samples of data.30 More appropriate procedures for smaller samples are described briefly in the next three sections; they are discussed in detail in Ref. 29.

4.6.1.2 Poissonness plots. Poissonness plots were originally proposed by Hoaglin31 and were subsequently modified by Hoaglin & Tukey,29 in order to simplify comparisons for frequency distributions with different total counts, N. We plot:

log.(k!nkjN) versus k


where k is the number of classes into which the distribution is divided and nk is the number of observations in each class. The slope of this plot is 10g,,(L) and the intercept is - L. The slope may be estimated by resistantly fitting a line to the points and then estimating L from L = eb • This procedure works even if the Poisson distribution being fitted is truncated or is a 'no-zeroes' Poisson distribution. Discrepant points in the plot do not affect the position of other points, so the procedure is reasonably resistant. A further improvement, which Hoaglin & Tukey29 introduced, 'levels' the Poissonness plot by plotting:

log,,(k!nk/N) + [Lo - klog,,(Lo)] versus k

where Lo is a rough value for the Poisson parameter L, and this new plot would have a slope of 10g,,(L) - 10g,,(Lo) and intercept Lo - L. If the original Lo was a reasonable estimate then this plot will be nearly as flat as it is possible to achieve. If there is systematic curvature in the Poissonness plot then the distribution is not Poisson.

4.6.1.3 Confidence interval plots. Sometimes an isolated point appears to have deviated from an otherwise linear Poissonness plot and it would be useful to judge how discrepant that point is. A confidence interval plot enables us to determine the extent of the discrepancy. However, in such a plot, the use of confidence intervals, based on the significance levels for individual categories, will mean that there is a substantial likelihood of one category appearing discrepant, even when it is not. To avoid this trap, we can use simultaneous significance levels. Hoaglin & Tukey29 suggest using both individual and simultaneous confidence intervals on a levelled plot so that the discrepant points are more readily identifiable.

4.6.1.4 Plots for other discrete data distributions. The same approach can be used for checking binomialness, negative binomialness or tendency to conform to the logarithmic or the geometric series. The negative binomialness of a distribution of many contagious insect populations is a measure of the dispersion parameter and Southwood32 discusses the problems of estimating the characteristic parameter of the negative binomial using the more traditional approach. It would be useful to re-assess data, previously assessed using the methods discussed by Southwood,32 with the methods proposed in Hoaglin & Tukey.29 However, space and time constraints limit the author to alerting the interested reader to studying Ref. 29 on these very useful plots for estimating discrete distribu-

228 J .M. THOMPSON

tion parameters, and for hunting out discrepancies in a way that is resistant to misclassification.

4.6.2 Continuous data distributions Whilst discrete data are important in environmental and ecological investigations, continuous data are obviously a major interest and the main thrust of research on graphical and illustrative methods remains in this area.

4.6.2.1 Theoretical quantile-quantile or probability plots. In exploring the data to establish conformity to a particular distribution, one approach is to sort the data into ascending order; obtain quantiles of the distribution of interest; and finally to plot the sorted data against the theoretical quantiles, thus producing a theoretical quantile-quantile plot. A 'quantile' such as the O' 76 quantile of a set of data is that number in the ordered set below which lies a fraction O· 76 of the observations and above which we find a fraction O' 24 of the data. Such plots can be done for both discrete and continuous data. Chambers et al. 33 give useful guidance on appropriate parameters to plot as ordinate and abscissa and how to add variability information into the plot. They also give help on dealing with censored and grouped data; problems which are common in environmental studies. Clustering of data or absence of data in specific zones of the ordered set can produce misleading distortions of such plots, as can the natural variability of the data. Hoaglin34 has developed a simplified version of the quantile-quantile plot using letter values.

4.6.2.2 Hanging histobars and suspended rootograms. An interesting way of checking the fit of data to a distribution consists of hanging the bars of a histogram from a plot of the model distribution, this is the hanging histobars plot;35 an example is shown in Fig. 10 of such a comparison with a Gaussian distribution. This kind of plot is available in Statgraphics Pc.

An alternative plot which enables one to check the fit to a Gaussian is the suspended rootogram devised by Velleman & Hoaglin.36 So, instead of using the actual count of observations in each bin (or bar) of the histogram, we use a function based on the square root of the bin count, in order to stabilize variance. The comparison is made with a Gaussian curve fitted using the median (as an indicator of the middle of the data) and the fourths as characteristics of the spread of the data (to protect the fit against outliers and unusual bin values). The function used is the double root

11

8

>,5 u c: V :J r:r v tL 2

-1

-4

o


XYLENE 1. xylene

Fig. 10. Hanging histobars plot of the xylene vapour exposure data (generated using Statgraphics PC).

residual, which is calculated as follows: 36

ORR = [2 + 4( observed)] 1/2 - [I + 4(fittedW/2,

if observed > or < 0

= I - [I - 4(fittedW /2 , if observed = 0

This kind of plot is available in Minitab as ROOTOGRAM.

4.7 Diagnostic plots for studying distribution shape Rather than testing the data to see whether it adheres to a particular model distribution, an alternative approach is to examine features of the actual distribution of the observations in an exploratory fashion. In this case, we are looking at features such as skewness and light or heavy tails. The letter values, described earlier, provide us with a suitable set of summary data with which we can explore visually these kinds of feature and from which

230 1.M. THOMPSON

we can calculate other parameters for graphical analysis of shape. The five plots discussed below are described in more detail in Ref. 37.

4.7.1 Upper versus lower letter value plots If the data distribution is not skewed plotting the upper letter values versus the corresponding lower letter values should give a straight line of slope - 1. To check deviations from the line of slope - I, this should be explicitly plotted on the graph as well, then we can more clearly see the direction of the deviations.

4.7.2 Mid versus spread plots An improvement on the upper versus lower plot, enabling deviations to be more clearly seen, is obtained by plotting the midsummaries of the letter values (mids) versus the letter value spreads. If the plotted points curve steadily upwards to the right, away from the horizontal line through the median, this indicates right skewness. On the other hand, if the points curve downwards, we have left skewness.

4.7.3 Mid versus Z2 plots A problem with the mid versus spread plot is that the more extreme spreads are adversely affected by outliers and, as a consequence, the plot will be strung out over an unnecessarily large range. To overcome that, we can plot the mids versus the square of the corresponding Gaussian quantile (.i). A distribution with no skewness is again a horizontal line.

4.7.4 Pseudosigma versus Z2 plots If pseudosigmas are calculated from the corresponding letter values and the data are normally distributed, the pseudosigmas will not differ greatly from one another but if the data distribution is elongated, the pseudosigmas will increase with increasing letter value. With data distributions less elongated than Gaussian, the pseudosigmas will decrease. Thus plotting pseudosigmas against i will enable us to diagnose elongation of a data distribution.

4.7.5 Pushback analysis and flattened letter values versus z plots Sometimes a data distribution will be elongated to differing extents in each tail. In order to study this, we need to subtract a Gaussian shape from the distribution, essentially pushing back or flattening the distribution and the letter values. A quick and resistant way of obtaining a scale factor with which to do the flattening of the letter values is to find the median s of the


pseudosigmas. Multiplying s by the standard Gaussian quantiles z for the upper and lower letter values and subtracting that product from the actual letter values, yields a set of flattened letter values. A plot of these versus z will reveal the behaviour of each tail separately. If the data are normally distributed, the plot is of a straight line. If more than one straight line segment is found, then the distribution may be built up from more than one Gaussian. Deviations from linearity indicate more complex behaviour.

4.7.6 Diagnostic plots to quantitate skewness and elongation Hoaglin38 describes graphical methods for obtaining numerical summaries of shape using further functions calculated from the letter values. These summaries disentangle skewness from elongation. If data are collected from a skewed distribution, we may attribute elongation to the longer tail which may have arisen purely as a function of skewness. Hoaglin's g and h functions enable us to extract from the data that part of any tail elongation arising from skewness and separately extract that part arising from pure elongation. After calculating the skewness function, its influence is removed by adjustment of the letter value half-spreads. Separate plots for the elongation analysis of the left and right tails are then constructed of the natural logarithms of the upper and lower adjusted letter value half-spreads versus Z2, the slopes of which are elongation factors for each tail.

5 REPRESENTING BIVARIATE DATA

5.1 Graphics for regression and correlation As was pointed out in Section I it is very important not to rely solely on number summaries, such as correlation and regression coefficients, when investigating relationships between variables. There are a wide variety of graphical tools available to enable one to present bivariate data and explore relationships. As our understanding of the limitations of some of the earlier tools has grown, new methods have evolved which provide new insights. Some of the limitations and new tools are discussed below.

5.1.1 Two dimensional scatter plots Simply plotting a two dimensional scatter plot can provide useful insight into the relationship between two variables but it should be only the starting point for data exploration. One can go further by plotting a

232 J.M. THOMPSON

regression line and confidence envelopes on the scatter plot but there are numerous ways of performing regression and of calculating confidence envelopes. In environmental studies, much data is collected from observation in the field or by measurements made on field-collected samples as opposed to laboratory experiments. When regressing such variables against one another, ordinary least squares regression is inappropriate because the x variable is assumed to be error free; clearly this is not the case in practice. Deming39 and Mandel40 have discussed ways of allowing for errors in x within the framework of ordinary least squares. If outliers and/or non-Gaussian behaviour are present, any such extreme data will have a disproportionate effect on the position of the regression line even for Deming/Mandel regression. Various exploratory, robust and nonparametric methods are available for regression which are protective against such behaviour.41-43 No one method will give a unique best line but methods are now appearing which enable the calculation of confidence envelopes around resistant lines to be made.44.45 Many of these resistant and robust regression methods are likely to be in reasonable agreement with one another. Using such methods, one can plot useful regression lines and confidence envelopes. The scatter plot is also essential as a diagnostic tool in the examination of regression residuals in guiding us on whether the data need transformation to improve the regression.

5.1.1.1 Problems with plotting local high data densities. Sometimes, in plotting data from large samples one observes such local high densities of data points that it is difficult to convey visually the density of points. Minitab overcomes this by plotting numbers representing the numbers of points it is trying to plot at particular coordinates. Other alternative ways of dealing with this problem include the use of symbols indicating the number of overlapping points (Section 5.1.1.2) and the use of sharpened plots (Section 5.1.1.3).

5.1. 1.2 Use of sunflower symbols. Where there is substantial overlap on scatter plots, this may be alleviated by means of symbols indicating the amount oflocal high density, which were devised by Cleveland & McGill46

and which they called sunflowers. A dot means one observation, a dot with two lines means two observations, a dot with three lines means three observations, etc. Dividing the plot up into square cells and counting the number of observations per cell, enables us to plot a bivariate histogram with a sunflower at the centre,of each cell. The process of dividing the plot into cells was termed cellulation by Tukey & Tukey.47


cumulative

frequency

100

r·d

----pF OL--------------------

Fig. 11. Comparing cumulative frequency plots.

233

5.1.1.3 Sharpened scatter plots. Chambers et al.48 describe ways of revealing detail in scatter plots by a technique which they term sharpening, where a series of plots is produced in which each successive plot in the series has had stripped out of it regions of successively higher relative data density. The plots give us an impression of contours in the data without needing to draw them in. They give algorithms for sharpened scatter plots applicable even when data scales are different on each axis.

5.1.2 Strip box plots It is sometimes useful to simplify a detailed bivariate plot by dividing it into a series of vertical stripes and summarising the distribution of y values in each strip as a box plot.

5.2 Diagnostic plots for bivariate regression analysis

5.2.1 Back-to-back stem and leaf displays We saw the use of stem and leaf displays in Section 4.3 and we can extend its use to the comparison of the distributions of two variables by placing their stem and leaf displays back-to-back allowing the visual appreciation of differences in the location and spread of each distribution.

5.2.2 Cumulative frequency plots In a similar way to comparing the distributions of two sets of observations via the back-to-back stem and leaf display, by plotting each set separately on a common scale we may plot two separate cumulative frequency distributions, as in Fig. II. We use the largest difference between the two plots as the test statistic for the two sample Kolmogorov-Smirnov test for comparing two empirical distributions (cf. Section 4.1).

234

xx

I.M. THOMPSON

x x x x x x x xx x x

x

x x

x

x x x x x x x Xx x x x x x x x x x

x x x x x x

x x x

Xx x

x x x x x

Xx x x x

Xx x x x x x x

x XX x x

x x

Xx xX x x x x x

x x x x x x x X

Fig. 12. Patterns of residuals from linear regressions.

5.2.3 Empirical quantile-quantile plots In this kind of display, we plot the median of the y variable versus the median of the x variable, the upper y quartile versus the upper x quartile and so on; each point on the plot corresponding to the same quantile in each distribution. If we plot the line of identity (y = x) on this plot, we can readily see how the two distributions compare and in what manner they may differ.

5.2.4 Diagnosing the need to transform variables: residuals plots When exploring the relationship between two variables, we usually first test whether the relationship is linear. It is then helpful to test the reliability of the hypothesis that there is a linear relationship by various graphical methods, because these frequently reveal deviations which would not be revealed by simply relying on the correlation coefficient. Tiley2 demonstrated that it is possible to be misled into believing that one is dealing with a strongly linear relationship when, with separate model experiments in which the 'precision' was said to have increased, the correlation coefficient was found to have improved with each increase in experimental 'precision'. A residuals plot was said to demonstrate the nonlinearity in the data.

The difference between the observed and fitted y values, at any given x value, is the residual of that corresponding y value. These residual y values may be plotted against the x values and this is termed a residuals plot. The shape of this plot is a useful indicator of whether we need to transform the y values using, for example, the logarithmic transformation. Figure 12


shows several different residuals plots characteristic of both linear and nonlinear relationships. The unstructured cloud of points shows us that the relationship is linear; the sloping band indicates the need for addition of a linear term; the curved band suggests a nonlinear dependence; the wedge shaped band indicates increasing variability of y with increasing x values. Such plots should be used both before and after any transformation of data as a first stage in demonstrating whether the transformation is satisfactory.

A useful brief introduction to residuals plots is given in Ref. 49 and a more detailed discussion is given in the excellent introductory monograph by Atkinsonso on graphical methods of diagnostic regression analysis.

5.2.5 Identifying outliers and points of high influence or leverage An important role for visual representations of data is in the identification and classification of unusual data within a sample. With univariate data analysis, we have already seen useful ways of identifying and classifying unusual data features. Those methods can be applied to each variable separately in a multivariate analysis but with the warning that variables may influence others and it is vital to hunt for unusual behaviour under those conditions as well. In order to do so we must understand some basic concepts relating to unusual behaviour in bivariate regression. We can also make use of these concepts again in the discussion of exploratory graphical multivariate analysis.

The first point to emphasize is the necessity of using robust and resistant regression methods because ordinary least squares regression is too easily influenced by unusual points. Identification of outliers in such circumstances is frought with difficulty because outliers may mask one another's influence (this is quite definitely not the same as inferring that they are cancelling out each other's influence). Graphical exploration combined with subjective assessment is helpful in protecting against mislabelling points as outliers and missing other points which may well be, but unfortunately it is not a foolproof safety net.

Another important concept is that points may actually exert a large influence without appearing to be outliers at all, even in a bivariate scatter plot. Such points may be ones with a high 'leverage' on the regression and a number of diagnostic measures of influence and leverage have been proposed which can be usefully employed in graphical identification of unusual points. Many of these diagnostics have been incorporated into Minitab, Stata and other PC-based software. As Hampel et al. 51 and

236 1.M. THOMPSON

Rousseeuw & Leroi2 make clear, the purpose of such plots is identification not rejection. If a point is clearly erroneous, as a result of faulty measurement, recording or transcription, then we have sound reasons for rejection. Otherwise the identification of outliers should be a starting point for the search for the reasons for the unusual behaviour and/or refinement of the hypothesized model.

5.3 Robust and resistant methods The lack of resistance and robustness of ordinary least sq uares regression and related methods to unusual data behaviour has prompted many statisticians and data analysts to search for better methods of assessment of data. The subject is now a vast one and graphical methods playa key role. The bibliography at the end of this chapter provides a suitable list for interested readers wishing to learn about these areas.

6 STUDYING MULTIVARIATE DATA

6.1 Multiple comparisons and graphical one-way analysis of variance Often in environmental studies, it is of interest to determine whether the concentrations of a pollutant found in specimens of an indicator species at various sites differ significantly from one site to another. This is a problem of one-way analysis of variance and such problems can be approached graphically. Such an approach may not only be useful to the data analyst but may be helpful in communicating the analysis to others.

6.1.1 Multiple one-dimensional scatter plots In a recent study53 of the exposure of operating theatre staff to volatile anaesthetic agent contamination of the theatre air, the time-weighted average exposures for individuals in different occupational and task groups in a theatre were measured for several operating sessions. Jittered scatter plots of these exposures are shown in Fig. 13 and they may be used to explore differences in group exposure to this occupational contaminant. The qualitative impression of these differences was confirmed by KruskalWallis one-way analysis of variance by ranks and subsequent multiple comparisons. Thus the multiple scatter plot served a useful role in supporting the formal analysis of variance.


10.7 anaesth 454.7

10.7 surgeon 454.7

10.7 scrbnurs 4547

10.7 othrnurs 454.7

10.7 auxnurs 454.7

10.7 thtrtech 454.7

Fig. 13. Multiple jittered scatter plots of occupational exposure to the volatile anaesthetic agent. halothane, in a poorly ventilated operating theatre, mg m-3

(data of the author and R. Sithamparanadarajah; generated using Stata).

6.1.2 Multiple notched box and whisker plots The same data may be plotted as box plots. This can have the advantage of reducing clutter in the plot and focussing the attention on two main aspects of each data set: the medians and the spreads. This makes the comparison more useful but it is still on somewhat shaky foundations in relying on subjective analysis. This may be improved upon by using notched boxes and looking for overlap or lack of overlap of the notches, as in Fig. 14. Where notches do not overlap, this is an indication that the medians may be significantly different. The further apart the notches are, the more reliable that conclusion is. Thus, we have a simple form of graphical one-way analysis of variance, including a multiple comparisons test.

238

c .2 ,. .. .. ; u c o u II

480

li 180 s: 1) ii J:

80

J.M. THOMPSON

+

+

Fig. 14. Graphical one-way analysis of variance, using multiple notched box and whisker plots, of the same data as in Fig. 14 (generated using Statgraphics

PC).

6.2 Solving the problems of illustrating multivariate data in two dimensions Multivariate analysis is often helped by the use of graphical and/or symbolic plots, which can sometimes provide quite striking illustrations of the distributions of variables through the range of specimens examined. We must be very wary though of becoming fascinated by the symbolism and missing the purpose of the display. Tufte54 has drawn attention to the ever-present dangers of poor use of graphics and warns us to be careful not to fall into the trap of inadvertantly and unintentionally 'lying' or conveying the wrong message, by using displays that are poorly designed in terms of the use of symbols, labelling, graphical 'junk' (such as distracting lines, etc.), shading and colour. Although this is a problem with univariate and bivariate data, it is much more acute when dealing with multivariate data.


Much attention is being devoted to the devising of clear displays and there is much to be gained from the collaboration between statisticians/data analysts and psychologists in this area. KosslynsS has usefully reviewed some recent monographs on graphical display methods from the perceptual psychologist's viewpoint.

6.2.1 One dimensional views When beginning to look at multivariate data, it is sensible to start to look at individual variables by the methods already discussed for univariate samples. Thus one builds up a picture of the shapes of the distributions of each variable before 'plunging in at the deep end' with the immediate use of multivariate methods. Adopting the exploratory approach, one is alerted to some of the potential problems at the start and a strategy can then be evolved to deal with them in a systematic way.

6.2.2 Two dimensional views for three variables The next useful stage is the examination of plots for all the pairs of variables in the set. This can be illustrated with sets of three variables.

6.2.2.1 Draftsman's plots. We can arrange the three possible pairwise plots of a set of three variables so that adjacent plots share an axis in common. As an example, we can look at data from a study of occupational exposure to air contaminants in an operating theatre. The time-weighted average exposure to halothane is compared to the concentration in the blood and expired air from anaesthetists in Fig. 15. This approach is similar to that which a draftsman takes in providing 'front', 'top' and 'side' views of an object. This analogy led Tukey & Tukey47.56 to call such a display a draftsman's display (or scatterplot matrix, as it is called in Stata).

Relationships between variables can be explored qualitatively using this approach and can aid the data analyst in the subsequent multivariate analysis. If several groups are present in the data, then the use of different symbols for the different groups helps in the visual exploration of the data.

6.2.2.2 Casement plots. An alternative approach is to partition the data into various subsets according to the values of one of the variables. This is rather like taking slices through the cube in which the three variables might be plotted. The cubical (i.e. three dimensional) scatter plot is shown in Fig. 16 and the slices through the cube in Fig. 17. Tukey & Tukeys6 call this a casement display and the process by which the display is built up was illustrated schematically by Chambers et al.s7 Each ofthese

50 + +

t Halothane blood + + concentrations

+ + + itt + + f t

0

18.----+-------,

Halothane t t expired air

N+ + +*+ + concentrations +

0 1 + ++ + I I I I I I

0 80 0 50 0 18

Fig. 15. Draftsman·s plot of occupational exposures to halothane and the corresponding expired air and venous blood concentrations from the same individuals (generated using Statgraphics PC from the data of the author and

R. Sithamparanadarajah).

kinds of display is useful in drawing one's attention to features of the data that may not be so apparent in the other.

6.2.3 Generalization to four or more variables Both the draftsman's and the casement plots can be extended to apply to four or more variables and then the main difficulty arises in trying to analyse such complex displays, especially as the number of dimensions studied increases. An example of a draftsman's display or scatterplot matrix for a many-variable problem is shown in Fig. 18, which shows the results of trace-element analysis (for 6 of 13 assayed trace metals) of various stream sediments taken from streams draining an old mining district in Sweden. 58 Chambers et af.57 discuss the role of these plots in

c: o

15

:;; 12 l! .... c: ~ 9 c: o u .c. 6

!


80

Fig. 16. Three dimensional scatter plot (same data as Fig. 15; generated using Statgraphics PC).

multivariate analysis, including the assessment of the effects of various data transformations in achieving useful data models.

Becker & Cleveland59 have developed a dynamic, interactive method of highlighting various features in many-variable draftsman's plots to perform various tasks including: single-point and cluster linking, conditioning on single and on two variables, subsetting with categorical variables, and stationarity probing of a time series. More recently, Meadw

has introduced a new technique of graphical exploratory data analysis of multivariate data which may be all continuous, all binary or a mixture of these, which he calls the sorted binary plot. He developed the plot originally to enable pattern analyses to be made of mass spectrometric measurements on process-effluent samples, in order to evaluate suspected fluctuations in the process. The first step is to prepare a table of residuals by taking the median values for each variable from the data vectors for the

242 1.M. THOMPSON

Casement plot by levels of halothane in blood concentration

80 (0,10) (10,20) (20,30) (30,40) (40,50)

+ c: .2 ....

f-I'll 60 L .... c: 41 u c: 0 U

"0 0 -

it + 0 40 :0 S 41 c:

+ t I'll .c ....

f-+ +

t

0 20 iii :r

+

180 180 180 18 0 18

Halothane in expired breath concentrations

Fig. 17. Casement display of same data as Fig. 15 (generated using Statgraphics PC).

individual samples. Only the signs of the residuals are used to make the plot. Mead describes simple sorting routines and illustrates these ideas with various data sets.60

6.3 Use of multicode symbols: the alternative approach for multidimensional data displays Several distinctive multicode symbols have been developed which enable one to display the values of the variables in an individual data vector and simultaneously to display several such symbols together, in order to show the variations from one individual sample to another. Because we are adept at recognizing shape changes, this opens up the possibility of graphical cluster or discriminant analysis, for example.

6.3.1 Profile symbols The simplest way to represent a many-dimensional observation is by the use of vertical bars or by the use of lines joining the midpoints of the bar tops61 (see Fig. 19). The appearance of this plot depends on the ordering


WEt] Co t \0+

+++ 0

ID f +

~ Cr + : r + ++t+ 0 [em ~ ~ 0. + t + t + *+ + t 0

ill w~ + Ni '\ + + + t

+r f + ++ + +++ +

o + +t

.1' ::.j m 14.~ 1 (I: :1 r.;'jl ++ u ++ ++ ~

00 Sa 16000 Co 60 0 Cr 200 0 CU 80 0 Ni 80

ppm

243

Fig. 18. Scatter-plot matrix or draftsman's display of the concentrations of six trace metals in sediments from streams draining an old mining district in

Sweden (generated using Statgraphics PC with data from Ref. 58).

Fig. 19. Profile symbol plot.

244 J .M. THOMPSON

. i \ 83

~ ~ * ~~ ~R

.-;-1:0

~ ,,~ \ J

, I \T' ., ......... \ , i"~ u,' ( --1 Cr " \1 \ I ' f,/ '-J

'----:J '" 5 9 10 ' 'J CU

j r--.

~h

~ ~ "'- ~-c..;

~ 1N (-.J~) Pb \j/ ----..

~) Sr 11 12 13 \4 15 (::) '.,t

" '-J

"\ ~ ~ '" Zn A (f ~

\.:J v f

16 I: 18 19 20

J ' \ ,/

~ ,'"" h\' \ ~

,/1

~ ~.

Fig. 20. Star symbol plots of the concentrations of various trace metals in 20 sediments collected from streams draining an old mining district in Sweden

(generated using Stata with data from Ref. 58).

of the variables, so this must be kept unvarying from one data vector to another. The plot is termed a 'profile symbol' and many profiles may be plotted on the same graph for comparison.

6.3.2 Star symbols Star symbols may be regarded as profile symbols in polar coordinates62

and a typical form is shown in Fig. 20. This form is much easier to use visually and is available in Statgraphics PC in two slightly different versions: Star Symbol and Sun Ray plots. Star plots are also available in SYST A T and Stata.

6.3.3 Kleiner-Hartigan trees As with profiles, stars depend critically on the ordering of the variables and changing this may radically change the appearance. To overcome this problem, Kleiner & Hartigan63 used a tree symbol in which each variable is represented by a branch, branches are connected to limbs which are connected to the trunk. Using hierarchical cluster analysis to group varia-


Fig. 21. Kleiner-Hartigan trees.

bles together enables us to put clustered variables together on the same limb (at the outset, one must be very clear whether one is using agglomerative or divisive clustering). Going higher up the dendogram (see Section 6.5.4) we can see which groups of variables themselves form clusters; these can in tum be represented by limbs joining together before joining the trunk. Thus, closely correlated variables will cluster together on the same limb and less closely correlated groups may appear as limbs joined together. The structure of the tree is determined by the dendogram from which it is derived. The dendogram represents the population average picture of all the data vectors, whereas the trees represent individual data vectors. Several benefits ensue from the use of trees:

(I) groups of variables that are closely correlated can reinforce one another's influence on the lengths of shared limbs,

(2) tree symbols are also sensitive to deviations from joint behaviour of closely linked variables.

An example of Kleiner-Hartigan trees is given in Fig. 21.

6.3.4 Anderson's glyphs In 1960 Anderson64 suggested the use of glyphs in which each variable is represented as a ray drawn outwards from the top of a circle, but unlike profiles and stars, the rays are not joined together. Anderson provided some guidelines for the use of glyphs:

(I) seven variables are the most that can be represented; (2) divide the range of each variable into categories, e.g. quartiles or

deciles, so that all variables are normalized to the same rank scale; (3) correlated variables should as far as possible be associated with

adjacent rays; (4) if two or more types of case (e.g. different sex or species) are being

analysed together, the circles of the glyphs could be coloured or shaded to distinguish them.

An example of the glyph representation is shown in Fig. 22.

246 I.M. THOMPSON

W F;g. 22. Anderson's glyphs.

6.3.5 Andrews' curves The use of a Fourier function to generate a curve representing a data vector was suggested by Andrews in 1972.65 For a p-variable data vector Ix' = (XI' x2, ••• , xp), the Andrews curve is calculated using the function:

fx(t) = (xl/../i) + x2sint + X3cost + x4sin2t + x5cos2t + ... This function is plotted over the range - pi < t < pi, and it has the following interesting and useful properties:

(1) it preserves the means, so that at each point t, the function corresponding to the mean vector is the mean of the n functions corresponding to the n observations;

(2) it preserves Euclidean distances, i.e. the Euclidean distance between two observation vectors is directly proportional to the Euclidean distance between the two functions which represent those observations.

An example of an Andrews curve plot is shown in Fig. 23. These plots are available in SYSTAT under the heading of Fourier plots.

Fig. 23. Andrews' curves.

VISUAL REPRESENT A TION OF DATA 247

Fig. 24. Chernoff faces.

6.3.6 Chernoff faces Originally introduced by Chernofri' in 1973, the idea of using faces to represent data vectors was considered useful because we are all used to distinguishing between faces with different features. The procedure is simple and adaptable to suit the needs of different data sets, so that, for example:

- XI could be associated with the size of the nose, - X2 with the left eyebrow, - X) with the left eye socket, etc.

Chernoff's implementation was suitable for use with up to 18 variables and required a plotter for good quality representations. Flury & Riedwyl67-69 have developed the method to produce asymmetrical faces in order to handle up to 36 variables on a CalComp plotter and Schupbach7o.71 has written Basic and Pascal versions of this asymmetrical faces routine for use on IBM compatible PCs linked to plotters. The SYST AT and Solo packages also implement Chernoff faces routines and Schupbach has written an SAS macro for asymmetrical faces.

It is necessary to experiment with assignment of features to variables; the most important variables being assigned to the most prominent features. A simple version of the Chernoff face is illustrated in Fig. 24.

6.3.7 Which symbols help us most? The value of different symbols in analysing multivariate data depends to some extent upon the intended use of the analysis. If one is primarily

248 I.M. THOMPSON

interested in clustering, the Andrews' curves are regarded by du Toit et a/.,72 Tufte73 and Krzanowski54 as useful for small data sets. Whereas, for large data sets, Chernoff faces and Kleiner-Hartigan trees are much more suitable. Stars and glyphs are also quite valuable, but profile plots are regarded by du Toit et a/.72 as the least useful.

6.4 Spatial data displays Spatial distributions of variables or sets of variables, with or without superposition of time variations, present us with additional problems, especially with the huge amounts of data available from aerial and satellite remote sensing. The problems associated with the high data density in geographical information systems will not be tackled in detail in this chapter.

6.4.1 Contour, shade and colour maps Contour, shade and colour are the traditional methods of displaying data on maps but considerable enhancement has been achieved with computer aided techniques. Although the eye can separate tiny features, achievable separability will depend upon the visual context, including the size, shape, orientation, shading, colour, intensity, etc., of the feature and of its background. A useful review of the problems involved, and of some monographs offering guidance, was published recently by Kosslyn. 55

An example of a highly sophisticated environmental data mapping system, which uses colour and shade, is the Geochemical Interactive Systems Analysis (GISA) service, funded by the UK Department of Trade and Industry and based at the National Remote Sensing Centre, Farnborough, Hampshire, England. Amongst the information sources, it uses the data bank of the British Geological Survey's Geochemical Survey Programme (GSP), as well as images from Landsat satellites, to construct colour maps which essentially enable visual multivariate analysis to be done for such purposes as geochemical prospecting.

6.4.2 Use of coded symbols on maps The combination of symbols such as Kleiner-Hartigan trees, Chernoff faces or Star symbols with maps enables the spatial distribution of multivariate observations to be displayed. Schmid74 urges caution in the use of symbols in this way, as such displays may possess the characteristics of a complex and difficult puzzle, and so frustrate the objective of clear communication. Ideally, the symbols should be readily interpretable when used in combination with maps.


Providing that the density of display of such symbols is not too high then this should provide a useful approach to spatial cluster analysis. It could also be quite an effective way of demonstrating or illustrating, for example, regional variations of species diversity or water quality in a drainage basin.

A well-known multivariate coded symbol is the weather vane, which combines quantitative and directional information in a well-tried way.

Varying the size of a symbol to convey quantitative information is not a straightforward solution, because our perception of such symbols does not always follow the designer's intentions. For example, using circles of different areas to convey quantity is not a reliable approach, because in our perception of circle size, we seem unable to judge their relative sizes at all accurately, and our judgement of sphere symbols appears to fare even worse. 75

6.4.3 Combining graphs and maps The alternative to the use of coded symbols in presenting complex information on a map is the use of small graphs, such as histograms or frequency polygons, in different map regions to illustrate such features as regional variations in monthly rainfall or other climatic information. Sometimes, this presentation can be enhanced and made easier to interpret by the addition of brief annotation, in order to draw one's attention to spatial patterns.

6.5 Graphical techniques of classification A major goal of much ecological and environmental work is the classification of sites into areas with similar species composition, patterns of pollution or geochemical composition, for example.

Techniques of cluster analysis are important tools of classification and can be divided into two main groups: hierarchical and nonhierarchical. Both groups rely on graphical presentation to display the results of the analysis.

Within the hierarchical group, a further subdivision is possible into agglomerative and divisive methods. Seber61 lists II different agglomerative methods, so that it is absolutely necessary to clearly identify which technique is being used. Rock76 has vividly demonstrated the problem of choosing a suitable agglomerative hierarchical method by analysing a set of data obtained from analyses of systematically collected samples from 6 isolated clusters of limestone outcrops within otherwise limestone-free metamorphic terrains in the Grampian Highlands in Scotland. He used

250

Leontodon Poterium Lo/ium Trifolium Ranunculus Plantago Helictotrichon Anthriscus Taraxacum Heracleum Lathyrus Dactylis Poa trivialis Poa pratensis Holcus

100

J.M. THOMPSON

% similarity

95 90

Alopecurus -====:::r-----1 Arrhenatherum -Agrostis Anthoxanthum Festuca

85 80

Fig. 25. Dendogram of results of cluster analysis of the log abundances of major Park Grass species (reproduced with permission from P.G.N. Digby and R.A. Kempton 'Multivariate Analysis of Ecological Communities' p. 140, Fig. 5.5, Chapman and Hall, 1987. Copyright 1987 P.G.N. Digby and R.A.

Kempton).

the same agglomerative, polythetic algorithm, but nine different choices of input matrix, similarity measure and linkage method. The result was a set of nine completely different dendograms showing how the limestones might be related! These different methods are compared in detail in Ref. 54 and 61. Dendograms may also be constructed from divisive hierarchical clustering methods.

Data may not always have an hierarchical structure, in which case a non hierarchical method is preferable and two basic approaches are possible: nested and fuzzy clusters. Rock76 considers fuzzy clustering as more appropriate for real geological data.

Dendograms are used to show overall groups, as in Fig. 25, but they may mask individual similarities. So, an individual may be shown as belonging to a particular group, merely by virtue of its similarity to one other individual within that group and yet it may be vastly different from the rest. Such anomolous behaviour is best discovered by inspecting the


2

3

9 I I 10 I ~_

2 3 4 5 6 7 8 9 10

Fig. 26. Artificial similarity matrix showing its representation as a shaded matrix with progressively darker shading as the similarity increases (reproduced with permission from P.G.N. Digby and RA Kempton 'Multivariate Analysis of Ecological Communities' p. 141, Fig. 5.6, Chapman and Hall, 1987.

Copyright 1987 P.G.N. Digby and RA Kempton).

shaded similarity matrix, in which the individual units are re-ordered to match the order in the dendogram, and numerical similarity values are replaced by shading in each cell of the matrix. The darkness of the shading increases as the similarity value increases (see, for example, Fig. 26). Dendograms and shaded similarity matrices may be combined to give a more complete picture of relationships as in Fig. 27.

7 SOFTWARE FOR ILLUSTRATIVE TECHNIQUES

Whilst most of the techniques of illustration described in this chapter can be implemented manually, complex or large data sets are best processed by computerized methods, with the proviso that the computerized

Agrostis Anthoxanthum Festuca Plantago Leontodon Poterium Lolium Trifolium Ranunculus Helictotrichon 1-4~411. Oactylis Poa trivia lis Anthriscus Taraxacum Heracleum Lathyrus Poa pratensis Holcus Alopecurus

Similarities

1>92.5%

rJ85-92.5%

0 75-85%

Arrhenatherum 1.....J'--L....J......L......J.....L....l.-..J'-..J.-L.-i;;.±;L.;LJ~-L...L.J

Fig. 27. Combined display of a hanging dendogram and a shaded similarity matrix for the Park Grass species data (reproduced with permission from P.G.N. Digby and RA Kempton 'Multivariate Analysis of Ecological Communities' p. 143, Fig. 5.7, Chapman and Hall, 1987. Copyright 1987 P.G.N.

Digby and RA Kempton).


processing will often need improvement by manual techniques, particularly for annotation and titling.

7.1 General purpose and business graphics Many general purpose packages, such as Lotus 123 and Symphony, Supercalc, dBase, etc., have quite useful simple graphics routines, which can be readily adapted to a number of the display tasks discussed above. They may also be interfaced with a variety of other graphical/illustrative packages, such as Harvard Graphics, Gem Paint or Freelance, in order to enhance the quality of the display.

7.2 Statistical software and computerized mapping Several packages are available for use on personal computers and many have a wide range of the display tools discussed above. Amongst the leaders in this area (not in any specific order of merit) are the following:

- Stata from Computing Resource Center, 1640 Fifth Street, Santa Monica, California 90401, USA.

- Systat from Systat, Inc., 1800 Sherman A venue, Evanston, Illinois 60201-3973, USA or Eurostat Ltd, Icknield House, Eastcheap, Letchworth SG6 3DA, UK.

- Minitab (preferably with a graphplotter) from Minitab, Inc., 3081 Enterprise Drive, State College, Pennsylvania 1680 I, USA or Cle.Com Ltd., The Research Park, Vincent Dr., Birmingham BI5 2SQ, UK.

- Solo and PC90 from BMDP Statistical Software, 1440 Sepulveda Blvd., Los Angeles, California 90025, USA or at Cork Technology Park, Model Farm Road, Cork, Eire.

- Statgraphics PC from STSC, Inc., 2115 East Jefferson Street, Rockville, Maryland 20852, USA or Mercia Software Ltd., Aston Science Park, Love Lane, Birmingham B7 4BJ, UK.

- SPSS/PC+ 4.0 from 444 North Michigan Avenue, Chicago, Illinois 60611, USA.

- CSS/Statistica from StatSoft, 2325 East 13th St., Tulsa, Oklahoma 74104, USA or Eurostat Ltd., Icknield House, Eastcheap, Letchworth SG6 3DA, UK.

Many special purpose statistical graphics software packages are also available and news and reviews of these may be found in the following journals: The American Statistician, Technometrics, The Statistician, Applied Statistics, Journal of Chemometrics, etc.

In the UK, a very useful source of information is the CHEST Directory

254 I.M. THOMPSON

published by the UK NISS (National Information on Software and Services), whose publications are available via most university and polytechnic computer centres.

A powerful Geographic Information System (GIS) called ARC/INFO is available for use on mainframes, minicomputers, workstations and even on IBM PC/AT or AT-compatibles and PS/2. This offers the capability of advanced cartographic facilities with a relational data base management system and extensive possibilities for multi-variable dynamic environmental modelling, together with data analysis and display both in graphical and map formats. The software is available from the Environmental Systems Research Institute, 380 New York Street, Redlands, California 92373, USA or one of its distributors.

Other GIS packages for use on PCs or Macintoshes were recently reviewed by Mandef7 and some of these have considerable data analysis capability combined with their mapping facilities.

REFERENCES

I. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P .. Graphical Methods jar Data Analysis. Duxbury Press, Boston, MA, 1983, pp. 76-80.

2. Tiley, P.F., The misuse of correlation coefficients. Chemistry in Britain. (Feb. 1985) 162-3.

3. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA, 1977, Chapter 5.

4. Cleveland, W.S., Diaconis, P. & McGill, R., Variables on scatterplots look more highly correlated when the scales are increased. Science, 216 (1982) 1138-41.

5. Diaconis, P., Theories of data anlaysis: From magical thinking through classical statistics. In Exploring Data Tables, Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1985, Chapter I.

6. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P., Graphical Methods of Data Analysis. Duxbury Press, Boston, MA, 1983, Chapter 8.

7. Chernoff, H., Graphical representations as a discipline. In Graphical Representation of Multivariate Data, cd. P.c. Wang. Academic Press, New York, 1978, pp. 1-11.

8. Ross, L. & Lepper, M.R., The perseverance of beliefs: empirical and norrnative considerations. In New Directions for Methodology of Behavioural Sciences: Fallible Judgement in Behavioral Research. Jossey-Bass, San Francisco, 1980.

9. Kahneman, D., Slovic, P. & Tversky, A. (ed.), Judgement under Uncertainty: Heuristics and Biases. Cambridge University Press, Cambridge, 1982.

10. Schmid, C.F., Statistical Graphics. John Wiley & Sons, New York, 1983.


II. Tukey, J.W., Exploratory Data Analysis. Addison-Wesley, Reading, MA, 1977.

12. Tukey, J.W. & Mosteller, F.W., Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley, Reading, MA, 1977.

13. Hoaglin, D.C., Mosteller, F. & Tukey, J.W., Understanding Robust and Exploratory Data Analysis. John Wiley & Sons, New York, 1983, pp. 1-4.

14. Ames, A.E. & Szonyi, G., How to avoid lying with statistics. In Chemometrics: Theory and Applications. ACS Symposium Series 52, American Chemical Society, Washington, DC, 1977, Chapter II.

IS. Fisher, R.A., On the mathematical foundations of theoretical statistics. Phil. Trans. Royal Soc., 222A (1922) 322.

16. Gibbons, J.D., Nonparametric Methodsfor Quantitative Analysis. Holt, Rinehart and Winston, New York, 1976, pp. 56-77.

17. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, PA, Graphical Methods for Data Analysis. Wadsworth International Group, Belmont, CA. 1983, pp. 19-21.

18. Velleman, P.F. & Hoaglin, D.C., Applications, Basics and Computing ()j' Exploratory Data Analysis. Duxbury Press, Boston. MA, 1981. Chapter I.

19. Emerson, J.D. & Hoaglin. D.C., Stem-and-Ieaf displays. In Understanding Robust and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1983, Chapter I.

20. McGill, R., Tukey, J.W. & Larsen, W.A., Variations of box plots. The American Statistician, 32 (1978) 12-16.

21. Mosteller, F. & Tukey, J.W., Data Analysis and Regression: A Second Course in Statistics, Addison-Wesley, Reading, MA, 1977, Chapter 3.

22. Velleman, P.F. & Hoaglin, D.C., Applications, Basics and Computing of Exploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 3.

23. Emerson, J.D. & Strenio, J., Boxplots and batch comparison. In Understanding Robust and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1983, Chapter 3.

24. Tukey, J.W., Exploratory Data Analysis, Addison-Wesley, Reading, MA. 1977, Chapter 2.

25. Frigge, M., Hoaglin, D.C. & Iglewicz, B., Some implementations of the boxplot. The American Statistician, 43 (1989) 50-4.

26. Minitab, Inc. Minitab Reference Manual, release 6, January 1988. Minitab, Inc., State College, PA, pp. 234-6.

27. Hoaglin, D.C., Letter values: a set of order statistics. In Understanding Robust and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J .W. Tukey. John Wiley & Sons, New York, 1983, Chapter 2.

28. Velleman, P.F. & Hoaglin, D.C., Applications, Basics and Computing oj' Exploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 2.

29. Hoaglin, D.C. & Tukey, J.W., Checking the shape of discrete distributions. In Exploring Data Tables, Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey, John Wiley & Sons, New York, 1985, Chapter 9.

30. Ord, J.K., Graphical methods for a class of discrete distributions. J. Roy. Statist. Assoc. A, 130 (1967) 232-8.

31. Hoaglin, D.C., A Poisson ness plot. The American Statistician, 34 (1980) 146-9.

256 I.M. THOMPSON

32. Southwood, T.R.E., Ecological Methods with Particular Reference to the Study of Insect Populations. Chapman and Hall, London, 1978, Chapter 2.

33. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical Methods for Data Analysis. Wadsworth International Group, Belmont, CA, 1983, Chapter 6.

34. Hoaglin, D.C., In Exploring Data Tables, Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1985, pp.437-8.

35. Statistical Graphics Corporation. STATGRAPHICS Statistical Graphics System. User's Guide, Version 2.6. STSC, Inc., Rockville, MA, 1987, pp. 11-13 to ll-14.

36. Velleman, P.F. & Hoaglin, D.C., Applications, Basics, and Computing of Exploratory Data Analysis. Duxbury Press, Boston, MA, 1981, Chapter 9.

37. Hoaglin, D.C., Using quantiles to study shape. In Exploring Data Tables. Trends. and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey, John Wiley & Sons, New York, 1985, Chapter 10.

38. Hoaglin, D.C., Summarizing shape numerically: the g-and-h distributions. In Exploring Data Tables. Trends, and Shapes, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey, John Wiley & Sons, New York, 1985, Chapter II.

39. Deming, W.E., Statistical Adjustment of Data. John Wiley & Sons, New York, 1943, pp. 178-82.

40. Mandel, J., The Statistical Analysis of Experimental Data. John Wiley & Sons, New York, 1964, pp. 288-92.

41. Theil, H., A rank invariant method of linear and polynomial regression analysis, parts I, II and III. Proc. Kon. Nederl. Akad. Wetensch., A, 53 (1950) 386-92, 521-5, \397-412.

42. Tukey, J.W., Exploratory Data Analysis, Limited Preliminary Edition, Addison-Wesley, Reading, MA, 1970.

43. Rousseeuw, P.I. & Leroy, A.M., Robust Regression and Outlier Detection. John Wiley &.Sons, New York, 1987, Chapter 2.

44. Lancaster, J.F. & Quade, D., A non parametric test for linear regression based on combining Kendall's tau with the sign test. J. Amer. Statist. Assoc., 80 (1985) 393-7.

45. Thompson, J.M., The use of a robust resistant regression method for personal monitor validation with decay of trapped materials during storage. Analytica Chimica Acta, 186 (1986), 205-12.

46. Cleveland, W.S. & McGill, R., The many faces of a scatterplot. 1. Amer. Statist. Assoc., 79 (1984) 807-22.

47. Tukey, J.W. & Tukey, P.A., Some graphics for studying four-dimensional data. Computer Science and Statistics: Proceedings of the 14th. Symposium on the Interface. Springer-Verlag, New York, 1983, pp. 60-6.

48. Chambers, J.M., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical Methods for Data Analysis. Duxbury Press, Belmont, CA, 1983, pp. 110-21.

49. Goodall, c., Examining residuals. In Understanding Robust and Exploratory Data Analysis, ed. D.C. Hoaglin, F. Mosteller & J.W. Tukey. John Wiley & Sons, New York, 1983, Chapter 7.

50. Atkinson, A.C., Plots, Transformations and Regression. An Introduction to


Graphical Methods of Diagnostic Regression Analysis. Oxford University Press, Oxford, 1985.

51. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. & Stahel, W.A. Robust Statistics. The Approach Based on Influence Functions. John Wiley & Sons, New York, 1986, Chapter 1.

52. Rousseeuw, P.l & Leroy, A.M., Robust Regression and Outlier Detection. John Wiley & Sons, New York, 1987, Chapter I.

53. Thompson, lM .. Sithamparanadarajah, R., Robinson, lS. & Stephen, W.l.. Occupational exposure to nitrous oxide, halothane and isopropanol in operating theatres. Health & Hygiene, 8 (1987) 60-8.

54. Krzanowski, W.J., Principles of Multivariate Analysis. A User's Perspective. Oxford University Press, Oxford, 1988, Chapter 2.

55. Kosslyn, S.M., Graphics and human information processing. A review of five books. 1. Amer. Statist. Assoc., 80 (1985) 499-512.

56. Tukey, P.A. & Tukey, lW., Graphical display of data sets in 3 or more dimensions. In Interpreting Multivariate Data, ed. V. Barnett. John Wiley & Sons, New York, 1981, Chapters 10, I I & 12.

57. Chambers, lM., Cleveland, W.S., Kleiner, B. & Tukey, P.A., Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, 1983, Chapter 5.

58. Davis, J.c., Statistics and Data Analysis in Geology. John Wiley & Sons, New York, 1986, p. 488.

59. Becker, R.A. & Cleveland, W.S., Brushing scatterplots. Technometrics, 29 (1987) 127-42.

60. Mead, G.A., The sorted binary plot: A new technique for exploratory data analysis. Technometrics, 31 (1989) 61-7.

61. Seber, G.A.F., Multivariate Observations. John Wiley & Sons, New York, 1984, Chapter 4.

62. Newton, CM., Graphics: From alpha to omega in data analysis. In Graphical Representation of Multivariate Data. ed. P.e. Wang. Academic Press. New York, 1978, pp. 59-92.

63. Kleiner, B. & Hartigan, J.A., Representing points in many dimensions by trees and castles. 1. Amer. Statist. Assoc., 76 (1981) 260-76.

64. Anderson, E., A semi graphical method for the analysis of complex problems. Technometrics, 2 (1960) 387-91.

65. Andrews, D.F., Plots of high-dimensional data. Biometrics, 28 (\ 972) 125-36. 66. Chernoff, H., Using faces to represent points in K-dimensional space graphic

ally. 1. Amer. Statist. Assoc., 68 (1973) 361-8. 67. Flury, B. & Riedwyl, H., Multivariate Statistics. A Practical Approach.

Chapman and Hall, London, 1988. Chapter 4. 68. Flury, B., Construction of an asymmetrical face to represent multivariate data

graphically. Technische Bericht No.3, Institut fUr Mathematische Statistik und Versicherungslehre, Universitiit Bern, 1980.

69. Flury, B. & Riedwyl, H., Some applications of asymmetrical faces. Technische Bericht No. 11. Institut fUr Mathematische Statistik und Versichungslehre, Universitiit Bern, 1983.

70. Schupbach, M., ASYMFACE: Asymmetrical faces on IBM and Olivetti Pc. Technische Bericht No. 16. Institut fur Mathematische Statistik und Versichungslehre, Universitiit Bern, 1984.

258 1.M. THOMPSON

71. Schupbach, M., ASYMFACE: Asymmetrical faces in Turbo Pascal. Technische Bericht No. 25. Institut fUr Mathematische Statistik und Versichungslehre, Universitiit Bern, 1987.

72. du Toit, S.H.C., Steyn, A.W.G. & Stumpf, R.H., Graphical Exploratory Data Analysis. Springer-Verlag, New York, 1986, Chapter 4.

73. Tufte, E.R., The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 1983.

74. Schmid, C.F., Statistical Graphics. John Wiley & Sons, New York, \983, pp. 188-90.

75. Dickenson, G.c., Statistical Mapping and the Presentation of Statistics, 2nd edn. Edward Arnold, London, 1973, Chapter 5.

76. Rock, N.M.S., Numerical Geology. A Source Guide, Glossary and Selective Bibliography to Geological Uses of Computers and Statistics. Springer-Verlag. Berlin, 1988, Topic 21, pp. 275-96.

77. Mandel, R., The world according to Micros. Byte, 15 (1990) 256-67. 78. Monmonier, M.S., Computer-Assisted Cartography: Principles and Prospects.

Prentice-Ha\l, Englewood-Cliffs, NJ, 1982, p. 90. 79. Seber, G.A.F., Multivariate Observations. John Wiley & Sons, New York,

1984. Chapter 7.

BIBLIOGRAPHY

In addition to the literature listed in the References Section as well as references within those sources, the fo\lowing books may prove useful to the reader.

Cleveland, W.S. & McGill, M.E. (ed.), Dynamic Graphics for Statistics. Wadsworth, BeImont, CA, 1989.

Davis, J.e., Statistics and Data Analysis in Geology. John Wiley & Sons, New York, 1973.

Eddy, W.F. (ed.), Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface. Springer-Verlag, New York. 1981.

Green. W.R., Computer-aided Data Analysis. A Practical Guide. John Wiley & Sons, New York, 1985.

Haining, R., Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, Cambridge, 1990.

Isaaks, E.H. & Srivastava, R.M., An Introduction to Applied Geostatistics. Oxford University Press, New York, 1989.

Ripley, B.D., Spatial Statistics. J. Wiley & Sons, New York, 1981. Ripley, B.D., Statistical Inference for Spatial Processes. Cambridge University

Press, Cambridge, 1988. Upton, G. & Fingleton, B., Spatial Data Analysis by Example. Volume 1, Point

Pattern and Quantitative Data and Volume 2, Categorical and Directional Data. John Wiley & Sons, New York, 1985, 1989.

Chapter 7

Quality Assurance for Environmental Assessment

Activities

ALBERT A. LIABASTRE US Army Environmental Hygiene Activity-South. Building 180. Fort

McPherson, Georgia 30330-5000. USA

KATHLEEN A. CARLBERG 29 Hoffman Place. Belle Mead. New Jersey 08502, USA

MITZI S. MILLER Automated Compliance Systems. 673 Emory Valley Road, Oak Ridge,

Tennessee 37830, USA

1 INTRODUCTION

1.1 Background Environmental assessment activities may be viewed as being comprised of four parts: establishment of Data Quality Objectives (DQO); design of the Sampling and Analytical Plan; execution of the Sampling and Analytical Plan; and Data Assessment.

During the last 20 years, numerous environmental assessments have been conducted, many of which have not met the needs of the data users. In an effort to resolve many of these problems, the National Academy of Sciences of the United States was requested by the US Environmental Protection Agency (EPA) to review the Agency's Quality Assurance (QA) Program. I

259

260 A.A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

This review and the efforts of the EPA Quality Assurance Management Staff have led to the use of both Total Quality Principles and the DQO concept. Data quality objectives are interactive management tools used to interpret and communicate the data users' needs to the data supplier such that the supplier can develop the necessary objectives for QA and appropriate levels of quality control. In the past, it was not considered important that data users convey to data suppliers what the quality of the data should be.

Data use objectives are statements relating to why data are needed, what questions have to be answered, what decisions have to be made, and/or what decisions need to be supported by the data. It should be recognized that the highest attainable quality may be unrelated to the quality of data adequate for a stated objective. The suppliers of data should know what quality is achievable, respond to the users' quality needs and eliminate unnecessary costs associated with providing data of much higher quality than needed, or the production of data of inadequate quality.

Use of the DQO process in the development of the assessment plan assures that the data adequately support decisions, provides for cost effective sampling and analysis, prevents unnecessary repetition of work due to incorrect quality specifications, and assures that decisions which require data collection are considered in the planning phase.

The discussions presented here assume the DQO process is complete and the data use objectives are set, thus allowing the sampling and analytical design process to begin.

The sampling and analytical design process should minimally address several basic issues including: historical information; purpose of the sample collection along with the rationale for the number, location, type of samples and analytical parameters; field and laboratory QA procedures.

Once these and other pertinent questions are addressed, a QA program may be designed that is capable of providing the quality specifications required for each project. By using this approach, the QA program ensures a project capable of providing the data necessary for meeting the data use objectives.

1.2 The quality assurance program The QA program is a system of activities aimed at ensuring that the information provided in the environmental assessment meets the data users' needs. It is designed to provide both control offield and laboratory

QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITIES 261

operations, traceable documentation, and reporting of results of sample collection and analytical activities.

The organizations involved in providing sampling and analytical services must demontrate their qualifications to perform the work. The qualification to perform the work involves assuring that adequate personnel, facilities and equipment/instrumentation are available.

One way of demonstrating qualifications is to develop, maintain and register2 a documented Quality Management System (QMS)-one that conforms to the generic quality systems model standards for QA of either the American Society for Quality Control (Q-90 Series)3-7 or the International Organization for Standardization (ISO Guide 9000 Series).

The organizations involved must demonstrate that personnel are qualified based on education, experience, and/or training to perform their job functions. In addition, the organizations must also demonstrate that: both the equipment and facilities are available to perform the work and the staff are trained in the operation and maintenance of the equipment and facilities. The term 'facilities' as used in this text includes either a permanent laboratory or a temporary field facility which provide for sample handling. A controlled environment appropriate for the work to be performed must also be provided.

Field and laboratory operations may be controlled by following written standard operating procedures (SOP). It is essential that the SOPs be documented, controlled, maintained and followed without exception unless authorized by a change order. Such standardization of procedures assures that no matter who performs the work, the information provided will be reproducible. Among the quality control procedures that may be necessary are those that identify and define control samples, data acceptance criteria, and corrective actions.

Documentation and reporting are essential in providing the traceability necessary for legally defensible data. The quality program must address and include detailed documentation requirements.

2 ORGANIZATIONAL AND DOCUMENTATION REQUIREMENTS

It is essential that environmental assessment activities provide information that has been developed using an adequate QA program or a plan that provides documentation of the process employed.

A number of organizations including the EPA,8-IO the International


Standards Organization, II the American Association for Laboratory Accreditation I2.13 and the American Society for Testing and Materials,I4-16 have developed requirements and criteria that field sample collection organizations and/or laboratories should meet in order to provide quality environmental assessment information.

The following sections detail minimal policies and procedures that must be developed, implemented and maintained within a field sampling organization and/or laboratory, in order to provide accurate and reliable information to meet environmental DQOs.

Many of the recommendations presented here are applicable to both field and laboratory organizations. Where there are specialized requirements, the organization will be identified. These recommendations are written in this manner to emphasize the similarity and interdependence of the requirements for field and laboratory organizations. It is well recognized that the sample is the focal point of all environmental assessments. It is essential that the sample, to the extent possible, actually represents the environment being assessed. The laboratory can only evaluate what is in the sample. Thus it is imperative that an adequate QA plan be in place.

In most situations, field sample collection and laboratory operations are conducted by separate organizations under different management. Therefore, it is necessary that communication exists between field, laboratory and project managers to ensure that the requirements of the QA plan are met. Where deviations from the plan occur, they must be discussed with project managers and documented.

2.1 Organization and personnel The field sample collection group and the laboratory must be legally identifiable and have an organizational structure, including quality systems, that enable them to maintain the capability to perform quality environmental testing services.

2. 1. 1 Organization The field and laboratory organizations should be structured such that each member of the organization has a clear understanding of their role and responsibilities. To accomplish this, a table of organization indicating lines of authority, areas of responsibility, and job descriptions for key personnel should be available. The organization'S management must promote QA ideals and provide the technical staff with a written QA policy. This policy should require development, implementation, and maintenance of a documented QA program. To further demonstrate their


capability, both field and laboratory organizations should maintain a current list of accreditations from governmental agencies and/or private associations.

2. 1.2 Personnel Current job descriptions and qualification requirements, including education, training, technical knowledge and experience, should be maintained for each staff position. Training shall be provided for each staff member to enable proper performance of their assigned duties. All such training must be documented and summarized in the individual's training file. Where appropriate, evidence demonstrating that the training met minimal acceptance criteria is necessary to establish competence with testing, sampling, or other procedures. The proportion of supervisory to non-supervisory staff should be maintained at the level needed to ensure production of quality environmental data. In addition, back-up personnel should be designated for all senior technical positions, and they should be trained to provide for continuity of operations in the absence of the senior technical staff member. There should be sufficient personnel to provide timely and proper conduct of analytical sampling and analysis in conformance with the QA program.

2.1.2.1 Management. The management is responsible for establishing and maintaining organizational, operational and quality policies, and also for providing the personnel and necessary resources to ensure that methodologies used are capable of meeting the needs of the data users. In addition, management should inspire an attitude among the staff of continuing unconditional commitment to implementation of the QA plan. A laboratory director and field manager should be designated by management. Each laboratory and field sampling organization should also have a QA Officer and/or Quality Assurance Unit.

2.1.2.2 Laboratory director/field manager. The director/manager is responsible for overall management of the field or laboratory organization. These responsibilities include oversight of personnel selection, development and training; development of performance evaluation criteria for personnel; review, selection and approval of methods of analysis and/or sample collection; and development, implementation and maintenance of a QA program. It is important that management techniques employed rely on a vigorous program of education, training and modern methods of supervision, which emphasize commitment to quality at all levels rather than numerical goals or test report inspection l7 alone.


The director/manager should be qualified to assume administrative, organizational, professional and educational responsibility and must be involved in the day-to-day management of the organization. The laboratory director should have at least a BS degree in chemistry and 6 years experience in environmental chemistry or an advanced degree in chemistry and 3 years experience in environmental chemistry. The field manager should have at least a BS degree in environmental, chemical or civil engineering or natural sciences and 6 years experience in environmental assessment activities.

2.1.2.3 Technical staff. The technical staff should consist of personnel with the education, experience and training to conduct the range of duties assigned to their position. Every member of the technical staff must demonstrate proficiency with the applicable analytical, sampling, and related procedures prior to analyzing or collecting samples. The criteria for acceptable performance shall be those specified in the procedure, best practice, or if neither of these exists, the criteria shall be determined and published by the laboratory director or field manager. Development of data acceptance criteria should be based on: consideration of precision, accuracy, quantitation limit, and method detection limit information published for the particular procedure; information available from Performance Evaluation Study data, previously demonstrated during satisfactory operation; or the requirements of the data user. Many of the more complex laboratory test technologies (gas chromatography, gas chromatography-mass spectroscopy, inductively coupled plasma-atomic emission spectroscopy and atomic absorption spectroscopy, for example), require specific education and/or experience, which should be carefully evaluated. As a general rule, 10.13 specific requirements for senior laboratory technical personnel should include a BS degree in chemistry or a degree in a related science with 4 years experience as minimum requirements. In addition, each analyst accountable for supervisory tasks in the following areas should meet minimum experience requirements: general chemistry and instrumentation, 6 months; gas chromatography and mass spectroscopy, I year; atomic absorption and emission spectroscopy, I year; and spectra interpretation, 2 years.

2.1.2.4 Support staff. The support staff is comprised of personnel who perform sampling, laboratory and administrative support functions. These activities include: cleaning of sample containers, laboratory ware and equipment; transportation and handling of samples and equipment;

QUALITY ASSURANCE FOR ENVmONMENTAL ASSESSMENT ACTIVITIES 265

and clerical and secretarial services. Each member of the support staff must be provided with on-the-job training to enable them to perform the job in conformance with the requirements of the job description and the QA program. Such training must enable the trainee to meet adopted performance criteria and it must also be documented in the employee's training file.

2.1.2.5 Quality assurance function. The organization must have a QA Officer and/or a QA Unit that is responsible for monitoring the field and laboratory activities. The QA staff assures management that facilities, equipment, personnel, methods, practices and records are in conformance with the QA plan. The QA function should be separate from and independent of the personnel directly involved in environmental assessment activities. The QA Unit should: (1) inspect records to assure that sample collection and testing activities were performed properly and within the sample holding times or specified turn around times; (2) maintain and distribute copies of the laboratory QA plan along with any project specific QA plan dealing with sampling and/or testing for which the laboratory is responsible; (3) perform assessments of the organization to ensure adherence to the QA plan; (4) periodically submit to management written status reports, noting any problems and corrective actions taken; (5) assure that deviations from the QA plan or SOPs were properly authorized and documented; and (6) assure that all data and sampling reports accurately describe the methods and SOPs used and that the reported results accurately reflect the raw data. The responsibilities, procedures, records of archiving applicable to the QA Unit should be documented and maintained in a manner reflecting the current practices.

2.1.3 Subcontractors All subcontractors should be required to meet the same quality standards as the primary laboratory or sample collection organization. Subcontractors should be audited to meet the same criteria as the in-house laboratories by using quality control samples, such as double-blind samples and site visits conducted by the QA Officer.

2.2 Facilities Improperly designed and poorly maintained laboratory facilities can have a significant effect on the results of analyses, the health, safety and morale of the analysts, and the safe operation of the facilities. Although the emphasis here is the production of quality data, proper facility design


serves to protect personnel from chemical exposure health hazards, as well as to protect them from fires, explosions and other hazards. 1s- 23 Furthermore, poorly maintained facilities detract from the production of quality data. An additional consideration is facility and equipment requirements for field measurements, which should be addressed in the field QA plan. These requirements should include consideration of the types of measurements, an appropriate area to perform the measurements and address ventilation, climate control, power, water, gases, and safety needs.

2.2.1 General Each laboratory should be sized and constructed to facilitate the proper conduct of laboratory analyses and associated operations. Adequate bench space or working area per analyst should be provided; 4'6-7·6 meters of bench space or 14-28 square meters of floor space per analyst has been recommended. 21 Lighting requirements may vary depending upon the tasks being performed. Lighting levels in the range of 540-1075 lumens per square meter are usually adequate.24 A stable and reliable source of power is essential to proper operation of many laboratory instruments. Surge suppressors are required for computers and other sensitive instruments and uninterrupted power supplies, as well as isolated ground circuits may be required. The actual requirements depend on the equipment or apparatus utilized, power line supply characteristics and the number of operations that are to be performed (many laboratories have more work stations than analysts) at one time. The specific instrumentation, equipment, materials and supplies required for performance of a test method are usually described in the approved procedure. If the laboratory intends to perform a new test, it must acquire the necessary instrumentation and supplies, provide the space, and conduct the training necessary to demonstrate competence with the new test before analyzing any routine samples.

2.2.2 Laboratory environmenfo The facility should be designed so that no one activity will have an adverse effect on another activity. In addition, separate space for laboratory operations and ancillary support should be provided. The laboratory should be well-ventilated, adequately lit, free of dust and drafts, protected from extremes of temperature and humidity, and have access to a stable and reliable source of power. The laboratory should be minimally equipped with the following safety equipment: fire extinguishers of the appropriate number and class; eye wash and safety shower facilities; eye, face, skin and


respiratory personal protective equipment; and spill control materials appropriate to each laboratory's chemical types and volumes. Laboratories may also have need of specialized facilities, such as a perchloric acid hood, glovebox, special ventilation requirements, and sample disposal area. It is frequently desirable and often essential to separate types of laboratory activities. Glassware cleaning and portable equipment storage areas should be convenient, but separated from the laboratory work area. When conducting analyses over a wide concentration range, caution must be exercised to prevent cross contamination among samples in storage and during analysis. Sample preparation areas where extraction, separation, clean-up, or digestion activities are conducted must be separate from instrumentation rooms to reduce hazards and avoid sample contamination and instrument damage. Documentation must be available demonstrating contamination is not a problem where samples, calibration standards and reference materials are not stored separately. Field and mobile laboratories may also have special requirements based on function or location.

2.2.3 Ventilation systerr12

The laboratory ventilation system should: provide a source of air for both breathing and input to local ventilation devices; ensure that laboratory air is continually replaced to prevent concentrating toxic substances during the work day; provide air flow from non-laboratory areas into the laboratory areas and direct exhaust air flow above the low pressure envelope around the building created by prevailing windflow. The best way to prevent contamination or exposure due to airborne substances, is to prevent their escape into the working atmosphere by using hoods or other ventilation devices. Laboratory ventilation should be accomplished using a central air conditioning system which can: filter incoming air to reduce airborne contamination; provide stable operation of instrumentation and equipment through more uniform temperature maintenance; lower humidity which reduces moisture problems associated with hygroscopic substances and reduces instrument corrosion problems; provide an adequate supply of make up air for hoods and local exhaust devices. It has been recommended that a laboratory hood with 0·75 meter of hood width per person be provided for every two workers20 (it is a rare occurrence for a hood not to be required). Each hood should be equipped with a continuous monitoring device to afford convenient confirmation of hood performance prior to use (typical hood face velocity should be in the range of 18'5-30' 5 meters per minute ).18.20.22 In addition, other local ventilation devices, such as ventilated storage cabinets, canopy hoods, and snorkles


should be provided where needed. These devices shall be exhausted separately to the exterior of the building. The laboratory must also have adequate and well ventilated storerooms (4-12 room air changes per hour).2o,23,25 The laboratory ventilation system, hoods and local exhaust ventilation devices should be evaluated on installation, monitored every 3 months where continuous monitors are not installed and must be reevaluated when any changes in the system are made.

2.2.4 Sample handling, shipping, receiving and storage area The attainment of quality environmental data depends not only on collecting representative samples, but also on ensuring that those samples remain as close to their original condition as possible until analyzed. When samples cannot be analyzed upon collection, they must be preserved and stored as required for the analytes of interest and shipped to a laboratory for analysis, Shipping of samples to the laboratory must be done in a manner that provides sufficient time to perform analyses within specified holding times. To ensure safe storage and prevent contamination and/or misidentification, there must be adequate facilities for handling, shipping, receipt, and secure storage of samples.

In addition, adequate storage facilities are required for reagents, standards, and reference materials to preserve their identity, concentration, purity and stability,

2,2.5 Chemical and waste storage areas Facilities adequate for the collection, storage and disposal of waste chemicals and samples must be provided at the field sampling site, as well as in the laboratory. These facilities are to be operated in a manner that minimizes the chance of environmental contamination, and complies with all applicable regulations.

Where practical, it is advisable to either recover, reuse, or dispose of wastes in-house. Sample disposal presents complex problems for laboratories. In many situations, these materials can be returned to the originator; however, in those situations where this is not possible, other alternatives must be considered, such as recovery, waste exchange, incineration, solidification, lab pack containerization and/or disposal to the sewer system or landfill site. Procedures outlining disposal practices must be available, and staff members must be trained in disposal practices applicable to their area,


2.2.6 Data and record storage areas Space must be provided for the storage and retrieval of all documents related to field sampling and laboratory operations and reports. Such space must also include an archive area for older records. The environment of these storage areas must be maintained to ensure the integrity of all records and materials stored. Access to these areas must be controlled and limited to authorized personnel.

2.3 Equipment and instrumentation There should be well-defined and documented purchasing guidelines for the procurement of equipment.2- 6,11-13 These guidelines should ensure the data quality needs of the users.

2.3. 1 General The laboratory and field sampling organizations should have available all items of equipment and instrumentation necessary to correctly and accurately perform all the testing and measuring services that it provides to its users, For the field sampling activities, the site facilities should be examined prior to beginning work to ensure that appropriate facilities, equipment, instrumentation, and supplies are available to accomplish the objectives of the QA plan. Records should be maintained on all major items of equipment and instrumentation which should include the name of the item, the manufacturer and model number, serial number, date of purchase, date placed in service, accessories, any modifications, updates or upgrades that have been made, current location of the equipment, as well as related accessories and manuals, and all details of maintenance. Items of equipment used for environmental testing must meet certain minimum requirements. For example, analytical balances must be capable of weighing to 0·1 mg; pH meters must have scale graduations of at least 0·01 pH units and must employ a temperature sensor or thermometer; sample storage refrigerators must be capable of maintaining the temperature in the range 2-5°C; the laboratory must have access to a certified National Institute of Science and Technology (NIST) traceable thermometer; thermometers used should be calibrated and have graduations no larger than that appropriate to the testing or operating procedure; and probes for conductivity measurements must be of the appropriate sensitivity.

2.3.2 Maintenance and calibration4.1 1

The organization must have a program to verify the performance of newly installed instrumentation to manufacturers' specifications. The program


must provide clearly defined and written maintenance procedures for each measurement system and the required support equipment. The program must also detail the records required for each maintenance activity to document the adequacy of maintenance schedules and the parts inventory.

All equipment should be properly maintained to ensure protection from corrosion and other causes of deterioration. A proper maintenance procedure should be available for those items of equipment which require periodic maintenance. Any item of equipment which has been subject to mishandling, which gives suspect results, or has been shown by calibration or other procedure to be defective, should be taken out of service and clearly labelled until it has been repaired. After repair, the equipment must be shown by test or calibration to be performing satisfactorily. All actions taken regarding calibration and maintenance should be documented in a permanent record.

Calibration of equipment used for environmental testing usually falls in either the operational or periodic calibration category. Operational calibration is usually performed prior to each use of an instrumental measurement system. It typically involves developing a calibration curve and verification with a reference material. Periodic calibration is performed depending on use, or at least annually on items of equipment or devices that are very stable in operation, such as balances, weights, thermometers and ovens. Typical calibration frequencies for instrumentation is either prior to use, daily or every 12 h. All such calibrations should be traceable to a recognized authority such as NIST. The calibration process involves: identifying equipment to be calibrated; identifying reference standards (both physical and chemical) used for calibration; identifying, where appropriate, the concentration of standards; use of calibration procedures; use of performance criteria; use of a stated frequency of calibration; and the appropriate records and documentation to support the calibration process.

2.4 Standard operating procedures28.39

All organizations need to have documented SOPs which clearly delineate the steps to be followed in performing a given operation.

It is essential that SOPs be clearly written and easily understood. Some excellent advice for writing clearly was given by Sir Ernest Gowers who said: 'It is not enough to write with such precision that a person reading with good will may understand; it is necessary to write with such precision that a person reading with bad will cannot misunderstand.' A standard format should be adopted and employed in writing SOPs for all pro-


cedures and methods, including sample collection and analysis, laboratory functions and ancillary functions. Published industry accepted procedures often serve as the foundation for these written in-house analytical procedures and these in-house protocols must include the technical restrictions of the applicable regulatory approved methods.

Each page of final drafts of SOPs should be annotated with a document control heading containing procedure number, revision number, date of revision, implementation date and page number, and the total number of pages in the SOP. The cover page of the SOP should indicate whether it is an uncontrolled copy or a controlled copy (with the control number annotated on the page) and must include endorsing signatures of approval by appropriate officers of the organization. A controlled copy is one that is numbered and issued to an individual, the contents of which are updated after issue; an uncontrolled copy is current at the time of issue but no attempt is made to update it after issuance. Whenever an SOP is modified, it should be rewritten with changes and contain a new revision number and a new implementation date. A master copy of each SOP should be kept in a file under the control of the QA Officer. The master copy should contain the signatures of the Laboratory Director/Field Manager and the Laboratory/Field QA Officer or their designees. Records should be maintained of the distribution of each SOP and its revisions. Use of a comprehensive documentation procedure allows tracking and identification of which procedure was being employed at any given time.

It is essential that appropriate SOPs be available to all personnel involved in field sampling and laboratory operations and that these SOPs address the following areas: (l) sample control including sample collection and preservation, sample storage and chain of custody; (2) standard and reagent preparation; (3) instrument and equipment maintenance; (4) procedures to be used in the field and laboratory; (5) analytical methods; (6) quality control SOPs; (7) corrective actions; (8) data reduction and validation; (9) reporting; (10) records management; (11) chemical and sample disposal; and (12) health and safety.

2.5 Field and laboratory records26-28

All information relating to field and laboratory operations should be documented, providing evidence related to project activities and supporting the technical interpretations and judgments. The information usually contained in these records relates to: project planning documents, including maps and drawings; all QA plans involved in the project; sample management, collection and tracking; all deviations from procedural and


planning documents; all project correspondence; personnel training and qualifications; equipment calibration and maintenance; SOPs, including available historical information about the site; traceability information relating to calibrations, reagents, standards and reference materials; original data, including raw data and calculated results for field, QC samples and standards; method performance, including detection and quantitation limits, precision, bias, and spike/surrogate recovery; and final report. It is essential that these records be maintained as though intended for use in regulatory litigation. These records must be traceable to provide historical evidence required for reconstruction of events and review and analysis.

Clearly defined SOPs should be available providing guidance for reviewing, approving and revising field and laboratory records. These records must be legible, identifiable, retrievable and protected against damage, deterioration or loss. All records should be written or typed with black waterproof ink. For laboratory operations each page of the notebook, logbook or each bench sheet must be signed and dated by the person who entered the data. Documentation errors should be corrected by drawing a line through the incorrect entry, initialing and dating the deletion, and writing the correct entry adjacent to the deletion. The reason for the change should be indicated.

Field records minimally consist of: bound field notebooks with prenumbered pages; sampling point location maps; sample collection forms; sample analysis request forms; chain of custody forms; equipment maintenance and calibration logs; and field change authorization forms.

Laboratory records minimally consist of: bound notebooks and logbooks with prenumbered pages; bench sheets; graphical and/or computerized instrument output; files and other sample tracking or data entry forms; SOPs; and QA plans.

2.6 Document storage11 ,27

Procedures should be established providing for safe storage of the documentation necessary to recreate sampling, analyses and reporting information. These documents should minimally include planning documents, SOPs, logbooks, field and laboratory data records, sample management records, photographs, maps, correspondence and final reports.

Field and laboratory management should have a written policy specifying the minimum length of time that documentation is to be retained. This policy should be in conformance with the more stringent of regulatory requirements, project requirements or organizational policy. Deviations from the policy should be documented and available to those responsible


for maintaining the documentation. No records should be disposed of without written notice being provided to the client stating a disposition date which provides an opportunity for them to obtain the records. If the testing organization or an archive contract facility goes out of business prior to expiration of the documentation storage time specified in the policy, all documentation should be transferred to the archives of the client involved.

The documentation should be stored in a facility capable of maintaining the integrity and minimizing deterioration of the records for the length of time they are retained.

Control of the records involved is essential in providing evidence of their traceability and integrity. These records should be identified, readily retrievable, and organized to prevent loss. Access to the archived records should restrict unauthorized personnel from free and open access. An authorized access list should be maintained naming personnel authorized to access the archived information. All accesses to the archives should be documented. The documentation involved should include: name of the individual; date and reason for accessing the data; and all changes, deletions or withdrawals that may have occurred.

3 MEASUREMENT SYSTEM CONTROL21 .29-31

An essential part of a QA system involves continuous and timely monitoring of measurement system control. Organizations may be able to demonstrate the capability to conduct field and laboratory aspects of environmental assessments, but this alone does not provide the ongoing day-to-day performance checks necessary to guarantee and document performance. Performance check procedures must be in place to ensure that the quality of work conforms to the quality specifications on an ongoing basis.

The first step in instituting performance checks is to evaluate the measurement systems involved in terms of sensitivity, precision, linearity and bias to establish conditional measurement system performance characteristics. This is accomplished by specifying: instrument and method calibration criteria; acceptable precision and bias of the measurement system; and method detection and reporting limits. The performance characteristics may serve as the basis for acceptance criteria used to validate data. If published performance characteristics are to be used, they must be verified


as reasonable and achievable with the instrumentation available and under the operating conditions employed.

3.1 Calibration Calibration criteria should be specified for each test technology and/or analytical instrument and method in order to verify measurement system performance. The calibration used should be consistent with the data use objectives. Such criteria should specify the number of standards necessary to establish a calibration curve, the procedures to employ for determining linear fit and linear range of the calibration, and acceptance limits for (continuing) analysis of calibration curve verification standards. All SOPs for field and laboratory measurement system operation should specify system calibration requirements, acceptance limits for analysis of calibration curve verification standards, and detailed steps to be taken when acceptance limits are exceeded.

3.2 Precision and bias32- 34

The precision and bias of a measurement system must be determined prior to routine use using solutions of known concentration. The degree to which a measurement is reproducible is frequently measured by the standard deviation or relative per cent difference of replicates. Bias, the determination of how close a measurement is to the true value, is usually calculated in terms of it's complement, the per cent recovery of known concentrations of analytes from reference or spiked samples. Precision and bias data may be used to establish acceptance criteria, also called control limits, which define limits for acceptance performance. Data that fall inside established control limits are judged to be acceptable, while data lying outside of the control interval are considered suspect. Quality control limits are often two-tiered: warning limits are established at ± 2 standard deviation units around the mean; and control limits are established at ± 3 standard deviation units around the mean. All limits should be updated periodically in order to accurately reflect the current state of the measurement system.

3.3 Method detection and reporting limits Prior to routine use, the sensitivity of the measurement system must be determined. There are many terms that have been used to describe sensitivity including: detection limit (DL),35 limit of detection (LOD),36 method detection limit (MDL),37 instrument detection limit (IDL),18 method quantitation limit (MQL),19 limit of quantitation (LOQ),36 practical quantita-

tion limit (PQL),39.40 contract required detection limit (CRDL),38 and criteria of detection (COD)33 (see Table 1).

In an analytical process, the IDL is generally used to describe instrument sensitivity and the MDL is generally used to describe overall sensitivity of the measurement system, including sample preparation and analysis. Frequently, either the IDL or MDL is designated as the lowest reportable concentration level, and analytical data are quantitatively reported at or above the IDL or MDL. As seen from Table 1, there is a great deal of disagreement and confusion regarding definitions and reporting limits. The definition of quantitation limit presumes that samples at the concentration can be quantitated with virtual assurance. This assumes the use of '< MDL' as a reporting threshold. However, results reported as '< LOQ' are not quantitative, which negates the definition of LOQ. This is a bad and unnecessary practice.

It is essential that the sensitivity of the measurement system and reporting limits be established prior to routine use of the measurement system. Although MDLs may be published in analytical methods, it is necessary for the laboratory to generate its own sensitivity data to support its own performance.

3.4 Quality control samples41 -45

The continuing performance of the measurement system is verified by using Quality Control (QC) samples. The QC samples are used to evaluate measurement system control and the effect of the matrix on the data generated. Control samples are of known composition and measured by the field or laboratory organization on an ongoing basis. Data from these control samples are compared to established acceptance criteria to determine whether the system is 'in a state of statistical control'. Control samples evaluate measurement system performance independent of matrix effects.

The effect of the matrix on measurement system performance is evaluated by collecting and analyzing environmental samples from the site or location being evaluated. These environmental samples are typically collected and/or analyzed in duplicate to assess precision, or spiked with known concentrations of target analytes to assess accuracy.

It is essential that both system control and matrix effects be evaluated to get a true picture of system performance and data quality. However, the types ofQC samples, and frequency of usage, depend on the end-use of the data. Prior to beginning a sampling effort, the types of QC samples required should be determined. Data validation procedures should specify

TA

BL

E 1

N

-.

.J

Def

init

ion

of d

etec

tion

lim

it t

erm

s c;

r,

Def

initi

on

Det

erm

inat

ion

Cal

cula

tion

Ref

eren

ce

Det

ecti

on l

imit

(D

L)

The

con

cent

rati

on w

hich

A

naly

sis

of

repl

icat

e T

wo

tim

es t

he s

tand

ard

35

>

is d

isti

nctl

y de

tect

able

st

anda

rds

devi

atio

n ~

abov

e, b

ut c

lose

to

a ~

blan

k =

L

imit

of

dete

ctio

n T

he l

owes

t co

ncen

trat

ion

Ana

lysi

s o

f re

plic

ate

Thr

ee t

imes

the

sta

ndar

d 36

>

V

J

(LO

D)

that

can

be

dete

rmin

ed

sam

ples

de

viat

ion

;d to

be

stat

isti

call

y ~m

:;-:

diff

eren

t fr

om a

bla

nk

~ M

etho

d de

tect

ion

lim

it

The

min

imum

A

naly

sis

of

a m

inim

um

The

sta

ndar

d de

viat

ion

37

(") >

(MO

L)

conc

entr

atio

n o

f a

of

seve

n re

plic

ates

spi

ked

tim

es t

he S

tude

nt t

-val

ue

::o:l

subs

tanc

e th

at c

an b

e at

1-5

tim

es t

he e

xpec

ted

at t

he d

esir

ed c

onfi

denc

e t"'

" =

iden

tifi

ed,

mea

sure

d an

d de

tect

ion

lim

it

leve

l (f

or s

even

rep

lica

tes,

~

repo

rted

wit

h 99

%

the

valu

e is

3·1

4)

0 >

conf

iden

ce t

hat

the

~ an

alyt

e co

ncen

trat

ion

is

grea

ter

than

zer

o is:

1n In

stru

men

t de

tect

ion

The

sm

alle

st s

igna

l ab

ove

Ana

lysi

s o

f th

ree

Thr

ee t

imes

the

sta

ndar

d 38

~

lim

it (

IDL

) ba

ckgr

ound

noi

se t

hat

an

repl

icat

e st

anda

rds

at

devi

atio

n ~

inst

rum

ent

can

dete

ct

conc

entr

atio

ns o

f 3-

5 re

liab

ly

times

the

det

ecti

on l

imit

Met

hod

quan

tita

tion

T

he m

inim

um

Ana

lysi

s o

f re

plic

ate

Fiv

e ti

mes

the

sta

ndar

d 39

li

mit

(M

QL

) co

ncen

trat

ion

of

a sa

mpl

es

devi

atio

n su

bsta

nce

that

can

be

mea

sure

d an

d re

port

ed

Lim

it o

f q

ua

nti

tati

on

(L

OO

)

Pra

ctic

al

qu

an

tita

tio

n

limit

(P

OL

)

Co

ntr

act

re

qu

ire

d

de

tect

ion

lim

it

(CR

DL

)

Th

e l

evel

ab

ove

wh

ich

q

ua

nti

tati

ve r

esul

ts m

ay

be o

bta

ine

d w

ith

a

spe

cifie

d d

egre

e o

f co

nfi

de

nce

Th

e l

ow

est

lev

el t

ha

t ca

n b

e re

liab

ly

de

term

ine

d w

ith

in

spe

cifie

d l

imit

s o

f p

reci

sio

n a

nd a

ccu

racy

d

uri

ng

ro

uti

ne

la

bo

rato

ry o

pe

ratin

g

con

dit

ion

s

Re

po

rtin

g l

imit

sp

eci

fied

fo

r la

bora

tori

es u

nd

er

con

tra

ct t

o t

he

EP

A f

or

Su

pe

rfu

nd

act

iviti

es

An

aly

sis

of

rep

lica

te

sam

ple

s

Inte

rla

bo

rato

ry a

na

lysi

s o

f ch

eck

sa

mp

les

Un

kn

ow

n

Te

n t

ime

s th

e s

tan

da

rd

de

via

tio

n

(1)

Te

n t

ime

s th

e M

DL

(2

) V

alu

e w

he

re 8

0%

of

lab

ora

tori

es

are

wit

hin

20

% o

f th

e

tru

e v

alu

e

Un

kn

ow

n

36

39

4

0

38

~ ~ ! Q

~ :ocJ I I ....j ;; ::l ;::; ~ N

......

......


how the results of analysis of these samples will be used in evaluating the data.

Common types of field and laboratory QC samples are discussed in the following sections, along with how they are commonly used in evaluating data quality.

3.4.1 Field QC samples '4•34

Field QC samples typically consist of the following kinds of samples:

(I) Field blank: a sample of analyte-free media similar to the sample matrix that is transferred from one vessel to another or is exposed to or passes through the sampling device or environment at the sampling site. This blank is preserved and processed in the same manner as the associated samples. A field blank is used to document contamination in the sampling and analysis process.

(2) Trip blank: a sample of analyte-free media taken from the laboratory to the sampling site and returned to the laboratory unopened. A trip blank is used to document contamination attributable to shipping, field handling procedures and sample container preparation. The trip blank is particularly useful in documenting contamination of volatile organics samples and is recommended when sampling for volatile organics.

(3) Equipment blank: a sample of analyte-free media that has been used to rinse sampling equipment (also referred to as an equipment rinsate). It is collected after completion of decontamination and prior to sampling. An equipment blank is useful in documenting adequate decontamination of sampling equipment.

(4) Material blank: a sample of construction materials such as those used in monitoring wells or other sampling point construction/installation, well development, pump and flow testing, and slurry wall construction. Samples of these materials are used to document contamination resulting from the use of construction materials.

(5) Field duplicates: independent samples collected as close as possible to the same point in space and time and which are intended to be identical; they are carried through the entire analytical process. Field duplicates are used to indicate the overall precision of the sampling and analytical process.

(6) Background sample: a sample taken from a location on or proximate to the site of interest, also known as a site blank. It is generally taken from an area thought to be uncontaminated in order to document baseline or historical information.


3.4.2 Laboratory QC samplesl1-14.21,31.33.34

Laboratory QC samples typically include the following kinds of samples:

(I) Method blank (also referred to as laboratory blank): an analyte-free media to which all reagents are added in the same volumes or proportions as used in the sample processing. The method blank is carried through the entire sample preparation and analysis procedure and is used to document contamination resulting from the analytical process. A method blank is generally analyzed with each group of samples processed.

(2) Laboratory control sample: a sample of known composition spiked with compound(s) representative of target analytes, which is carried through the entire analytical process. The results of the laboratory control sample(s) analyses are compared to laboratory acceptance criteria (control limits) to document that the laboratory is in control during an individual analytical episode. The laboratory control sample must be appropriately batched with samples to assure that the quality of both the preparation and analysis of the batch is monitored.

(3) Reference material: a sample containing known quantities of target analytes in solution or in a homogeneous matrix. Reference materials are generally provided to an organization through external sources. Reference materials are used to document the accuracy of the analytical process.

(4) Duplicate samples: two aliquots of sample taken from the same container after sample homogenization and intended to be identical. Matrix duplicates are analyzed independently and are used to assess the precision of the analytical process. Duplicates are used to assess precision when there is a high likelihood that the sample contains the analyte of interest.

(5) Matrix spike: an aliquot of sample (natural matrix) spiked with a known concentration of target analyte(s) which is carried through the entire analytical process. A matrix spike is used to document the affect of the matrix on the accuracy of the method, when compared to an aliquot of the unspiked sample.

(6) Matrix spike duplicates: two aliquots of sample (natural matrix) taken from the same container and spiked with identical concentrations of target analytes. Matrix spike duplicates are analyzed independently and are used to assess the effect of the matrix on the precision of the analytical process. Matrix spike duplicates are used to assess precision when there is little likelihood the sample contains compounds of interest.

(7) Surrogate standard: a compound, usually organic, which is similar to the target analyte(s) in chemical composition and behavior in the


analytical process, but which is generally not found in environmental samples. Surrogate samples are usually added to environmental samples prior to sample extraction and analysis and are used to monitor the effect of the matrix on the accuracy of the analytical process.

4 QUALITY SYSTEMS AUDITS3-6.1~14.46-50

The effectiveness of quality systems are evaluated through the audit process. The audit process should be designed to provide for both consistent and objective evaluation of the item or element undergoing evaluation.

The QA plan should provide general guidance regarding the frequency and scope of audits; rationale for determining the need for additional audits; recording and reporting formats; guidance for initiation and training; and corrective action plans including implementation and tracking of corrective actions.

The audit process provides the means for continuously monitoring the effectiveness of the QA program. The audit process can involve both internal and third party auditors. Where internal auditors are involved, it is essential that the auditors have no routine involvement in the activities they are auditing.

Audit results should be reported to management in a timely manner. Management should respond to the auditor by outlining its plan for correcting the quality system deficiencies through the implementation of corrective actions.

4.1 Internal QC audit The field and laboratory organization should develop written guidelines for conducting internal QC audits. The written guidelines should include specifications of reporting formats and checklist formats to ensure the audits are performed in a consistent manner.

The internal QC audits should be conducted by the QA officer or a trained audit team. These audits must be well-planned to ensure minimization of their impact on laboratory or field operations. Audit planning elements should include guidance on: (l) scheduling and notification; (2) development of written SOPs for conduct of audits; (3) development of standard checklists and reporting formats; and (4) corrective action. The auditors should be trained in the areas they audit with emphasis on consistently recording information collected during audits. To be most


effective, internal QC audits should be timely, thorough, and accurate. The results of the audits should be reported to the staff both verbally and in writing in a timely manner.

4.2 Third party QC audits Third party QC audits are used by the organizations responsible for conducting the sampling and analysis and the group that contracted the work to verify the validity of the QA plan and the internal auditing process.

Third party audits are important because internal QC audits do not necessarily result in the implementation of the QC procedures specified in the QA plan and they also provide an independent evaluation of the quality system.

4.3 Type of audits46.41

There are several types of audits that are commonly conducted by both internal and third party auditors: system, performance, data quality, and contract/regulatory compliance audits. Frequently, an audit may involve some combination of any or all of these audit types to obtain a more complete understanding of an organization's performance.

4.3. 1 System audit A system audit is performed on a scheduled, periodic (usually semi-annually) basis by an independent auditor. These audits are usually conducted by auditors external to the auditee organization, and the results are reported to the auditee management.

The system audit involves a thorough overview of implementation of the QA plan within the auditee organization and includes inspection and evaluation of: (1) facilities; (2) staff; (3) equipment; (4) SOPs; (5) sample management procedures; and (6) QA/QC procedures.

4.3.2 Performance audit A performance audit involves a detailed inspection of specific areas within an organization and its implementation of the QA program including: (1) sample maintenance; (2) calibration; (3) preventive maintenance; (4) receipt and storage of standards, chemicals and gases; (5) analytical methods; (6) data verification; and (7) records management.

The performance audit may also be more restrictive and simply test the ability of the organization to correctly test samples of known composition. This involves the use of performance evaluation (PE) samples which may


be submitted as blind or double blind samples by either an internal or external source. The data from PE sample analyses are compared to acceptance limits in order to identify problems with qualitative identification or quantitative analysis. The organizations involved are usually asked to provide both an explanation for any data outside acceptance limits and a listing of corrective actions taken.

This audit is performed on an ongoing basis within field and laboratory organizations by the QA officer or staff. The results of these audits are reported to the management of the organization.

4.3.3 Data quality audit The data quality audit involves an assessment of the precISIOn, bias (accuracy), representativeness and completeness of the data sets obtained. This audit is performed on representative data sets produced by the organization involved. These audits are usually conducted on a projectspecific basis. As with other audits, the results are reported to management.

4.3.4 Contract/regulatory compliance audit This audit is conducted to evaluate the effectiveness of the QA plan and protocols in ensuring contract/regulatory compliance by the organization involved. The contract/regulatory compliance audit generally revolves around ensuring that correct protocols are followed and the resulting reports are in the contractually prescribed format.

4.4 Field and laboratory checklists1o.34.49.51 A checklist should be developed for each audit activity, ensuring that all areas of the field and laboratory organizations' operations are systematically addressed. The actual content of the checklist depends on the objective of the audit but should minimally address QC procedures, the sampling and analysis plan and conformance to the quality plan. These checklists are helpful in ensuring that the audits are as objective, comprehensive and consistent as possible.

4.5 Corrective action The result of the auditing process, whether by internal auditors or third party auditors, is the audit report which serves as the basis for the development of corrective actions. These corrective actions are implemented to eliminate any deficiencies uncovered during the audit. The final step in this process is the evaluation of the effectiveness of the corrective actions in


eliminating the deficiencies. All corrective actions and their effectiveness must be documented.

5 DATA ASSESSMENT34.52

Data assessment involves determining whether the data meet the requirements of the QA plan and the needs of the data user. Data assessment is a three-part process involving assessment of the field data, the laboratory data and the combined field and laboratory data. Both the field and laboratory assessments involve comparison of the data obtained to the specifications stated in the QA plan, whereas the combined, or overall, assessment involves determining the data usability. The data usability assessment is the determination of the data's ability to meet the DQOs and whether they are appropriate for the intended use.

5.1 Field data assessment34•46•52

This aspect of assessment involves verification of the documentation and evaluation of both quantitative and qualitative validation offield data from measurements such as pH, conductivity, flow rate and direction, as well as information such as soil stratigraphy, groundwater well installation and sample management records, observations, anomalies and corrective actions.

5.1.1 Field data completeness The process of reviewing the field data for completeness ensures that the QA plan requirements for record traceability and procedure documentation have been implemented. The documentation must be sufficiently detailed to permit recreation of the field episode if necessary. Incomplete records should be identified and should include data qualifiers that specify the usefulness of the data in question.

5.1.2 Field data validity Review of field data for validity ensures that problems affecting accuracy and precision of quantitative results as well as the representativeness of samples are identified so that the data may be qualified as necessary. Problems of this type may include improper sample preservation and well screening, instability of pH or conductivity, or collection of volatile organic samples adjacent to sources of contamination.


5.1.3 Field data comparisons Review of the data should include comparison/correlation of data obtained by more than one method. This review will allow identification of anomalous field test data such as groundwater samples with pH several units higher than those from similar wells in the same aquifer.

5.1.4 Field laboratory data validation Field laboratory data validation involves using data validation techniques which mirror those used in fixed laboratory operations. Details of these requirements are given in Section 5.2.

5.1.5 Field quality plan variances This data review involves documenting all quality plan variances. Such variances must be documented and include the rationale for the changes that were made. The review involves evaluation of all failures to meet data acceptance criteria and corrective actions implemetlted and should include appropriate data qualification where necessary.

5.2 Laboratory data assessment34•46•52

The laboratory data assessment involves a review of the methods employed, conformance with QA plan requirements and a review of records. The discussion presented here is limited to the technical data requirements because contract requirements and technical data conformance requirements are usually not the same.

5.2.1 Laboratory data completeness Laboratory data completeness is usually defined as the percentage of valid data obtained. It is necessary to define procedures for establishing data validity (Sections 5.2.2-5.2.7).

5.2.2 Evaluation of laboratory OC sample!l6-28.31,44,45

(1) Laboratory blank: the results of these blank sample analyses should be compared with the sample analyse per analyte to determine whether the source of any analyte present is due to laboratory contamination or actual analyte in the sample.

(2) QC sample: the results of control sample analyses are usually evaluated statistically in order to determine the mean and standard deviation. These data can be used to determine data acceptance criteria (control limits), and to determine precision and accuracy characteristics for the method. The data can be statistically analyzed to determine outliers, bias


and trends. The control limits from these data should be used on a real-time basis to demonstrate measurement system control and sample data validity.

(3) Performance evaluation sample: the results ofPE sample analyses are used to evaluate and compare different laboratories. In order for such comparisons to be valid, the PE sample results must have associated control limits that are statistically valid.

5.2.3 Evaluation and reporting of low level data33

Sample data should be evaluated on an analyte basis in determining whether values to be reported are above detection or reporting limits. Since there are numerous definitions (see Section 3.3) of these terms, it is essential that the QA plan clearly defines the terms and specifies the procedures for determining their values for each analyte/procedure.

5.2.4 Evaluation of matrix effects Since each environmental sample has the potential for containing a different matrix, it may be necessary to document the effect of each matrix on the analyte of interest. Matrix effects are usually evaluated by analyzing the spike recovery from an environmental sample spiked with the analyte of interest prior to sample preparation and analysis. Surrogate spike recoveries and matrix spike recoveries may be used to evaluate matrix effects. If a method shows significant bias or imprecision with a particular matrix then data near the regulatory limit (or an action level) should be carefully evaluated.

5.2.5 Review of sample management data Sample management information/data involves a review of records that include and address sample traceability and storage and holding times. The field and laboratory organizations must evaluate whether the information contained in the records is complete and whether it documents improper preservation, exceeded holding times or improper sample storage procedures.

5.2.6 Calibration Instrument calibration information including sensitivity checks, instrument calibration and continuing calibration checks should be evaluated and compared to historical information and acceptance criteria. Control charts are a preferred method for documenting calibration performance. Standard and reference material traceability must be evaluated and doc-

286 A.A. LlABASTRE, K.A. CARLBERG AND M.S. MILLER

umented along with verifying and documenting their concentration! purity.

5.3 Assessment of combined data34•52

At some point after the field and laboratory organizations have validated and reported their data, these reported data are combined to determine the data's usability. This assessment of data usability occurring after completion of the data collection activity involves evaluating the documentation of information and performance against established criteria.

5.3.1 Combined data assessment The final data assessment involves integration of all data from the field and laboratory activities. The data are evaluated for their conformance with the DQOs in terms of their precision, accuracy, representativeness, completeness and comparability. The precision and accuracy are evaluated in terms of the method. Representativeness expresses the degree to which the data represents the sample population. Completeness is expressed as the percentage of valid field and laboratory data. Comparability, based on the degree of standardization of methods and procedures, attempts to express how one data set compares to another data set. Comparisons of the field, rinsate, and trip blanks, which usually occur only after the field and laboratory data have been combined, is accomplished by employing procedures similar to those used in evaluating laboratory blanks. Sample management records should be assessed to assure that sample integrity was maintained and not compromised.

5.3.2 Criteria for classification of data The criteria for data classification must be specified in the QA plan and based on the documentation of specific information and evaluation of the data in specific quality terms. Where mUltiple data usability levels are permitted, the minimum criteria for each acceptable level of data usability must be specified in the QA plan. To be categorized as fully usable, the data must be supported (documented) by the specified (as described in the QA plan) minimum informational requirements.

5.3.3 Classification of the data The assessment process should evaluate the data using procedures and criteria documented in the QA plan. A report should be prepared which describes each of the usability levels of the data for the DQOs. This report should also delineate the reason(s) data have been given a particular usability level classification.


6 FIELD AND LABORATORY STANDARD OPERATING PROCEDURES26.53

The following paragraphs specify recommended minimal levels of detail for SOPs used in obtaining environmental data.

6.1 Sample control2S

Sample control SOPs detail sample collection, management, receipt, handling, preservation, storage and disposal requirements, including chain of custody procedures. The purpose of these procedures is to permit traceability from the time samples are collected until they are released for final disposition.

The field manager and laboratory director should appoint a sample custodian with the responsibility for carrying out the provisions of the sample control SOPs.

6.1.1 Sample collection and field managementI4.54

There are a large number of sample collection procedures necessary to provide for proper collection of the wide variety of environmental samples. Since sample collection is the most critical part of the environmental assessment, it is essential that these sample collection protocols (SOPs) be implemented as written. These SOPs should not be simple references to a published method unless the method is to be implemented exactly as written. Sample collection SOPs should include at least the following sections: (l) purpose; (2) scope; (3) responsibility; (4) references; (5) equipment required; (6) detailed description of the procedure; (7) QA data acceptance criteria; (8) reporting and records; and (9) health and safety considerations.

Field sample management SOPs should describe the numbering and labeling system, sample collection point selection methods; chain of custody procedures; the specification of holding times; sample volume and preservatives required for analysis; as well as detail sample shipping requirements.

Samples should be identified by attaching a label or tag to the container usually prior to sample collection. The label or tag used should be firmly attached to the sample container, made of waterproof paper, and filled in with waterproof ink. The label or tag should uniquely identify the sample and include information such as: (1) date and time of sample collection; (2) location of sample point; (3) sample type; preservative if added; (4) safety considerations; (5) company name and project identification; and

288 A,A. LIABASTRE, K.A. CARLBERG AND M.S. MILLER

(6) observations, remarks, and name and signature of person recording the data.

6.1.2 Sample receipt, handling and preservation54•55

These SOPs describe procedures to be followed in: opening sample shipment containers; verifying chain of custody maintenance; examining samples for damage; checking for proper preservatives and temperature; assignment to the testing program; and logging samples into the laboratory sample stream.

Samples should be inspected to determine the condition of the sample and custody seal, if used. If any sample has leaked or any custody seal is broken, the condition is noted and the custodian, along with the supervisor responsible, must decide if it is a valid sample. Sample documentation is verified for agreement between chain of custody/shipping record, sample analysis request form, and sample label. Any discrepancies must be resolved prior to sample assignment for analysis.

The results of all inspections and investigations of discrepancies are noted on the sample analysis request forms, as well as the laboratory sample logbook. All samples are assigned, a unique laboratory sample number and logged into the laboratory sample logbook or a computerized sample management system.

6.1.3 Sample storage54•55

Sample storage SOPs describe storage conditions required for all samples, procedures used to verify and document storage conditions, and procedures used to ensure that sample custody has been maintained from sample collection to sample disposal.

Sample storage conditions address holding times and preservation and refrigeration requirements. The SOP dealing with storage conditions should specify that a log be maintained and entries be made at least twice a day, at the beginning and end of the workday. Such entries should note, at minimum, the security condition, the temperature of the sample storage area and any comments. The required sampling chain of custody procedures should minimally meet standard practices such as those described in Ref. 28.

6.1.4 Sample disposal56

Sample disposal SOPs should describe the process of sample release and disposal. The authority for sample release should be assigned to one individual, the sample custodian. Laboratory and field management


should designate an individual responsible for waste disposal, with adequate training, experience and knowledge to deal with the regulatory aspects of waste disposal. Typical procedures used are listed in Section 2.2.5 above.

6.2 Standard and reagent preparation57

Standard and reagent preparation SOPs detail the procedures used to prepare, verify and document standard and reagent solutions, including reagent-grade water and dilution water used in the laboratory. These SOPs should include: (I) information concerning specific grades of materials used; (2) appropriate laboratory ware and containers for preparation and storage; (3) labeling and record-keeping for stocks and dilutions; (4) procedures used to verify concentration and purity; and (5) safety precautions to be taken.

For environmental analyses, certain minimal requirements must be met: (l) use of reagent water meeting a standard equivalent to ASTM Type II (ASTM D 1194); (2) chemicals of ACS grade or better; (3) glass and plastic laboratory ware, volumetric flasks and transfer pipets shall be Class A, precision grade and within tolerances established by the National Institute of Science and Technology (NIST); (4) balances and weights shall be calibrated at least annually (traceable to NIST), (5) solution storage should include light sensitive containers if required, storage conditions for solutions should be within the range IS-30°C unless special requirements exist, detennination of shelflife if not known; (6) purchased and in-house calibration solutions should be assayed periodically; (7) labeling should include preparation date, name and concentration of analyte, lot number, assay and date; and (8) the preparation of calibration solutions, titrants and assay results should be documented.

6.3 Instrument/equipment maintenance4.11

Instrument/equipment maintenance SOPs describe the procedures necessary to ensure that equipment and instrumentation are operating within the specifications required to provide data of acceptable quality. These SOPs specify: (I) calibration and maintenance procedures; (2) performance schedules for these functions; (3) record-keeping requirements, including maintenance logs; (4) service contracts and/or service arrangements for all non-in-house procedures; and (5) the availability of spare parts. The procedures employed should include manufacturer's recommendations to ensure performance of the equipment to be within specifications.

290 A.A. LlABASTRE, K.A. CARLBERG AND M.S. MILLER

6.4 General field and laboratory procedures4.8.9.11-14.21.26 General field and laboratory SOPs detail all essential operations or requirements not detailed in other SOPs. These SOPs include: (1) preparation and cleaning of sampling site preparation devices; (2) sample collection devices; (3) sample collection bottles and laboratory ware; (4) use of weighing devices; (5) dilution techniques; (6) use of volumetric glassware; and (7) documentation of thermally controlled devices.

Many of the cleaning and preparation procedures have special requirements necessary to prevent contamination problems associated with specific analytes. Therefore, individual procedures are required for metals analyses, various types of organic analyses, total phosphorous analysis, and ammonia analysis, for example.35•39.58

6.5 Analytical methods35.37.39.58--61 Analytical method SOPs detail the procedures followed in the field or laboratory to determine a chemical or physical parameter and should describe how the analysis is actually performed. They should not simply be a reference to standard methods, unless the analysis is to be performed exactly as described in the reference method.

Whenever possible, reference method sources used should include those published by the US Environmental Protection Agency (EPA), American Public Health Association (APHA), American Society for Testing and Materials (ASTM), Association of Official Analytical Chemists (AOAC), Occupational Health and Safety Administration (OSHA), National Institute for Occupational Safety and Health (NIOSH), or other recognized organizations.

A test method that involves selection of options which depend on conditions or the sample matrix, requires that the SOP be organized into subsections dealing with each optional path; each optional path being treated as a separate test method. The laboratory must validate each SOP by verifying that it can obtain results which meet the minimum requirements of the published reference methods. For test methods, this involves obtaining results with precision and bias comparable to the reference method and documenting this performance.

Each analyst should be provided with a copy of the procedure currently being employed. Furthermore, the analyst should be instructed to follow the procedure exactly as written. Where the test method allows for options, the particular option employed should be reported with the final results.

A standardized format for SOPS26 should be adopted and employed.


This format should include: title, principal reference, application, summary of the test method, interferences, sample handling, apparatus, chemical and reagents, safety, procedure, sample storage, calculation, data management, QA and QC, references and appendices.

Of particular importance to this process are those method-specific QA and QC procedures necessary to provide data of acceptable quality. These method-specific procedures should describe the QA elements that must be routinely performed. These should include equipment and reagent checks, instrument standardization, linear standardization range of the method, frequency of analysis of calibration standards, recalibrations, check standards, calibration data acceptance criteria and other system checks as appropriate. Necessary statistical QC parameters should be specified along with their frequency of performance and include the use of: QC samples; batch size; reference materials; and data handling, validation and acceptance criteria. The requirements for reporting low-level data for all analytes in the method should be addressed and include limits of detection and reporting level (see Section 3.3 above). Precision and accuracy of data should be estimated and tabulated using the specific test method (by both matrix and concentration) by the laboratory. Finally, appropriate references to any applicable QC SOPs should be included.

6.6 Quality control SOPS29-31.48

Quality control SOPs address QC procedures necessary to assure that data produced meet the needs of the users. The procedures should detail minimal QC requirements to be followed unless otherwise specified in the analytical method SOP. The topics addressed by these SOPs should include: standardization/calibration requirements; use of QC samples; statistical techniques for the method performance characteristic determinations, detection limit values, precision and bias/recovery estimates; and development of data acceptance criteria.

A useful approach to developing minimal QC requirements has been recommended by A2LA 13 and is based on controlling the testing technology involved; it is intended to augment, where necessary, the method specific requirements.

The SOPs concerning the use of QC samples should address the types, purpose and frequency of employment. These SOPs provide information and data for demonstrating measurement system control, estimating matrix effects, evaluation of method recovery/bias and precision, demonstration of effectiveness of cleaning procedures and evaluation of reagent and blank used for evaluating contamination.


The SOPs dealing with method performance characteristics should detail the statistical procedure necessary to determine the value and uncertainty associated with the laboratory measurement using validated methods. It should be noted that there are a number of different and often conflicting detection limit definitions and procedures for their determination. This topic needs to be clarified. However, it appears that currently the soundest approach may be to use the method detection limit (MDL) procedure proposed by the USEPA (see Table 1).37

6.7 Corrective actions Corrective action SOPs describe procedures used in identifying and correcting non-conformances associated with out-of-control situations occurring within the testing process. These SOPs should describe specific steps to be followed in evaluating and correcting loss of system control; such as, reanalysis of the reference material, preparation of new standards, reagents and/or reference materials, recalibration/restandardization of equipment, evaluation of effectiveness of the corrective action taken, reanalysis of samples, and/or recommending retraining of laboratory analysts in the use of the affected procedures.

The evaluations of non-conformance situations should be documented using a corrective action report which describes: (I) the non-conformance; (2) the samples affected by the non-conformance; (3) the corrective action taken and evaluation of its effectiveness; and (4) the date of corrective action implementation.

6.S Data reduction and validation21 •29-33.53.62

Data reduction and validation SOPs describe the procedures used in reviewing and validating sample and QC data. They should include procedures used in computing, evaluating and interpreting results from analyses of QC samples; and procedures used in certifying intrasample and/or inters ample consistency among multiparameter analyses and/or sample batches.

Furthermore, these SOPs should also include those elements necessary to establish and monitor precision and bias/recovery characteristics associated with the analysis of the full range of QC samples: (l) blank samples (field, trip and reagent); (2) calibration standards; (3) check standards; (4) control standards; (5) reference standards; (6) duplicate samples; (7) matrix spike samples; and (8) surrogate recovery samples.

QUALITY ASSURANCE FOR ENVIRONMENTAL ASSESSMENT ACTIVITmS 293

6.9 Reporting 11-13.33.36.63

Reporting SOPs describe the process for reporting testing results and should clearly and unambiguously present the test results and all other relevant information. These SOPs should include the procedures for: (I) summarizing the testing results and QC data; (2) presentation format and content, report review process; and (3) issuing and amending test reports.

Each testing report should minimally contain the following information:"-13 (I) name and address of the testing organization; (2) unique identification of the report, and of each page of the report; (3) name and address of the client; (4) description and identification of the test item; (5) date of receipt of test item and date(s) of performance of the test; (6) identification of the testing procedure; (7) description of the sampling procedure, where relevant; (8) any deviations, additions to or exclusions from the testing procedure, and any other information relevant to the specific test; (9) disclosure of any non-standard testing procedure used; (10) measurements, examinations and derived results, supported by tables, graphs, sketches and photographs as appropriate; (II) where appropriate, a statement on measurement uncertainty to include precision and bias! recovery; (12) signature and title of person(s) accepting technical responsibility for the testing report and date of report issue; (13) and a statement that only a complete copy of the testing report may be made.

6.10 Records management27.34

Records management SOPs describe the procedures for generating, controlling and archiving laboratory records. These SOPs should detail the responsibilities for record generation and control, and policies for record retention, including type, time, security, and retrieval and disposal authorities.

Records documenting overall laboratory and project specific operations should be maintained in conformance with laboratory and project policy and any regulatory requirements. These records include correspondence, chain of custody, request for testing, DQOs, QA plans, notebooks, equipment performance and maintenance logs, calibration records, testing data, QC samples, software documentation, control charts, reference/control material certification, personnel files, SOPs and corrective action reports.

6.11 Chemical and sample disposal14.55

Chemical and sample disposal SOPs describe the policies and procedures necessary to properly dispose of chemicals, standard and reagent solutions, process wastes and unused samples. Disposal of all chemicals and


samples must be in conformance with applicable regulatory requirements. These SOPs should detail appropriate disposal and pretreatment/recovery methods. They must take into consideration the discharge requirements of the publicly-owned treatment works and landfills. The pretreatment/ recovery SOPs should include recovery, reuse, dilution, neutralization, oxidation, reduction, and controlled reactions/processes.

6.12 Health and safety18-23.25.84 Health and safety SOPs describe policies and procedures necessary to meet health and safety regulatory requirements in providing a safe and healthy working environment for field and laboratory personnel engaged in environmental sample collection and testing operations. The SOPs should be work practice oriented, that is detailing how to accomplish the task safely, and in conformance with appropriate regulatory requirements. For example, in the UK the specific requirements of the Control of Substances Hazardous to Health (CoSH H) regulations must be fully addressed.

The SOPs should detail the procedures necessary for operation and maintenance of laboratory safety devices including fume hoods, glove boxes, miscellaneous ventilation devices, eye washes, safety showers, fire extinguishers, fire blankets and self-contained breathing apparatus.

In addition, the SOPs should fully describe emergency procedures for contingency planning, evacuation plans, appropriate first aid measures, chemical spill plans, proper selection and use of protective equipment, hazard communication, chemical hygiene plans, and information and training requirements.

6.13 Definitions and terminology14.34.39.85-70 The QA plan as well as associated SOPs and other documentation must be supported by a set of standardized and consistently used definitions and terminology. The quality system should have an SOP that describes and lists the terms, acronyms and symbols that are acceptable for use.

ACKNOWLEDGEMENT

The author would like to thank James H. Scott, Senior Environmental Analyst, Georgia Power Company, Atlanta, Georgia for his suggestions and assistance with the preparation and proof-reading of this manuscript.


REFERENCES·

I. National Research Council, Final report on quality assurance to the Environmental Protection Agency. Report to Environmental Protection Agency, National Academy Press, Washington, DC, 1988.

2. American Association for Registration of Quality Systems, Program Specific Requirements, 656 Quince Orchard Road, Gaithersburg, MD 20878, 1990.

3. American National Standards Institute, Quality management and quality assurance standards-Guidelines for selection and use. American Society for Quality Control, Designation ANSI/ASQC Q90, Milwaukee, WI, 1987.

4. American National Standards Institute, Quality systems-Model for quality assurance in design/development, production, installation, and servicing. American Society for Quality Control, Designation ANSI/ASQC Q9L Milwaukee, WI, 1987.

5. American National Standards Institute, Quality systems-Model for quality assurance in production and installation. American Society for Quality Control, Designation ANSI/ASQC Q92, Milwaukee, WI, 1987.

6. American National Standards Institute, Quality systems-Model for quality assurance in final inspection and test. American Society for Quality Control, Designation ANSI/ASQC Q93, Milwaukee, WI, 1987.

7. American National Standards Institute, Quality Management and quality system elements-Guidelines. American Society for Quality Control, Designation ANSI/ASQC Q94, Milwaukee, WI, 1987.

8. US Environmental Protection Agency, Interim Guidelines and Specifications for Preparing Quality Assurance Project Plans. QAMS-005/80, EPA-600/4-83-004, USEPA, Quality Assurance Management Staff, Washington, DC, 1983.

9. US Environmental Protection Agency, Guidance/or Preparation o/Combined Work/Quality Assurance Project Plans/or Environmental Monitoring. OWRS QA-l, USEPA, Office of Water Regulations and Standards, Washington, DC, 1984.

10. US Environmental Protection Agency, Manual/or the Certification of Laboratories Analyzing Drinking Water. EPA-570/9-90-008, USEPA, Office of Drinking Water, Washington, DC, 1990.

II. International Standards Organization, General requirements for the technical competence of testing laboratories. ISO/IEC Guide 25, Switzerland, 1990.

12. American Association for Laboratory Accreditation, General Requirements for Accreditation. A2LA, Gaithersburg, MD, 1991.

13. American Association for Laboratory Accreditation, Environmental Program Requirements. A2LA, Gaithersburg, MD, 1991.

14. American Society for Testing and Materials, Standard practice for the generation of environmental data related to waste management activities. ASTM Designation ESI6, Philadelphia, PA, 1990.

15. American Society for Testing and Materials, Standard practice for prepara-

"The majority of the references cited are periodically reviewed and updated, the reader is advised to consult the latest edition of these documents.

296 A.A. LIABASTRE, K.A. CARLBERG AND M_S. MILLER

tion of criteria for use in the evaluation of testing laboratories and inspection bodies. ASTM Designation E548, Philadelphia, PA, 1984.

16. American Society for Testing and Materials, Standard guide for laboratory accreditation systems. ASTM Designation E994, Philadelphia, PA, 1990.

17. Locke, J.W., Quality, productivity, and the competitive position of a testing laboratory. ASTM Standardization News, July (1985) 48-52.

18. American National Standards Institute, Fire protection for laboratories using chemicals. National Fire Protection Association, Designation ANSI/NFPA 45, Quincy, MA, 1986.

19. Koenigsberg, J., Building a safe laboratory environment. American Laboratory, 19 June (1987) 96-105.

20. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and Health Standards, Subpart Z, Toxic and Hazardous Substances, Section .1450, Occupational exposure to hazardous chemicals in laboratories, OSHA, 1990, pp 373-89.

21. American Society for Testing and Materials, Standard guide for good laboratory practices in laboratories engaged in sampling and analysis of water. ASTM Designation 03856, Philadelphia, PA, 1988.

22. Committee on Industrial Ventilation, Industrial Ventilation: A Manual of Recommended Practice, 20th edn. American Conference of Governmental Industrial Hygienists, Cincinnati, OH, 1988.

23. American National Standards Institute, Flammable and combustible liquids code. National Fire Protection Association, Designation ANSI/NFPA 30, Quincy, MA, 1987.

24. Kaufman, J.E. (ed.), IES Lighting Handbook: The Standard Lighting Guide, 5th edn. Illuminating Engineering Society, New York, 1972.

25. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and Health Standards, Subpart H, Hazardous Materials, Section .106, Flammable and combustible liquids, OSHA, 1990, pp. 242-75.

26. American Society of Testing and Materials, Standard guide for documenting the standard operating procedure for the analysis of water. ASTM Designation D5172, Philadelphia, PA, 1991.

27. American Society for Testing and Materials, Standard guide for records management in spectrometry laboratories performing analysis in support of nonciinicallaboratory studies. ASTM Designation E899, Philadelphia, PA, 1987.

28. American Society for Testing and Materials, Standard practice for sampling chain of custody procedures. ASTM Designation D4840, Philadelphia, PA, 1988.

29. American Society for Testing and Materials, Standard guide for accountability and quality control in the chemical analysis laboratory. ASTM Designation E882, Philadelphia, PA, 1987.

30. American Society for Testing and Materials, Standard guide for quality assurance of laboratories using molecular spectroscopy. ASTM Designation E924, Philadelphia, PA, 1990.

31. US Environmental Protection Agency, Handbook for Analytical Quality Control in Water and Wastewater, Laboratories. EPA-600/4-79-019, USEPA, Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1979.


32. American Society for Testing and Materials, Standard practice for determination of precision and bias of applicable methods of Committee 0-19 on water. ASTM Designation 02777, Philadelphia, PA, 1986.

33. American Society for Testing and Materials, Standard practice for intralaboratory quality control procedures and a discussion on reporting low-level data. ASTM Designation 04210, Philadelphia, PA, 1989.

34. US Environmental Protection Agency, Guidance Documentfor Assessment of RCRA Environmental Data Quality. DRAFT, USEPA, Office of Solid Waste and Emergency Response, Washington, DC, 1987.

35. US Environmental Protection Agency, Methods for Chemical Analysis of Water and Wastes. EPA/600/4-79-020, Environmental Monitoring and Support Laboratory, Cincinnati, OH, revised 1983.

36. Keith, L.H., Crummett, W., Deegan, 1., Jr., Libby, R.A., Taylor, 1.K., Wender, G., Principles in environmental analysis. Analytical Chemistry, 55 (1983) 2210-18.

37. US Code of Federal Regulations, Title 40, Part 136, Guidelines establishing test procedures for the analysis of pollutants, Appendix B-Definition and procedure for the determination of the method detection limit, Revision 1.11, USEPA, 1990, pp. 537-9.

38. US Environmental Protection Agency, User's Guide to the Contract Laboratory Program: and Statements of Work for Specific Types of Analysis. Office of Emergency and Remedial Response, USEPA 9240.0-1, December 1988, Washington, DC.

39. US Environmental Protection Agency, Test Methods for Evaluating Solid Waste, SW-846, 3rd edn, Office of Solid Waste (RCRA), Washington, DC, 1990.

40. US Code of Federal Regulation, Title 40, Protection of Environment, Part 141 -National Primary Drinking Water Regulations, Subpart C: Monitoring and Analytical Requirements, Section .24, Organic Chemicals other than total trihalomethanes, sampling and analytical requirements, USEPA, 1990, pp. 574-9.

41. Britton, P., US Environmental Protection Agency, Estimation of generic acceptance limits for quality control purposes in a drinking water laboratory. Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1989.

42. Britton, P., US Environmental Protection Agency, Estimation of generic quality control limits for use in a water pollution laboratory. Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1989.

43. Britton, P. & Lewis, D., US Environmental Protection Agency, Statistical basis for laboratory performance evaluation limits. Environmental Monitoring and Support Laboratory, Cincinnati, OH, 1986.

44. US Environmental Protection Agency, Data Quality Objectives for Remedial Response Activities Example Scenario: RI/ FS Activities at a Site with Contaminated Soils and Ground Water. EPA/540/G-87/004, USEPA, Office of Emergency and Remedial Response and Office of Waste Programs Enforcement, Washington, DC, 1987.

45. US Environmental Protection Agency, Field Screening Methods Catalog: User's Guide. EPA/540/2-88/005, USEPA, Office of Emergency and Remedial Response, Washington, DC, 1988.


46. US Environmental Protection Agency, Guidance Documentfor the Preparation of Quality Assurance Project Plans. Office of Toxic Substances, Office of Pesticide and Toxic Substances; Battelle Columbus Division, Contract No. 68-02-4243, Washington, DC, 1987.

47. Worthington, J.e. & Lincicome, D., Internal and third party quality control audits: more important now than ever. In Proceedings of the Fifth Annual Waste Testing and QA Symposium, ed. D. Friedman. USEPA, Washington, DC, 1989, pp. 1-7.

48. US Environmental Protection Agency, NPDES Compliance Inspection Manual. EN-338, USEPA, Office of Water Enforcement and Permits, Washington, DC, 1984.

49. Carlberg, K.A., Miller, M.S., Tait, S.R., Beiro, H. & Forsberg, D., Minimal QA/QC criteria for field and laboratory organizations generating environmental data. In Proceedings of the Fifth Annual Waste Testing and QA Symposium, ed. D. Friedman. USEPA, Washington, DC, 1989, p. 321.

50. Liabastre, A.A., The A2LA accreditation approach: Lab certification from an assessor's point of view. Environmental Lab., 2 (Oct. 1990) 24-5.

51. American Association for Laboratory Accreditation, Environmental field of testing checklists for potable water, nonpotable water, and solid/hazardous waste. A2LA, Gaithersburg, MD, 1990.

52. US Environmental Protection Agency, Report on Minimum Criteria to Assure Data Quality. EPA/530-SW-90-021, USEPA, Office of Solid Waste, Washington, DC, 1990.

53. Ratliff, T.A., The Laboratory Quality Assurance System, Van Nostrand Reinhold, New York, 1990.

54. US Environmental Protection Agency, Handbook for Sampling and Sample Preservation of Water and Wastewater. Document No. EPA-600/4-82-029, Environmental Monitoring and Support Laboratory, Cincinnati, OH.

55. American Society for Testing and Materials, Standard practice for estimation of holding time for water samples containing organic and inorganic constituents. ASTM Designation 04841, Philadelphia, PA, 1988.

56. American Society for Testing and Materials, Standard guide for disposal of laboratory chemicals and samples. ASTM Designation 04447, Philadelphia, PA,1990.

57. American Society for Testing and Materials, Standard practice for the preparation of calibration solutions for spectrophotometric and for spectroscopic atomic analysis. ASTM Designation EI330, Philadelphia, PA, 1990.

58. American Public Health Association, American Water Works Association and Water Pollution Control Federation, Standard Methodsfor the Examination of Water and Wastewater, 16th edn. American Public Health Association, Washington, De.

59. US Department of Labor, Occupational health and safety administration, Official analytical method manual. OSHA Analytical Laboratory, Salt Lake City, UT, 1985.

60. US Department of Health and Human Services, National Institute for Occupational Safety and Health, NIOSH manual of analytical methods, 3rd edn. NIOSH Publication No. 84-100, Cincinnati, OH, 1984.

61. Horwitz, W. (ed.), Official Methods of Analysis of the Association of Official


Analytical Chemists, 14th edn. Association of Official Analytical Chemists, Washington, DC, 1985.

62. American Society for Testing and Materials, Standard practice for the verification and the use of control charts in spectrochemical analysis. ASTM Designation E1329, Philadelphia, PA, 1990.

63. American Society of Testing and Materials, Standard method of reporting results of analysis of water. ASTM Designation 0596, Philadelphia, PA, 1983.

64. US Code of Federal Regulations, Title 29, Part 1910, Occupational Safety and Health Provisions, OSHA, 1990.

65. International Standard for Organization, Quality-Vocabulary, ISO Designation 8402, Switzerland, 1986.

66. American Society for Testing and Materials, Standard terminology for statistical methods. ASTM Designation E456, Philadelphia, PA, 1990.

67. American Society for Testing and Materials, Standard terminology relating to chemical analysis of metals. ASTM Designation EI227, Philadelphia, PA, 1988.

68. American Society for Testing and Materials, Standard practice for use of the terms precision and bias in ASTM test methods. ASTM Designation EI77, Philadelphia, PA, 1986.

69. American Society for Testing and Materials, Standard definitions of terms relating to water. ASTM Designation Dl129, Philadelphia, PA, 1988.

70. American Society for Testing and Materials, Standard specification for reagent water. ASTM Designation Dl193, Philadelphia, PA, 1983.

Index

Absolute deviation, mean, 24 Accuracy, 21, 283-4

see also Precision Adjusted dependent variables, 125 AF, see Autocorrelation function Analysis of variance table, see

ANOVA Analytical methods, 290-1 Anderson's glyphs, 245 Andrew's curves, 245-6 ANOVA, 97-8,198-201 Approximate Cook statistic, 128 Arithmetic mean, 12-7

definition of, 12-23 frequency tables and, 15-7 properties of, 14-5 see also Mean

Audits, checklists, 282 corrective action and, 282-3 internal, 280-1 third party, 281 types of, 281-2

Autocorrelation function, 42 Autoregressive filter, 44 Autoregressive process, 43

Background samples, 278 Back-to-back stem-and-Ieaf display,

32-3,233

301

Backwards elimination, 100 Beer-Lambert law, 206 Bias, 21, 274 Bimodal distribution, 17 Bivariate data representation, 231-6

graphics for correlation/regression, 231-3

regression analysis, diagnostic plots, 233-5

robust/resistant methods, 236 Blanks, 278-9 Blank solution, 210-11 Box plots, 33-4 Box and whisker displays, 33-4

extended versions, 221-2 notched box, 222 simple, 220-1

Business graphics, 251

Calibration, 206-12, 269-70, 274, 285-6

blanks, 210-11 linear, 206-10 standard deviation, 211-12

Canonical link function, 121 Carbon dioxide, and time-series, 62-8 Casement plots, 239 Cellulation, 232 Central limit theory, 188-9 Centralised moving average filters, 61

302 INDEX

Checklists, 282 Chemical disposal, 293-4 Chemicals, 268 Chernoff faces, 246-7 Classification

of data, 286 graphical, 249-50

Cluster analysis, 249-50 Coded symbols, on maps, 248 Coding, and standard deviation, 27-9 Coefficient of variation, see Relative

standard deviation Collection, of samples, 287-8 Collinearity, 100 Combined (field/laboratory) data

assessment, 286 Complementary log-log link function,

129 Computer software, for illustrative

techniques, 250-4 business, 251 general purpose, 251 mapping, 251, 253 see also specific names of

Confidence interval plots, 227 Confounded error, 198 Continuous data distribution, 227 Contour, 247-8 Contract audit, 282 Cook statistic, 128 Correction, 200 Corrective actions, 292

audits, 282-3 Correlation, 87-91

analysis, 42-44 coefficient, 87

Correlative analysis, 139-80 application, 155-76 Eigenvector analysis, 142-55 interpretation of data, 163-5 receptor modelling, 165-71 screening data, 155-62

Critical region, and significance testing, 191

Cumulative frequency graphs, 10-11 Cumulative frequency plots, 233 Cumulative tables, 6-7 CV, see Relative standard deviation

Data assessment, 283-6 combined, 286 field, 283-4 laboratory, 284-6

Data behaviour types, 216 Data classification criteria, 286 Data distribution, graphical

assessment, 225-8 confidence interval plots, 227 continuous data, 227-8 discrete data distribution, 225 Ord's procedure, 226 Poissonness plots, 226-7 quantile-quantile plots, 228 suspended rootograms, 228-9

Data distribution shape, 229-31 mid vs. spread plots, 230 mid vs. Z2 plots, 230 pseudosigma vs. ; plots, 230 push back analysis, 230 skewness and, 231 upper vs. lower letter value plots,

229 Data interpretation, 163-5 Data quality audit, 282 Data quality objectives, 259-60 Data representation, see Visual data

representation Data screening, in factor/correlation

analysis, 155-62 decimal point errors, 155-7 interfering variables, 158-62 random variables, 157-8

Data storage, 269, 271-2 document, 272-3

Data types, 217 Decimal point errors, 155-7 Decision limit, 204 Decomposition of variability, 95-9 Degrees of freedom, 92, 97 Demming-Mandel regression, 231-2 Dendrograms, 249-50 Depth, of data, 220 Descriptive statistics, 1-35

diagrams, 7-11 exploratory data analysis, 31-4 measures of, 11-30

dispersion, 21-30

INDEX 303

Descriptive statistics-contd. measures of-contd.

location, 11-21 skewness, 30

random variation, 1-2 tables, 3-7

Designed experimentation, 86 Detection limits, 274-5

analytical methods, 203-6 definitions of, 276-7

Determination limit, 206 Deviance, 126-7 DHRSMOOTH, 58, 59, 61 Diagrams, 7-11

cumulative frequency graphs, 10-11 histograms, 8-10 line, 7-8

Discrete data distributions, 225-7 Discrete variables, 2

continuous, and, 2 Dispersion, 121

measures of, 21-30 interquartile range, 23-4 mean absolute deviation, 24 range, 22 standard deviation, 24-9 variance, 24-9

Dispersion matrix, 142-5 Display, see specific types of DLM, see Dynamic Linear Model Document storage, 272-3 Dot diagram, 7 Draftsman's plots, 239 Duplicate samples, 279 Dynamic autoregression, 50 Dynamic harmonic model, 52 Dynamic Linear Model, 49-53

EDA, see Exploratory data analysis Eigenvector analysis, 142-55

calculation in, 145-50 dispersion matrix, 142-5 factor axes rotation, 153-5 number of retained factors, 151-3

Elongation, 231 Entropy spectrum, 46 Environmetrics, definition of, 37

Equipment blanks, 278 Error, 181-90

central limit theory, 188-9 confounded, 198 distribution, 184-5

normal, 185-8 propagation of, 189-90 types of, 181-4

II/II, 203-6 gross, 181-2 random, 182 systematic, 182-4

Error bias, 21-2 Error propagation, 189-90

multiplicative combination, 189-90 summed combinations, 189

Estimable combinations, 115 Explanatory variables, 80 Exploratory data analysis, 31-4,

215-6 box-and-whisker displays, 33-4,

220-1 stem-and-Ieaf displays, 33-4,

219-20

Factor analysis, 139-80 applications of, 155-76 data interpretation, 163-5 data screening, 155-62 receptor modelling, 165-71

Factor axes rotation, 153-5 FANTASIA,171 Far out values, 221 Field blanks, 278 Field data assessment, 283-4

comparisons, 284 completeness of, 283 quality plan variances, 284 validity of, 283, 284

Field duplicates, 278 Field quality-control samples, 278 Field managers, 263-4 Field records, 271-2 Flattened letter values, 230 Flicker noise, 202 Foreward interpolation, 56 Foreward selection 100

304 INDEX

Four-variable, two-dimensional views, 239-42

Frequency-domain time-series, see Spectral analysis

Frequency plots, 218-9 Frequency tables, 3-6 F-test, 197-8

Gaussian letter spread, 223 Generalized linear model, 119-23

deviance and, 126-7 diagnostics, 127-8 estimation, 123-6 parts comprising, 119-23 residuals and, 127-8

Generalized random walk, 50-3 GENSTAT,133-4 Geometric mean, 20-1 GUM,133-4 Glyphs, 245 Graphical classification, 249-80 Graphical one-way analysis, 236-7 Graphs, 216, 217 Gross error, 181-2 Gumbel distribution, 132

Handling, of samples, 268, 288 Hanging histobars, 228-9 Harmonic regression, 62 Health & safety, 266-8, 293-4 Hierarchical cluster analysis, 249-50 Histobars, hanging, 228-9 Histograms, 8-10, 218-9 Hybrid displays, 217-8

Illustrative techniques, software for, 250-4

Inner fences, 221 Instrumentation, 269-70 Interfering variables, 158-62 Internal audits, 280-1 Interquartile range, 23-4 IRWSMOOTH,57-9

Mauna Loca CO2 data, and, 62-8

IRWSMOOTH--contd. Raup Sepkoski extinction data,

and,68-73 Isolated discrepancy, 102 Iterative weighted least squares, 125

Jackknifed deviance residuals, 127-8 Jackknifed residuals, 104-5 Jittered scatter plots, 236-7 'Junk', graphical, 237-8

Kalman filtering algorithm, 53-6 Kleiner-Hartigan trees, 243-4 Kolmogorov-Smirnov test, 219, 233

Laboratories, 266-7 safety in, 293-4

Laboratory blanks, 279, 284 Laboratory control samples, 279 Laboratory data assessment, 284-6 Laboratory directors, 263-4 Laboratory quality-control samples,

279-80 Laboratory records, 271-2 Lag window, 46 'Lagrange Multiplier' vector, 54 Leaf, see Stem-and-leaf displays Least squares

estimates, 92 method,208 weighted, 111-2 see also Recursive estimation

Letter value displays, 223-5 Letter value plots, upper vs. lower,

229 Leverage, 102, 235 Likelihood, 94

equation, 124 Limits of detection, 274-5 Linear calibration, 206-10 Linear correlation, 87 Linear regression, 91-112

basics of, 91-5 decomposition, 95-9

INDEX 305

Linear regression-contd. model checking, 10 1-8 model selection, 99-10 I weighted least squares, 111-2

Line diagrams, 7-8 Link functions, 120 Local high data density plotting, 232 Location, measures of, ll-21

mean, 12-7; see also Arithmetic mean

median, 17-20 geometric, 20-1

mode, 17 Logistic, 129 Log likelihood, 93 Low level data, 285

Maintenance, 289 instrumentation, of, 269-70

Malinowski empirical indicator function, 152

Maps, 247-8 coded symbols, 248 graphs, combined, and, 249

Masking, 10 1 Material blanks, 278 Matrix effects, 285 Matrix spike, 279 Mauna Loa CO2 data, 62-8 Maximum entropy spectrum, 46 Maximum likelihood, 56-7 Mean

non stationarity of, 37, 39 standard error of, 188-9 (-test and, 194-6 see also Arithmetic mean and also

Geometric mean and Location, measures of

Mean absolute deviation, 24 Measurement system control, 273-80 Median, 17-20 Method blanks, 279 microCaptain, 40 Midplots, 230 MINIT AB, 222

Mode, 17 Moments, 30, 42 Multicode symbols, 241-7

Anderson's glyphs, 245 Andrew's curves, 245-6 assessment, 247 Chernoff faces, 246-7 Kleiner-Hartigan trees, 243-4 profile, 242 star, 242

Multidimensional data display, multicode symbols, 241-7

Multiple notched box plots, 236-7 Multiple one-dimensional scatter

plots, 236 Multiple whisker plots, 236-7 Multiplicative combination, of errors,

189-90 Multivariate environmental data,

139-80, 236-50 graphical classification, 249-80 graphical one-way analysis, 236-7 multicode symbols, 241-7 spatial display, 247-9 two dimensions, in, 237-41

Nesting, 126 Noise, see Signal-to-noise ratio 'Noise variance ratio' matrix, 54 Nonhierarchical cluster analysis,

249-50 Nonstationarity, 37 Nonstationary time-series analysis,

37-77 model, 49-53 model identification/estimation,

56-61 practical examples, 61-73 prior evaluation, 40-6 recursive forecasting, 53-6 smoothing algorithms, 53-6

Normal distribution, 184-8 Notched box, multiple, 236-7 Notched box plots, 222 Null hypothesis, 190-1 Nyquist frequency, 45

306

Observation, 86 Ogive, 10 One-way layout, 112 Ord's procedure, 226 Outer fences, 221 Outlier identification, 216 Outliers, lOl, 235

Paired experimentation, and I-test, 196-7

Parametric nonstationarity, 59-60 Parent distribution, 185 Partial autocorrelation function, 44 Pearson residual, 128 Percentage relative standard

deviation, 30 Performance audits, 281-2 Periodograms, 45 Personnel, and QA, 263-5

support staff, 264-5 technical staff, 264

Poissonness plots, 226-7 Population parameters, 25-6 Precision, 21, 183-4, 274 Prior evaluation, in time-series, 40-6

correlation analysis, 42-4 spectral analysis, 44-6

Probability plots, 228 Probit, 129 Profile symbols, 242 Pseudosigma, 230 Push back analysis, 230

Qualifications, of staff, 263-5 Quality assurance, 259-99

audits, 280-3 data assessment, 283-6 document storage, 272-3 equipment, 269-70 facilities, 265-9 function, 265 management, 263-4 measurement system control,

273-80 organization, 261-5

INDEX

Quality assurance-contd, program, 260-1 records, 271-2 SOPs, 270-1, 287-94 support staff, 264-5 technical staff, 264 use of subcontractors, 265

Quality systems audits, see Audits Quantile-quantile plots, 228

empirical, 233 Quartiles, 23-4

lower, 23 middle, 23 upper, 23

Random errors, 182 Random variables, 157-8 Random variation, 1-2, 79-86 Range, 22

interquartile, 23-4 Raup Sepko ski extinctions data,

68-73 Receptor modelling, 165-71 Records, 271-2

management, 269, 271-3, 293 storage, 269

Recursive estimation, 46-9 Recursive forecasting, 53-6

forecasting, 55 forward interpolation, 56 see also Smooth algorithms

Reference materials, 279 Regression, 91-112, 209

see also Linear regression Regression lines, 209 Regulatory compliance audit, 282 Relative frequency, 4 Relative standard deviation, 29-30,

187 Repeatability, 184 Reporting

limits, 274-5 SOPs, of, 293

Reproducibility, 184 Residual plots, 234-5 Residuals, 102, 215 Residual standard deviation, 209

INDEX 307

Residual sum of squares, 92, 209 Resistance, 215 Resistant methods, 236 Retained factors, in Eigenvector

analysis, 151-3 RLS, see Recursive estimation Robust methods, 236 Root-mean-square, 24 ROOTOGRAM, 228-9

Safety, 294 chemicals, 268 handling, 268 laboratory, 266-7 ventilation, 267-8

_,waste, 268, 293-4 Samples

autocorrelation function, 42 collection of, 287-8 covariance matrix, 42 estimates, 25-6 handling of, 268 large, and Ord's procedure, 226 partial autocorrelation function, 42 quality control, 275-80

field,278 laboratory, 279-80

Scatter plots, 87-91, 214, 219 jittered, 236-7 multiple one-dimensional, 236 sharpened, 232 two-dimensional, 231-2

Schuster periodograms, 45 Shade, 247-8 Sharpened scatter plots, 232 Shipping, 268 Signal-to-noise ratio, 201-3 Significance testing, 190-20 I

F-test, 197-8 t-test, 192-7

means comparison, 194-6 paired experimentation, 196-7

variance analysis, see ANOV A Sign Test, 222 Skewness, 30, 231 Smoothing algorithm, 53-6

forecasting, 55

Smoothing algorithm-contd. forward interpolation, 56 Kalman filtering, 53-4

SIN, see Signal-to-noise ratio Software, 251, 253 SOPs, 261, 270-1, 287-94

analytical methods, 290-1 chemicals, 293-4 corrective action, 292 data reduction, 292 definition, 294 health & safety, 294 instrument maintenance, 289 quality control, 291-2 records management, 293 reporting, 293 sample control, 287-9 standard preparation, 289 terminology, 294 validation, 292 wastes, 293-4

Sorted binary plot, 240-1 Spatial data display, 247-9

coded symbols, 248 colour and, 247-8 graph-map combinations, 249

Spectral analysis, 44-6 Spectral decomposition, 56-7 Spiking, 279 Standard addition, 211-2 Standard deviation

calculations, 26-7 frequency tables, 29 shortcut, 27

coding, 27-9 definition, calculation, 26-7 frequency, see Relative standard

deviation popUlation parameters, 25-6 residual, 285, 209 sample estimates, 25-6 shortcut calculation, 27

Standard error, of sample mean, 188 Standard operating procedures, see

SOPs Standard preparation, 289 Standardized Pearson residual, 128 Standardized residuals, 101

308

Star symbols, 242 StatGraphics™,40 Stationarity, time-series, 41 Stationary time-series, 41 Statistical software, 251, 253 Stem-and-leaf display, 31-3, 219-20,

233 back-to-back, 32-3, 233

Stepwide regression, 100 Storage, 268, 288

data, of, see Data storage documents, of, 272-3 see also under Health & Safety

Strip box plots, 233 student's t-test, see t-test Subcontractors, 265 Summed combinations, in error

propagation, 189 Sunflower symbols, 232 Support staff, 264-5 Surrogate standards, 279-80 Suspended rootograms, 228-9 SYST A T, 222, 246 System audits, 281 Systematic discrepancy, 10 1 Systematic errors, 21, 282-4 Systematic variation, 79-86

Tables, 216 presentation of, 3-7

cumulative, 6-7 frequency, 3-6, 15-7,29 mean, of, 15-7 standard deviation, of, 29

Target transformation factor analysis (TTFA), 169-76

t-distribution, 192-3 Technical staff, 264 Third party audits, 281 Three-variable, two-dimensional

views, 239 Time-domain time-series, see

Correlation analysis Time-series

analysis, see Nonstationary time-series analysis

INDEX

Time-series-contd. frequency domain, in, see Spectral

analysis time domain, in, see Correlation

analysis Time-series models, 49-53

DLR,50-3 seasonal adjustment, 60-1

identification/estimation, 56-7 IRWSMOOTH, 57-9 parametric nonstationarity, 59-60

Time variable parameters, 46-9 DLM,49-53

Transformations, 108-10 Trees, see Kleiner-Hartigan trees Trip blanks, 278 t-test, 192-7

means comparison, 194-6 paired experimentation, 196-7

Tukey analysis, 31-4 Two-dimensional multivariate data

illustrations, 237-41 casement plots, 239 draftsman's plots, 239 four or more variables, 239-41 one-dimensional views, 238-9 three variables, 239

Two-dimensional scatter plots, 231-2 Type 1/11 errors, 203-6

Uncorrelated variables, 87 Unimodal distribution, 17 Univariate methods, see

Nonstationary, time-series analysis

Validation, 283-4, 292 Variability, decomposition, 95-9 Variable transformation, 234-5 Variables, see specific types of Variance, 24-9, 112-9

see also Standard deviation Variance analysis, see ANOVA

INDEX 309

Variance function, 122 Variance intervention, 59-60 Variates, 2 Variation, 79-86

see also Random variation and also Relative standard deviation

Vectors, see Eigenvector analysis Ventilation systems, 267-8 Visual data representation, 213-58

bivariate data, 231-6 data types, 217 display, 217-8 exploratory data analysis, 215-7 multivariate data, 236-50 single variable distributions, 218-31

software for illustrations, 250-4 uses/misuses, 213, 215

Waldemeir sunspot data, 38-9 Waste, 268

disposal, 288-9, 293-4 Weighted least squares, 111-2 Weibull distribution, III Whisker plots, 222

multiple, 236-7 see also Box-and-whisker displays

Z2 plots, 230

methods of environmental data analysis

Documents