outlier kerry.pdf
DESCRIPTION
outleirTRANSCRIPT
-
Computers & Geosciences 33 (2
mII. Outliers
roots and logarithms; the decrease was generally larger for the latter, however. Aggregated outliers had different effects on
tail of larger or smaller values in the underlying relatively small number (e) of extreme values fromanother population (C ) that contaminate a primary
ARTICLE IN PRESSGaussian, NP (0,1), process (P)these values maybe considered as outliers. There has been much
0098-3004/$ - see front matter r 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cageo.2007.05.009
Corresponding author.E-mail address: [email protected] (R. Kerry).the variogram shape from those that were randomly located, and this also depended on whether they were aggregated near
to the edge or the centre of the eld. The results of cross-validation showed that the robust estimators and the removal of
outliers were the most effective ways of dealing with outliers for variogram estimation and kriging.
r 2007 Elsevier Ltd. All rights reserved.
Keywords: Geostatistics; Normality; Outliers; Simulation; Skewness; Variogram; Robust estimators; Data transformation
1. Introduction
Departures from normality can arise from a long
process; in our rst paper in this issue (Kerry andOliver, 2007), we examined the effect of this on thevariogram. Here, we focus on the presence of aR. Kerry , M.A. OliveraDepartment of Geography, Brigham Young University, Provo, Utah, USA
bDepartment of Soil Science, University of Reading, Reading, England
Received 21 May 2005; accepted 26 July 2006
Abstract
Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to
another population that contaminate the primary process. The rst paper of this series examined the effects of the former
on the variogram and this paper examines the effects of asymmetry arising from outliers. Simulated annealing was used to
create normally distributed random elds of different size that are realizations of known processes described by variograms
with different nugget:sill ratios. These primary data sets were then contaminated with randomly located and spatially
aggregated outliers from a secondary process to produce different degrees of asymmetry. Experimental variograms were
computed from these data by Matherons estimator and by three robust estimators. The effects of standard data
transformations on the coefcient of skewness and on the variogram were also investigated. Cross-validation was used to
assess the performance of models tted to experimental variograms computed from a range of data contaminated by
outliers for kriging.
The results showed that where skewness was caused by outliers the variograms retained their general shape, but showed
an increase in the nugget and sill variances and nugget:sill ratios. This effect was only slightly more for the smallest data set
than for the two larger data sets and there was little difference between the results for the latter. Overall, the effect of size of
data set was small for all analyses. The nugget:sill ratio showed a consistent decrease after transformation to both squarea, bDetermining the effect of asym007) 12331260
etric data on the variogram.
www.elsevier.com/locate/cageo
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601234discussion about what values in a distributionconstitute outliers and how to deal with them (seeBarnett and Lewis (1994) for a thorough discussionon this subject). In summary, outliers are eitherdistributional in nature and are usually obvious asparticularly large or small values in a histogram orbox and whisker plot (Tukey, 1977), or they can bespatial, whereby the values are particularly differentfrom other values in their spatial vicinity. The latterneed not be distributional outliers and may beidentied to an extent from a pixel map of values orby other more elaborate methods such as thoseproposed by Gnanadesikan and Kettenring (1972)and Haslett et al. (1991), for example. For thepurpose of this investigation, we assume that theoutliers are distributional, are not erroneous valuesand have been identied by a thorough precedingexploratory data analysis.Environmental data frequently have asymmetric
distributions caused by a small number of marginalvalues or outliers from a secondary process. In thespatial context, the secondary process may give riseto randomly located outliers, such as localizedenrichments of ores, additions to the soil by animalfaecal deposits (McBratney and Webster, 1986),accidental localized spills of fertilizers and so on.Outliers can also be spatially aggregated, forexample, on industrial sites where the secondaryprocess has led to localized contamination of thesurface materials at isolated points or in elds whereanimal faecal deposits are in one part of the eld.These can all be regarded as quasi-point processessuperimposed on a primary continuous process.Although Matherons (1965) variogram estimator
is asymptotically unbiased for any intrinsic randomfunction (Cressie, 1993), it is sensitive to departuresfrom normality or from a symmetric distribution(Webster and Oliver, 2001) because it is based onsquared differences. It is particularly sensitive tooutlying values of Z and even a single outlier candistort the experimental variogram because it mightbe involved in several paired comparisons overmany or all lag intervals. Several robust estimatorsof the variogram have been devised to solve theproblem of asymmetry resulting from outliers, suchas those of Armstrong and Delner (1980), Cressieand Hawkins (1980), Dowd (1984) and Genton(1998a). Lark (2000) investigated three of theserobust variogram estimators with simulated andreal soil data contaminated by outliers. He showedthat robust estimators were generally useful for data
contaminated with outliers, but not for data wherethe asymmetry has a more general underlying cause.This was to be expected because an underlyingassumption of these robust estimators is that thedata have a contaminated normal distribution.Genton (1998b) further showed that the shape ofvarious robust estimators changed in response tothe presence of different proportions of outliers.He used the term breakdown point to refer tothe number of outliers necessary to make anestimator explode (tend towards innity) or implode(tend to 0). Lark (2000) concluded from hisinvestigation that robust estimators are not asubstitute for a thorough exploratory data analysiswith appropriate editing and transformation of thedata prior to variography.Although we consider robust variogram estima-
tors here, we are aware that analysts do not alwayshave access to appropriate software to computethem. Therefore, based on this and on Larks (2000)comment mentioned above, we also consider howone should proceed with exploratory data analysis,editing and transformation of data prior to geosta-tistical analysis where data are contaminated byoutliers. The procedure generally agreed on ingeostatistical texts is summarized in Fig. 1 of Kerryand Oliver (2007); however, this is based oninformed intuition rather than rigorous investiga-tion. If the skewness is outside the bounds of 71,the histogram and/or box and whisker plot shouldbe investigated. If the asymmetry is caused byoutliers, this is often more evident in a schematicbox and whisker plot than histogram and theextreme values should be investigated further. Ifthey are clearly the result of errors in the assemblyof data or laboratory analysis, they should beremoved permanently from the data. If they are truevalues they are likely to be of interest, particularly inpollution studies. Outliers can be treated as aseparate statistical population and removed forcomputing the variogram because it is often thevariogram of the underlying process that is ofinterest (Cressie, 1993). The removal of outliers isless problematic in spatial than classical statisticsbecause this action does not affect the randomnessof the sample (Barnett and Lewis, 1994). Transfor-mation of the data can also be considered, butGoovaerts (1997) has indicated that this is not idealif the aim is prediction. In general, those whoultimately use the predictions, such as land man-agers, environmental scientists and so on, wantvalues on the original scale of measurement, which
involves a back transformation. For square root
-
range (a) of 75m and nugget variances (c0) of 0, 0.25,0.5 and 0.75. At a proportion, e, of the sites, values ofthe primary Gaussian process were added to atrandom locations by those from a secondary randomprocess NC (mC,,sC). Two rates of contamination wereused: e 0.02 and 0.05; the latter was the main focusof attention because the smaller rate gave only twocontaminated sites for data on the 20-m grid. Thecontaminants were drawn from random, normallydistributed populations with different means,NC (1,1), NC (1.25,1), NC (1.5,1) ,y,NC (10,1), andadded to the original values of the primary processto give skewness coefcients of 0.5, 1.0, 1.5, 2.0 and3.0. Table 1 gives the coefcient of skewness, meansof the secondary process and the number of randomlylocated sites contaminated for the 0.05 rate ofcontamination for each set of data simulated usinga variogram with no nugget variance. The mean ofthe secondary process used to produce a desiredcoefcient of skewness varied slightly between thedata sets because of differences in the original valuesof the primary process and the locations selected atrandom for contamination. The overall distributionfunction of the data is given by
Zx f1 N m ;s ;j N m ;s g: (1)
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1235With greater insight into the effects of outliers onthe variogram, the standard best practice describedabove will be appraised and suggestions made as tohow it might be improved.
2. Methods
2.1. Simulation of two-dimensional data
contaminated by outliers
Twelve random elds with a standard normaldistribution, NP (0,1), were simulated using thesimulated annealing procedure of Deutsch andJournel (1992) for a 200m 200m hypotheticaleld at the nodes of 5-, 10- and 20-m grids, to give1600, 400 and 100 data, respectively. These data arerealizations of the primary process NP (mP, sP, jP),
whtors solve the effects of asymmetry caused byoutliers on the variogram and the accuracy ofprediction?outliers for different degrees of asymmetryinuence the variogram differently?To what extent do the removal of outliers, datatransformations and robust variogram estima-
functions with different nugget:sill ratios, i.e.different degrees of spatial continuity?Do randomly located and spatially aggregatedand logarithmic transformations, the back-trans-form tends to exaggerate any error associated withprediction through squaring and exponentiation,respectively. This can affect extreme values themost, which are of most interest in pollution studies.Therefore, one should question the appropriatenessof any data transformation, where asymmetry iscaused by outliers.Kerry and Oliver (2007) showed that the effect of
underlying asymmetry on the variogram was less forlarge sets of data than for small ones, as illustratedin Figs. 2 and 8 of that paper. The effects of samplesize, the degree of continuity in the variation and thelocation of outliers, however, have not beeninvestigated thoroughly where asymmetry has beencaused by outliers. This paper explores the follow-ing in this context:
Do similar degrees of asymmetry in the distribu-tion caused by outliers affect the variogramequally for data sets of different size?
For a given sample size how do different degreesof asymmetry affect data generated by variogramere j is a vector of spatial parameters; in this case,a spherical function with a sill variance (c0+c) of 1, a
Table 1
Coefcient of skewness, means of the secondary process and
number of sites for a rate of contamination of 0.05 for each set of
data produced by simulated annealing data using a variogram
with a nugget:sill ratio of 0
Data (m) Coefcient of
skewness
Mean of
secondary
process
Number of
sites in
secondary
process
5 0.5 2.00 80
5 1.0 3.50 80
5 1.5 4.50 80
5 2.0 5.50 80
5 3.0 9.00 80
10 0.5 2.50 20
10 1.0 4.00 20
10 1.5 5.00 20
10 2.0 6.00 20
10 3.0 10.00 20
20 0.5 3.25 5
20 1.0 4.25 5
20 1.5 5.25 5
20 2.0 6.50 5
20 3.0 10.00 5P P P C C C
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601236In addition, the primary process was contaminatedat a rate of 0.05 in such a way that the outliers wereaggregated either near the edge or centre of the eld.A random location for an outlier was selected eithernear to the edge or centre of the eld; the remainingoutliers (e1) were then placed at the appropriatenumber of surrounding sites. The contaminants at thespatially aggregated locations were drawn from thesame populations as described above and added tothe original values of the primary process to giveskewness coefcients of 0.5, 1.0, 1.5, 2.0 and 3.0.Although the outliers are spatially aggregated, theyare spatially independent in this case. This might notalways be the case; for example, a large chemical spillin a restricted area could result in spatially dependentsample data. We have not considered this scenariohere as it did not accord with our underlying model,Eq. (1) above.
2.2. Approaches to reduce asymmetry
Values that have been identied as distributionaloutliers are often removed from the data to achievea normal or near-normal distribution. However, asmentioned above, there is some reluctance in doingthis when the outliers are associated with contami-nated sites and are the values of most interest orconcern. Therefore, we also transformed the data tosquare roots and common logarithms (log10) toassess the extent to which they reduced asymmetryand the effects on the variograms computed fromthe range of data sets described above aftertransformation. A consistent procedure wasadopted before transformation: a constant of 4was added to each value in the data to make allvalues just positive as described in Kerry and Oliver(2007).
2.3. Matherons variogram estimator and robust
variogram estimators
Omni-directional experimental variograms werecomputed using Matherons (1965) estimator as inKerry and Oliver (2007); it is given by
g^Mh 1
2mhXmh
i1fzxi zxi hg2, (2)
where g^Mh is the semi variance at a given lagdistance h, z(xi) and z(xi+h) are the observed valuesof Z at xi and xi+h, and m(h) is the number of
paired comparisons at lag h.Experimental variograms were computed on thesimulated normally distributed data and that con-taminated with outliers with initial lag intervalsbased on the grid spacings of the data of 5-, 10- and20-m. They were then modelled by weighted least-squares approximation using GenStat (Payne,2006). In addition, variograms were computed ondata transformed to square roots and commonlogarithms (log10), and with outliers removed usingEq. (2) and with robust estimators. Cressie (1993)uses the term robust to describe inference proce-dures that are stable when model assumptionsdepart from those of a central model, for exampleby a small amount of contamination by anindependent Gaussian process. Robust variogramestimators are a possible solution to the problem ofoutliers because the goal is to estimate thevariogram of the non-contaminated part of thedata (1e)P. Consequently, they are less sensitive tooutliers than is Eq. (2).Lark (2000) gives a succinct description of robust
variogram estimators and their properties, and wedo not repeat this here. We have used the same threerobust estimators as used by Lark (2000), namelythose of Cressie and Hawkins (1980), Dowd (1984)and Genton (1998a). We summarize these estima-tors below following Lark (2000).Cressie and Hawkins (1980) estimator estimates
the variogram at lag h for a primary process with anormal distribution of differences, Z(x)Z(x+h),and damps the effect of outliers from a secondaryprocess. For a given lag, it is an estimation of thelocation (rst-order moment) of the squared differ-ences. Cressie and Hawkins (1980) estimator isbased on taking the fourth roots of the squareddifferences, and it is given by
2g^CHh 1=mhPmhi1 jzxi zxi hj1=2n o4
0:457 0:494=mh 0:045=m2h .
(3)
The denominator in Eq. (3) is a correction based onthe assumption that the underlying process to beestimated has normally distributed differences overall lags. Genton (1998a) says that Cressie andHawkins estimator is not really a solution to theproblem because a single outlier can still have anadverse effect.Cressie (1993) suggests that variogram estimation
can also be regarded as a problem of identifying thescale at various lags, i.e. the second-order moment
of the differences, Z(x)Z(x+h), and that this
-
value at xi and z^xi the estimated value there.The MSDR is
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1237approach might be the most suitable for datacontaminated by outliers. The estimators of Dowd(1984) and of Genton (1998a) are both scale estima-tors. They estimate the variogram for a dominantintrinsic process, for which the differences Z(x)Z(x+h) are normal, in the presence of outliers froma secondary process. Dowds (1984) estimator isgiven as
2g^Dh 2:198fmedianjyihjg2, (4)where yi(h) z(xi)z(xi+h), i 1,2 ,y,m(h). Theterm within the braces is the median absolute pairdifference (MAPD) for lag h, which is a scaleestimator only for variables where the expectationof the differences is 0. In addition, the pairdifference must be distributed symmetrically so thatthe expectation of the median pair difference is 0.The constant, 2.198, in Eq. (3) is a correction forconsistency that scales the MAPD to the standarddeviation of a normally distributed population.Gentons (1998a) estimator is based on the scale
estimator, QN, of Rousseeuw and Croux (1992,1993). The Quantity QN is given by
QN 2:219fjXi Xjj; iojg H2 , (5)
where the constant 2.219 is a correction forconsistency with the standard deviation of thenormal distribution, and H is the integral part of(N/2)+1. Gentons (1998a) estimator uses Eq. (5) asan estimator of scale applied to the differences ateach lag; it is given by
2g^Gh 2:219fjyih yjhj; iojg H2
h i2, (6)
where yi(h) is the same as for Eq. (3), and now H isthe integral part of {m(h)/2}+1.
2.4. Cross-validation
Cross-validation was done as described in Kerryand Oliver (2007) for a selection of the tted modelsand associated data sets. The method used involvedremoving each datum in turn and then kriging at thepoint with the relevant model parameters andneighbouring data points. The diagnostic statisticsderived from cross-validation for this investigationwere the mean error (ME), mean squared error(MSE), mean squared deviation ratio (MSDR) andmedian squared deviation ratio (MeSDR). Theratios are derived from the squared errors and canbe used to distinguish between variogram models
for the range of data examined. Lark (2000),MSDR 1N
XNi1
fzxi z^xig2s^2xi
;
where s^2xi is the kriging variance at the point. Thecloser the MSDR is to 1, the better the model is forkriging.The MeSDR was determined by dividing the
squared errors by the kriging variances for eachdata point, and then ordering the values; the middlevalue was taken as the MeSDR. When the correctmodel is used for kriging, the MeSDR should beclose to 0.455, which is the median of the standardw2 distribution with one degree of freedom.
3. Results and discussion
Fig. 1 shows the schematic box and whisker plotsfor the 10-m data for the range of skewnesscoefcients examined arising from randomly locatedoutliers. The graphs show the box, which containsthe middle 50% of the distribution, and thehorizontal line is the median. The circles beyondthe whiskers are large values at the margins of thedistribution and the crosses are the outliers, whichare beyond three times the inter-quartile range. Forthe normal distribution and a skewness coefcientof 0.5 (Fig. 1a and b, respectively) there are nocrosses, indicating that the marginal values are notextreme. As the asymmetry increases, the number oflarge values increases at the extremes on the positivehowever, recommended the MeSDR to determinethe best model for kriging with skewed data becausethe mean is affected by asymmetry, which meansthat the MSDR is not robust if the data arecontaminated by outliers.The ME is given by
ME 1N
XNi1
fzxi z^xig
and the MSE is given by
MSE 1N
XNi1
fzxi z^xig2;
where N is the number of data values, z(xi) the trueside of the distribution.
-
ARTICLE IN PRESS
3
2
1
0
1
2
3
4
2
0
2
6
4
2
0
2
6
8
4
2
0
2
6
8
4
2
0
2
6
8
4
2
0
2
Fig. 1. Schematic box and whisker plots for data on the 10-m grid and 0 nugget variance for (a) a normal distribution, and skewness
coefcients of: (b) 0.5, (c) 1.0, (d) 1.5, (e) 2.0 and (f) 3.0; the circles represent large values in the margin of the distribution and the crosses
are outliers.
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601238
-
ARTICLE IN PRESS
0
Variance
0
0
1
2
3
4
5
6
7
Variance
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
10080604020
Lag Distance (m)
0 10080604020
Lag Distance (m)
0 10080604020
Lag Distance (m)
Variance
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Variance
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Variance
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Variance
Fig. 2. Experimental variograms computed from data on 5-m (~), 10-m (&) and 20-m (n) grids simulated by a variogram function with anugget:sill ratio of 0 () for (a) a normal distribution, and skewness coefcients of: (b) 0.5, (c) 1.0, (d) 1.5, (e) 2.0 and (f) 3.0 caused by
randomly located outliers.
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1239
-
3.1. The effect of randomly located outliers on
variograms computed from simulated data of
different sample sizes
Fig. 2 shows the experimental variograms com-puted from the three sizes of data set simulated witha nugget:sill ratio of 0 for the range of skewnesscoefcients (0, 0.5, 1.0, 1.5, 2.0 and 3.0) resultingfrom randomly located outliers with a rate ofcontamination of 0.05. The exhaustive variogramsof the normally distributed data are very similarto those used to simulate the data (the solid line inFig. 2); therefore, the latter were used to assess theeffects of asymmetry in the data on the variogram asin Kerry and Oliver (2007). Tables 2 and 3 give theparameters of the models tted to the range of data.The general pattern for all sizes of data set is that asskewness increases the sill and nugget variances
of outliers increased to 2030% of the data, whichwe can interpret as resulting in greater asymmetry.For all coefcients of skewness 40, the 5-m data
(1600 sites) have variograms with the smallest nuggetand sill variances (Fig. 2 and Table 2). There is littledifference between the variograms for the 10-m (400sites) and 20-m (100 sites) data, except for theskewness coefcient of 2.0 (Fig. 2e). Overall, the effectof the size of data set is less than that observed, whereasymmetry is caused by a long tail in the distribution(see Kerry and Oliver, 2007). As the asymmetrycaused by randomly located outliers increases, thevariograms for all sizes of data set are affectedsimilarly (Fig. 2). Changes in the variogram caused byskewness coefcients 41 suggest a need to mitigatethe effects of the outliers before computing it.Fig. 3c and d shows the experimental variograms
for the 5- and 10-m data, with skewness coefcients
ARTICLE IN PRESS
ed fro
ated
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601240increase quite dramatically; this is particularly sowhen the skewness coefcient reaches 3.0 andthe vertical scale of the graph (Fig. 2f ) is larger.Cressie and Hawkins (1980), Genton (1998a)and Lark (2000) all noted that the effect of outlierson the variogram was to increase the nugget andsill variance. Fig. 3a and b summarizes this effectof increasing asymmetry in the data for the 5- and20-m data sets for a rate of contamination byoutliers of 0.05. Tables 2 and 3 also indicate anincrease in nugget:sill ratio as asymmetry in thedistribution increases. For skewness coefcientsX2, the variogram tends towards pure nugget.Genton (1998a) also noted that Matherons estima-tor tended towards pure nugget when the proportion
Table 2
Parameters of models tted to experimental variograms comput
nugget:sill ratio of 0 and with asymmetry caused by randomly loc
Coefcient of
skewness
Grid interval of data (m)
(rate of contamination by
outliers of 0.05)
Model type
0 5 Spherical
0.5 5 Spherical
1.0 5 Spherical
1.5 5 Spherical
2.0 5 Circular
3.0 5 Circular
0 20 Spherical
0.5 20 Spherical
1.0 20 Spherical
1.5 20 Spherical
2.0 20 Spherical
3.0 20 Circularfrom 0 to 2.0 caused by contamination with outliersat a rate of 0.02. We excluded the 20-m data fromthis comparison because there were just two out-liers. The nugget and sill variances (Fig. 3c and dand Table 4) are smaller than those for the largerrate of contamination for both sizes of data set(Tables 24). These graphs (Fig. 3c and d) furtherconrm that the effect of data-set size is less thanthat of either the rate of contamination or degree ofasymmetry. The difference in nugget:sill ratiobetween the 0.02 and 0.05 rates of contaminationis least for a coefcient of skewness of 0.5 and thisdifference increases gradually from 0.033 to 0.230for the 5-m data and from 0.047 to 0.325 for the10-m data as the skewness increases.
m data on 5- and 20-m grids generated by a variogram with a
outliers at a rate of contamination of 0.05
c0 c a (m) c0+c c0:c0+c
0 1 75.0 1.00 0
0.340 0.948 70.2 1.288 0.264
0.847 0.901 71.4 1.748 0.484
1.320 0.866 73.1 2.186 0.604
1.931 0.796 68.3 2.728 0.708
5.248 0.704 80.3 5.952 0.882
0 1 75.0 1.00 0
0.364 1.030 57.0 1.394 0.261
0.520 1.220 50.5 1.740 0.299
0.981 1.289 54.3 2.270 0.432
1.374 1.622 48.6 2.996 0.459
3.674 2.221 40.8 5.895 0.623
-
ARTICLE IN PRESS
ed fro
ent n
.962
.950
.944
.905
.926
.750
.764
.763
.765
.769
.777
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1241Table 3
Parameters of models tted to experimental variograms comput
located outliers at a rate of contamination of 0.05 and with differ
Coefcient of
skewness
Nugget:sill
ratio of
generating
function
Model type c0 c
0 0 Spherical 0 1
0.5 0 Spherical 0.240 0
1.0 0 Spherical 0.636 0
1.5 0 Spherical 1.017 0
2.0 0 Circular 1.522 0
3.0 0 Circular 3.890 0
0 0.25 Spherical 0.250 0
0.5 0.25 Spherical 0.503 0
1.0 0.25 Spherical 0.903 0
1.5 0.25 Spherical 1.286 0
2.0 0.25 Spherical 1.761 0
3.0 0.25 Circular 4.175 03.2. Effect of spatial continuity on variograms
computed from data simulated on a 10-m grid with
randomly located outliers
Experimental variograms were computed andmodelled for the three sizes of data set from datasimulated with different nugget:sill ratios (0, 0.25,0.5 and 0.75) and for the range of asymmetryconsidered. The results showed that the effects ofincreasing asymmetry in the distribution anddiscontinuity in the spatial variation are similarfor the three sizes of data set; therefore, the resultsare given for the 10-m data only (Table 3). Theresults are summarized in Fig. 4 for data on the10-m grid. For the data simulated by functions withnugget:sill ratios of 0, 0.25 and 0.5, the variogramshape remains fairly constant as the asymmetryincreases. However, for data simulated with anugget:sill ratio of 0.75, the nugget variance
0 0.50 Spherical 0.500 0.500
0.5 0.50 Spherical 0.754 0.550
1.0 0.50 Circular 1.168 0.546
1.5 0.50 Circular 1.549 0.559
2.0 0.50 Circular 2.022 0.575
3.0 0.50 Circular 4.398 0.625
0 0.75 Spherical 0.750 0.250
0.5 0.75 Pentaspherical 1.006 0.264
1.0 0.75 Pentaspherical 1.390 0.289
1.5 0.75 Pentaspherical 1.762 0.308
2.0 0.75 Pentaspherical 2.227 0.329
3.0 0.75 Circular 4.599 0.399
aMSE is the mean squared error and MeSDR is the median squaredm data on the 10-m grid with asymmetry caused by randomly
ugget:sill ratios, and cross-validation results
a c0+c c0:c0+c Cross-validation
results
MSEa MeSDRa
75.0 1.00 0 0.1677 0.496
74.2 1.202 0.200 0.5812 0.219
76.2 1.586 0.401 1.077 0.129
77.9 1.961 0.519 1.530 0.113
71.9 2.427 0.627 2.083 0.087
78.7 4.817 0.808 5.326 0.065
75.0 1.00 0.250 0.4260 0.447
77.0 1.267 0.397 0.7829 0.291
78.0 0.542 1.080 0.262
79.0 2.051 0.627 1.480 0.203
80.3 2.531 0.696 2.271 0.155
77.7 4.952 0.843 5.009 0.099increases as skewness increases (Fig. 4 and Table 3)and for skewness coefcients X1.5 the variogramsare almost pure nugget, Fig. 4d and Table 3. For agiven coefcient of skewness, the sill variancechanges little as spatial continuity in the datadecreases. Fig. 5 summarizes the effects of asym-metry for the different degrees of spatial continuityon the variogram for data on the 10-m grid.The results described above suggest that the
degree of asymmetry caused by outliers has agreater effect on the shape of the variogram whenthe nugget:sill ratio of the original generatingvariogram of the primary process is 40.5.
3.3. Effects of spatially aggregated outliers on the
variogram
Fig. 6 shows the experimental variograms com-puted from data simulated with a nugget:sill ratio of
75.0 1.00 0.500 0.6591 0.434
83.9 1.304 0.579 1.054 0.316
75.2 1.714 0.681 1.573 0.210
75.7 2.108 0.735 2.047 0.163
76.5 2.597 0.779 2.622 0.143
79.8 5.023 0.875 5.946 0.101
75.0 1.00 0.750 0.8825 0.430
93.9 1.270 0.792 1.196 0.366
94.8 1.679 0.828 1.692 0.293
96.9 2.070 0.851 2.156 0.240
99.5 2.556 0.871 2.717 0.213
81.9 4.998 0.920 5.987 0.116
deviation ratio.
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 1233126012421.5
2.0
2.5
3.0
3.5
Variance0 for all sizes of data set and all coefcients ofskewness, and with outliers (contamination rate of0.05) aggregated either near to the edge or the centreof the eld. The variograms are quite different inform from those computed from data with ran-domly located outliers, Fig. 3. Table 5 gives theparameters of the tted functions for the spatiallyaggregated outliers; it shows that they have small or0 nugget effects for all sizes of data set and skewnesscoefcients, whereas for the randomly located out-liers the nugget variance increases with increasing
0.0
0.5
1.0
0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Variance
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
skew =
skew =
skew = 1.5
normal
Fig. 3. Experimental variograms (symbols) computed from 5-m (a, c), 1
coefcients of skewness caused by contamination at rates of 0.05 (a, b1.5
2.0
2.5
3.0
3.5
Varianceasymmetry in the distribution (Tables 2 and 3).There is also a marked difference between thevariograms computed from data with outliersaggregated near the edge and centre of the eld(Fig. 6a, c and e and Fig. 6b, d and f, respectively);Table 5 shows that, for the latter, the sill variancesare considerably larger and that the range decreasesas skewness increases. The large difference betweenthe effect of outliers near the edge and near thecentre of the eld can be explained by the fact thatoutliers near the centre are involved in many more
0.0
0.5
1.0
0
0
Lag Distance (m)
10080604020
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5V
ariance
Lag Distance (m)
14012010080604020
generating
model
skew = 1.0
2.0
0.5
0-m (d) and 20-m (b) data with a normal distribution and different
) and 0.02 (c, d) with randomly located outliers.
-
ARTICLE IN PRESS
from
a rat
e
rical
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1243paired comparisons than are those at the edge of theeld, in particular at the longer lags. The greatercontinuity in the variogram near to the origin forthe spatially aggregated outliers probably relates tostrong continuity over the majority of the eld,whereas the randomly located outliers increasediscontinuity in the variation at several isolatedplaces in the eld. The effect of size of data set isagain small compared with the other effects.
3.4. Effect of data transformation (square root and
log10) on the variogram
Table 4
Parameters of models tted to experimental variograms computed
0 nugget variance and with skewness caused by contamination at
Coefcient of
skewness
Grid interval of data (m)
(rate of contamination by
outliers of 0.02)
Model typ
0 5 Spherical
0.5 5 Spherical
1.0 5 Spherical
1.5 5 Spherical
2.0 5 Pentasphe
3.0 5 Circular
0 5 Spherical
0.5 10 Circular
1.0 10 Circular
1.5 10 Circular
2.0 10 Circular
3.0 10 CircularData simulated by variogram functions with anugget:sill ratio of 0 were used to determine theeffects of square root and log10 transformations,and the removal of outliers on the variogram for allcoefcients of skewness examined. This was donefor all sizes of data set for a 0.05 rate ofcontamination and for data on the 5- and 10-mgrids for the 0.02 rate. Table 6 gives the coefcientsof skewness before and after transformation for allsizes of data set and 0.05 rate of contamination. AsTable 6 shows, the results for both rates ofcontamination followed a similar form, only thosefor the 0.05 rate are focused upon because thisincludes all sizes of data set. For original skewnesscoefcients p1 the square root transformation wasmore effective in reducing the coefcient of skew-ness, and for original skewness coefcients 41 thelogarithmic one was generally the more effective.Fig. 7 shows the experimental variograms for the
three sizes of data set after transformation to squareroots and log10 for all levels of asymmetryexamined. The shapes of the variograms remainsimilar for all sizes of data set. For data with askewness coefcient of 3.0, the variograms aftertransformation to log10 are far less different fromthose computed from less skewed data than werethose computed from the raw data, Figs. 7f and 2f,respectively. The experimental variograms in Fig. 7indicate that the larger the size of data set, thesmaller are the nugget and sill variances for bothtransformations, but this difference is small.Tables 7 and 8 give the model parameters of
variograms computed for the 0.05 rate of contam-
data on 5- and 10-m grids generated by a variogram function with
e of 0.02 with randomly located outliers
c0 c a c0+c c0:c0+c
0 1 75.0 1.00 0
0.200 0.998 73.8 1.198 0.167
0.387 0.997 74.4 1.383 0.279
0.582 0.994 75.0 1.576 0.369
0.789 1.026 89.9 1.815 0.435
1.387 1.014 92.0 2.401 0.578
0 1 75.0 1.00 0
0.275 0.991 65.8 1.266 0.217
0.475 1.029 66.5 1.504 0.316
0.679 1.067 67.3 1.746 0.389
0.924 1.110 68.2 2.034 0.454
1.532 1.217 70.1 2.748 0.557ination by outliers for all sizes of data set andskewness coefcients after transformation to squareroots and logarithms, respectively. For data on the5- and 10-m grids, the nugget:sill ratios show aconsistent decrease after transformation to bothsquare roots and log10 (Tables 7 and 8, respectively)compared with the results for the raw data (Tables 2and 3), and the decrease is greater after the log10transformation (Table 8). The same is true for the20-m data for skewness coefcients 41 (Tables 2, 7and 8), but the decrease is less than that for theother data sets.Table 6 gives the coefcients of skewness after
transforming the 10-m data with outliers aggregatedat both the edge and centre of the eld to squareroots and log10; there is little difference betweenthese results and those for randomly locatedoutliers. This is largely to be expected as thetransformations do not take into account the spatialpositions of the outliers. Fig. 8a and c shows the
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601244iance
3.5
3.0
2.5
2.0experimental variograms computed from the squareroot-transformed values with outliers at the edgeand centre of the eld, respectively, and Fig. 8b andd those computed from log10-transformed values atthe edge and centre of the eld, respectively. Thedifferences in sill variance for data with outliers atthe centre and edge of the eld remain, but there islittle difference between the effects of the twotransformations. The sill variances for the square
0
Var
1.5
1.0
0.5
0.0
Variance
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
normal ske
skew = 1.5 ske
Fig. 4. Experimental variograms (symbols) computed from 10-m data
caused by randomly located outliers and the variogram functions (soli
(b) 0.25, (c) 0.5 and (d) 0.75.iance
3.5
3.0
2.5
2.0root- and log10-transformed data with randomlylocated outliers and those aggregated at the edge ofthe eld are similar (Figs. 7 and 8a, b, respectively),whereas the sill variances of variograms computedfrom data with outliers grouped near the centre arefar larger after transformation (Fig. 8c and d).Outliers were removed from the data and the
variogram computed for the primary process(1e)P with 0 nugget variance for each size of data
Variance
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Var
1.5
1.0
0.5
0.0
0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
w = 0.5 skew = 1.0
w = 2.0 generating
model
with a normal distribution and different coefcients of skewness
d lines) used to generate the data with nugget:sill ratios of: (a) 0,
-
ARTICLE IN PRESS
0
Va
ria
nce
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Va
ria
nce
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Va
ria
nce
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Va
ria
nce
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Va
ria
nce
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
0
Lag Distance (m)
10080604020
generating model
0 nugget
generating model
0.25 nugget
generating model
0.50 nugget
generating model
0.75 nugget
Fig. 5. Comparison of experimental variograms computed from 10-m data generated by functions with different nugget:sill ratios for (a) a
normal distribution, and skewness coefcients of: (b) 0.5, (c) 1.0, (d) 1.5 and (e) 2.0 caused by randomly located outliers.
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1245
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 1233126012465set; this was done once only as the outliers were atthe same locations for each coefcient of skewness.Table 9 shows that there is a small increase in the
0
1
2
3
4
0 20 40 60 80 100
Lag Distance (m)
Variance
0
1
2
3
4
5
Variance
0
1
2
3
4
5
Variance
0 20 40 60 80 100 120 140
Lag Distance (m)
0 20 40 60 80 100
Lag Distance (m)
normal skew = 0
skew = 1.5 skew = 2
Fig. 6. Experimental variograms (symbols) computed from data on 5-m
and different coefcients of skewness caused by contamination with outl
of the eld.5nugget:sill ratio compared with the generatingmodel, which can be explained by the gaps in thedata that cause some apparent loss of continuity in
0
1
2
3
4
5
Variance
0
1
2
3
4
Variance
0
1
2
3
4
5
Variance
0 20 40 60 80 100
Lag Distance (m)
0 20 40 60 80 100
Lag Distance (m)
0 20 40 60 80 100 120 140
Lag Distance (m)
.5 skew = 1.0
.0 generating
model
(a, b), 10-m (c, d) and 20-m (e, f) grids with a normal distribution
iers spatially aggregated in the corner (a, c, e) or the centre (b, d, f )
-
ARTICLE IN PRESS
from
regate
e
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1247Table 5
Parameters of models tted to experimental variograms computed
with a nugget:sill ratio of 0 and skewness caused by spatially agg
Coefcient of
skewness
Grid interval
of data (m)
Location of
spatially
aggregated
outliers
Model typ
0 5 Spherical
0.5 5 Edge Circularthe variation at the rst lag interval. The results forthe 0.02 rate of contamination are similar to thosedescribed above for the larger contamination rate,Table 9. Variograms computed for the 10-m dataafter the removal of outliers aggregated at the edgeof the eld are surprisingly similar to thosecomputed after the removal of randomly locatedoutliers (Table 9), whereas after the removal ofthose at the centre of the eld the nugget:sill ratio issmaller than for the latter variograms.
1.0 5 Edge Circular
1.5 5 Edge Circular
2.0 5 Edge Circular
3.0 5 Edge Circular
0 5 Spherical
0.5 5 Centre Circular
1.0 5 Centre Circular
1.5 5 Centre Circular
2.0 5 Centre Circular
3.0 5 Centre Circular
0 10 Spherical
0.5 10 Edge Circular
1.0 10 Edge Circular
1.5 10 Edge Circular
2.0 10 Edge Spherical
3.0 10 Edge Pentaspherical
0 10 Spherical
0.5 10 Centre Circular
1.0 10 Centre Circular
1.5 10 Centre Circular
2.0 10 Centre Circular
3.0 10 Centre Spherical
0 20 Spherical
0.5 20 Edge Circular
1.0 20 Edge Spherical
1.5 20 Edge Pentaspherical
2.0 20 Edge Pentaspherical
3.0 20 Edge Pentaspherical
0 20 Spherical
0.5 20 Centre Circular
1.0 20 Centre Circular
1.5 20 Centre Circular
2.0 20 Centre Spherical
3.0 20 Centre Sphericaldata on 5-, 10- and 20-m grids generated by a variogram function
d outliers (rate of contamination of 0.05)
c0 C a c0+c c0:c0+c
0 1 75.0 1.00 0
0.066 1.008 56.3 1.075 0.0623.5. Robust variogram estimators
The results for the robust variograms are given indetail only for data on the 10-m grid with nugget:sillratios of 0 and 0.5 and for a rate of contaminationby outliers of 0.05, as those for the other sizes ofdata set followed a similar pattern.Fig. 9 shows the experimental variograms com-
puted by Matherons estimator and the three robustestimators, g^CHh; g^Dh and g^Gh; for data
0.068 1.245 53.6 1.313 0.052
0.071 1.480 52.7 1.551 0.046
0.075 1.776 52.3 1.852 0.041
0.116 3.284 53.0 3.400 0.034
0 1 75.0 1.00 0
0.036 1.620 78.1 1.656 0.022
0.004 2.219 79.8 2.223 0.002
0.000 2.972 80.5 2.972 0.000
0.000 3.889 80.6 3.889 0.000
0.000 9.314 79.9 9.314 0.000
0 1 75.0 1.00 0
0.054 1.059 68.5 1.113 0.048
0.068 1.298 70.9 1.366 0.050
0.101 1.513 72.2 1.614 0.063
0.104 1.836 82.8 1.940 0.054
0.451 3.487 108.9 3.938 0.115
0 1 75.0 1.00 0
0.022 1.632 67.9 1.654 0.013
0.000 2.169 67.2 2.169 0.000
0.000 2.866 66.4 2.866 0.000
0.000 4.117 65.4 4.117 0.000
0.000 8.444 71.5 8.444 0.000
0 1 75.0 1.00 0
0.028 1.211 79.5 1.239 0.023
0.032 1.552 102.8 1.584 0.020
0.048 1.948 139.8 1.996 0.024
0.115 2.401 154.7 2.516 0.046
0.542 5.370 222.0 5.912 0.092
0 1 75.0 1.00 0
0.000 1.616 43.8 1.616 0.000
0.200 2.000 48.0 2.200 0.091
0.018 2.719 49.4 2.737 0.007
0.000 3.601 55.5 3.601 0.000
0.000 8.750 63.4 8.750 0.000
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601248Table 6
Coefcients of skewness for data on 5-, 10- and 20-m grids
generated by a variogram function with a nugget:sill ratio of 0
and contaminated with randomly located and spatially aggre-
gated outliers (rate of 0.05) after transformation to square roots
and log10
Coefcient of
skewness of
original data
Coefcient of
skewness of square
root of data (+4)a
Coefcient of
skewness of log10data (+4)a
5-m grid random outliers
0.5 0.108 0.764simulated with a nugget:sill ratio of 0, with anormal distribution and all coefcients of skewness.Tables 3 and 10 give the parameters of the modelstted to Matherons and the robust variograms,respectively. For the normal distribution, g^Mh andg^CHh are closest to the generating model (Fig. 9aand Table 10). As the skewness starts to increase,Matherons estimator shows an increasing depar-ture from the original variogram and so does that ofCressie and Hawkins (1980), but to a lesser extent(Fig. 9bf and Table 10). For all degrees of
1.0 0.353 0.4331.5 0.711 0.1762.0 1.059 0.077
3.0 2.013 0.819
10-m grid random outliers
0.5 0.101 0.9341.0 0.378 0.5751.5 0.734 0.3092.0 1.077 0.0513.0 2.111 0.787
10m grid, aggregated outliers-edge
0.5 0.160 0.9671.0 0.315 0.6141.5 0.677 0.3452.0 1.028 0.0813.0 2.095 0.779
10m grid, aggregated outliers-centre
0.5 0.053 0.8821.0 0.271 0.6391.5 0.629 0.3742.0 1.145 0.013
3.0 2.053 0.743
20-m grid random outliers
0.5 0.185 1.3281.0 0.203 1.0161.5 0.604 0.7032.0 1.067 0.3393.0 2.026 0.469
a(+4) constant added to standard normal data so that smallest
value was just positive for transformation.asymmetry, g^Dh departs less from the generatingvariogram than g^Gh, but there is little differencebetween these two robust variograms (Fig. 9bf andTable 10). For Dowds (1984) estimator, the sillvariance is closer to 1 than for Gentons (1998a),but there is little change in either as the asymmetryincreases (Table 10). For Cressie and Hawkinsestimator, the nugget and sill variances increasewith increasing asymmetry (Table 10).Fig. 10 shows the experimental variograms
computed by Matherons estimator and the threerobust estimators, g^CHh; g^Dh and g^Gh; fordata simulated with a nugget:sill ratio of 0.5 with anormal distribution and all coefcients of skewness.Tables 3 and 11 give the parameters of the modelstted to Matherons and the robust variograms,respectively. Discontinuity in the variation has anadverse effect on the shape of Matherons estimatoras asymmetry in the distribution increases. This isnot so for the robust estimators; the forms of thesevariograms remain close to the original function.The robust variograms show the same pattern inrelation to each other as in Fig. 9 for the data with anugget:sill ratio of 0. Table 11 shows that asasymmetry increases the sill variance also increasesin the robust variograms, and this effect is greaterfor Cressie and Hawkinss and Gentons estimatorsthan for Dowds. It is also greater overall for thevariograms computed from data with nugget:sillratio of 0.5 (Table 11) than for those computedfrom data with a nugget:sill ratio of 0 (Table 10).
3.6. Cross-validation results
The MEs for all cross-validation analyses wereclose to 0 showing that the estimators are unbiased.The MSDRs were close to 1 for most analyses; thissupports Larks (2000) observation that the MSDRis poor for comparing the effects of asymmetry inthe data on the variogram. As the MEs and MSDRsdo not provide any insight into the effect of outliersin the data on the variogram and on the accuracy ofprediction, we do not include them here.Table 12 gives the results of cross-validation
(MSEs and MeSDRs) for data simulated by thespherical function with 0 nugget for the threegrid spacings and the range of asymmetry.Cross-validation was also done for both rates ofcontamination; the results for the 0.05 rate of conta-mination only are given because the differencesbetween the two rates are small. To summarize, the
MSEs for the 0.02 rate of contamination are smaller
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1249 0.25and the departure of the MeSDR from 0.455 isslightly less than are those for the larger rate ofcontamination for the three grid sizes. Cross-
0 20 40 60 80
0
Variance
Variance
0.20
0.15
0.10
0.05
0.00
Lag Distance (m)
100
0 20 40 60 80
Lag Distance (m)
100
0.25
Variance
0.20
0.15
0.10
0.05
0.00
0.25
0.20
0.15
0.10
0.05
0.00
Lag Distance (m)
14012010080604020
square root of skew = 0.5
square root of skew = 1.0
square root of skew = 1.5
square root of skew = 2.0
square root of skew = 3.0
Fig. 7. Experimental variograms computed from data on 5-m (a, b), 10
and log10 (b, d, f) with skewness caused by randomly located outliers.0.035
0.030validation was also done for the 10-m data withdifferent nugget:sill ratios. The results are given inTable 3 and follow a similar pattern to those for the
0 20 40 60 80
Lag Distance (m)
100
0 20 40 60 80
Lag Distance (m)
100
Variance
0.025
0.020
0.015
0.010
0.005
0.000
Variance
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
Variance
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
0
Lag Distance (m)
14012010080604020
log10 of skew = 0.5
log10 of skew = 1.0
log10 of skew = 1.5
log10 of skew = 2.0
log10 of skew = 3.0
-m (c, d) and 20-m (e, f) grids transformed to square roots (a, c, e)
-
ARTICLE IN PRESS
ted o
d wi
33
99
40
00
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601250Table 7
Parameters of models tted to experimental variograms compu
generated by a variogram function with a nugget:sill ratio of 0 an
Original skewness
of data
Grid interval of
data (m)
Model type c0
0.5 5 Spherical 0.01
1.0 5 Spherical 0.02
1.5 5 Spherical 0.04
2.0 5 Spherical 0.06data with 0 nugget. As the nugget variance in-creases, the MSEs increase more for coefcients ofskewness 41.5, but overall there is little difference.For a given coefcient of skewness, the MeSDRsdepart less from 0.455 as the nugget varianceincreases, but again this is small. The effects onthe MeSDR with increasing skewness caused by out-liers appear to be less when the original data weresimulated with a nugget variance (Table 3). This mightbe because the larger the nugget variance, the more is
3.0 5 Circular 0.1284
0.5 10 Circular 0.0185
1.0 10 Circular 0.0391
1.5 10 Circular 0.0562
2.0 10 Circular 0.0753
3.0 10 Circular 0.1663
0.5 20 Spherical 0.0225
1.0 20 Spherical 0.0327
1.5 20 Spherical 0.0441
2.0 20 Spherical 0.0586
3.0 20 Spherical 0.1254
Table 8
Parameters of models tted to experimental variograms computed on
generated by a variogram function with a nugget:sill ratio of 0 and wi
Original
skewness of data
Grid interval of
data (m)
Model type c0
0.5 5 Pentaspherical 0.00242
1.0 5 Pentaspherical 0.00466
1.5 5 Pentaspherical 0.00633
2.0 5 Pentaspherical 0.00806
3.0 5 Pentaspherical 0.01408
0.5 10 Circular 0.00308
1.0 10 Circular 0.00575
1.5 10 Circular 0.00770
2.0 10 Circular 0.00970
3.0 10 Circular 0.01763
0.5 20 Circular 0.00526
1.0 20 Circular 0.00677
1.5 20 Spherical 0.00767
2.0 20 Spherical 0.00959
3.0 20 Spherical 0.01483n data transformed to square roots on 5-, 10- and 20-m grids
th skewness caused by randomly located outliers
c a c0+c c0:c0+c
0.0591 76.1 0.072 0.183
0.0578 77.9 0.088 0.341
0.0571 79.3 0.101 0.435
0.0566 80.8 0.117 0.515the weight given to the more distant points in thekriging neighbourhood. This results in greater smooth-ing of the predictions and of the kriging errors, inparticular when outliers are present in the data.Table 12 shows that the MSEs increase markedly
as the coefcient of skewness increases for all sizesof data set. It is notable from Table 12 that theeffect of data-set size is small; the MSEs for the20-m data are larger than for the two larger sets ofdata, but not especially so. The MeSDRs are
0.0538 77.0 0.182 0.705
0.0695 65.7 0.088 0.210
0.0581 66.7 0.097 0.402
0.0565 67.6 0.113 0.498
0.0551 68.7 0.130 0.577
0.0506 74.0 0.217 0.767
0.0616 62.9 0.084 0.268
0.0654 59.2 0.098 0.333
0.0703 55.8 0.114 0.385
0.0788 51.7 0.137 0.426
0.0938 53.1 0.219 0.572
data transformed to logarithms (log10) on 5-, 10- and 20-m grids
th skewness caused by randomly located outliers
c a c0+c c0:c0+c
0.01186 94.1 0.014 0.170
0.01155 96.4 0.016 0.288
0.01141 98.0 0.018 0.357
0.01130 99.6 0.019 0.416
0.01107 105.3 0.025 0.560
0.01301 68.0 0.016 0.191
0.01260 68.7 0.018 0.313
0.01237 69.3 0.020 0.384
0.01218 69.9 0.022 0.443
0.01160 72.4 0.029 0.603
0.01181 63.1 0.017 0.308
0.01202 61.2 0.019 0.360
0.01294 65.4 0.021 0.372
0.01336 63.3 0.023 0.418
0.01463 58.8 0.029 0.503
-
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1251riance
0.35
0.30
0.25
0.20considerably less than 0.455 for all coefcients ofskewness and for all sizes of data set (Table 12); thismeans that the function is over-estimating thekriging variance. The departure from 0.455 isconsiderable even for a skewness of 0.5. TheMeSDRs, however, do not show a consistentpattern in relation to the size of data set.Table 12 gives the cross-validation results for all
sizes of data set and skewness after transformation
0
Va 0.15
0.10
0.05
0.00
Variance
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
10080604020
Lag Distance (m)
0 10080604020
Lag Distance (m)
square root of skew = 0.5
square root of skew = 1.0
square root of skew = 1.5
square root of skew = 2.0
square root of skew = 3.0
Fig. 8. Experimental variograms computed from data on 10-m grid tr
caused by outliers spatially aggregated in the corner (a, b) and centre (ariance
0.045
0.040
0.035
0.030
0.025to square roots and logarithms. The pattern of theMeSDRs with increasing asymmetry for the 10-mdata after transformation was similar for all nuggetvariances, so the results are not presented here.The MSEs for the square root- and log10-trans-formed data increase as the original skewness in thedata increases and also as the size of data setdecreases. The effect of size of data set is small,however. Table 12 shows that the MeSDRs for all
V
0 10080604020
Lag Distance (m)
0 10080604020
Lag Distance (m)
0.020
0.015
0.010
0.005
0.000
Variance
0.045
0.040
0.035
0.030
0.025
0.020
0.015
0.010
0.005
0.000
log10 of skew = 0.5
log10 of skew = 1.0
log10 of skew = 1.5
log10 of skew = 2.0
log10 of skew = 3.0
ansformed to square roots (a, c) and log10 (b, d) with skewness
c, d) of the eld.
-
ARTICLE IN PRESS
from
aggr
ode
ircul
ircul
entas
ircul
ircul
ircul
ircul
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601252sizes of data set and for both transformationsdepart considerably from 0.455. There is littledifference between the MeSDRs for a givencoefcient of skewness for both transformations;sometimes the MeSDR departs less from 0.455 forthe square root transformation and at others itdeparts less for the log10 one. The MeSDRs for thetransformed data depart only slightly less from0.455 than do those for the raw data, Table 12. Inall cases, the greater the original asymmetry in thedata, the greater the degree of departure from 0.455.In general, these results suggest that data transfor-mations do not improve markedly the performanceof the model for kriging, although they reduce thecoefcient of skewness.For all sizes of data set, the MSEs and MeSDRs
for the variograms and data where the outliers hadbeen removed were the smallest and showed theleast departure from 0.455, respectively (Table 12).
Table 9
Parameters of models tted to experimental variograms computed
with a nugget:sill ratio of 0 where randomly located and spatially
Location of outliers Grid interval
of data (m)
Rate of
contamination
M
Random 5 0.05 C
Random 10 0.05 C
Random 20 0.05 P
Random 5 0.02 C
Random 10 0.02 C
Aggregatededge 10 0.05 C
Aggregatedcentre 10 0.05 CIf the outliers are essentially nuisance data, then thisapproach would be the most sensible. However, ifthe outliers are important values, such as largevalues of a pollutant, they should be returned to thedata for kriging. The cross-validation results withthe outliers restored and using the variogramcomputed after outliers had been removed are givenin parentheses in Table 12 for the 10-m data. TheMSEs for all original coefcients of skewness of the10-m data are larger than are those computed withthe original variograms of these data. The MeSDRsalso depart more from 0.455 than do those for theoriginal data; they are larger, which means that thekriging variances have been under-estimated. Theseresults show that the variogram model for data withoutliers performs better in kriging than that with theoutliers removed.Table 13 gives the cross-validation results foroutliers aggregated either near the edge or the centreof the eld for all sizes of data set. This is the onlyanalysis for which there is an obvious effect of thesize of data set; for a given coefcient of skewnessthe MSEs increase as asymmetry increases and asthe size of the data set decreases. However, theMSEs are considerably smaller for the aggregatedoutliers than for the randomly located ones(Table 12). The MSEs for the centrally groupedoutliers are larger than are those for outliers at theedge, but the difference is small for all sizes of data.For randomly located outliers, the kriging errors arelarge in their vicinity because of the lack of anyrelation with surrounding values. In addition, thelarger errors associated with the outliers are widelydistributed over the eld of data. For the aggregatedoutliers, the errors in the uncontaminated part ofthe eld are small and they will also be smaller in the
data on 5-, 10- and 20-m grids generated by a variogram function
egated outliers have been removed
l type c0 c a c0+c c0:c0+c
ar 0.028 0.946 65.0 0.974 0.029
ar 0.021 0.963 64.0 0.984 0.021
pherical 0.060 0.913 94.5 0.973 0.062
ar 0.022 0.958 65.2 0.980 0.022
ar 0.023 0.961 65.3 0.984 0.023
ar 0.042 0.908 68.6 0.950 0.044
ar 0.010 0.940 60.9 0.950 0.011contaminated area than for the randomly locatedoutliers. This is because adjacent values within thekriging neighbourhood in the area of aggregatedoutliers will be more similar to each other thanthose where the outliers are randomly located.The MeSDRs for aggregated outliers depart less
overall from 0.455 than do those of the randomlylocated ones (Tables 13 and 12, respectively). TheMeSDRs for outliers aggregated near the edge ofthe eld show less departure from 0.455 in generalthan do those for outliers aggregated near the centrefor the two smaller sets of data; the opposite is thecase for the 5-m data. The centrally located outliersare likely to result in larger MSEs than those locatedat the edge of the eld because they will be involvedin many more predictions as the kriging neighbour-hood moves over the eld.
-
ARTICLE IN PRESS
0
0
1
2
3
4
5
6
3.0
2.5
2.0
1.5
Variance
1.0
0.5
0.0
3.0
2.5
2.0
1.5
Variance
1.0
0.5
0.0
3.0
2.5
2.0
1.5
Variance
1.0
0.5
0.0
20 40 8060 100
Lag distance (m)
0 20 40 8060 100
Lag distance (m)
0 20 40 8060 100
Lag distance (m)
0 20 40 8060 100
Lag distance (m)
0 20 40 8060 100
Lag distance (m)
0 20 40 8060 100
Lag distance (m)
Variance
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Variance
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Variance
Fig. 9. Experimental variograms computed from data simulated by a variogram function with a nugget:sill ratio of 0 on a 10-m grid by
Matherons (m) estimator and Cressie and Hawkinss (), Dowds ( ) and Gentons (&) robust estimators for skewness coefcients of:(a) 0, (b) 0.5, (c) 1.0, (d) 1.5, (e) 2.0 and (f) 3.0 caused by randomly located outliers.
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1253
-
ARTICLE IN PRESS
mpu
e of 0
uared
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601254Table 10
Parameters of models tted to robust experimental variograms co
with a nugget:sill ratio of contaminated by randomly outliers (rat
Robust estimator Model type c0 c
Skewness
Cressie and Hawkins
0.5 Pentaspherical 0 1.166
1.0 Pentaspherical 0.1291 1.259
1.5 Pentaspherical 0.1658 1.372
2.0 Pentaspherical 0.2018 1.477
3.0 Pentaspherical 0.3552 1.846
Dowd
0.5 Pentaspherical 0.02142 1.155
1.0 Circular 0.00244 1.162
1.5 Circular 0 1.200
2.0 Spherical 0 1.214
3.0 Spherical 0 1.215
Genton
0.5 Spherical 0.02945 1.025
1.0 Spherical 0.00691 1.200
1.5 Spherical 0 1.280
2.0 Spherical 0 1.318
3.0 Spherical 0 1.336
aMSE is the mean squared error and MeSDR is the median sqFor the transformed data, results for the 10-mdata only are given. For both transformations theMSEs increase little as skewness in the original dataincreases (Table 13). This suggests that transforma-tion has been successful for the aggregated outliers,regardless of whether they are near the edge or thecentre of the eld. This is supported by the MeSDRswhich are closer to 0.455 than for the original dataand for the data with randomly located outliers(Table 12). The results also suggest that thetransformation to square roots is more successfulfor outliers aggregated near the edge of the eld andthat to log10 for outliers near the centre; however,the difference between them is small. The MeSDRsfor data with the outliers removed are the closest to0.455 for both scenarios, but those for outliers nearthe edge are the closer. The results after the removalof outliers appear to suggest that this is preferableto data transformation before computing thevariogram. However, outliers were not removedfor the transformation and this might be importanton contaminated sites so that information is notlost. When the outliers are returned to the data forkriging, however, the MeSDRs depart more from0.455 (Table 13, see results in parentheses forted from data on a 10-m grid generated by a variogram function
.05) and cross-validation results
a c0+c c0:c0+c Robust models
MSEa MeSDRa
71.75 1.166 0 0.654 0.497
77.18 1.388 0.0930 1.205 0.387
77.66 1.538 0.1078 1.736 0.352
78.02 1.679 0.1202 2.382 0.334
78.99 2.201 0.1614 6.085 0.301
84.70 1.176 0.0182 0.641 0.515
54.89 1.164 0.0021 1.286 0.753
56.16 1.200 0 1.880 0.885
65.08 1.214 0 2.608 0.866
65.14 1.215 0 6.833 1.127
56.65 1.055 0.0279 0.639 0.469
56.99 1.207 0.0057 1.286 0.631
58.89 1.280 0 1.886 0.735
60.15 1.318 0 2.611 0.740
60.82 1.336 0 6.839 0.956
deviation ratio.coefcient of skewness of 0.5) than do those forthe square root-transformed data for outliersaggregated near the edge and log10-transformeddata aggregated near the centre.Where the original data contain randomly located
outliers, the MeSDRs for the transformed data tendto be more similar for different coefcients ofskewness compared with those where there wasunderlying asymmetry. The latter showed a morepronounced decrease in MeSDR as skewnessincreased (Kerry and Oliver, 2007). The overallpattern of MeSDR values suggests that for all gridsizes the log10 transformation is the more effective.Tables 10 and 11 give the cross-validation results
for the robust variogram estimators for the 10-mdata only simulated with 0 and 0.5 nugget variance,respectively. The MSEs increase in all cases withincreasing asymmetry, but to a slightly smallerextent for the Cressie and Hawkins estimator thanfor the Dowd and Genton ones. The MSEs are alsosmaller for the data simulated with a nugget:sillratio of 0.5, which supports the previous observa-tions for the 10-m data with Matherons estimator.As above, the smoothing effect on kriged predic-tions of a nugget effect in the variogram reduces the
-
ARTICLE IN PRESS
0
0
1
2
3
4
5
6
7
3.0V
ariance
2.5
2.0
1.5
1.0
0.5
0.0
3.0
Variance
2.5
2.0
1.5
1.0
0.5
0.0
3.0
Variance
Variance
2.5
2.0
1.5
1.0
0.5
0.0
3.0
Variance
2.5
2.0
1.5
1.0
0.5
0.0
3.0V
ariance
2.5
2.0
1.5
1.0
0.5
0.0
Lag distance (m)
10080604020
0
Lag distance (m)
10080604020
0
Lag distance (m)
10080604020
0
Lag distance (m)
10080604020
0
Lag distance (m)
10080604020
0
Lag distance (m)
10080604020
Fig. 10. Experimental variograms computed from data simulated by a variogram function with a nugget:sill ratio of 0.5 on a 10-m grid by
Matherons (m) estimator and Cressie and Hawkinss (), Dowds ( ) and Gentons (&) robust estimators for skewness coefcients of:(a) 0, (b) 0.5, (c) 1.0, (d) 1.5, (e) 2.0 and (f) 3.0 caused by randomly located outliers.
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1255
-
ARTICLE IN PRESS
Table 11
Parameters of models tted to robust experimental variograms computed from data on a 10-m grid generated by a variogram function
with a nugget:sill ratio of 0.5 contaminated by randomly outliers (rate of 0.05) and cross-validation results
Robust estimator Model type c0 c1 A c0+c c0:c0+c Robust models
Skewness MSEa MeSDRa
Cressie and Hawkins
0.5 Spherical 0.7718 0.5025 68.98 1.274 0.6057 1.054 0.386
1.0 Spherical 0.9425 0.5762 71.56 1.519 0.6206 1.577 0.353
1.5 Spherical 1.046 0.6262 72.24 1.672 0.6255 2.056 0.316
2.0 Spherical 1.146 0.6714 72.81 1.817 0.6306 2.639 0.305
3.0 Circular 1.570 0.7901 69.03 2.360 0.6653 5.988 0.328
Dowd
0.5 Spherical 0.6609 0.5645 71.58 1.225 0.5395 1.055 0.433
1.0 Spherical 0.6933 0.6503 73.14 1.343 0.5162 1.587 0.458
1.5 Spherical 0.6988 0.6651 73.28 1.364 0.5123 2.075 0.470
2.0 Spherical 0.6974 0.6697 73.18 1.367 0.5102 2.669 0.503
3.0 Spherical 0.6974 0.6697 73.18 1.367 0.5102 6.104 0.667
Genton
0.5 Spherical 0.7436 0.5207 68.37 1.264 0.5883 1.054 0.394
1.0 Spherical 0.8273 0.6419 70.78 1.469 0.5632 1.582 0.394
1.5 Spherical 0.8265 0.7692 71.54 1.596 0.5179 2.075 0.398
2.0 Spherical 0.8266 0.7707 71.75 1.597 0.5176 2.668 0.423
3.0 Spherical 0.8214 0.7893 71.49 1.611 0.5099 6.111 0.561
aMSE is the mean squared error and MeSDR is the median squared deviation ratio.
Table 12
Mean squared errors and median squared deviation ratios, MSE and MeSDR, respectively, from cross-validation using variogram model
parameters from original data contaminated by randomly located outliers (rate 0.05) on 5-, 10- and 20-m grids (nugget:sill ratio of 0), from
data transformed to square roots and logarithms (log10), and from data with outliers removed
Coefcient
of skewness
of original
data
Original data Transformed data
Square roota Log10a Outliers removed (outliers returned
for kriging)
MSEa MeSDRa MSEa MeSDRa MSEa MeSDRa MSEa MeSDRa
5-m grid
0 0.1322 0.703
0.5 0.3944 0.214 0.02215 0.226 0.004089 0.206 0.1254 (0.4285) 0.516 (0.773)
1.0 0.8138 0.128 0.03979 0.146 0.006451 0.144
1.5 1.217 0.100 0.05479 0.116 0.008219 0.123
2.0 1.720 0.082 0.07182 0.097 0.01005 0.112
3.0 4.258 0.058 0.1434 0.069 0.01646 0.076
10-m grid
0 0.1677 0.496
0.5 0.5812 0.219 0.03137 0.234 0.005387 0.245 0.1678 (0.6361) 0.490 (0.660)
1.0 1.077 0.129 0.05197 0.153 0.008115 0.166 (1.256) (0.869)
1.5 1.530 0.113 0.06863 0.130 0.01007 0.149 (1.829) (1.017)
2.0 2.083 0.087 0.08719 0.103 0.01205 0.126 (2.531) (1.006)
3.0 5.326 0.065 0.1758 0.074 0.01987 0.091 (6.626) (1.424)
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601256
-
ARTICLE IN PRESS
Table 12 (continued )
Coefcient
of skewness
of original
data
Original data Transformed data
Square roota Log10a Outliers removed (outliers returned
for kriging)
MSEa MeSDRa MSEa MeSDRa MSEa MeSDRa MSEa MeSDRa
20-m grid
0 0.3691 0.394
0.5 0.9816 0.199 0.05992 0.211 0.00922 0.189 0.3306 (1.007) 0.303 (0.586)
1.0 1.373 0.188 0.07594 0.184 0.01131 0.164
1.5 1.854 0.185 0.09371 0.185 0.1332 0.155
2.0 2.605 0.148 0.1183 0.175 0.01595 0.165
3.0 5.528 0.082 0.1989 0.126 0.02293 0.137
MSE is the mean squared error.
MeSDR is the median squared deviation ratio.
Values in brackets are for kriging using the variogram with outliers removed but with the outliers returned to the data for kriging.aA constant of 4 was added to values in each original data set so that all values were just positive for the logarithmic transformation.
Table 13
Mean squared errors and median squared deviation ratios, MSE and MeSDR, respectively, from cross-validation using model parameters
of original data on grids of 5-, 10- and 20-m (nugget:sill ratio of 0), and on the 10-m grid for data transformed to square roots and
logarithms (log10) and with outliers removed where outliers are spatially aggregated
Data with
grouped
outliers
Skewness
coefcient of
original data
Original data Transformed data
Square roota Log10a Outliers removed (outliers
returned for kriging)
MSEa MeSDRa MSEa MeSDRa MSEa MeSDRa MSEa MeSDRa
5-mcorner 0 0.1322 0.703
5-mcorner 0.5 0.1898 0.357
5-mcorner 1.0 0.2083 0.310
5-mcorner 1.5 0.2254 0.268
5-mcorner 2.0 0.2464 0.234
5-mcorner 3.0 0.3524 0.136
5-mcentre 0 0.1322 0.703
5-mcentre 0.5 0.1982 0.404
5-mcentre 1.0 0.2174 0.437
5-mcentre 1.5 0.2397 0.348
5-mcentre 2.0 0.2686 0.268
5-mcentre 3.0 0.4544 0.114
10-mcorner 0 0.1677 0.496
10-mcorner 0.5 0.2809 0.426 0.01659 0.429 0.002922 0.362 0.1684 (0.2816) 0.458 (0.515)
10-mcorner 1.0 0.3363 0.376 0.01825 0.394 0.003002 0.335
10-mcorner 1.5 0.3983 0.299 0.02037 0.363 0.003203 0.312
10-mcorner 2.0 0.4808 0.262 0.02315 0.298 0.003462 0.267
10-mcorner 3.0 1.041 0.111 0.03870 0.241 0.004750 0.203
10-mcentre 0 0.1677 0.496
10-mcentre 0.5 0.3154 0.357 0.01365 0.361 0.003052 0.404 0.1616 (0.3161) 0.475 (0.578)
10-mcentre 1.0 0.3661 0.318 0.01433 0.305 0.003153 0.356
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1257
-
ARTICLE IN PRESS
ed d
ota
eSD
.255
.199
.118
R. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601258Table 13 (continued )
Data with
grouped
outliers
Skewness
coefcient of
original data
Original data Transform
Square ro
MSEa MeSDRa MSEa M
10-mcentre 1.5 0.4286 0.240 0.01552 0
10-mcentre 2.0 0.5502 0.172 0.01799 0
10-mcentre 3.0 0.9632 0.084 0.02590 0
20-mcorner 0 0.3691 0.394
20-mcorner 0.5 0.4138 0.386
20-mcorner 1.0 0.4547 0.368
20-mcorner 1.5 0.5130 0.305
20-mcorner 2.0 0.5957 0.215
20-mcorner 3.0 1.156 0.112
20-mcentre 0 0.3691 0.394
20-mcentre 0.5 0.7320 0.229
20-mcentre 1.0 0.8937 0.163
20-mcentre 1.5 1.107 0.196
20-mcentre 2.0 1.279 0.101
20-mcentre 3.0 2.592 0.059
MSE is the mean squared error.localized errors caused by the outliers. TheMeSDRs are closer to 0.455 for Cressie andHawkins estimator for data simulated with anugget:sill ratio of 0 (Table 10), followed by thosefor Gentons estimator. For data simulated with anugget:sill ratio of 0.5, the MeSDRs are closest to0.455 for Dowds estimator for skewness coefcientso2 and for Gentons for skewness coefcients of 2.0and 3.0. The MeSDRs are sometimes b0.455 forthe robust variograms, which suggest that thefunction is under-estimating the kriging variancepossibly because of the effect of non-normality onthe robust estimator. The MeSDRs for Cressie andHawkins estimator for data with no nuggetvariance and those for Dowds and Gentonsestimators for data with a nugget variance aresimilar to those for data with the outliers removed.However, it is important to note that when theoutliers are returned for kriging and the variogramcomputed on data with the outliers removed is used,the MSEs are still slightly smaller than are those forDowds and Gentons estimators, but overall theMeSDRs depart more from 0.455 than do those ofthe robust estimators.
MeSDR is the median squared deviation ratio.
Values in brackets are for kriging using the variogram with outliers reaA constant of 4 was added to values in each original data set so thaata
Log10a Outliers removed (outliers
returned for kriging)
Ra MSEa MeSDRa MSEa MeSDRa
0.003327 0.323
0.003655 0.273
0.004521 0.1964. Conclusions
Where asymmetry arises from contamination of aprimary process by a secondary process at arelatively small number of randomly located sites(outliers), the effect of size of data set on both theform of the variogram and the results of cross-validation is small compared with that observedwhere asymmetry arises from a long tail in thedistribution (Kerry and Oliver, 2007). The rate ofcontamination by outliers, however, has more effecton the variogram and results of cross-validationthan does the size of data set. A more modestcoefcient of skewness of 3.0 caused by outliersdistorts the variogram considerably more than thatobserved for underlying asymmetry with largercoefcients of skewness.For a given coefcient of skewness caused by
randomly located outliers, an increase in nugget:sillratio of the generating variogram up to andincluding 0.5 has little effect on the shape of thevariogram for different coefcients of skewness, butfor a nugget:sill ratio of 0.75 the variograms arealmost pure nugget for the larger coefcients of
moved but with the outliers returned to the data for kriging.
t all values were just positive for the logarithmic transformation.
-
Gnanadesikan, R., Kettenring, J.R., 1972. Robust estimates,
residuals, and outlier detection with multiresponse data.
Kerry, R., Oliver, M.A., 2007. Determining the effect of
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 12331260 1259skewness. However, cross-validation showed thatthe larger the nugget:sill ratios in the generatingvariogram, the smaller the departure of the MeSDRfrom 0.455 as skewness increases.The effect of aggregated outliers compared with
randomly located ones on the nugget variance ismarked; the former have very little effect on it.There is also a difference in the effects on thevariogram if outliers are aggregated near to the edgeof the eld or the centre. The MSEs indicate that ifoutliers are aggregated, their effect on the accuracyof prediction is less than if they are randomlylocated. This is further conrmed by the smallerdeparture of the MeSDR from 0.455 for aggregatedoutliers compared with randomly located ones. Foraggregated outliers, there is less need to amelioratetheir effect until the skewness coefcient X2.Transformation to square roots and logarithms
reduces the coefcient of skewness and improves theshape of the variogram for the larger initialskewness coefcients. However, the MeSDRs de-part considerably from 0.455 for all coefcients ofskewness and this is only marginally less so than forthe raw data. In contrast, the MeSDRs for therobust variograms show less departure from 0.455overall than do those for the transformed data. Fora generating variogram with a nugget:sill ratio ofCressie and Hawkins (1980) estimator results inMeSDRS that were closest to 0.455, whereas with anugget:sill ratio of 0.5 Dowds estimator performedthe best in this context. These results indicate thatone should compute several robust estimators, if thepresence of outliers is suspected, as they performdifferently under different circumstances. The re-moval of outliers, however, resulted in MeSDRsthat were generally closer to 0.455, but, as notedabove, this does not hold when the outliers arereturned to the data for kriging.The MSEs for randomly located outliers suggest
that the standard approach that many geostatisti-cians have adopted of mitigating the effects ofasymmetry only when the skewness coefcientexceeds the bounds 71 might need revising whenasymmetry arises from randomly located outliers.There is a large increase in MSE (Table 12) betweenskewness coefcients of 0.5 and 1.0, which suggeststhat for skewness coefcients 40.75 there is a needto reduce the asymmetry. The current best practiceapproach of removing outliers appears to be themost appropriate method, when they are randomlylocated and will not be returned to the data for
kriging. However, when the outliers form a crucialasymmetric data on the variogram. I. Underlying asymmetry.
Computers and Geosciences, doi:10.1016/j.cageo.2007.05.008.
Lark, R.M., 2000. A comparison of some robust estimators of the
variogram for use in soil survey. European Journal of Soil
Science 51 (1), 137157.
Matheron, G., 1965. Les variables regionalisees et leur estima-
tion: une application de la theorie de fonctions aleatoires aux
sciences de la nature. Masson et Cie, Paris, 306pp.
McBratney, A.B., Webster, R., 1986. Choosing functionsBiometrics 28 (1), 81124.
Goovaerts, P., 1997. Geostatistics for Natural Resources
Evaluation. Oxford University Press, New York, 483pp.
Haslett, J., Bradley, R., Craig, P.S., Wills, G., Unwin, A.R., 1991.
Dynamic graphics for exploring spatial data, with application
to locating global and local anomalies. American Statistician
45 (3), 234242.part of the investigation, as on contaminated sites,or when they are difcult to identify in theexploratory data analysis, we recommend thatpractitioners compute several robust variogramestimators in preference to the removal of outliersand use the one that gives the best cross-validationresults.
Acknowledgements
We thank Professors R. Webster and A.B.McBratney and an unknown referee for theirguidance in revising this paper.
References
Armstrong, M., Delner, P., 1980. Towards a more robust
variogram: a case study on coal. Unpublished Note N-671,
centre de Geostatistique et de Morphologie Mathematique,
Fontainebleau, 49pp.
Barnett, V., Lewis, T., 1994. Outliers in Statistical Data, third ed.
Wiley, Chichester, 604pp.
Cressie, N.A.C., 1993. Statistics for Spatial Data. Wiley, New
York, 900pp.
Cressie, N., Hawkins, D., 1980. Robust estimation of the
variogram. Mathematical Geology 12 (2), 115125.
Deutsch, C.V., Journel, A.G., 1992. GSLIB: Geostatistical
Software Library and Users Guide. Oxford University Press,
New York, 369pp.
Dowd, P.A., 1984. The variogram and kriging: robust and
resistant estimators. In: Verly, G., David, M., Journel, A.G.,
Marechal, A. (Eds.), Geostatistics for Natural Resources
Characterization. Reidel, Dordrecht, pp. 221236.
Genton, M.G., 1998a. Highly robust variogram estimation.
Mathematical Geology 30 (2), 213221.
Genton, M.G., 1998b. Spatial breakdown point of robust
estimators. Mathematical Geology 30 (7), 853871.for semi-variograms of soil properties and tting them
-
to sampling estimates. Journal of Soil Science 37 (4),
617639.
Payne, R.W. (Ed.), 2006. The Guide to GenStat Release 9Part
2: Statistics. VSN International, Hemel Hempstead.
Rousseeuw, P.J., Croux, C., 1992. Explicit scale estimators with
high breakdown point. In: Dodge, Y. (Ed.), L1 Statistical
Analyses and Related Methods. North-Holland, Amsterdam,
pp. 7792.
Rousseeuw, P.J., Croux, C., 1993. Alternatives to the median
absolute deviation. Journal of the American Statistical
Association 88 (424), 12731283.
Tukey, J.W., 1977. Exploratory Data Analysis. Addison-Wesley,
Reading, MA, 688pp.
Webster, R., Oliver, M.A., 2001. Geostatistics for Environmental
Scientists. Wiley, Chichester, 271pp.
ARTICLE IN PRESSR. Kerry, M.A. Oliver / Computers & Geosciences 33 (2007) 123312601260
Determining the effect of asymmetric data on the variogram. II. OutliersIntroductionMethodsSimulation of two-dimensional data contaminated by outliersApproaches to reduce asymmetryMatherons variogram estimator and robust variogram estimatorsCross-validation
Results and discussionThe effect of randomly located outliers on variograms computed from simulated data of different sample sizesEffect of spatial continuity on variograms computed from data simulated on a 10-m grid with randomly located outliersEffects of spatially aggregated outliers on the variogramEffect of data transformation (square root and log10) on the variogramRobust variogram estimatorsCross-validation results
ConclusionsAcknowledgementsReferences