robust outlier detection
DESCRIPTION
TRANSCRIPT
Projectnummer: RSM-80820BPA-nummer: 324-00-RSM/INTERN
Datum: 19-jul-00
Statistics NetherlandsDivision Research and DevelopmentDepartment of Statistical Methods
ROBUST MULTIVARIATE OUTLIER DETECTION
Peter de Boer and Vincent Feltkamp
Summary: Two robust multivariate outlier detection methods, based on the
Mahalanobis distance, are reported: the projection method and the Kosinski
method. The ability of those methods to detect outliers is exhaustively tested.
A comparison is made between the two methods as well as a comparison with
other robust outlier detection methods that are reported in the literature.
The opinions in this paper are those of the authors and do not necessaril y reflect
those of Statistics Netherlands.
Robust multivariate outlier detection
1
1. Introduction
The statistical process can be separated in three steps. The input phase involves the
collection of data by means of surveys and registers. The throughput phase involves
preparing the raw data for tabulation purposes, weighting and variance estimating.
The output phase involves the publication of population totals, means, correlations,
etc., which have come out of the throughput phase.
Data editing is one of the first steps in the throughput process. It is the procedure for
detecting and adjusting individual errors in data. Editing also comprises the
detection and treatment of correct but influential records, i.e. records that have a
substantial contribution to the aggregates to be published.
The search for suspicious records, i.e. records that are possibly wrong or influential,
can be done in basically two ways. The first way is by examining each record and
looking for strange or wrong fields or combinations of fields. In this view a record
includes all fields referring to a particular unit, be it a person, household or business
unit, even if those fields are stored in separate files, like files containing survey data
and files containing auxiliary data.
The second way is by comparing each record with the other records. Even if the
fields of a particular record obey all the edit rules one has laid down, the record
could be an outlier. An outlier is a record, which does not follow the bulk of the
records.
The data can be seen as a rectangular file, each row denoting a particular record and
each column a particular variable. The first way of searching for suspicious data can
be seen as searching in rows, the second way as searching in columns. It is remarked
that some and possibly many errors can be detected by both ways.
Records could be outliers while their outlyingness is not apparent by examining the
variables, or columns, one by one. For instance, a company that has a relatively
large turnover but that has paid relatively little taxes might be no outlier in either one
of the variables, but could be an outlier considering the combination. Outliers
involving more than one variable are multivariate outliers.
In order to quantify how far a record lies from the bulk of the data, one needs a
measure of distance. In the case of categorical data no useful distance measure
exists, but in the case of continuous data the so-called Mahalanobis distance is often
employed.
A distance measure should be robust against the presence of outliers. It is known
that the classical Mahalanobis distance is not. This means that the outliers, which are
to be detected, seriously hamper the detection of those outliers. Hence, a robust
version of the Mahalanobis distance is needed.
In this report two robust multivariate outlier detection algorithms for continuous
data, based on the Mahalanobis distance, are reported. In the next section the
classical Mahalanobis distance is introduced and ways to robustify this distance
measure are discussed. In sections 3 and 4 the two algorithms, successively the
Robust multivariate outlier detection
2
Kosinski method and the projection method, are presented. In section 5 a
comparison between the two algorithms is made as well as a comparison with other
algorithms reported in the outlier literature. A practical example, and problems
involved with it, is the subject of section 6. In section 7 some concluding remarks
are made.
2. The Mahalanobis distance
The Mahalanobis distance is a measure of the distance between a point and the
center of all points, with respect to the scale of the data — and in the multivariate
case with respect to the shape of the data as well . It is remarked that in regression
analysis another distance measure is more convenient: instead of the distance
between a point and the center of the data, the distance between the point and the
regression plane (see also section 5).
Suppose we have a continuous data set nyyy ,..,, 21 . The vectors iy are p-
dimensional, i.e. tipiii yyyy )..( 21= , where iqy denotes a real number. The
classical squared Mahalanobis distance is defined by
)()( 12 yyCyyMD it
ii −−= −
where y and C denote the mean and the covariance matrix respectively:
∑
∑
=
=
−−−
=
=
n
i
tii
n
ii
yyyyn
C
yn
y
1
1
))((1
1
1
In the case of one-dimensional data the covariance matrix reduces to the variance
and the Mahalanobis distance to σyyMD ii −= , where σ denotes the standard
deviation.
Another point of view results by noting that the Mahalanobis distance is the solution
of a maximization problem. The maximization problem is defined as follows. The
data points iy can be projected on a projection vector a. The outlyingness of the
point iy is the squared projected distance 2))(( yya it − , with respect to the
projected variance Caat . Assuming that the covariance matrix C is positive
definite, there exists a non-singular matrix A such that ICAAt = . Using the
Cauchy-Schwarz equality we have
Robust multivariate outlier detection
3
2
1
1
11
212
)()(
)()()(
)()()()(
))(())((
i
it
i
ti
tti
tt
ti
tti
t
ti
ttt
ti
t
MD
yyCyy
Caa
yyAAyyaAAa
Caa
yyAAyyaAaA
Caa
yyAAa
Caa
yya
=
−−=
−−=
−−≤
−=
−
−
−
−−
−
with equality if and only if )(1 yycAaA it −=− for some constant c. Hence
Caa
yyaMD
ti
t
aai
t
2
1
2 ))((sup
−=
=
i.e., the Mahalanobis distance is equal to the supremum of the outlyingness of iy
over all possible projection vectors.
If the data set iy is multivariate normal the squared Mahalanobis distances 2iMD
follow the 2χ distribution with p degrees of freedom.
The classical Mahalanobis distance suffers however from the masking and
swamping effect. Outliers seriously affect the mean and the covariance matrix in
such a way that the Mahalanobis distance of outliers could be small (masking),
while the Mahalanobis distance of points which are not outliers could be large
(swamping).
Therefore, robust estimates of the center and the covariance matrix should be found
in order to calculate a useful Mahalanobis distance. In the univariate case the most
robust choice is the median (med) and the median of absolute deviations (mad)
replacing the mean and the standard deviation respectively. The med and mad have a
robustness of 50%. The robustness of a quantity is defined as the maximum
percentage of data points that can be moved arbitraril y far away while the change in
that quantity remains bounded.
It is not trivial to generalize the robust one-dimensional Mahalanobis distance to the
multivariate case. Several robust estimators for the location and scale of multivariate
data have been developed. We have tested two methods, the projection method and
the Kosinski method. Other methods for robust outlier detection will be discussed in
section 5, where we will compare the different methods on their ability to detect
outliers.
In the next two sections the Kosinski method and the projection method will be
discussed in detail.
Robust multivariate outlier detection
4
3. The Kosinski method
3.1 The principle of KosinskiThe method discussed in this section was quite recently published by Kosinski. The
idea of Kosinski is basically the following:
1) start with a few, say g, points, denoted the “good” part of the data set;
2) calculate the mean and the covariance matrix of those points;
3) calculate the Mahalanobis distances of the complete data set;
4) increase the good part of the data set with one point by selecting the g+1 points
with the smallest Mahalanobis distance and define g=g+1;
5) return to step 2 or stop as soon as the good part contains more than half the data
set and the smallest Mahalanobis distance of the remaining points is higher than
a predefined cutoff value. At the end the remaining part, or the “bad” part,
should contain the outliers.
In order to assure that the good part wil l contain no outliers at the end, it is important
to start the algorithm with points which all are good. In the paper by Kosinski this
problem is solved by repetitively choosing a small set of random points, and
performing the algorithm for each set. The number of sets of points to start with is
taken high enough to be sure that at least one set contains no outliers.
We made two major adjustments to the Kosinski algorithm. The first one is the
choice of the starting data set. The demanded property of the starting data set is that
it contains no outliers. It does not matter how these points are found. We choose the
starting data set by robustly estimating the center of the data set and selecting the
p+1 closest points. In the case of a p-dimensional data set, p+1 points are needed to
get a useful starting data set, since the covariance matrix of a set of at most p points
is always non-invertible. A separation of the data set in p+1 good points and n-p-1
bad points is called an elemental partition.
The center is estimated by calculating the mean of the data set, neglecting all
univariate robustly detected outliers. This is of course just a rude estimation, but is
satisfactory for the purpose of selecting a good starting data set. Another rude
estimation of the center that was tried out was the coordinate-wise median. The
coordinate-wise median appeared to result in less satisfactory starting data sets.
The p+1 points closest to the mean are chosen, where closest is defined by an
ordinary distance measure. In order to take the different scales and units of the
different dimensions into account, the data set is coordinate-wisely scaled before the
mean is calculated, i.e. each component of each point is divided by the median of
absolute deviations of the dimension concerned. It is remarked that, after the first
p+1 points are selected the algorithm continues with the original unscaled data.
It is, of course, possible to construct a data set for which this algorithm fails to select
p+1 points that are all good points. However, in all the data sets exploited in this
report, artificial and real, this choice of a starting data set worked very well .
Robust multivariate outlier detection
5
This adjustment results in a spectacular gain in computer time, since the algorithm
has to be run only once instead of more than once. Kosinski estimates the required
number of random starting data sets in his own original algorithm to be
approximately 35 in the case of 2-dimensional data sets, and up to 10000 in 10
dimensions.
The other adjustment is in the expansion of the good part. In the Kosinski paper the
increment is always one point. We implemented an increment proportional to the
good part already found, for instance 10%. This means that the good part is
increased with a factor of 10% each step. This speeds up the algorithm as well,
especially in large data sets. The original algorithm with one-point increment scales
with 2n , where n is the number of data points, while the algorithm with proportional
increment scales with nnln . Also this adjustment was tested and appeared to be
very good.
In the remainder of this report, “ the Kosinski method” denotes the adjusted Kosinski
method, unless otherwise noted.
3.2 The Kosinski algorithmThe purpose of the algorithm is, given a set of n multivariate data points
nyyy ,..,, 21 , to calculate the outlyingness iu for each point i. The algorithm can be
summarized as follows.
Step 0. In: data set
The algorithm is started with a set of continuous p-dimensional data nyyy ,..,, 21 ,
where ( )tipii yyy ..1= .
Step 1. Choose an elemental partition
A good part of p+1 points is found as follows.
• Calculate the med and mad for each dimension q:
qlql
q
kqk
q
MymedS
ymedM
−=
=
• Divide each component q of each data point i by the mad of the dimension
concerned. The scaled data points are denoted by the superscript s:
q
iqsiq S
yy =
• Declare a point to be a univariate outlier if at least one component of the data
point is farther than 2.5 standard deviations away from the scaled median. The
standard deviation is approximated by 1.484 times the mad (see section 4.1 for
the background of the factor 1.484). So calculate for each component q of each
point i:
Robust multivariate outlier detection
6
q
qsiqiq S
Myu −=
484.1
1
If 5.2>iqu for any q, then point i is an univariate outlier.
• Calculate the mean of the data set, neglecting the univariate outliers:
∑=
=n
outlier no is 1 0
1
iyi
si
s yn
y
where n0 denotes the number of points that are no univariate outliers.
• Select the p+1 points that are closest to the mean. Define those points to be the
good part of the data set. So calculate:ss
ii yyd −=
The g=p+1 points with the smallest di form the good part, denoted by G.
Step 2. Iteratively increase the good part
The good part is increased until a certain stop criterion is fulfilled.
• Continue with the original data set iy , not with the scaled data set siy .
• Calculate the mean and the covariance matrix of the good part:
∑
∑
∈
∈
−−−
=
=
Gi
tii
Gii
yyyyg
C
yg
y
))((1
1
1
• Calculate the Mahalanobis distance of all the data points:
)()( 12 yyCyyMD it
ii −−= −
• Calculate the number of points with a Mahalanobis distance smaller than a
predefined cutoff value. A useful cutoff value is 21, αχ −p , with • =1%.
• Increase the good part with a predefined percentage (a useful percentage is 20%)
by selecting the points with the smallest Mahalanobis distances, but not more
than up to
a) half the data set if the good part is smaller than half the data set
(g<h=[½(n+p+1]).
b) the number of points with a Mahalanobis distance smaller than the cutoff if
the good part is larger than half the data set.
• Stop the algorithm if the good part was already larger than half the data set and
no more points were added in the last iteration.
Step 3. Out: outlyingnesses
The outlyingness of each point is now simply the Mahalanobis distance of the point,
calculated with the mean and the covariance matrix of the good part of the data set.
Robust multivariate outlier detection
7
3.3 Test resultsA prototype/test program was implemented in a Borland Pascal 7.0 environment.
Documentation of the program is published elsewhere. We successively tested the
choice of the elemental partition by means of the mean, the amount of swamped
observations of data sets containing no outliers, the amount of masked and swamped
observations of data sets containing outliers, the algorithm with proportional
increment and the time-performance of the proportional increment of the good part
compared to the one-point increment. Finally, we tested the sensitivity of the
number of detected outliers to the cutoff value and the increment percentage in some
known data sets.
3.3.1 Elemental partition
First of all , the choice of the elemental partition was tested with the generated data
set published by Kosinski. The Kosinski data set is a kind of worst-case data set. It
contains a large amount of outliers (40% of the data) and the outliers are distributed
with a variance much smaller than the variance of the good points.
Before using the mean, we calculated the coordinate-wise median as a robust
estimator of the center of the data, and selected the three closest points. This strategy
failed. Although the median has a 50%-robustness, the 40% outliers strongly shift
the median. Hence, one of the three selected points appeared to be an outlier. As a
consequence, the forward search algorithm indicated all point to be good points, i.e.
all the outliers were masked.
This was the reason we searched for another robust measure of the location of the
data. One of the simplest ideas is to search for univariate outliers first, and to
calculate the mean of the points that are outlier in none of the dimensions.
The selected points, the three points closest to the mean, appeared all to be good
points. Moreover, the forward search algorithm, applied with this elemental
partition, successfully distinguished the outliers from the good points.
All following tests were performed using this “mean” to select the first p+1 points.
For all tested data sets the selected p+1 points appeared to be good points, resulting
in a successful forward search. It is possible, in principle, to construct a data set for
which this selection algorithm still fails, for instance a data set with a large fraction
of outliers which are univariately invisible and with no unambiguous dividing line
between the group of outliers and the group of good points. This is, however, a very
hypothetical situation.
3.3.2 Swamping
A simulation study was performed in order to determine the average fraction of
swamped observations in normal distributed data sets. In large data sets almost
always a few points are indicated to be an outlier, even if the whole data set nicely
follows a normal distribution. This is due to the cutoff value. If a cutoff value of2
1, αχ −p is used as discriminator between good points and outliers in a
p-dimensional standard normal data set, a fraction of • data points will have a
Mahalanobis distance larger than the cutoff value.
Robust multivariate outlier detection
8
For each dimension p between 1 and 8 we generated 100 standard normal data sets
of 100 points. The Kosinski algorithm was run twice on each data set, once with a
cutoff value 299.0,pχ , and once with 2
95.0,pχ . Each point that is indicated to be an
outlier is a swamped observation since there are no true outliers by construction. We
calculated the average fraction of swamped observations (i.e. the number of
swamped observations of each data set divided by 100, the number of points in the
data set, averaged over all 100 data sets). Results are shown in Table 3.1.
• p=1 2 3 4 5 6 7 80.01 0.015 0.011 0.010 0.008 0.008 0.008 0.007 0.0070.05 0.239 0.112 0.081 0.070 0.059 0.052 0.045 0.042
Table 3.1. The average fraction of swamped observations of the simulations of 100
generated p-dimensional data sets of 100 points for each p between 1 and 8, with
cutoff value 21, αχ −p .
For • =0.01 the fraction of swamped observations is very close to the value of •
itself. These results are very similar to the results of the original Kosinski algorithm.
For • =0.05, however, the average fraction of swamped observations is much larger
than 0.05 for the lower dimensions, especially for p=1 and p=2. The reason for this
is the following. Consider a one-dimensional standard normal data set. If the
variance of all points is used, the outlyingness of a fraction of • points will be larger
than 21,1 αχ − . However, in the Kosinski algorithm the variance of all points but at
least that fraction of • points with the largest outlyingnesses is calculated. This
variance is smaller than the variance of all points. Hence, the Mahalanobis distances
are overestimated and too many points are indicated to be an outlier. This is a self-
magnifying effect. More outliers lead to a smaller variance which leads to more
points indicated to be an outlier, etc.
The effect is the strongest in one dimension. In higher dimensions the points with a
large Mahalanobis distance are “all around” . Therefore they less influence the
variance in the separate directions.
Apparently, the effect is quite strong for • =0.05, but almost negligible for • =0.01. In
the remaining tests • =0.01 is used, unless otherwise stated.
3.3.3 Masking and swamping
The ability of the algorithm to detect outliers was tested in another simulation. We
generated data sets in the same way as is done in the Kosinski paper in order to get a
fair comparison between the original and our adjusted Kosinski algorithm. Thus we
generated data sets of 100 points containing good points as well as outliers. Both the
good points and the outliers were generated from a multivariate distribution, with
402 =σ for the good points and 12 =σ for the bad points. The distance between
Robust multivariate outlier detection
9
the center of the good points and the bad points is denoted by d. The vector between
the centers is along the vector of 1’s.
We varied the dimension (p=2, 5), the fraction of outliers (0.10• 0.45), and the
distance (d=20• 60). We calculated the fraction of masked outliers (the number of
masked outliers of each data set divided by the number of outliers) and the fraction
of swamped points (the number of swamped points of each data set divided by the
number of good points), both averaged over 100 simulation runs for each set of
parameters p, d, and fraction of outliers. Results are shown in Table 3.2.
p=2 p=5fraction of
outliersfraction of
masked obs.fraction ofswamped
obs.
fraction ofoutliers
fraction ofmasked obs.
fraction ofswamped
obs.d=20 d=25
0.10 0.81 0.009 0.10 0.90 0.0080.20 0.89 0.014 0.20 0.91 0.0210.30 0.88 0.022 0.30 0.93 0.1460.40 0.86 0.146 0.40 0.97 0.5510.45 0.88 0.350 0.45 1.00 0.855
d=30 d=400.10 0.03 0.011 0.10 0.00 0.0080.20 0.00 0.011 0.20 0.04 0.0080.30 0.01 0.010 0.30 0.03 0.0220.40 0.05 0.043 0.40 0.02 0.0200.45 0.01 0.019 0.45 0.01 0.014
d=40 d=600.10 0.00 0.011 0.10 0.00 0.0080.20 0.00 0.011 0.20 0.00 0.0070.30 0.00 0.011 0.30 0.00 0.0090.40 0.00 0.009 0.40 0.00 0.0100.45 0.00 0.010 0.45 0.00 0.008
Table 3.2. Average fraction of masked and swamped observations of 2- and
5-dimensional data sets over 100 simulation runs. Each data set consisted of 100
points with a certain fraction of outliers. The good (bad) points were generated from
a multivariate normal distribution with 402 =σ ( 12 =σ ) in each direction. The
distance between the center of the good points and the bad points is denoted by d.
The following conclusions can be drawn from these results. The algorithm is said to
be performing well i f the fraction of masked outliers is close to zero and the fraction
of swamped observation is close to • =0.01. The first conclusion is: the larger the
distance between the good points and the bad points the better the algorithm
performs. This conclusion is not surprising and is in agreement with Kosinski’s
results. Secondly, the higher the dimension, the worse the performance of the
algorithm. In five dimensions the algorithm starts to perform well at d=40, and close
to perfect at d=60, while in two dimensions the performance is good at d=30,
respectively perfect at d=40. The original algorithm did not show such a dependence
on the dimension. It is remarked, however, that the paper by Kosinski does not give
Robust multivariate outlier detection
10
enough details for a good comparison on this point. Third, for both two and five
dimensions the adjusted algorithm performs worse than the original algorithm. The
original algorithm is almost perfect at d=25 for both p=2 and p=5, while the adjusted
algorithm is not perfect until d=40 or d=60. This is the price that is paid for the large
gain in computer time. The fourth conclusion is: the performance of the algorithm is
almost not dependent on the fraction of outliers, in agreement with Kosinski’s
results. In some cases, the algorithm even seems to perform better for higher
fractions. This is however due to the relatively small number of points (100) per data
set. For very large data sets and very large number of simulation runs this artifact
will disappear.
p d fr inc masked swamped2 20 0.10 1p 0.79 0.0102 20 0.10 10% 0.80 0.0092 20 0.10 100% 0.80 0.0092 20 0.40 1p 0.86 0.2252 20 0.40 10% 0.86 0.1462 20 0.40 100% 0.89 0.093
2 30 0.10 1p 0.00 0.0112 30 0.10 10% 0.03 0.0112 30 0.10 100% 0.02 0.0112 30 0.40 1p 0.05 0.0422 30 0.40 10% 0.05 0.0432 30 0.40 100% 0.08 0.038
2 40 0.10 1p 0.00 0.0112 40 0.10 10% 0.00 0.0112 40 0.10 100% 0.00 0.0112 40 0.40 1p 0.00 0.0102 40 0.40 10% 0.00 0.0092 40 0.40 100% 0.02 0.009
5 40 0.10 1p 0.00 0.0085 40 0.10 10% 0.00 0.0085 40 0.10 100% 0.01 0.0085 40 0.40 1p 0.01 0.0165 40 0.40 10% 0.01 0.0165 40 0.40 100% 0.06 0.035
Table 3.3. Average fraction of masked and swamped observations for p-dimensional
data sets with a fraction of fr outliers on a distance d from the good points (for more
details about the data sets see Table 3.2), calculated with runs with either one-point
increment (1p) or proportional increment (10% or 100% of the good part).
3.3.4 Proportional increment
Until now all tests have been performed using the one-point increment, i.e. at each
step of the algorithm the size of the good part is increased with just one point. In
section 3.1 it was already mentioned that a gain in computer time is possible by
increasing the size of the good part with more than one point per step. The
simulations on the masked and swamped observations were repeated with the
proportional increment algorithm. The increment with a certain percentage was
Robust multivariate outlier detection
11
tested for percentages up to 100% (which means that the size of the good part is
doubled at each step).
The results of Table 3.1, showing the average fraction of swamped observations in
outlier-free data sets, did not change. Small changes showed up for large
percentages in the presence of outliers. A summary of the results is shown in Table
3.3. In order to avoid an unnecessary profusion of data we only show the results for
p=2 in some relevant cases and, as an illustration, in a few cases for p=5.
A general conclusion from the table is that for a wide range of percentages the
proportional increment algorithm works satisfactoril y. For a percentage of 100%
outliers are masked slightly more frequently than for lower percentages. The
differences between 10% increment and one-point increment are negligible.
3.3.5 Time dependence
To illustrate the possible gain with the proportional increment we measured the time
per run for p-dimensional data sets of n points, with p ranging from 1 to 8 and n
from 50 to 400. The simulations were performed with outlier-free generated data
sets so that the complete data sets had to be included in the good part. This was done
in order to obtain useful information about the dependence of the simulation times
on the number of points. Table 3.4 shows the results for the simulation runs with
one-point increment. The results for the runs with a proportional increment of 10%
are shown in Table 3.5.
n p=1 2 3 4 5 6 7 850 0.09 0.18 0.29 0.45 0.64 0.84 1.08 1.35100 0.36 0.68 1.05 1.75 2.5 3.3 4.3 5.5200 1.46 2.8 4.6 7.0 10400 6.2 12
Table 3.4. Time (in seconds) per run on p-dimensional data sets of n points, using
the one-point increment.
n p=1 2 3 4 5 6 7 850 0.05 0.10 0.16 0.23 0.31 0.39 0.52 0.62100 0.14 0.24 0.39 0.56 0.76 1.00 1.25 1.55200 0.33 0.60 0.92 1.35 1.90400 0.80 1.40
Table 3.5. Time (in seconds) per run on p-dimensional data sets of n points, using
the proportional increment (perc=10%).
Let us denote the time per run as a function of n for fixed p by tp, and the time per
run as a function of p for fixed n by tn. For the one-point increment simulations tp is
approximately proportional to n2. This is as expected since there are O(n) steps with
a increment of one point and at each step the Mahalanobis distance has to be
calculated for each point (O(n)) and sorted (O(n ln n)). For the simulations with
proportional increment tp is approximately O(n ln n), due to the fact that only
O(ln n) steps are needed instead of O(n). As a consequence there is a substantial
Robust multivariate outlier detection
12
gain in the time per run, ranging from a factor of 2 for 50 points up to a factor of 8
for 400 points.
The time per run for fixed n, tn, is approximately proportional to p1.5, for both one-
point and proportional increment runs. The exponent 1.5 is just an empirical average
over the range p=1..8 and is result of several O(p) and O(p2) steps. Since the
exponent is much smaller than 2 it is more efficient to search for outliers in one
p-dimensional run than in ½p(p-1) 2-dimensional runs, one for each pair of
dimensions, even if one is not interested in outliers in more than 2 dimensions.
Consider for instance p=8, n=2. One run takes 0.62 seconds. However, a total of 1.4
seconds would be needed for the 28 runs in each pair of dimensions, each run taking
0.05 seconds.
3.3.6 Sensitivity to parameters
The Kosinski algorithm was tested on the twelve data sets described in section 5. A
full description of the outliers and a comparison of the results with the results of the
projection algorithm as well as with other methods described in the literature is
given in that section. In the present section we restrict the discussion to the
sensitivity of the number of outliers to the cutoff and the increment percentage.
The algorithm was run with a cutoff 21, αχ −p for • =1% as well as • =5%.
Furthermore, both one-point increment and proportional increment (in the range 0-
40%) were used. The number of detected outliers of the twelve data sets is shown in
Table 3.6.
It is clear that the number of outliers for a specific data set is not the same for each
set of parameters. It is remarked that, in all cases, if different sets of parameters lead
to the same number of outliers, the outliers are exactly the same points. Moreover, if
one set of parameters leads to more outliers than another set, all outliers detected by
the latter are also detected by the former (these are empirical results).
Let us first discuss the differences between the detection with • =1% and with • =5%.
It is obvious that in many cases • =5% results in slightly more outliers than • =1%.
However, in two cases the differences are substantial, i.e. in the Stackloss data and
in the Factory data.
In the Stackloss data five outliers for • =5% are found using moderate increments,
while • =1% shows no outliers at all . The reason for this difference is the relatively
small number of points related to the dimension of the data set. It has been argued
by Rousseeuw that the ratio n/p should be larger than 5 in order to be able to detect
outliers reliably. If n/p is smaller than 5 one comes to a point where it is not useful
to speak about outliers since there is no real bulk of data.
With n=21 and p=4 the Stackloss data lie on the edge of meaningful outlier
detection. Moreover, if the five points which are indicated as outliers with • =5% are
left out, only 16 good points remain, resulting in a ratio n/p=4. In such a case any
outlier detection algorithm will presumably fail to find outliers consistently.
Robust multivariate outlier detection
13
Data set p n inc • =5% • =1%1. Kosinski 2 100 1p 42 40
• 40% 42 40
2. Brain mass 2 28 1p 5 3• 10% 5 3
15-20% 4 330-40% 3 3
3. Hertzsprung-Russel 2 47 1p 7 6• 30% 7 6
40% 6 6
4. Hadi 3 25 1p 3 3• 5% 3 310% 3 0
15-25% 3 330% 3 040% 3 3
5. Stackloss 4 21 1p 5 0• 17% 5 0
18-24% 4 025-30% 1 0
40% 0 0
6. Salinity 4 28 1p 4 2• 30% 4 2
40% 2 2
7. HBK 4 75 1p 15 14• 30% 15 14
40% 14 14
8. Factory 5 50 1p 20 0• 40% 20 0
9. Bush fire 5 38 1p 16 13• 40% 16 13
10. Wood gravity 6 20 1p 6 5• 20% 6 5
30% 6 640% 6 5
11. Coleman 6 20 1p 7 7• 40% 7 7
12. Milk 8 85 1p 20 17• 30% 20 17
40% 18 15Table 3.6. Number of outliers detected by the Kosinski algorithm with a cutoff of
21, αχ −p , for • =1% respectively • =5%, with either one-point (1p) or proportional
increment in the range 0-40%.
Robust multivariate outlier detection
14
The Factory data is an interesting case. For • =5% twenty outliers are detected,
which is 40% of all points, while detection with • =1% shows no outliers.
Explorative data analysis shows that about half the data set is quite narrowly
concentrated in a certain region, while the other half is distributed over a much
larger space. There is however no clear distinction between these two parts. The
more widely distributed part is rather a very thick tail of the other part. In such a
case the effect that the algorithm with • =5% tends to detect too much outliers, which
is explained discussing Table 3.1, is very strong. It is questionable whether the
indicated points should be considered as outliers.
Let us now discuss the sensitivity of the number of detected outliers to the
increment. At low percentages the number of outliers is always the same as for the
one-point increment • in fact, at very low percentages the proportional increment
procedure leads to an increment of just one point per step, making the two
algorithms equal. For most data sets the number of outliers is constant for a wide
range of percentages and starts to differ slightly only at 30-40% or higher. Three of
the twelve data sets behave differently: the Brain mass data, the Hadi data, and the
Stackloss data.
The Brain mass data shows 5 outliers at low percentages for • =5%. At percentages
around 15% the number of outliers is only 4 and at 30% only 3. So the number of
outliers changes earlier (at 15%) than in most other data sets (• 30%). For • =1% the
number of outliers is constant over the whole range. In fact, the three outliers which
are found at 30-40% for • =5% are exactly the same as the three outliers found for
• =1%. The two outliers which are missed at higher percentages for • =5% both lie
just above the cutoff value. Therefore it is disputable whether they are real outliers at
all .
The Hadi data shows strange behavior. At all percentages for • =5% and at most
percentages for • =1% three outliers are found. However, near 10% and near 30% no
outliers are detected. Again, the three outliers are disputable. All have a
Mahalanobis distance just above the cutoff (see Table 5.2). Hence it is not strange
that sometimes these three points are included in the good part (the three points lie
close together; hence, the inclusion of one of them in the good part leads to low
Mahalanobis distances for the other two as well). On the other side, it is also not a
big problem, since it is rather a matter of taste than a matter of science to call the
three points outliers or good points.
The Stackloss data shows a decreasing number of outliers for • =5% at relatively low
percentages, like in the Brain mass data. Here, the sensitivity to the percentage is
related to the low ratio n/p, as is discussed previously.
In conclusion, for increments up to 30% the same outliers are found as with the one-
point increment. In cases where this is not true, the supposed outliers always have an
outlyingness slightly above or below the cutoff, so that missing such outliers has no
big consequences. Furthermore, relatively low cutoff values could lead to
disproportionate swamping.
Robust multivariate outlier detection
15
4. The projection method
4.1 The principle of projectionThe projection method is based on the idea that outliers in univariate data are easily
recognized, visually as well as by computational means. In one dimension the
Mahalanobis distance is simply σyyi − . A robust version of the univariate
outlyingness is found by replacing the mean by the med and replacing the standard
deviation by the mad. Denoting the robust outlyingness by iu , this leads to
S
Myu i
i
−=
where M and S denote the med respectively the mad:
MymedS
ymedM
ll
kk
−=
=
In the case of multivariate data the idea is to “ look” at the data set from all possible
directions and to “see” whether a particular data point lies far away from the bulk of
the data points. Looking in this context means projecting the data set on a projection
vector a; seeing means calculating the outlyingness as is done in univariate data. The
ultimate outlyingness of a point is just the maximum of the outlyingnesses over all
projection directions.
The outlyingness defined in this way corresponds to the multivariate Mahalanobis
distance as is shown in section 2. Recalling the expression for the Mahalanobis
distance:
Caa
yyaMD
ti
t
aai
t
2
1
2 ))((sup
−=
=
Robustifying the Mahalanobis distance leads to
S
Myau
it
aai
t
−=
=1
sup
Now M and S are defined as follows:
MyamedS
yamedM
lt
l
kt
k
−=
=
It is remarked that 2iMD corresponds to 2
iu .
How is the maximum calculated? The outlyingness
S
Mya it −
Robust multivariate outlier detection
16
as a function of a could posses several local maxima, making gradient search
methods unfeasible. Therefore the outlyingness is calculated on a grid of a finite
number of projection vectors. The grid should be fine enough in order to calculate
the maximum outlyingness with enough accuracy.
This robust measure of outlyingness was firstly developed by Stahel en Donoho.
More recent work on this subject has been reported by Maronna and Yohai. These
authors used the outlyingness in order to calculate a weighted mean and covariance
matrix. Outliers were given small weights so that the Stahel-Donoho estimator of the
mean was robust against the presence of outliers. It is of course possible to use the
weighted mean and covariance matrix to calculate a weighted Mahalanobis distance.
This is not done in the projection method discussed here.
The robust outlyingness iu was slightly adjusted for the following reason. The mad
of univariate standard normal data, which has a standard deviation of 1 by definition,
is 0.674=1/1.484. In order to assure that, in the limiting case of an infinitely large
multivariate normal data set, the outlyingness 2iu is equal to the squared
Mahalanobis distance, the mad in the denominator is multiplied with 1.484:
S
Myau
it
aai
t 484.1sup
1
−=
=
4.2 The projection algorithmThe purpose of the algorithm is, given a set of n multivariate data points
nyyy ,..,, 21 , to calculate the outlyingness iu for each point i. The algorithm can be
summarized as follows.
Step 0. In: data set
The algorithm is started with a set of continuous p-dimensional data nyyy ,..,, 21 ,
with ( )tipii yyy ..1= .
Step 1. Define a grid
There are
q
p subsets of q dimensions in the total set of p dimensions. The
“maximum search dimension” q is predefined. Projection vectors a in a certain
subset are parameterized by the angles 121 ,..,, −qθθθ :
=
−−
−−
121
121
123
12
1
sinsinsin
sinsincos
sinsincos
sincos
cos
θθθθθθ
θθθθθ
θ
�
�
�
a
Robust multivariate outlier detection
17
A certain predefined step size step (in degrees) is used to define the grid.
The first angle 1θ can take the values 1stepi , with 1step the largest angle smaller
than or equal to step for which 1
180
stepis an integer value, and with
1
180,..,2,1
stepi = .
The second angle can take the values 2stepj , with 2step the largest angle smaller
than or equal to 1
1
cosθstep
for which 2
180
step is an integer value, and with
2
180,..,2,1
stepj = .
The r-th angle can take the values rstepk , with rstep the largest angle smaller than
or equal to 1
1
cos −
−
r
rstep
θ for which
rstep
180 is an integer value, and with
rstepk
180,..,2,1= .
Such a grid is defined in each subset of q dimensions.
Step 2. Outlyingness for each grid point
For each grid point a, calculate the outlyingness for each data point iy :
• Calculate the projections iya .
• Calculate the median kk
a yamedM = .
• Calculate the mad MyamedL ll
a −= .
• Calculate the outlyingness a
aii L
Myaau
484.1)(
−= .
Step 3. Out: outlyingness
The outlyingness iu is the maximum over the grid:
)(sup auu ia
i = .
4.3 Test resultsA prototype/test program was implemented in an Excel/Visual Basic environment.
Documentation of the program is published elsewhere. We successively tested the
amount of swamped observations of data sets containing no outliers, the amount of
masked observations of data sets containing outliers, the time-dependence of the
algorithm on the parameters step and q, and the sensitivity of the number of detected
outliers to these parameters in some known data sets.
Robust multivariate outlier detection
18
4.3.1 Swamping
A simulation study was performed in order to determine the average fraction of
swamped observations in normal distributed data sets. See section 3.3.2 for more
detailed remarks about the swamping effect and about generating the data sets. The
results of the simulations are shown in Table 4.1.
• step p=1 2 3 4 51% 10 0.010 0.011 0.016 0.018 0.0235% 10 0.049 0.052 0.067 0.071 0.088
1% 30 0.010 0.010 0.012 0.011 0.0125% 30 0.049 0.049 0.051 0.049 0.058
Table 4.1. The average fraction of swamped observations of the simulations on
several generated p-dimensional data sets of 100 points, with cutoff value 21, αχ −p
and step size step. The parameter q is equal to p.
p=2 q=2 p=5 q=2 p=5 q=5fraction of
outliersfraction of
masked obs.fraction of
outliersfraction of
masked obs.fraction of
outliersfraction of
masked obs.d=20 d=30 d=30
0.12 0.83 0.12 1.00 0.12 0.220.23 1.00 0.23 1.00 0.23 0.540.34 1.00 0.34 1.00 0.34 1.000.45 1.00 0.45 1.00 0.45 1.00
d=40 d=50 d=500.12 0.00 0.12 0.00 0.12 0.000.23 0.00 0.23 0.67 0.23 0.000.34 0.62 0.34 1.00 0.34 0.650.45 1.00 0.45 1.00 0.45 1.00
d=50 d=80 d=600.12 0.00 0.12 0.00 0.12 0.000.23 0.00 0.23 0.00 0.23 0.000.34 0.00 0.34 0.00 0.34 0.000.45 1.00 0.45 1.00 0.45 1.00
d=90 d=140 d=1200.12 0.00 0.12 0.00 0.12 0.000.23 0.00 0.23 0.00 0.23 0.000.34 0.00 0.34 0.00 0.34 0.000.45 0.00 0.45 0.00 0.45 0.00
Table 4.2. Average fraction of masked outliers of 2- and 5-dimensional generated
data sets (see also section 3.3.3).
For low dimensions the average fraction of swamped observations tend to be almost
equal to • . The fraction increases, however, with increasing dimension. This due to
the decreasing ratio n/p. It is remarkable that if the step size is 30 the fraction of
swamped observations seems to be much better than for step size 10. This is just a
coincidence. The fact that more observations are declared to be an outlier is
Robust multivariate outlier detection
19
compensated by the fact that outlyingnesses are usually smaller if high step sizes are
used. In fact, the differences between step size 10 and 30 are so large for higher
dimensions that this is an indication that a step size of 30 could be too low to result
in reliable outlyingnesses.
4.3.2 Masking and swamping
The ability of the projection algorithm to detect outliers was tested by generating
data sets that contain good points as well as outliers. See section 3.3.3 for details on
how the data sets were generated.
Results are shown in Table 4.3. In all cases, the ability to detect the outliers is
strongly dependent on the contamination of outliers. If there are many outliers, they
can only be detected if they lie very far away from the cloud of good points. This is
due to the fact that, although the med and the mad have a robustness of 50%, a large
concentrated fraction of outliers strongly shifts the med towards the cloud of outliers
and enlarges the mad.
In higher dimensions it is more difficult to detect the outliers, like in the Kosinski
method. The ability to detect the outliers depends also on the maximum search
dimension q. If q is taken equal to p less outliers are masked.
4.3.3 Time dependence
The time dependence of the projection algorithm on the step size step and the
maximum search dimension q is shown in Table 4.3.
n p q step t n p q step t400 2 2 36 13.0 100 2 2 9 8.0
18 21.0 3 19.39 32.7 4 33.5
4.5 56.8 5 50.16 71.4
400 3 3 36 28.1 7 98.918 68.6 8 128.09 209.1
4.5 719.3 100 5 1 9 5.92 50.1
50 5 2 9 26.3 3 479.8100 50.1 4 2489.1200 107.7 5 4692.1400 202.9
Table 4.3. Time t (in seconds) per run on p-dimensional data sets of n points using
maximum search dimension q and step size step (in degrees).
Asymptotically the time per run should be proportional to 1)180
()ln( −
q
stepq
pnn ,
since for each of the
q
psubsets a grid is defined with a number of grid points of
Robust multivariate outlier detection
20
the order of 1)180
( −q
step, and at each grid point the median of the projected points has
to be calculated (n ln n). The results in the table roughly confirm this theoretical
estimation. The most important conclusion from the table is that the time per run
strongly increases with the search dimension q. This makes the algorithm only
useful for relatively low dimensions.
4.3.4 Sensitivity to parameters
The projection method was tested with the twelve data sets that are fully described
in section 5, like is done with the Kosinski method (see section 3.3.6). The results
are shown in Table 4.4.
Let us first discuss the differences between • =5% and 1%. In almost all cases the
number of outliers, detected with • =5% are larger than with • =1%. This is
completely due to stronger swamping. It is remarked that there is no algorithmic
dependence on the cutoff value, like in the Kosinski method. In the projection
method a set of outlyingnesses is calculated and after the calculation a certain cutoff
value is used in order to discriminate between good and bad points. Hence, a smaller
cutoff value leads to more outliers, but all points still have the same outlyingness. In
the Kosinski method the cutoff value is already used during the algorithm: the cutoff
is used in order to decide whether more points should be added to the good part. A
smaller cutoff leads not only to more outliers but also to a different set of
outlyingnesses since the mean and the covariance matrix are calculated with a
different set of points. As a consequence, in cases where the Kosinski possibly
shows a rather strong sensitivity to the cutoff value, this sensitivity is missing in the
projection method.
Now let us discuss the dependence of the number of outliers on the maximum search
dimension q. In the Hertzsprung-Russel data set and in the HBK data set the number
of outliers found with q=1 is already as large as found with higher values of q. In the
Brain mass data set and in the Milk data set, the number of outliers for q=1 are
however much smaller than for large values of q. In those cases, many outliers are
truly multivariate.
In the Hadi data set, the Factory data set and the Bush fire data set there is also a
rather large discrepancy between q=2 and q=3. It is remarked that the Hadi data set
was constructed so that all outliers were invisible looking at two dimensions only
(see section 5.2.4). Also in the other two data sets it is clear that many outliers can
only be found by inspecting three or more dimensions at the same time.
If q is higher than three, only slightly more outliers are found than for q=3.
Differences can be explained by the fact that searching in higher dimensions with
the projection method leads to more outliers (see section 4.3.1).
Robust multivariate outlier detection
21
Data set p n q step • =5% • =1%1. Kosinski 2 100 2 10 78 34
2 20 77 342 30 42 31
2. Brain mass 2 28 2 5 9 62 10 9 42 30 8 41 n/a 3 1
3. Hertzsprung-Russel 2 47 2 1 7 62 30 6 52 90 6 51 n/a 6 5
4. Hadi 3 25 3 5 11 53 10 8 02 10 0 0
5. Stackloss 4 21 4 5 14 94 10 10 94 15 8 64 20 9 74 30 6 6
6. Salinity 4 28 4 10 12 84 20 9 73 30 6 4
7. HBK 4 75 4 10 15 144 20 14 141 n/a 14 14
8. Factory 5 50 5 10 24 185 20 14 94 10 24 173 10 22 142 10 9 9
9. Bush fire 5 38 5 10 24 195 20 19 174 10 22 193 10 21 172 10 13 12
10. Wood gravity 6 20 5 20 14 145 30 12 113 10 15 14
11. Coleman 6 20 5 20 10 85 30 4 4
12. Milk 8 85 5 20 18 145 30 15 134 20 16 144 30 15 133 20 15 133 30 15 122 20 13 112 30 12 71 n/a 6 5
Table 4.4. Number of outliers detected by the projection algorithm with a cutoff of
21, αχ −p , for • =1% respectively • =5%, with maximum search dimension q and angular
step size step (in degrees).
Robust multivariate outlier detection
22
The sensitivity to the step size is not large in most cases. In cases li ke the Hadi data,
the Stackloss data, the Salinity data and the Coleman data, the sensitivity can be
explained by the sparsity of the data sets. A step size near 10-20 seems to work well
in most cases.
In conclusion, the number of outliers is not very sensitive to the parameters q and
step. However, the sensitivity is not completely negligible. In most practical cases
q=3 and step=10 work well enough.
5. Comparison of methods
In this section the projection method and the Kosinski method are compared with
each other as well with other robust outlier detection methods. In section 5.1 we will
shortly describe some other methods reported in the literature. The comparison is
made by applying the projection method and the Kosinski method on data sets that
are analyzed by at least one of the other methods. Those data sets and the results of
the said methods are described in section 5.2. In section 5.3 the results are discussed.
Unfortunately, in most papers on outlier detection methods very little is said about
the efficiency of the methods, i.e. how fast the algorithms are and how it depends on
the number of points and the dimension of the data set. Therefore we restrict the
discussion to the abili ty to detect outliers.
5.1 Other methodsIt is important to note that two different type outliers are distinguished in the outlier
literature. The first type outlier, which is used in this report, is a point that lies far
away from the bulk of the data. The second type is a point that lies far away from the
regression plane formed by the bulk of the data. The two types will be denoted by
bulk outliers respectively regression outliers.
Of course, outliers are often so according to both points of view. That is why we
compare the results of the projection method and the Kosinski method, which are
both bulk outlier methods, also with regression outlier methods. An outlier that is
declared to be so by both methods is called a bad leverage point. In the case that a
point lies far away from the bulk of the points but close to the regression plane it is
called a good leverage point.
Rousseeuw (1987, 1990) developed the minimum volume elli psoid (MVE) estimator
in order to robustly detect bulk outliers. The principle is to search for the ellipsoid,
covering at least half the data points, for which the volume is minimal. The mean
and the covariance matrix of the points inside the ell ipsoid are inserted in the
expression for the Mahalanobis distance. This method is costly due to the
complexity of the algorithm that searches the minimum volume elli psoid.
A related technique is based on the minimum covariance determinant (MCD)
estimator. This technique is employed by Rocke. The aim of this technique is to
search for the set of points, containing at least half the data, for which the
determinant of the covariance matrix is minimal. Again, the mean and the
Robust multivariate outlier detection
23
covariance matrix, determined by that set of points, are inserted in the Mahalanobis
distance expression. Also this method is rather complex, although substantially
optimized by Rocke.
Hadi (1992) developed a bulk outlier method that is very similar to the Kosinski
method. He also starts with a set of p+1 “good” points and increases the good set
one point by one. The difference lies in the choice of the first p+1 points. Hadi
orders the n points using another robust measure of outlyingness. The question
arises why that other outlyingness would not be appropriate for outlier detection. A
reason could be that an arbitrary robust measure of outlyingness deviates relatively
strongly from the “real” Mahalanobis distance.
Atkinson combines the MVE method of Rousseeuw and the forward search
technique also employed by Kosinski. A few sets of p+1 randomly chosen points are
used for a forward search. The set that results in the ell ipsoid with minimal volume
is used for the calculation of the Mahalanobis distances.
Maronna employed a projection–like method, but slightly more complicated. The
outlyingnesses are calculated li ke in the projection method. Then, weights are
assigned to each point, with low weights for the outlying points, i.e. the influence of
outliers is restricted. The mean and the covariance matrix are calculated using these
weights. They form the Stahel-Donoho estimator for location and scatter. Finally,
Maronna inserts this mean and this covariance matrix in the expression for the
Mahalanobis distance.
Egan proposes resampling by the half-mean method (RHM) and the smallest half-
volume method (SHV). In the RHM method several randomly selected portions of
the data are generated. In each case the outlyingnesses are calculated. For each point
is counted how many times it has a large outlyingness. It is declared to be a true
outlier if this happens often. In the SHV method the distance between each pair of
points is calculated and put in a matrix. The column with the smallest sum of the
smallest n/2 distances is selected. The corresponding n/2 points form the smallest
half-volume. The mean and the covariance of those points are inserted in the
Mahalanobis distance expression.
The detection of regression outliers is mainly done with the least median of squares
(LMS) method. The LMS method is developed by Rousseeuw (1984, 1987, 1990).
Instead of minimizing the sum of the squares of the residuals in the least squares
method (which should rather be called the least sum of squares method in this
context) the median of the squares is minimized. Outliers are simply the points with
large residuals as calculated with the regression coefficients determined with the
LMS method.
Hadi (1993) uses a forward search to detect the regression outliers. The regression
coefficients of a small good set are determined. The set is increased by subsequently
adding the points with the smallest residuals and recalculating the regression
coefficients until a certain stop criterion is fulfilled. A small good set has to be found
beforehand.
Robust multivariate outlier detection
24
Atkinson combines forward search and LMS. A few sets of p+1 randomly chosen
points are used in a forward search. The set that results in the smallest LMS is used
for the final determination of the regression residuals.
A completely different approach is the genetic algorithm for detection of regression
outliers by Walczak. We will not describe this approach here since it lies beyond the
scope of deterministic calculation of outlyingnesses.
Fung developed an adding-back algorithm for confirmation of regression outliers.
Once points are declared to be outliers by any other robust method, the points are
added back to the data set in a stepwise way. The extent to which estimation of
regression coefficients are affected by the adding-back of a point is used as a
diagnostic measure to decide whether that point is a real outlier. This method was
developed since robust outlier methods tend to declare too many points to be
outliers.
5.2 Known data setsIn this section the projection method and the Kosinski method are compared by
running both algorithms on the twelve data sets given in Table 5.1. The main part of
these data sets is well described in the robust outlier detection literature. Hence, we
are able to compare the results of the two algorithms with known results.
The outlyingnesses as calculated by the projection method and the Kosinski method
are shown in Table 5.2, Table 5.4 and Table 5.5. In both methods the cutoff value
for • =1% is used. In the Kosinski method a proportional increment of 20% was
used. The outlyingnesses of the projection method were calculated with q=p (if p<6;
if p>5 then q=5) and the lowest step size that is shown in Table 4.4.
We will now discuss the data sets one by one.
Data set p n Source1. Kosinski 2 100 Ref. [1]2. Brain mass 2 28 Ref. [3]3. Hertzsprung-Russel 2 47 Ref. [3]4. Hadi 3 25 Ref. [4]5. Stackloss 4 21 Ref. [3]6. Salinity 4 28 Ref. [3]7. HBK 4 75 Ref. [3]8. Factory 5 50 This work9. Bush fire 5 38 Ref. [5]10. Wood gravity 6 20 Ref. [6]11. Coleman 6 20 Ref. [3]12. Milk 8 85 Ref. [7]Table 5.1. The name, the dimension p, the number of points n, and the source of the
tested data sets.
5.2.1 Kosinski data
The Kosinski data form a data set that is difficult to handle from a point of view of
robust outlier detection. The two-dimensional data set contains 100 points. Points 1-
Robust multivariate outlier detection
25
40 are generated from a bivariate normal distribution with
0,1,18,18 22
2121 ===−== ρσσµµ , and are considered to be outliers. Points
41-100 are good points and are a sample from the bivariate normal distribution with
7.0,40,0,0 22
2121 ===== ρσσµµ .
The Kosinski method correctly identifies all outliers (see Table 5.2). The projection
method identifies none of the outliers and declares many good points to be outliers.
The reason for this failure is the large contamination and the small scatter of the
outliers. Since there are so many outliers they strongly shift the median towards the
outliers. Hence, the outliers are not detected. Furthermore, since they are narrowly
distributed, they almost completely determine the median of absolute deviations in
the projection direction perpendicular to the vector pointing from the center of the
good points to the center of the outliers. Hence, many points, lying at the end points
of the ellipsoid of good points, have a large outlyingness.
It is remarked that this data set is not an arbitrarily chosen data set. It was generated
by Kosinski in order to demonstrate the superiority of his own method over other
methods.
5.2.2 Brain mass data
The Brain mass data contain three outliers according to the Kosinski method: points
6, 16 and 25. Those points are also indicated to be outliers by Rousseeuw (1990) and
Hadi (1992). Those authors also declare point 14 to be an outlier, but with an
outlyingness slightly above the cutoff . The projection method declares points 6, 14,
16, 17, 20 and 25 to be outliers.
5.2.3 Hertzsprung-Russel data
The two methods produce almost the same outlyingnesses for all points. Both
declare points 11, 20, 30 and 34 to be large outliers, in agreement with results by
Rousseeuw (1987) and Hadi (1993). However, the projection method and the
Kosinski method also declare points 7 and 14 as outliers and point 9 is an outlier
according to the Kosinski method . The outlyingness of these three points is
relatively small . Visual inspection of the data (see page 28 in Rousseeuw (1987))
shows that these points are indeed moderately outlying.
5.2.4 Hadi data
The Hadi data is an artificial one. The data set contains three variables 1x , 2x and y .
The two predictors were originally created as uniform (0,15) and were then
transformed to have a correlation of 0.5. The target variable was then created by
ε++= 21 xxy with )1,0(~ Nε . Finally, cases 1-3 were perturbed to have
predictor values around (15,15) and to satisfy 421 ++= xxy .
The Kosinski method finds the outliers, with a relatively small outlyingness. The
projection method finds these outliers too but declares also two good points to be
outliers.
Robust multivariate outlier detection
26
A: Kosinski Brain mass Hertzsprung-Russel HadiB: 3,035 3,035 3,035 3,368C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos
1 2,59 7,45 51 4,37 1,01 1 1,79 0,75 1 0,80 1,20 1 4,75 3,472 2,80 7,96 52 1,53 0,98 2 1,05 1,13 2 1,39 1,46 2 4,75 3,473 2,46 7,14 53 2,22 1,05 3 0,37 0,16 3 1,41 1,83 3 4,76 3,464 2,87 8,21 54 4,69 1,32 4 0,65 0,13 4 1,39 1,46 4 2,86 1,845 2,78 7,97 55 3,97 1,50 5 1,99 0,92 5 1,42 1,90 5 0,96 0,706 2,59 7,48 56 3,47 1,44 6 8,40 6,19 6 0,80 1,04 6 3,43 1,577 2,84 8,09 57 4,59 2,55 7 2,08 1,27 7 5,55 6,35 7 2,21 0,918 2,75 7,89 58 2,27 0,37 8 0,66 0,55 8 1,44 1,38 8 0,46 0,369 2,51 7,22 59 2,96 0,51 9 0,94 0,91 9 2,59 3,26 9 0,99 0,35
10 2,45 7,12 60 2,22 0,54 10 1,93 0,99 10 0,61 0,93 10 1,74 1,3411 2,69 7,71 61 4,94 1,83 11 1,23 0,51 11 11,01 12,67 11 2,50 1,6512 2,84 8,12 62 5,07 1,29 12 0,96 0,90 12 0,91 1,21 12 1,54 1,1313 2,77 7,95 63 4,66 1,13 13 0,64 0,60 13 0,79 0,88 13 2,81 1,2514 2,68 7,72 64 1,68 1,17 14 3,87 2,21 14 3,04 3,51 14 0,98 0,6815 2,37 6,95 65 3,32 1,03 15 2,22 1,44 15 1,55 1,22 15 2,65 1,3716 2,46 7,17 66 2,25 1,03 16 7,54 5,63 16 1,23 0,99 16 0,97 0,8417 2,64 7,59 67 2,59 1,13 17 3,18 1,83 17 2,17 1,80 17 3,31 1,6418 2,40 6,96 68 3,89 1,04 18 0,90 0,92 18 2,17 2,04 18 3,17 1,3919 2,46 7,11 69 1,82 0,88 19 3,00 1,43 19 1,77 1,54 19 2,78 1,4920 2,45 7,15 70 5,96 1,59 20 3,59 1,71 20 11,26 13,01 20 2,94 1,3721 2,70 7,71 71 2,29 0,70 21 1,54 0,66 21 1,35 1,07 21 0,90 0,6622 2,62 7,54 72 3,91 0,86 22 0,50 0,25 22 1,62 1,28 22 1,61 1,2723 2,82 8,11 73 2,15 1,30 23 0,66 0,74 23 1,60 1,41 23 3,89 1,3924 2,68 7,67 74 6,76 2,00 24 2,18 1,11 24 1,21 1,10 24 2,80 1,2225 2,37 6,88 75 6,20 2,01 25 8,97 6,75 25 0,34 0,58 25 2,04 1,1226 2,75 7,86 76 3,37 0,77 26 2,61 1,24 26 1,04 0,7827 2,67 7,70 77 2,67 0,49 27 2,59 1,41 27 0,88 1,0728 2,85 8,14 78 1,83 0,50 28 1,13 1,17 28 0,36 0,3329 2,78 7,98 79 4,19 2,45 29 1,43 1,6030 2,78 8,00 80 2,71 0,46 30 11,61 13,4831 2,45 7,14 81 4,49 1,12 31 1,36 1,0932 2,91 8,29 82 2,74 0,79 32 1,59 1,4833 2,51 7,27 83 1,62 0,31 33 0,49 0,5234 2,33 6,80 84 2,81 0,47 34 11,87 13,8835 2,68 7,72 85 5,94 1,57 35 1,50 1,5036 2,82 8,08 86 3,50 1,01 36 1,57 1,7037 2,52 7,31 87 1,38 1,93 37 1,27 1,1338 2,65 7,66 88 2,21 1,57 38 0,49 0,5239 2,49 7,18 89 5,47 1,73 39 1,14 1,0340 2,61 7,52 90 3,07 1,44 40 1,17 1,5241 1,89 0,50 91 2,94 1,54 41 0,88 0,6042 1,84 0,41 92 6,02 1,59 42 0,46 0,3043 7,94 2,03 93 3,65 0,80 43 0,81 0,7744 3,04 0,61 94 3,89 0,98 44 0,61 0,8045 2,35 0,67 95 6,68 1,64 45 1,17 1,1946 6,42 1,76 96 2,50 0,84 46 0,58 0,3747 5,36 1,68 97 4,59 1,32 47 1,41 1,2048 3,74 0,77 98 5,65 1,4649 3,92 0,92 99 2,12 1,6450 6,53 1,78 100 2,31 0,30
Table 5.2. The outlyingness of each point of the Kosinski, the Brain mass, the Hertzsprung-
Russel and the Hadi data. A: Name of data set. B: Cutoff value for • =1%; outlyingnesses
higher than the cutoff are shown in bold. C: Method (Proj: projection method; Kos: Kosinski
method).
Robust multivariate outlier detection
27
The projection method finds consistently larger outlyingnesses than the Kosinski
method, roughly a factor 2 for most points. This is related to the sparsity of the data
set. Consider for instance the extreme case of three points in two dimensions. Every
point will have an infinitely large outlyingness according to the projection method.
This can be understood by noting that the mad of the projected points is zero if the
projection vector intersects two points. The remaining point has an infinite
outlyingness. For data sets with more points the situation is less extreme. But as long
as there are relatively little points the projection outlyingnesses will be relatively
large. In such a case the cutoff values based on the 2χ -distribution are in fact too
low, leading to the swamping effect.
5.2.5 Stackloss data
The Stackloss data outlyingnesses show large differences between the two methods.
One of the reasons is the sensitivity of the Kosinski results to the cutoff value in this
case, as is discussed in section 3. If a cutoff value 080.3295.0,4 =χ is used instead of
644.3299.0,4 =χ , the Kosinski method shows outlyingnesses as in Table 5.3.
outl. outl. outl.1 4.73 8 0.98 15 1.072 3.30 9 0.76 16 0.873 4.42 10 0.98 17 1.144 4.19 11 0.83 18 0.715 0.63 12 0.93 19 0.806 0.76 13 1.24 20 1.047 0.87 14 1.04 21 3.80
Table 5.3. The outlyingnesses of the Stackloss data, calculated with the Kosinski
method with cutoff value 080.3295.0,4 =χ . Outlyingnesses above this value are
shown in bold, outlyingnesses that are even higher than 644.3299.0,4 =χ are shown
in bold italic.
Here 5 points have an outlyingness exceeding the cutoff value for • =5%, four of
them (points 1, 3, 4 and 21) even above the value for • =1%. Even in this case the
differences with the projection method are large. The projection outlyingnesses are
up to 5 times larger than the Kosinski ones.
For comparison, Walczak and Atkinson declared points 1, 3, 4 and 21 to be outliers,
Rocke indicated also point 2 as an outlier, while points 1, 2, 3 and 21 are outliers
according to Hadi (1992). These results are comparable with the results of the
Kosinski method with • =5%. Hence, considering the results in Table 5.4, the
Kosinski method results in too little outliers, the projection method too much. In
both cases the origin lies in the low n/p ratio.
Robust multivariate outlier detection
28
A: Stackloss Salinity HBK FactoryB: 3,644 3,644 3,644 3,884C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos
1 8,42 1,62 1 2,67 1,29 1 30,38 32,34 51 1,99 1,64 1 5,23 2,122 6,92 1,53 2 2,58 1,46 2 31,36 33,36 52 2,20 2,06 2 5,66 1,673 8,14 1,45 3 4,65 1,84 3 32,81 34,90 53 3,18 2,80 3 5,55 1,914 9,00 1,51 4 3,54 1,63 4 32,60 34,97 54 2,13 1,96 4 4,57 2,055 1,74 0,41 5 6,06 4,06 5 32,71 34,92 55 1,57 1,22 5 3,28 2,346 2,33 0,82 6 3,12 1,41 6 31,42 33,49 56 1,78 1,46 6 2,19 1,487 3,45 1,31 7 2,62 1,25 7 32,34 34,33 57 1,81 1,61 7 2,27 1,498 3,45 1,24 8 2,87 1,59 8 31,35 33,24 58 1,67 1,55 8 1,85 1,239 2,15 1,11 9 3,31 1,90 9 32,13 34,35 59 0,89 1,13 9 2,15 1,17
10 4,26 1,16 10 2,08 0,91 10 31,84 33,86 60 2,08 2,05 10 3,56 1,7011 3,01 1,11 11 2,76 1,24 11 28,95 32,68 61 1,78 1,99 11 3,64 1,8712 3,30 1,34 12 0,77 0,43 12 29,42 33,82 62 2,29 2,00 12 3,67 1,9913 3,25 1,01 13 2,36 1,28 13 29,42 33,82 63 1,70 1,70 13 2,24 1,4314 3,75 1,15 14 2,52 1,24 14 33,97 36,63 64 1,62 1,75 14 2,13 1,7915 3,90 1,20 15 3,71 2,16 15 1,99 1,89 65 1,90 1,85 15 1,84 1,2916 2,88 0,85 16 14,83 8,08 16 2,33 2,03 66 1,78 1,87 16 3,52 2,3417 7,09 1,78 17 3,68 1,60 17 1,65 1,74 67 1,34 1,20 17 2,42 1,7918 3,56 0,98 18 1,84 0,82 18 0,86 0,70 68 2,93 2,20 18 5,55 2,4919 3,07 1,04 19 2,93 1,79 19 1,54 1,18 69 1,97 1,56 19 5,65 1,7620 2,48 0,61 20 2,00 1,22 20 1,67 1,95 70 1,59 1,93 20 5,91 2,8321 8,85 2,11 21 2,50 0,95 21 1,57 1,76 71 0,75 1,01 21 4,35 1,90
22 3,34 1,23 22 1,90 1,70 72 1,00 0,83 22 2,20 1,6323 5,20 2,07 23 1,72 1,72 73 1,70 1,53 23 2,77 1,6224 4,62 1,90 24 1,70 1,56 74 1,77 1,80 24 2,14 0,9025 0,77 0,42 25 2,06 1,83 75 2,44 1,98 25 3,11 2,1326 1,80 0,87 26 1,73 1,80 26 2,27 1,3127 2,85 1,11 27 2,17 2,01 27 4,88 2,0228 3,72 1,48 28 1,41 1,13 28 5,08 2,67
29 1,33 1,13 29 4,49 2,5930 2,04 1,86 30 1,91 1,2731 1,61 1,53 31 1,13 0,8332 1,78 1,70 32 2,00 1,3433 1,55 1,45 33 3,13 2,0534 2,10 2,07 34 2,43 1,7035 1,41 1,80 35 5,96 2,8236 1,63 1,61 36 5,78 2,2537 1,75 1,87 37 5,75 1,8338 2,01 1,86 38 4,14 1,6239 2,16 1,93 39 3,16 2,1940 1,25 1,17 40 2,77 1,6241 1,65 1,81 41 2,75 1,8642 1,91 1,72 42 2,56 1,6743 2,50 2,17 43 4,54 2,1544 2,04 1,91 44 4,25 1,8945 2,07 1,86 45 3,91 2,1446 2,04 1,91 46 2,10 1,5247 2,92 2,56 47 1,06 0,8448 1,40 1,70 48 1,47 1,1049 1,73 2,01 49 3,34 2,1650 1,05 1,36 50 2,51 1,39
Table 5.4. The outlyingness of each point of the Stackloss, the Salinity, the HBK
and the Factory data. A, B, C: see Table 5.2.
Robust multivariate outlier detection
29
A: Bush fire Wood gravity Coleman MilkB: 3,884 4,100 4,100 4,482C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos
1 3,48 1,38 1 4,72 2,65 1 3,56 2,84 1 9.06 9,46 51 2.62 1,982 3,27 1,04 2 2,71 1,20 2 4,92 6,37 2 10.57 10,81 52 3.64 2,983 2,76 1,11 3 3,68 2,19 3 6,76 2,94 3 4.04 5,09 53 2.38 2,224 2,84 1,02 4 14,45 33,75 4 2,99 1,53 4 3.86 2,83 54 1.22 1,165 3,85 1,40 5 3,02 2,80 5 2,70 1,43 5 2.23 2,52 55 1.68 1,696 4,92 1,90 6 16,19 38,83 6 5,74 10,43 6 2.97 2,84 56 1.10 1,017 11,79 4,37 7 7,90 5,00 7 3,11 2,23 7 2.36 2,35 57 1.96 2,198 17,96 11,87 8 15,85 37,88 8 1,48 1,83 8 2.32 2,08 58 2.05 1,959 18,36 12,18 9 6,12 2,72 9 2,49 5,95 9 2.58 2,49 59 1.47 2,21
10 14,75 7,64 10 8,59 2,37 10 5,71 12,04 10 2.20 1,98 60 2.04 1,7611 12,31 6,76 11 5,38 3,04 11 5,07 7,70 11 5.28 4,60 61 1.48 1,4212 6,17 2,38 12 6,79 2,65 12 4,31 2,77 12 6.65 6,05 62 2.64 2,0713 5,83 1,77 13 7,14 1,98 13 3,49 2,92 13 5.63 5,38 63 2.33 2,6014 2,30 1,59 14 2,38 2,09 14 1,95 2,16 14 6.17 5,48 64 2.58 1,9015 4,70 1,55 15 2,40 1,47 15 6,11 6,56 15 5.47 5,73 65 1.85 1,5616 3,43 1,38 16 4,74 2,86 16 2,18 2,30 16 3.84 4,56 66 2.01 1,6417 3,06 0,92 17 6,07 2,12 17 3,78 5,95 17 3.59 4,76 67 3.28 2,5918 2,75 1,41 18 3,28 2,49 18 7,86 3,09 18 3.74 3,30 68 2.41 2,3319 2,82 1,38 19 18,33 44,49 19 3,48 2,11 19 2.43 2,85 69 46.45 44,6120 2,89 1,20 20 7,16 2,07 20 2,80 1,56 20 4.14 3,44 70 1.99 1,8721 2,47 1,13 21 2.26 2,08 71 2.19 2,2722 2,44 1,73 22 1.69 1,59 72 3.24 3,0223 2,46 1,04 23 1.81 2,04 73 6.89 6,9924 3,44 1,04 24 2.28 2,05 74 5.01 4,9025 1,90 0,91 25 2.81 2,83 75 2.02 2,0326 1,69 0,97 26 1.83 2,09 76 4.77 4,5127 2,27 0,99 27 4.24 3,71 77 1.35 1,4328 3,31 1,35 28 3.29 3,04 78 1.49 1,8729 4,82 1,83 29 3.19 2,57 79 2.93 2,6630 5,06 2,18 30 1.47 1,39 80 1.40 1,3831 6,00 5,66 31 2.87 2,29 81 2.59 2,3432 13,48 14,08 32 2.37 2,66 82 2.14 2,4233 15,34 16,35 33 1.78 1,33 83 3.00 2,5634 15,10 16,11 34 2.09 1,96 84 3.88 3,0635 15,33 16,43 35 2.73 2,10 85 2.19 2,3636 15,02 16,04 36 2.66 2,3237 15,17 16,30 37 2.61 2,2338 15,25 16,41 38 2.23 2,07
39 2.27 2,0740 3.31 2,8941 10.63 10,1142 3.69 3,0443 3.20 2,8544 7.67 6,0845 1.99 2,2846 1.78 2,4147 5.19 5,3548 2.92 2,5849 3.43 2,7050 3.96 2,69
Table 5.5. The outlyingness of each point of the Bush fire, the Wood gravity, the
Coleman, and the Milk data. A, B, C: see Table 5.2.
Robust multivariate outlier detection
30
5.2.6 Salinity data
The outlyingnesses of the Salinity data are roughly two times larger for the
projection method as compared to the Kosinski method. As a consequence, the latter
shows just 2 outliers (points 5 and 16), the former 8 points. Rousseeuw (1987) and
Walczak agree that the points 5, 16, 23 and 24 are outliers, with points 23 and 24
lying just above the cutoff. Fung finds the same points in first instance, but after
applying his adding-back algorithm he concludes that point 16 is the only outlier.
The projection method shows too much outliers, while the Kosinski method misses
points 23 and 24.
5.2.7 HBK data
In the case of the HBK data the projection method and the Kosinski method agree
completely. Both indicate points 1-14 to be outliers. This is also in agreement with
the results of the original Kosinski method and of Egan, Hadi (1992,1993), Rocke,
Rousseeuw (1987,1990), Fung and Walczak. It is remarked that some of these
authors only find points 1-10 as outliers, but they use the “ regression” definition of
an outlier. The HBK is a artificial data set, where the good points lie along a
regression plane. Points 1-10 are bad leverage points, i.e. they lie far away from the
center of the good points and from the regression plane as well . Points 11-14 are
good leverage points, i.e. although they lie far away from the bulk of the data they
still l ie close to the regression plane. If one considers the distance from the
regression plane, the points 11-14 are not outliers.
5.2.8 Factory data
The Factory data set is a new one1. It is given in Table 5.6.
The outlyingnesses show a big discrepancy between the two methods. The
projection outlyingnesses are much larger than the Kosinski ones, resulting in 18
versus 0 outliers. The outlyingnesses are so large due to the shape of the data. About
half the data set is quite narrowly concentrated around the center of the data, the
other half forms a rather thick tail. Hence, in many projection directions the mad is
very small , leading to large outlyingnesses for the points in the tail. It is remarked
that the projection outliers are well comparable to the Kosinski outliers found with a
cutoff f or • =5% (see also section 3.3.6).
1 The Factory data is a generated data set, originall y used in an exercise on regressionanalysis in the CBS course “multivariate technics with SPSS”. It is interesting to note thatthe regression coefficients change radically if the points, that are indicated to be outliers bythe projection method and the Kosinski method with low cutoff, are removed from the dataset. In other words, the regression coefficients are mainly determined by the “outlying”points.
Robust multivariate outlier detection
31
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
1 14.9 7.107 21 129 11.609 26 12.3 12.616 20 192 11.4782 8.4 6.373 22 141 10.704 27 4.1 14.019 20 177 14.2613 21.6 6.796 22 153 10.942 28 6.8 16.631 23 185 15.3004 25.2 9.208 20 166 11.332 29 6.2 14.521 19 216 10.1815 26.3 14.792 25 193 11.665 30 13.7 13.689 22 188 13.4756 27.2 14.564 23 189 14.754 31 18 14.525 21 192 14.1557 22.2 11.964 20 175 13.255 32 22.8 14.523 21 183 15.4018 17.7 13.526 23 186 11.582 33 26.5 18.473 22 205 14.8919 12.5 12.656 20 190 12.154 34 26.1 15.718 22 200 15.459
10 4.2 14.119 20 187 12.438 35 14.8 7.008 21 124 10.76811 6.9 16.691 22 195 13.407 36 18.7 6.274 21 145 12.43512 6.4 14.571 19 206 11.828 37 21.2 6.711 22 153 9.65513 13.3 13.619 22 198 11.438 38 25.1 9.257 22 169 10.44514 18.2 14.575 22 192 11.060 39 26.3 14.832 25 191 13.15015 22.8 14.556 21 191 14.951 40 27.5 14.521 24 177 14.06716 26.1 18.573 21 200 16.987 41 17.6 13.533 24 186 12.18417 26.3 15.618 22 200 12.472 42 12.4 12.618 21 194 12.42718 14.8 7.003 22 130 9.920 43 4.3 14.178 20 181 14.86319 18.2 6.368 22 144 10.773 44 6 16.612 21 192 14.27420 21.3 6.722 21 123 15.088 45 6.6 14.513 20 213 10.70621 25 9.258 20 157 13.510 46 13.1 13.656 22 192 13.19122 26.1 14.762 24 183 13.047 47 18.2 14.525 21 191 12.95623 27.4 14.464 23 177 15.745 48 22.8 14.486 21 189 13.69024 22.4 11.864 21 175 12.725 49 26.2 18.527 22 200 17.55125 17.9 13.576 23 167 12.119 50 26.1 15.578 22 204 13.530
Table 5.6. The Factory data (n=50, p=5). The average temperature (x1, in degrees
Celsius), the production (x2, in 1000 pieces), the number of working days (x3), the
number of employees (x4) and the water consumption (x5, in 1000 liters) at a factory
in 50 successive months.
5.2.9 Bushfire data
The outliers found by the adjusted Kosinski method (points 7-11, 31-38) agree
perfectly with those found by the original algorithm of Kosinski and with the results
by Rocke and Maronna. The projection method shows as additional outliers points 6,
12, 13, 15, 29 and 30. Due to the large contamination the projected median is shifted
strongly, leading to relatively large outlyingnesses for the good points and,
consequently, many swamped points.
5.2.10 Wood gravity data
Rousseeuw (1984), Hadi (1993), Atkinson, Rocke and Egan declare points 4, 6, 8
and 19 to be outliers. The Kosinski method finds these outliers too, but outlier 7 is
additional. The projection method shows strange results. Fourteen points have an
outlyingness above the cutoff, which is 70% of the data set. This is of course not
realistic. The reason is again the sparsity of the data set. Hence, it is rather surprising
that the Kosinski method and the methods by other authors perform relatively well
in this case.
5.2.11 Coleman data
The Coleman data contain 8 outliers according to the projection method, 7 according
to the Kosinski method. However, they agree only upon 5 points (2, 6, 10, 11, 15).
Robust multivariate outlier detection
32
The Kosinski method finds as additional outliers points 9 and 17, the projection
method points 3, 12 and 18. Only one author has searched for outliers in this data
set, to our knowledge. Rousseeuw (1987) declares points 3, 17 and 18 to be outliers.
A straightforward conclusion is difficult. None of the outliers is found by all three
methods. There is more agreement between the Kosinski method and the projection
method than between the Rousseeuw method and any of the other two.
However, it is possible that the original Kosinski method will give other results,
since the data set is very sparse. If the number of outliers is truly 7 or 8, the
contamination is also extremely large, since [½(n+p+1)]=13 should be the minimum
number of good points.
5.2.12 Milk data
The adjusted Kosinski method is in good agreement with the results of the original
Kosinski method and with the results of Rocke, which both give points 1-3, 12-17,
41, 44, 47, 69, 73 and 74 as outliers. The adjusted Kosinski method finds points 11
and 76 as additional outliers. Point 76 is also found by Atkinson, who misses points
69 and who finds point 27 as an another additional outlier. The projection method
misses points 3, 16 and 17 compared to the Kosinski method. Hence, there is good
agreement between the several methods, while the disagreement concerns only the
points with an outlyingness just below or above the cutoff .
5.3 DiscussionIn general, both the projection method and the Kosinski method show roughly the
same outliers as other methods. If there are any differences between methods, the
disagreement usually concerns points that have an outlyingness just below or just
above the arbitrarily chosen cutoff, i.e. points of which the true outlyingness is
disputable.
In the case of sparse data sets the projection method tends to give too many outliers
and the Kosinski method too li ttle. The Kosinski method is more reliable in the case
of very many outliers. If the distribution deviates from the normal distribution,
especially when there are thick tails, the projection method and the Kosinski method
with low cutoff declare many points in the tails to be outliers.
In almost all cases both the projection method and the Kosinski method show larger
outlyingnesses for points declared to be outliers by other methods than for points
declared to be good by those other methods. This holds even if the projection
method and/or the Kosinski method declare too little outliers to be so or declare too
many good points to be an outlier. This means that the ordering of the Mahalanobis
distances is, roughly, similar across methods. Exceptions only occur in the case of
very sparse or very contaminated data sets.
Robust multivariate outlier detection
33
6. A practical example
As an illustration we will show some results in a practical case, a file with VAT data
of the retail trade in 1996. This file contains 87376 companies with data on several
VAT entries. A complete search on statistical outliers would meet the problem that
many cells are filled with zeroes. In the VAT file there are many different VAT
entries. Many companies show zeroes for most entries. In many SBI classes this is
even true for one of the two most important variables, turnover of goods with high
VAT rate respectively low VAT rate, i.e. in many classes almost all companies have
a zero on either turnover high or turnover low. Hence, the file is mainly filled with
zeroes with only in a few places non-zero values.
If this fact is neglected, application of the projection method or the Kosinski method
will l ead to strange results. The projection method will show a zero median of
absolute deviations in cases where more than half the records shows a zero for a
particular variable. The Kosinski method will show non-invertible covariance
matrices in those cases. From the point of view of distributions, this is due to the fact
that data with many zeroes and few non-zeroes deviates extremely strongly from the
normal distribution. From the point of view of the definition of an outlier, if more
than half the records shows a zero for a particular variable, all records that do not
show a zero should be considered as an outlier.
The presence of a zero is, in fact, often (but not necessarily always) due to an
implicit categorical variable. If a particular variable is zero (non-zero) for a
particular company, one could say that a hypothetical variable indicates that this
company has (has no) contribution to this variable. As is previously mentioned, the
projection method and the Kosinski method can only be used for numerical
continuous variables, not for categorical variables. Hence, the algorithms should be
used with care. Searching for outliers is only useful if appropriate categories and
combinations of variables are selected.
If one still would li ke to apply an outlier detection method on a file li ke the VAT file
at once (with the advantages of searching in one run) the following adjustment to the
Kosinski method is a possible solution: simply neglect the zero-value cells in all
summations in the expressions of the mean, the covariance matrix and the
Mahalanobis distance. The expressions will then be:
∑
∑
∑
≠≠=
−
≠≠
=
≠=
−−=
−−−
=
=
p
yy
kjkikjkjiji
n
yyi
kikjijjk
jk
n
yi
ijj
j
ik
ij
ik
ij
ij
yyCyyMD
yyyyn
C
yn
y
001,
12
00
1
01
)())((
))((1
1
1
Robust multivariate outlier detection
34
with nj denoting the number of points for which yij is non-zero and njk denoting the
number of points for which yij and yik are non-zero. This possibility is promising if
one assumes that the presence of a zero is not strongly correlated to the magnitude of
other variables. The Kosinski method including the alternative expressions is worth
further examination. Unfortunately, such a simple adjustment to the projection
method is not possible since zeroes disappear in projection directions that are not
parallel to one of the axes.
So for the moment, for a successful application of the methods, variables have to be
selected carefully. We merely search for outliers among the companies in SBI 5211.
We only take the variables “annual turnover high” and “annual turnover low” into
account. We choose SBI 5211 since almost all companies in this class show
substantial contributions for both variables, so that a two-dimensional search for
outliers makes sense.
Since some companies make a declaration once a month or once a quarter, the sum
of the declarations was calculated for each company. So a new file was created
containing 3755 companies and three variables: turnover high, turnover low, and
size class. Outliers were searched per size class. The number of companies in each
size class is shown in Table 6.1. Size classes 7, 8 and 9 contain few companies,
making an outlier search useless. These classes were combined with class 6, merely
as an example.
cutoff=50.0 cutoff=9.210size class n #good #outl. #good #outl.
0 748 733 15 609 1391 952 942 10 835 1172 696 685 11 623 733 417 412 5 357 604 484 482 2 439 455 420 418 2 382 386 30 30 0 30 07 58 29 1
6-9 38 35 3 33 5Table 6.1. Number of companies, number of good points and number of outliers for
cutoff value 50.0 respectively 9.210 for each size class and for the combined size
classes 6-9 in SBI 5211, as found with the Kosinski method.
Outliers were searched with the Kosinski method as well as with the projection
method. The methods showed roughly the same results. Therefore only the results of
the Kosinski method are discussed here.
Outliers were searched with two different cutoff values. Results are shown in figures
6.1, 6.2 and 6.3. First a cutoff value 210.9299.0,2 =χ ( 035.32
99.0,2 =χ ) was used.
It appeared that many companies were indicated to be an outlier, roughly 10-20% in
Robust multivariate outlier detection
35
each size class. The reason for this phenomenon is the distribution of the data, which
deviates from the normal distribution rather strongly. The data show very thick tails,
compared to the variance, which is small due to the large amount of data in the
neighborhood of the origin. For this reason a second search with a much larger
cutoff value (50.0) was performed. This led to more realistic numbers of outliers.
7. Conclusions
Both the projection method and the Kosinski method are well able to detect
multivariate outliers. In the case of strong contamination it is slightly more difficult
to find the outliers with the projection method than with the Kosinski method. The
Kosinski method tends to relatively strongly overestimate the number of outliers in
the case of a low cutoff value. At a given cutoff value, the number of outliers is
slightly more sensitive to the tunable parameters in the case of the projection method
than in the case of the Kosinski method.
The time dependence of the two algorithms on the number of points in a data set is
roughly the same, i.e. they are both roughly proportional to n ln n. It is remarked that
the absolute times per run, as shown in this report, cannot be compared due to the
different implementations.
The time dependence on the dimension of the data set is much worse for the
projection method than for the Kosinski method. In the case of the Kosinski method,
the time per run is between linear and quadratic in the dimension. In the case of the
projection method, it is either exponential if the maximum search dimension is taken
equal to the dimension itself or cubic if a moderate maximum search dimension of
three is chosen.
As far as the ability to detect outliers and the expected time-performance is
concerned, there is not a strong preference for either of the two methods if outlier
detection is restricted to 2 or at most 3 dimensions. Since it is expected that for large
data sets in higher dimensions the projection method could lead to undesirable large
run times, the Kosinski method is the recommended method if a multivariate outlier
detection method in high dimensions is requested.
Robust multivariate outlier detection
36
References
[1] A.S. Kosinski, Computational Statistics & Data Analysis 29, 145 (1999).
[2] D.M. Rocke and D.L. Woodruff, Journal of the American Statistical Association
91, 1047 (1996).
[3] P.J. Rousseeuw and A.M. Leroy, Robust regression & outlier detection (Wiley,
NY, 1987).
[4] A.S. Hadi and J.S. Simonoff , Journal of the American Statistical Association 88,
1264 (1993).
[5] R.A. Maronna and V.J. Yohai, Journal of the American Statistical Association
90, 330 (1995).
[6] P.J. Rousseeuw, Journal of the American Statistical Association 79, 871 (1984).
[7] J.J. Daudin, C. Duby, and P. Trecourt, Statistics 19, 241 (1988).
[8] P.J. Rousseeuw and B.C. van Zomeren, Journal of the American Statistical
Association 85, 633 (1990).
[9] A.S. Hadi, J.R. Statist. Soc. B 54, 761 (1992).
[10] A.C. Atkinson, Journal of the American Statistical Association 89, 1329 (1994).
[11] B. Walczak, Chemometrics and Intelligent Laboratory Systems 28, 259 (1995).
[12] W.-K. Fung, Journal of the American Statistical Association 88, 515 (1993).
[13] W.J. Egan and S.L. Morgan, Anal. Chem. 70, 2372 (1998).
[14] W.A. Stahel, Research Report 31, Fachgruppe für Statistik, E.T.H. Zürich
(1981).
[15] D.L. Donoho, Ph.D. qualifying paper, Harvard University (1982).
Robust multivariate outlier detection
37