copyright2010ieee ...public.lanl.gov/jt/papers/igarss-periphery-post.pdf · statistics for...

Copyright 2010 IEEE. Published in the IEEE 2010 International Geoscience & Remote Sensing Symposium (IGARSS2010), scheduled for July 25-30, 2010 in Honolulu, Hawaii, U.S.A. Personal use of this material is permitted. However,permission to reprint/republish this material for advertising or promotional purposes or for creating new collectiveworks for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in otherworks, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

STATISTICS FOR CHARACTERIZING DATA ON THE PERIPHERY

James Theiler and Don Hush

Space and Remote Sensing SciencesLos Alamos National Laboratory

Los Alamos, NM 87545

ABSTRACTWe introduce a class of statistics for characterizing the pe-riphery of a distribution, and show that these statistics are par-ticularly valuable for problems in target detection. Because somany detection algorithms are rooted in Gaussian statistics,we concentrate on ellipsoidal models of high-dimensionaldata distributions (that is to say: covariance matrices), but werecommend several alternatives to the sample covariance ma-trix that more efficiently model the periphery of a distribution,and can more effectively detect anomalous data samples.

Index Terms— anomaly detection, outlier, target detec-tion, probability distribution, robust statistics, Gaussian mix-ture models, expectation-maximization, leptokurtosis

1. INTRODUCTION

What makes target detection difficult is that the target must bedistinguished from the background clutter, and this requiresthat the background be well characterized. More particularly,when that characterization is a probability distribution, it isthe periphery of the background distribution that must be mostcarefully characterized. Targets in the core of the distributionare impossible to detect; targets far out on the tail of the dis-tribution are easy to detect. It is the targets on the periphery,the targets that are difficult but detectable, that are of mostinterest to the algorithm developer who wants improved ROCcurves.

The detection of anomalies (and of anomalous changes)requires that the samples that are anomalous be distinguishedfrom the samples that are normal [1]. One way this can beachieved is by identifying two probability distributions: onefor normal data and one for anomalies. The normal data distri-bution is generally fit to the data, while the anomalies are (of-ten implicitly) defined with a distribution that is much broaderand flatter than the normal data distribution. If both distribu-tions were precisely known, then their ratio would provide theBayes optimal detector of those anomalies.

While the choice of distribution for modeling the anoma-lies does require some care, the main technical challenge in

This work was supported by the Laboratory Directed Research and De-velopment (LDRD) program at Los Alamos National Laboratory.

anomaly detection is the characterization of the normal datadistribution. The more “tightly” fit the distribution is to thenormal data, the more accurately one can detect those datathat do not fit the normal model.

For anomaly detection problems, very low false alarmrates are desired. Thus the challenge is even greater becausewe need to characterize the density in regions where thedata are sparse; that is, on the periphery (or the “tail”) ofthe distribution. Yet, traditional density estimation methodsfor anomaly detection (e.g., the simplest and most commonapproach is to fit a single Gaussian to the dataare dominatedby the high-density core.

In the examples here, our model for characterizing the pe-riphery of a multivariate distribution will be an ellipsoid; ouraim then, is to estimate a covariance matrix that character-izes that ellipsoid. We remark that the overall scale of thecovariance is not of particular concern to us; for the singlescalar measure of overall size, we can adjust the parameterto achieve the desired false alarm rate α. What is of moreconcern is the O(p2) parameters, where p is the number ofspectral channels, that characterize the shape of the ellipsoid.

In this work, we will investigate a variety of approachesfor characterizing the periphery of a data distribution: theseinclude anti-robust statistics (Section 2), anti-shrinkage (Sec-tion 3), eigenvalue adjustment (Section 4), Gaussian mixturemodeling (Section 5), and support vector machines (Sec-tion 6). We will introduce a volume versus coverage plotto evaluate their performance in Section 7, and will finallyconclude in Section 8.

2. IN DEFIANCE OF ROBUST STATISTICS

The goal of robust statistics is to produce characterizationsof data that are insensitive to a few bad data samples. Thisis typically achieved by discounting (or de-weighting) thosesamples that, because of their long “lever arm” have undueinfluence on the estimation. While this can produce betterestimates for some kinds of target detection [2], we will con-sider a contrary approach that puts extra weight on points thatare far from the centroid.

To estimate mean µ and covariance matrix R, from a set

of m samples x ∈ Rp, Campbell [3] suggests

µ =m∑

i=1

wixi

/ m∑i=1

wi,

R =m∑

i=1

w2i (xi − µ)(xi − µ)T

/ m∑i=1

w2i . (1)

When the weights are all equal (e.g., wi = 1 for all i), thenthe standard sample estimators for mean and covariance areobtained. But one can alter these weights depending on howfar the samples are from the mean. The Mahalanobis distanceis given by

ri =[(xi − µ)T R−1(xi − µ)

]T. (2)

To make the robust estimator less sensitive to outliers, onediscounts the large r samples; for instance [3]:

Robust: w(r) ={

1 if r ≤ ro

ro/r if r > ro. (3)

In practice this requires an iterative approach, since weightsdepend on Mahalanobis distance, Mahalanobis distance de-pends on µ and R, and µ and R depend on the weights.

But for problems which depend primarily on the peripheryof the distribution, this scheme seems to be getting it exactlybackwards: it discounts just the data that we most need to payattention to. Therefore, we considered a weighting schemethat discounts the small Mahalanobis points:

Anti-robust: w(r) ={

(r/ro)µ if r ≤ ro

(r/ro)ν if r > ro. (4)

Here, µ = ν = 0 corresponds to the standard sample covari-ance, while µ = 0, ν = −1 corresponds to the robust esti-mator suggested by Campbell [3]. An anti-robust estimatortakes µ > 0. Note that the choice of a large ro and a nega-tive ν imbues the estimator with some robustness to extremevalues of r, even as it emphasizes data on the periphery.

One must also choose a value for the cutoff radius ro. Fora p-dimensional Gaussian, the squared Mahalanobis distancer2 is chi-squared distributed, with p degrees of freedom. Thisis approximately Gaussian with mean p and variance 2p. Forour experiments, we take ro =

√p + b/

√2 with b = 2.

In the adaptive version of this scheme, we choose a frac-tion α � 1 of the points to emphasize, then (at each iteration)choose ro so that a fraction α of the data points have Maha-lanobis distance larger than ro.

3. ANTI-SHRINKAGE ESTIMATOR

One difficulty with the anti-robust estimators is that the iter-ations can be unstable. An alternative is to estimate a robustcovariance matrix and to recognize that the sample covarianceis a positive linear combination of the robust and anti-robust

estimators. In general, “shrinkage” refers to the statistical ap-proach of modifying an estimator by taking a positive linearcombination with a simpler estimator. Since what we wantis the anti-robust estimator, we will take a non-positive linearcombination of the sample covariance and the robust estima-tor:

R̂ = αRrobust + (1− α)Rsample (5)

where α < 0 is chosen so to optimize an in-sample measureof coverage versus volume, as described in Section 7.

4. EIGENVALUE ADJUSTMENT APPROACH

In the spirit of the anomaly detector suggested by Adler-Golden [4], we use the sample covariance R to align thecovariance matrix, but adjust the magnitudes within thatalignment. Specifically, we write R = EΛE, where E is thematrix of eigenvectors, and Λ is a diagonal matrix of eigen-values; and then adjust the values of Λ. (A similar adjustmentwas also suggested for estimating local covariances [5].)

Initially, the kth element Λkk is the variance in the ek

direction, where ek is the kth column vector in the matrixE; i.e., Λkk = (1/n)

∑i(e

Tk xi)2. In place of variance we

will use inter-percentile difference; let Λkk be the squared dis-tance between the tth lowest value of eT

k xi and the tth high-est value, thus enclosing a fraction (n − 2t)/n of the sam-ples. In our experiments, we took this fraction to be 0.999.Using these new values Λ̆kk, we estimate the covariance ma-trix with EΛ̆ET . Here, Λ̆kk > Λkk just because the inter-percentile distance is larger than the standard deviation; butthe overall magnitude of R doesn’t matter. We find that theratio Λ̆kk/Λkk tends to be larger for small values of k, con-sistent with observations made elsewhere that tails are fatterin the high variance directions [4, 6].

We remark that in addition to the original sample matrixdecomposition, one can also apply this correction to the de-composition of other matrices, such as the anti-robust covari-ances in the previous section.

5. GAUSSIAN MIXTURE MODEL APPROACH

Weighting pixels by Mahalanobis distance makes intuitivesense, but a a more formal approach explicitly models thedata with a Gaussian mixture model. Write

N (x;µ, R) = (2π)−d/2|R|−1/2 exp(−1

2xT R−1x

)(6)

as the normal distribution with mean µ and covariance R. Wewill consider a two-component mixture model

P (x) = (1− α)N (x;µ, Rlo)︸︷︷︸core

+αN (x;µ, Rhi)︸︷︷︸periphery

(7)

in which we impose a number of constraints. One, we willtake the same µ for both components; that is, they will be con-centric. In fact, for simplicity, we will use the sample mean

for µ. Two, we take α � 1 to be fixed at a user-specifiedvalue. We want Rlo � Rhi, but we will not require thatthe shapes of these covariances be the same. Subject to theseconstraints, we use the usual expectation-maximization algo-rithm [7] to estimate Rlo and Rhi. One minor modificationwas to used a trimmed estimator that, at each iteration, setsthe weights to zero for a tiny fraction ε of the points withlargest Mahalanobis distance with respect to Rhi.

6. SUPPORT VECTOR MACHINE APPROACH

As noted in the Introduction, if both the normal and theanomaly distributions were known then their ratio would pro-vide the Bayes optimal anomaly detector. It follows that ifwe have samples from both distributions then we can designa support vector machine (SVM) to approximate the Bayesoptimal detector [8]. In this paper we use a training set thatcontains both normal samples and synthetically generatedanomalies to design a quadratic SVM that (approximately)optimizes a weighted linear combination of false alarm andmissed detection rates. The SVM discriminant function takesthe form1

f(x) = xT Qx + qT x + q0 (8)

and can be converted to a Mahalanobis distance classifier us-ing

R = Q−1, µ = −12Q−1q . (9)

Instead of computing moments (or Mahalanobis distanceweighted moments), the support vector machine more di-rectly estimates the decision boundary between the two dis-tributions. Increasing the weight on false alarms moves thedecision boundary toward the periphery of the data so thatthe solution has fewer false alarms, though at the expense ofmore missed detections. Furthermore the SVM solution forQ takes the form

Q =∑

xi∈dataaixixT

i −∑

xi∈anomaliesaixixT

i (10)

where all ai ≥ 0. The support vector property of SVM so-lutions implies that the nonzero coefficients in the first sumcorrespond to normal samples that lie near or beyond the de-cision boundary. Thus the solution is defined explicitly interms of the peripheral normal samples.

The SVM approach requires us to generate samples fromthe anomaly distribution. The results in this paper we ob-tained using random samples from a uniform distributionover a hyper-rectangle that encompasses the normal data.Although increasing the number of samples promises moreaccurate solutions, it also increases the computational de-mand, and so the number of samples must be chosen to

1This form can be realized by using a quadratic kernel, or by quadraticallyextending the original training vectors and using a linear kernel.

balance these two concerns. The results in this paper wereobtained using approximately fives times as many anomaloussamples as normal samples.

7. A MEASURE OF PERFORMANCE FORANOMALY DETECTION

Because anomalies are rare, measuring the performance of ananomaly detection algorithm can be problematic. Rather thanconcentrate on the anomalies, however, we will emphasizehow well the model fits the normal data. In particular, givenan alarm rate α (the rate at which normal samples are pre-dicted to be anomalous), we will compute the volume V (α)of the ellipsoid which contains a fraction 1 − α of the data.We will plot V versus α and our best algorithms will give thesmallest values of V at low α. As we adjust the overall radiusof the ellipsoid whose shape is specified by a given covariancematrix, we will trace out a curve in the V -versus-α space thathas the flavor of a ROC curve. In fact, the α directly corre-sponds to false alarm rate. The V corresponds to a kind ofmissed detection rate, since the anomalies that are inside thevolume V are the ones that will not be detected.

Fig. 1(b,c) shows two such curves. As the alarm rate de-creases, the volume necessary for achieving that alarm rateincreases. For the low alarm rates, we see that the periphery-characterizing estimates outperform the standard and robustestimates. The robust estimator is best at larger values of α –that is, it does a better job of characterizing the core of the dis-tribution – but substantially worse at the low values of α thatwe care about. Some algorithms (such as eigenvalue adjust-ment) do not have much influence at small p but are very ef-fective for large dimensions, while others (such as the supportvector machine) are difficult to implement at high dimension.

We remark that the MINVOL [10] algorithm seeks theminimum-volume ellipsoid that covers h out of m points ina multi-dimensional dataset. This is exactly the conditionwe want to optimize, but MINVOL is notoriously expensive.A faster heuristic was suggested, that computes a covariancefrom those h points [11], but this amounts to a robust estima-tor of the core covariance, and we care about the periphery.

8. DISCUSSION AND CONCLUSIONS

In the ideal case of a multivariate Gaussian distribution, thecontours are concentric ellipsoids, fully characterized by amean vector and covariance matrix. Furthermore, the opti-mal estimator of these parameters are the sample mean andsample covariance. These statistics give equal weight to alldata samples, whether they are in the core or the periphery ofthe distribution. But for deviations from this ideal, it may bepreferable to emphasize data in the periphery of the distribu-tion. This is done explicitly in the weighting function shownin Eq. (4), and implicitly when a support vector machine isused to learn that contour.

(a) (b) (c)

First coordinate

Second c

oord

inate

lo

hi

10−4

10−3

10−2

10−1

100

600

700

800

900

1000

1100

1200

False Alarm Rate

Log V

olu

me

samplerobusteigenvalue adjustedanti−shrinkageGMM

10−4

10−3

10−2

10−1

100

26

28

30

32

34

False Alarm Rate

Log V

olu

me

samplerobust

anti−robustSVM

Fig. 1. (a) The mixture-of-Gaussians model is illustrated on the first two coordinates of a hyperspectral AVIRIS (AirborneVisual/InfraRed Imaging Spectrometer [9]) image of the Florida coastline, from data set f960323t01p02 r04 sc01. Con-tours corresponding to coverage of 95% and 99.9% of the data are shown for Rlo and Rhi. Although Rlo more effectively (i.e.,with smaller area) covers the core of the data, we see that Rhi more effectively characterizes the periphery. (b,c) Coverage plotsshow how the volume V of the ellipsoid increases as the fraction of uncovered data (the alarm rate) α decreases, using variousalgorithms to to estimate the covariance matrix. The middle panel is for the first p = 3 principal components, and the rightpanel is all p = 224 spectral channels of the AVIRIS data. Half the points are used to estimate covariance, and the other halfare used to estimate performance, so these are out-of-sample results.

It is widely recognized that hyperspectral data is generallymore fat-tailed than a Gaussian distribution, but it has recentlybecome apparent that the “fatness” of those tails is differentin different directions [4, 6, 12]. A consequence of this obser-vation is that the best covariance matrix for characterizing thecore of the data may differ from the best covariance matrix forcharacterizing the periphery. The approach we suggest herefollows Vapnik’s dictum [13] – rather that attempt to char-acterize the full distribution, we seek instead to characterizeonly the contour on the periphery.

9. REFERENCES

[1] A. Schaum, “Hyperspectral anomaly detection: BeyondRX,” Proc. SPIE, vol. 6565, 2007.

[2] W. F. Baesner, “Clutter and anomaly removal for en-hanced target detection,” Proc. SPIE, vol. 7695, pp.769525, 2010.

[3] N. A. Campbell, “Robust procedures in multivari-ate analysis I: Robust covariance estimation,” AppliedStatistics, vol. 29, pp. 231–237, 1980.

[4] S. M. Adler-Golden, “Improved hyperspectral anomalydetection in heavy-tailed backgrounds,” Proc. FirstIEEE Workshop on Hyperspectral Image and SignalProcessing: Evolution in Remote Sensing, 2009, DigitalObject Identifier 10.1109/WHISPERS.2009.5289019.

[5] C. E. Caefer, J. Silverman, O. Orthal, D. Antonelli,Y. Sharoni, and S. R. Rotman, “Improved covariance

matrices for point target detection in hyperspectraldata,”Optical Engineering, vol. 7, pp. 076402, 2008.

[6] J. Theiler, B. R. Foy, and A. M. Fraser, “Character-izing non-Gaussian clutter and detecting weak gaseousplumes in hyperspectral imagery,” Proc. SPIE, vol.5806, pp. 182–193, 2005.

[7] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maxi-mum likelihood from incomplete data via the EM algo-rithm (with discussion),” Journal of the Royal StatisticalSociety B, vol. 39, pp. 138, 1977.

[8] Ingo Steinwart, Don Hush, and Clint Scovel, “A classi-fication framework for anomaly detection,” J. MachineLearning Research, vol. 6, pp. 211–232, 2005.

[9] G. Vane, R. O. Green, T. G. Chrien, H. T. Enmark,E. G. Hansen, and W. M. Porter, “The Airborne Visi-ble/Infrared Imaging Spectrometer (AVIRIS),” RemoteSensing of the Environment, vol. 44, pp. 127–143, 1993.

[10] P. J. Rousseeuw and A. M. Leroy, Robust Regressionand Outlier Detection, Wiley-Interscience, New York,1987.

[11] P. J. Rousseeuw and K. Van Driessen, “A fast algo-rithm for the minimum covariance determinant estima-tor,” Technometrics, vol. 41, pp. 212–223, 1999.

[12] P. Bajorski, “Maximum Gaussianity models for hyper-spectral images,” Proc. SPIE, vol. 6966, pp. 69661M,2008.

[13] V. Vapnik, The Nature of Statistical Learning Theory,Springer, New York, 2nd edition, 1999.

copyright2010ieee ...public.lanl.gov/jt/papers/igarss-periphery-post.pdf · statistics for...

Documents