multivariate outlier detection

Estimating Distance Distributions and Testing Observation Outlyingness for Complex Surveys

Jianqiang Wang

Major Professor: Jean Opsomer

Committee: Wayne A. Fuller

Song X. Chen

Dan Nettleton

Dimitris Margaritis

2/52

OutlineW IntroductionW Notation and assumptionsW Mean, median-based inferenceW Variance estimationW Simulation studyW Application in National Resources InventoryW Theoretical extensions

3/52

Structure of survey dataW Many finite populations targeted by surveys consist

of homogeneous subpopulations.

W “Homogeneity” refers to variables being collected, generally different from design variables.

W ExampleW Interested in the health condition of U.S. residents

between 45 and 60 years old, stratify by county, and homogeneity refers to health condition variables we collect.

4/52

Conceptual ideasW Given this structure of population, provide measure of

outlyingness and flag unusual points.

W Assign each point to a subpopulation and define certain measure of outlyingness.

W Dimension reduction, describe multivariate populations, identify outliers and discriminate objects.

5/52

Outlier identification procedure (1)W Identify target variables that we want to test

outlyingness on.

W Partition the population into a number of relatively homogeneous groups.

W Define a measure of subpopulation center and a distance metric from each point to subpopulation center.

W Define the outlyingness of each point as the fraction of points with a less extreme distance in its subpopulation.

6/52

Outlier identification procedure (2)

W Estimate the distance distribution and outlyingness of each point.

W Flag observations with measure of outlyingness exceeding a prespecified threshold (0.95 or 0.98).

W Make decisions on the list of suspicious points.

7/52

Inference in survey samplingW The target measure of outlyingness is defined at

finite population level. W Two mechanisms

W Mechanism for generating finite populationW Mechanism for drawing a sample

W Condition on the finite population, use design-based inference.

W Asymptotic theory in survey samplingW Sequence of finite populationsW Sequence of sampling designs

8/52

Sequence of finite populationsW Let be the population index.W Associated with the -th population element is a -dim

vector , with inclusion probability .W Finite pop of sizes , and sample

a of sizes , expected sample size a

W Population composition , with .W Assume we know and subpopulation association.W Let be the power set of .

yi = (yi ;1;:::;yi ;p)p

¼i

Ng = fN gN fN g 2 [f L ; f H ]

i

Uº = f1;2;¢¢¢;Nº g

Uº = [ Gg=1Uº g Nº g

Aº =S G

g=1 Aº;g

G

n¤º g = E(nº gjF º ):

nº g

F º f y1;y2;¢¢¢;yN ºg

9/52

Sequence of sampling designsW A probability sample is drawn from with

respect to some measurable design.

W Associate a sampling indicator with each element.

W Inclusion probability

W Sample size with expectation .

AN UN

I(i2A N )

n

¼i=Pr(i 2 AN ) = E(I(i2A N ) jF N )

¼i j = Pr(i; j 2 AN ) = E(I(i2A N )I(j 2A N ) jF N )

n¤ = E(njFN )

10/52

Examples of sampling designs and estimatorsW Simple random sampling without replacement

W Inclusion probability

W Poisson samplingW Inclusion probability

W Horvitz-Thompson and Hajek estimator of the mean

Arbitrary ¼i j =½ ¼i¼j ; i 6= j

¼i ; i = j

¼i = nN ; ¼i j = (n¡ 1)n

(N ¡ 1)N ; i 6= j

¼i ;

11/52

NormsW Use the notion of norm to quantify the distance

between an observation and measure of center.

W A norm satisfies:W Non-degenerate: W Homogeneity:W Triangle inequality:

k¹ k : <p ! <+

k¹ k = 0, ¹ = 0

k¹ 1 +¹ 2k · k¹ 1k+k¹ 2k

k®¹ k = j®jk¹ k

12/52

Example of norms and unit circleW General norms

W Manhattan distance:

W Euclidean distance:

W Supremum norm:

W Quadratic norm

L1 : k¹ k1 =P p

i=1 j¹ i j

L2 : k¹ k2 =p P p

i=1 ¹ 2i

L1 : k¹ k = maxf j¹ 1j;¢¢¢;j¹ pjg

LA : k¹ kA =p ¹ 0A¹

13/52

Distribution of population distancesW Population

W Sample

where W Measure of center: mean vector

W Population

W Sample

Dº;d(¹ º ) = 1N º

PUº

I(ky i ¡ ¹ º k· d)

bDº;d(¹ º ) = 1bN º

PA º

1¼i

I(ky i ¡ ¹ º k· d)

bNº =P

A º1¼i

¹ º = 1N º

PUº

yi

¹ º = 1bN º

PA º

y i¼i

location

radius

14/52

Bivariate population

f y : ky ¡ ¹ k = dg¹

Dº;d(¹ º )

bDº;d(¹ º )

15/52

Nondifferentiability with respect to location

16/52

General design assumptionsW Assumptions on , , and design variance

W W W For any vector with finite moments, define

as the HT estimator of mean, assume

W For any with positive definite population variance-covariance matrix and finite fourth moment,

and

z

zK L · N

n ¼j · K U

Var(¹zN ;¼jFN ) · K 1VarSR S (¹zN ;SR S jF N )

[V (¹zN ;¼jF N )]¡ 1 VH T f ¹zN ;¼g¡ Ip£ p = Op(n¤¡ 1=2)

n = Op(N ¯º ); with ¯ 2 ( 2p

2p+1;1]

¹zN ;¼

n ¼i

2+±

n¤1=2(¹zN ;¼¡ ¹zN )jF Nd! N (0;§ zz )

17/52

Application specific assumption 1W The population distance distribution converges to a limiting

function

where W The limiting function is continuous in . and

, with finite derivatives and

W The norm is continuous on , with a continuous derivative , and bounded second derivative matrix .

(d;¹ ) 2 [0;1 ) £ <p:

d 2 [0 1 )¹ 2 <p

k¢k <p

Ã(¢) Hs(¢)

limN ! 1

Dº;d(¹ ) = Dd(¹ )

Dd(¹ )@Dd (¹ )

@d ; @Dd (¹ )@¹

@2Dd (¹ )@¹ 2 :

18/52

Application specific assumption 2W The population quantity

where andW Justification assumes a probabilistic model.W Markov’s Inequality, Borel-Cantalli Lemma.

®2 [14;1)

p Nºn

1N º

PUº

I(d<ky i ¡ ¹ k· d+hN º ) ¡ @D d (¹ )@d hN º

o! 0

hº = O(N ¡ ®º )

19/52

Application specific assumption 3W The population quantity

converges to 0 uniformly for and W Justification assumes a probabilistic model.W Proof: empirical process theory.

s 2 Cs¹ 2 <p

n¤º

1=2

N º

PUº

hI(ky i ¡ ¹ ¡ n¤º ¡ 1=2sk· d) ¡ I(ky i ¡ ¹ k· d) ¡ Dd(¹ º + n¤º

¡ 1=2s) +Dd(¹ )i

20/52

Design consistencyW Decomposition

W Intermediate result

W Consistency

n¤º1=2( bDº;d(¹ º ) ¡ bDº;d(¹ º ) ¡ Dd(¹ º ) +Dd(¹ º )) p! 0

n¤º

1=2( bDº;d(¹ º ) ¡ Dd(¹ º ))¯¯F º = Op(1)

n¤º

1=2³

bDº;d(¹ º ) ¡ Dº;d(¹ º )´

= n¤º

1=2³

bDº;d(¹ º ) ¡ bDº;d(¹ º ) ¡ Dg;d(¹ º ) +Dg;d(¹ º )´

+ n¤º

1=2³

bDº;d(¹ º ) ¡ Dº;d(¹ º )´

+ n¤º

1=2 (Dd(¹ º ) ¡ Dd(¹ º )) ;

21/52

Asymptotic normalityW Let and be the design

variance-covariance matrix of HT estimator of mean .

W Asymptotic normality

where

Unknown subpop size

assume subpop size and mean are known

Unknown subpop mean

b¹ ;i = (I(ky i ¡ ¹ º k· d);1;yi )0

b¹ ;i

a¹ =·1;¡ Dº;d(¹ º ) ¡ @Dd (¹ º )

@¹ º

T ¹ º ; @Dd (¹ º )@¹ º

T¸0

§ ¹ ;d

¡a0¹ § ¹ ;da¹

¢¡ 1=2 ³bDº;d(¹ º ) ¡ Dº;d(¹ º )

´¯¯F º

d! N (0;1)

22/52

Multivariate medianW Mean vector

W Generalized medianW Population

W Sample

W Existence and uniqueness

qº = arg infqP

Uºkyi ¡ qk

qº = arg infqP

A º1¼i

kyi ¡ qk

23/52

Multivariate medianW Estimating equations

W Population

W Sample

W Linearization of

W What if the estimating equation is not differentiable?

PUº

Ã(yi ¡ q) = 0P

A º1¼i

Ã(yi ¡ q) = 0:qº

qº = qº +"

1Nº

X

i2A º

Hs(yi ¡ qº )¼i

#¡ 1 1Nº

X

i2A º

Ã(yi ¡ qº )¼i

+op(n¤º

¡ 1=2)

24/52

Median-based distances Asymptotic results

W Design consistency and asymptotic normality of for .

W Design consistency and asymptotic normality of as an estimator of .

qºqº

bDº;d(qº )Dº;d(qº )

25/52

Mahalanobis distancesW Mean and median-based inference.

W Choose an appropriate norm to match the shape of underlying multivariate distribution.

W Estimate the variance-covariance matrix or other shape measure of subpopulation, and use Mahalanobis distance.

W Estimate the distribution of Mahalanobis distances.

W See application section for more details.

26/52

Naive variance estimatorW Use mean-based case to explain variance estimators.

W Recall the asymptotic variance of :

where

W Claim: The extra variance due to estimating the center can be ignored in elliptical distributions using quadratic norm.

W Naïve variance estimator, ignoring the gradient vector:

bDº;d(¹ º )V

³bDº;d(¹ º )

´= a0

¹ § ¹ ;da¹

a¹ =µ

1;¡ Dº;d(¹ º ) ¡³

@Dd (¹ º )@¹ º

´0¹ º ;

³@Dd (¹ º )

@¹ º

´0¶0:

¾2¹ ;d;naive =

³1;¡ bDº;d(¹ º )

´b§ ¹ ;d

³1;¡ bDº;d(¹ º )

´0:

27/52

Estimate the gradient vector by kernel smoothingW Idea: estimate by

where , e.g.: CDF of standard normal.

W Kernel estimator

W Design consistent for under mild assumptions.

W Jackknife variance estimation has been proposed for mean-based case.

K(¢) =RK(t)dt

@Dd (¹ º )@¹ º

Dd(¹ ) = limº ! 1

1N º

PUº

I(ky i ¡ ¹ k· d)

1N º

PA º

K³

d¡ ky i ¡ ¹ kh

´1¼i

³ º;d(¹ º ) = 1N º h

PA º

K³

d¡ ky i ¡ ¹ º kh

´Ã(yi ¡ ¹ º ) 1

¼i

28/52

Jackknife variance estimatorW Recall

W Recalculate mean for each jackknife replicate? Inconsistent!

W Proposed idea: incorporate an estimated gradient vector in replication estimation.

W For the l-th replicate sample, calculate

and use

bDº;d(¹ º ) = 1bN º

PA º

1¼i

I(ky i ¡ ¹ º k· d)

bD (l)(¹ º ) = bD(l)º;d(¹ º ) + ³ º;d(¹ º )(¹ (l)

º ¡ ¹ º )

bVJ K³

bDº;d(¹ º )´

=LX

l=1cl

³bD (l)(¹ º ) ¡ bDº;d(¹ º )

´2

29/52

Simulation studyW Goal of simulation study:

W Assess asymptotic properties of estimators.W Compare naive variance estimator with kernel estimator.

W Simulation parametersW P=2, G=5.W Subpopulations 1-4 are elliptically contoured,

subpopulations 5 is skewed.W Stratified SRS.W Norm: Euclidean norm.

30/52

Simulated population

31/52

Subpopulation distance distribution functions

32/52

=5000, =1000, =5

Cluster 4 Cluster 5

1.00 1.41 2.45 1.00 1.41 2.45

.44 .54 .71 .31 .52 .85

-0.11 0.00 -0.00 -0.00 -0.00 0.05

-0.01 0.00 -0.01 0.00 0.00 0.01

1.03 1.00 1.00 1.30 1.13 1.00

G

d

Effect of estimating the center

bias(D (¹ ))sd(D (¹ ))sd(D (¹ ))sd(D (¹ ))

bias(D (¹ ))sd(D (¹ ))

N n

Dº;d(¹ º )

33/52

=5000, =200, =5

Cluster 4 Cluster 5

1.00 1.41 2.45 1.0 1.41 2.45

.43 .53 .68 .35 .55 .88

-0.28 -0.05 0.12 0.05 0.14 0.10

0.03 0.04 0.06 0.00 -0.01 0.02

1.17 1.03 1.01 1.16 1.04 0.96

G

d

Effect of estimating the center

bias(D (¹ ))sd(D (¹ ))sd(D (¹ ))sd(D (¹ ))

bias(D (¹ ))sd(D (¹ ))

N n

Dº;d(¹ º )

34/52

=5000, =1000, =5

Cluster 4 Cluster 51.0 1.41 2.45 1.0 1.41 2.45

0.44 0.54 0.71 0.31 0.52 0.85

0.94 1.00 1.00 0.53 0.78 1.07

1.21 1.15 1.12 1.00 1.01 1.04

1.07 1.06 1.04 0.85 0.94 0.98

G

d

¾2d;S M

¾2d;M C

h=0:4

Average estimated variance relative to MC variance

N n

Dº;d(¹ º )

¾2d;S M

¾2d;M C

h=0:1

¾2d;N V

¾2d;M C

35/52

NRI applicationW Introduction to NRI.

W Outlier identification for a longitudinal survey.

W Strategy for initial partitioning in NRI.

W How to define Mahalanobis distances.

W Analysis of identified points.

36/52

National Resources Inventory (1)

W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.

W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.

W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.

W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.

37/52

National Resources Inventory (2)W Various aspects of land use, farming practice, and

environmentally important variables like wetland status and soil erosion.

W Measure both level and change over time in these variables.

W Primary mode of data collection is a combination of aerial photography and field collection.

W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.

38/52

Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.

W Build outlier identification rules on previous years’ data and use the rules to flag current observations.

Observe years

2001-2005

(2001,2002,2003)

(2003,2004,2005)

Training set

Test set

39/52

Target variablesW Non-pseudo core points with soil erosion in years

2001-2005.

W Variables: broad use, land use, USLE C factor, support practice factor, slope, slope length and USLE loss .

W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P

40/52

Point classification

b.u. Point Type b.u. Point Type1 Cultivated cropland 7 Urban and built-up

land2 Noncultivated

cropland8 Rural transportation

3 Pastureland 9 Small water areas4 Rangeland 10 Large water areas5 Forest land 11 Rederal land6 Minor land 12 CRP

41/52

Initial partitioningW Initial partitioning uses geographical association

and broad use category.Partition national data into state-wise categories.

Collapse northeastern states.

Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points

with broad use change.

Merge points with same broad use change pattern, say (2,2,3), (1,1,12).

42/52

Defining distancesW Estimate subpopulation mean vector and

covariance matrix

W Calculate distance to the center

W The inverse matrix is defined through a principal value decomposition.

¹ º = 1bN º

PSº

y i¼i

b§ º = 1bN º

PSº

(yi ¡ ¹ º )(yi ¡ ¹ º )0 1¼i

kyi ¡ ¹ º kb§ º =q

(yi ¡ ¹ º )0b§ ¡º (yi ¡ ¹ º ):

b§ ¡º

43/52

Source of outlyingnessW Flagged 1% points in training set, and compared

test distances with 99%-quantile of training distances.

W Source of outlyingnesseº;i = b§ ¡ 1=2

º (¹ º ¡ y i )kb§ ¡ 1=2

º (¹ º ¡ y i )k

44/52

Analysis of flagged pointsW Agricultural specialists analyzed identified points by

suspicious variables.

W C factor: almost all points were considered suspicious.W Data entry errors

W Invalid entries c factor=1 for hayland, pastureland or CRP

W Unusual levels or trends in relation to landuse

(0.013, 0.13, 0.013, 0.013, 0.013)

(0.011, 0.06, 0.11, 0.003, 0.003)

45/52

Analysis of flagged pointsW P factor: all points are candidates for review

because of the change over time.

W Slope length: all points were flagged because of the level, not change over time.

W Land use: Most points flagged because of a change in the type of hayland or pastureland over time. Not a major concern to NRCS reviewers.

(1.0, 1.0, 1.0, 0.6, 1.0)

46/52

Nondifferentiable survey estimatorsW The sample distance distribution is nondifferentiable

function of the estimated location parameter.W A general class of survey estimators:

with corresponding population quantity

W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .

bT(^) = 1N

Pi2Sº

1¼i

h(yi ; ^)

TN (¸ N ) = 1N

P Ni=1 h(yi ;¸ N )

Not necessarily differentiable

³ (° )

bDº;d(¹ º )

T (° ) = limN ! 1

TN (° )

47/52

AsymptoticsW We provide a set of sufficient conditions on the limiting

function and a number of population quantities under which

where

W The extra variance due to estimating unknown parameter may or may not be negligible, depending on the derivative.

W Propose a kernel estimator to estimate unknown derivative.

n¤1=2hV( bT(^))

i ¡ 1=2 ³bT(^) ¡ TN (¸ N )

´¯¯F d! N (0;1)

( bT(^)) =³1;[³ (¸ N )]T

´V (¹z¼)

µ 1³ (¸ N )

¶:

48/52

Estimating distribution function using auxiliary informationW Ratio model

W Use as a substitute of , where .

W Difference estimator

W The extra variance due to estimating ratio is negligible (RKM, 1990).

Rxi yi R =P

S º yi =¼iPS º x i =¼i

bT(R) = 1N

nPSº

1¼i

I(yi · t) +hP

U I(R xi · t) ¡ PSº

1¼i

I(R xi · t)i o

yi = Rxi + ²i ; ²i » I D(0;xi ¾2)

49/52

Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when

the poverty line is drawn at 60% of the median income.

with population quantity

W Assume that , the extra variance depends on .

bT(q) = 1N

PSº

1¼i

I(yi · 0:6q)

TN (qN ) = 1N

NPi=1

I(yi · 0:6qN )

limN ! 1

TN (°) = FY (0:6°)@F Y (0:6° )

@°

50/52

Nondifferentiable estimating equationsW The sample p-th quantile can be defined through

estimating equations

W The usual practice is to linearize the estimating function, but this approach is not applicable due to nondifferentiability.

W Provide a set of sufficient conditions on the monotonicity and smoothness of and its limit for proof.

S(t) = 1N

Pi2S

1¼i

I(yi ¡ t· 0) ¡ p

» = inff t : S(t) ¸ 0g

SN (t) = 1N

NPi=1

I(yi ¡ t· 0) ¡ p

»N = inff t : SN (t) ¸ 0g

SN (t)

51/52

Concluding remarksW Proposed an estimator for subpopulation distance

distribution and demonstrated its statistical properties.

W Application in a large-scale longitudinal survey.

W Theoretical extensions to nondifferentiable survey estimators.

52/52

Thank you

multivariate outlier detection

Data & Analytics