multivariate outlier detection

52
Estimating Distance Distributions and Testing Observation Outlyingness for Complex Surveys Jianqiang Wang Major Professor: Jean Opsomer Committee: Wayne A. Fuller Song X. Chen Dan Nettleton Dimitris Margaritis

Upload: jay-jianqiang-wang

Post on 22-Jan-2017

99 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Multivariate outlier detection

Estimating Distance Distributions and Testing Observation Outlyingness for Complex Surveys

Jianqiang Wang

Major Professor: Jean Opsomer

Committee: Wayne A. Fuller

Song X. Chen

Dan Nettleton

Dimitris Margaritis

Page 2: Multivariate outlier detection

2/52

OutlineW IntroductionW Notation and assumptionsW Mean, median-based inferenceW Variance estimationW Simulation studyW Application in National Resources InventoryW Theoretical extensions

Page 3: Multivariate outlier detection

3/52

Structure of survey dataW Many finite populations targeted by surveys consist

of homogeneous subpopulations.

W “Homogeneity” refers to variables being collected, generally different from design variables.

W ExampleW Interested in the health condition of U.S. residents

between 45 and 60 years old, stratify by county, and homogeneity refers to health condition variables we collect.

Page 4: Multivariate outlier detection

4/52

Conceptual ideasW Given this structure of population, provide measure of

outlyingness and flag unusual points.

W Assign each point to a subpopulation and define certain measure of outlyingness.

W Dimension reduction, describe multivariate populations, identify outliers and discriminate objects.

Page 5: Multivariate outlier detection

5/52

Outlier identification procedure (1)W Identify target variables that we want to test

outlyingness on.

W Partition the population into a number of relatively homogeneous groups.

W Define a measure of subpopulation center and a distance metric from each point to subpopulation center.

W Define the outlyingness of each point as the fraction of points with a less extreme distance in its subpopulation.

Page 6: Multivariate outlier detection

6/52

Outlier identification procedure (2)

W Estimate the distance distribution and outlyingness of each point.

W Flag observations with measure of outlyingness exceeding a prespecified threshold (0.95 or 0.98).

W Make decisions on the list of suspicious points.

Page 7: Multivariate outlier detection

7/52

Inference in survey samplingW The target measure of outlyingness is defined at

finite population level. W Two mechanisms

W Mechanism for generating finite populationW Mechanism for drawing a sample

W Condition on the finite population, use design-based inference.

W Asymptotic theory in survey samplingW Sequence of finite populationsW Sequence of sampling designs

Page 8: Multivariate outlier detection

8/52

Sequence of finite populationsW Let be the population index.W Associated with the -th population element is a -dim

vector , with inclusion probability .W Finite pop of sizes , and sample

a of sizes , expected sample size a

W Population composition , with .W Assume we know and subpopulation association.W Let be the power set of .

yi = (yi ;1;:::;yi ;p)p

¼i

Ng = fN gN fN g 2 [f L ; f H ]

i

Uº = f1;2;¢¢¢;Nº g

Uº = [ Gg=1Uº g Nº g

Aº =S G

g=1 Aº;g

G

n¤º g = E(nº gjF º ):

nº g

F º f y1;y2;¢¢¢;yN ºg

Page 9: Multivariate outlier detection

9/52

Sequence of sampling designsW A probability sample is drawn from with

respect to some measurable design.

W Associate a sampling indicator with each element.

W Inclusion probability

W Sample size with expectation .

AN UN

I(i2A N )

n

¼i=Pr(i 2 AN ) = E(I(i2A N ) jF N )

¼i j = Pr(i; j 2 AN ) = E(I(i2A N )I(j 2A N ) jF N )

n¤ = E(njFN )

Page 10: Multivariate outlier detection

10/52

Examples of sampling designs and estimatorsW Simple random sampling without replacement

W Inclusion probability

W Poisson samplingW Inclusion probability

W Horvitz-Thompson and Hajek estimator of the mean

Arbitrary ¼i j =½ ¼i¼j ; i 6= j

¼i ; i = j

¼i = nN ; ¼i j = (n¡ 1)n

(N ¡ 1)N ; i 6= j

¼i ;

Page 11: Multivariate outlier detection

11/52

NormsW Use the notion of norm to quantify the distance

between an observation and measure of center.

W A norm satisfies:W Non-degenerate: W Homogeneity:W Triangle inequality:

k¹ k : <p ! <+

k¹ k = 0, ¹ = 0

k¹ 1 +¹ 2k · k¹ 1k+k¹ 2k

k®¹ k = j®jk¹ k

Page 12: Multivariate outlier detection

12/52

Example of norms and unit circleW General norms

W Manhattan distance:

W Euclidean distance:

W Supremum norm:

W Quadratic norm

L1 : k¹ k1 =P p

i=1 j¹ i j

L2 : k¹ k2 =p P p

i=1 ¹ 2i

L1 : k¹ k = maxf j¹ 1j;¢¢¢;j¹ pjg

LA : k¹ kA =p ¹ 0A¹

Page 13: Multivariate outlier detection

13/52

Distribution of population distancesW Population

W Sample

where W Measure of center: mean vector

W Population

W Sample

Dº;d(¹ º ) = 1N º

PUº

I(ky i ¡ ¹ º k· d)

bDº;d(¹ º ) = 1bN º

PA º

1¼i

I(ky i ¡ ¹ º k· d)

bNº =P

A º1¼i

¹ º = 1N º

PUº

yi

¹ º = 1bN º

PA º

y i¼i

location

radius

Page 14: Multivariate outlier detection

14/52

Bivariate population

f y : ky ¡ ¹ k = dg¹

Dº;d(¹ º )

bDº;d(¹ º )

Page 15: Multivariate outlier detection

15/52

Nondifferentiability with respect to location

Page 16: Multivariate outlier detection

16/52

General design assumptionsW Assumptions on , , and design variance

W W W For any vector with finite moments, define

as the HT estimator of mean, assume

W For any with positive definite population variance-covariance matrix and finite fourth moment,

and

z

zK L · N

n ¼j · K U

Var(¹zN ;¼jFN ) · K 1VarSR S (¹zN ;SR S jF N )

[V (¹zN ;¼jF N )]¡ 1 VH T f ¹zN ;¼g¡ Ip£ p = Op(n¤¡ 1=2)

n = Op(N ¯º ); with ¯ 2 ( 2p

2p+1;1]

¹zN ;¼

n ¼i

2+±

n¤1=2(¹zN ;¼¡ ¹zN )jF Nd! N (0;§ zz )

Page 17: Multivariate outlier detection

17/52

Application specific assumption 1W The population distance distribution converges to a limiting

function

where W The limiting function is continuous in . and

, with finite derivatives and

W The norm is continuous on , with a continuous derivative , and bounded second derivative matrix .

(d;¹ ) 2 [0;1 ) £ <p:

d 2 [0 1 )¹ 2 <p

k¢k <p

Ã(¢) Hs(¢)

limN ! 1

Dº;d(¹ ) = Dd(¹ )

Dd(¹ )@Dd (¹ )

@d ; @Dd (¹ )@¹

@2Dd (¹ )@¹ 2 :

Page 18: Multivariate outlier detection

18/52

Application specific assumption 2W The population quantity

where andW Justification assumes a probabilistic model.W Markov’s Inequality, Borel-Cantalli Lemma.

®2 [14;1)

p Nºn

1N º

PUº

I(d<ky i ¡ ¹ k· d+hN º ) ¡ @D d (¹ )@d hN º

o! 0

hº = O(N ¡ ®º )

Page 19: Multivariate outlier detection

19/52

Application specific assumption 3W The population quantity

converges to 0 uniformly for and W Justification assumes a probabilistic model.W Proof: empirical process theory.

s 2 Cs¹ 2 <p

n¤º

1=2

N º

PUº

hI(ky i ¡ ¹ ¡ n¤º ¡ 1=2sk· d) ¡ I(ky i ¡ ¹ k· d) ¡ Dd(¹ º + n¤º

¡ 1=2s) +Dd(¹ )i

Page 20: Multivariate outlier detection

20/52

Design consistencyW Decomposition

W Intermediate result

W Consistency

n¤º1=2( bDº;d(¹ º ) ¡ bDº;d(¹ º ) ¡ Dd(¹ º ) +Dd(¹ º )) p! 0

n¤º

1=2( bDº;d(¹ º ) ¡ Dd(¹ º ))¯¯F º = Op(1)

n¤º

1=2³

bDº;d(¹ º ) ¡ Dº;d(¹ º )´

= n¤º

1=2³

bDº;d(¹ º ) ¡ bDº;d(¹ º ) ¡ Dg;d(¹ º ) +Dg;d(¹ º )´

+ n¤º

1=2³

bDº;d(¹ º ) ¡ Dº;d(¹ º )´

+ n¤º

1=2 (Dd(¹ º ) ¡ Dd(¹ º )) ;

Page 21: Multivariate outlier detection

21/52

Asymptotic normalityW Let and be the design

variance-covariance matrix of HT estimator of mean .

W Asymptotic normality

where

Unknown subpop size

assume subpop size and mean are known

Unknown subpop mean

b¹ ;i = (I(ky i ¡ ¹ º k· d);1;yi )0

b¹ ;i

a¹ =·1;¡ Dº;d(¹ º ) ¡ @Dd (¹ º )

@¹ º

T ¹ º ; @Dd (¹ º )@¹ º

T¸0

§ ¹ ;d

¡a0¹ § ¹ ;da¹

¢¡ 1=2 ³bDº;d(¹ º ) ¡ Dº;d(¹ º )

´¯¯F º

d! N (0;1)

Page 22: Multivariate outlier detection

22/52

Multivariate medianW Mean vector

W Generalized medianW Population

W Sample

W Existence and uniqueness

qº = arg infqP

Uºkyi ¡ qk

qº = arg infqP

A º1¼i

kyi ¡ qk

Page 23: Multivariate outlier detection

23/52

Multivariate medianW Estimating equations

W Population

W Sample

W Linearization of

W What if the estimating equation is not differentiable?

PUº

Ã(yi ¡ q) = 0P

A º1¼i

Ã(yi ¡ q) = 0:qº

qº = qº +"

1Nº

X

i2A º

Hs(yi ¡ qº )¼i

#¡ 1 1Nº

X

i2A º

Ã(yi ¡ qº )¼i

+op(n¤º

¡ 1=2)

Page 24: Multivariate outlier detection

24/52

Median-based distances Asymptotic results

W Design consistency and asymptotic normality of for .

W Design consistency and asymptotic normality of as an estimator of .

qºqº

bDº;d(qº )Dº;d(qº )

Page 25: Multivariate outlier detection

25/52

Mahalanobis distancesW Mean and median-based inference.

W Choose an appropriate norm to match the shape of underlying multivariate distribution.

W Estimate the variance-covariance matrix or other shape measure of subpopulation, and use Mahalanobis distance.

W Estimate the distribution of Mahalanobis distances.

W See application section for more details.

Page 26: Multivariate outlier detection

26/52

Naive variance estimatorW Use mean-based case to explain variance estimators.

W Recall the asymptotic variance of :

where

W Claim: The extra variance due to estimating the center can be ignored in elliptical distributions using quadratic norm.

W Naïve variance estimator, ignoring the gradient vector:

bDº;d(¹ º )V

³bDº;d(¹ º )

´= a0

¹ § ¹ ;da¹

a¹ =µ

1;¡ Dº;d(¹ º ) ¡³

@Dd (¹ º )@¹ º

´0¹ º ;

³@Dd (¹ º )

@¹ º

´0¶0:

¾2¹ ;d;naive =

³1;¡ bDº;d(¹ º )

´b§ ¹ ;d

³1;¡ bDº;d(¹ º )

´0:

Page 27: Multivariate outlier detection

27/52

Estimate the gradient vector by kernel smoothingW Idea: estimate by

where , e.g.: CDF of standard normal.

W Kernel estimator

W Design consistent for under mild assumptions.

W Jackknife variance estimation has been proposed for mean-based case.

K(¢) =RK(t)dt

@Dd (¹ º )@¹ º

Dd(¹ ) = limº ! 1

1N º

PUº

I(ky i ¡ ¹ k· d)

1N º

PA º

d¡ ky i ¡ ¹ kh

´1¼i

³ º;d(¹ º ) = 1N º h

PA º

d¡ ky i ¡ ¹ º kh

´Ã(yi ¡ ¹ º ) 1

¼i

Page 28: Multivariate outlier detection

28/52

Jackknife variance estimatorW Recall

W Recalculate mean for each jackknife replicate? Inconsistent!

W Proposed idea: incorporate an estimated gradient vector in replication estimation.

W For the l-th replicate sample, calculate

and use

bDº;d(¹ º ) = 1bN º

PA º

1¼i

I(ky i ¡ ¹ º k· d)

bD (l)(¹ º ) = bD(l)º;d(¹ º ) + ³ º;d(¹ º )(¹ (l)

º ¡ ¹ º )

bVJ K³

bDº;d(¹ º )´

=LX

l=1cl

³bD (l)(¹ º ) ¡ bDº;d(¹ º )

´2

Page 29: Multivariate outlier detection

29/52

Simulation studyW Goal of simulation study:

W Assess asymptotic properties of estimators.W Compare naive variance estimator with kernel estimator.

W Simulation parametersW P=2, G=5.W Subpopulations 1-4 are elliptically contoured,

subpopulations 5 is skewed.W Stratified SRS.W Norm: Euclidean norm.

Page 30: Multivariate outlier detection

30/52

Simulated population

Page 31: Multivariate outlier detection

31/52

Subpopulation distance distribution functions

Page 32: Multivariate outlier detection

32/52

=5000, =1000, =5

Cluster 4 Cluster 5

1.00 1.41 2.45 1.00 1.41 2.45

.44 .54 .71 .31 .52 .85

-0.11 0.00 -0.00 -0.00 -0.00 0.05

-0.01 0.00 -0.01 0.00 0.00 0.01

1.03 1.00 1.00 1.30 1.13 1.00

G

d

Effect of estimating the center

bias(D (¹ ))sd(D (¹ ))sd(D (¹ ))sd(D (¹ ))

bias(D (¹ ))sd(D (¹ ))

N n

Dº;d(¹ º )

Page 33: Multivariate outlier detection

33/52

=5000, =200, =5

Cluster 4 Cluster 5

1.00 1.41 2.45 1.0 1.41 2.45

.43 .53 .68 .35 .55 .88

-0.28 -0.05 0.12 0.05 0.14 0.10

0.03 0.04 0.06 0.00 -0.01 0.02

1.17 1.03 1.01 1.16 1.04 0.96

G

d

Effect of estimating the center

bias(D (¹ ))sd(D (¹ ))sd(D (¹ ))sd(D (¹ ))

bias(D (¹ ))sd(D (¹ ))

N n

Dº;d(¹ º )

Page 34: Multivariate outlier detection

34/52

=5000, =1000, =5

Cluster 4 Cluster 51.0 1.41 2.45 1.0 1.41 2.45

0.44 0.54 0.71 0.31 0.52 0.85

0.94 1.00 1.00 0.53 0.78 1.07

1.21 1.15 1.12 1.00 1.01 1.04

1.07 1.06 1.04 0.85 0.94 0.98

G

d

¾2d;S M

¾2d;M C

h=0:4

Average estimated variance relative to MC variance

N n

Dº;d(¹ º )

¾2d;S M

¾2d;M C

h=0:1

¾2d;N V

¾2d;M C

Page 35: Multivariate outlier detection

35/52

NRI applicationW Introduction to NRI.

W Outlier identification for a longitudinal survey.

W Strategy for initial partitioning in NRI.

W How to define Mahalanobis distances.

W Analysis of identified points.

Page 36: Multivariate outlier detection

36/52

National Resources Inventory (1)

W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S.

W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University.

W Produce a longitudinal database containing numerous agro-environmental variables for scientific investigation and policy-making.

W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.

Page 37: Multivariate outlier detection

37/52

National Resources Inventory (2)W Various aspects of land use, farming practice, and

environmentally important variables like wetland status and soil erosion.

W Measure both level and change over time in these variables.

W Primary mode of data collection is a combination of aerial photography and field collection.

W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.

Page 38: Multivariate outlier detection

38/52

Outlier identification for a longitudinal surveyW Identify outliers for periodically updated data.

W Build outlier identification rules on previous years’ data and use the rules to flag current observations.

Observe years

2001-2005

(2001,2002,2003)

(2003,2004,2005)

Training set

Test set

Page 39: Multivariate outlier detection

39/52

Target variablesW Non-pseudo core points with soil erosion in years

2001-2005.

W Variables: broad use, land use, USLE C factor, support practice factor, slope, slope length and USLE loss .

W USLE loss represents the potential long term soil loss in tons/acre.USLELOSS= R * K * LS * C * P

Page 40: Multivariate outlier detection

40/52

Point classification

b.u. Point Type b.u. Point Type1 Cultivated cropland 7 Urban and built-up

land2 Noncultivated

cropland8 Rural transportation

3 Pastureland 9 Small water areas4 Rangeland 10 Large water areas5 Forest land 11 Rederal land6 Minor land 12 CRP

Page 41: Multivariate outlier detection

41/52

Initial partitioningW Initial partitioning uses geographical association

and broad use category.Partition national data into state-wise categories.

Collapse northeastern states.

Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points

with broad use change.

Merge points with same broad use change pattern, say (2,2,3), (1,1,12).

Page 42: Multivariate outlier detection

42/52

Defining distancesW Estimate subpopulation mean vector and

covariance matrix

W Calculate distance to the center

W The inverse matrix is defined through a principal value decomposition.

¹ º = 1bN º

PSº

y i¼i

b§ º = 1bN º

PSº

(yi ¡ ¹ º )(yi ¡ ¹ º )0 1¼i

kyi ¡ ¹ º kb§ º =q

(yi ¡ ¹ º )0b§ ¡º (yi ¡ ¹ º ):

b§ ¡º

Page 43: Multivariate outlier detection

43/52

Source of outlyingnessW Flagged 1% points in training set, and compared

test distances with 99%-quantile of training distances.

W Source of outlyingnesseº;i = b§ ¡ 1=2

º (¹ º ¡ y i )kb§ ¡ 1=2

º (¹ º ¡ y i )k

Page 44: Multivariate outlier detection

44/52

Analysis of flagged pointsW Agricultural specialists analyzed identified points by

suspicious variables.

W C factor: almost all points were considered suspicious.W Data entry errors

W Invalid entries c factor=1 for hayland, pastureland or CRP

W Unusual levels or trends in relation to landuse

(0.013, 0.13, 0.013, 0.013, 0.013)

(0.011, 0.06, 0.11, 0.003, 0.003)

Page 45: Multivariate outlier detection

45/52

Analysis of flagged pointsW P factor: all points are candidates for review

because of the change over time.

W Slope length: all points were flagged because of the level, not change over time.

W Land use: Most points flagged because of a change in the type of hayland or pastureland over time. Not a major concern to NRCS reviewers.

(1.0, 1.0, 1.0, 0.6, 1.0)

Page 46: Multivariate outlier detection

46/52

Nondifferentiable survey estimatorsW The sample distance distribution is nondifferentiable

function of the estimated location parameter.W A general class of survey estimators:

with corresponding population quantity

W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative .

bT(^) = 1N

Pi2Sº

1¼i

h(yi ; ^)

TN (¸ N ) = 1N

P Ni=1 h(yi ;¸ N )

Not necessarily differentiable

³ (° )

bDº;d(¹ º )

T (° ) = limN ! 1

TN (° )

Page 47: Multivariate outlier detection

47/52

AsymptoticsW We provide a set of sufficient conditions on the limiting

function and a number of population quantities under which

where

W The extra variance due to estimating unknown parameter may or may not be negligible, depending on the derivative.

W Propose a kernel estimator to estimate unknown derivative.

n¤1=2hV( bT(^))

i ¡ 1=2 ³bT(^) ¡ TN (¸ N )

´¯¯F d! N (0;1)

( bT(^)) =³1;[³ (¸ N )]T

´V (¹z¼)

µ 1³ (¸ N )

¶:

Page 48: Multivariate outlier detection

48/52

Estimating distribution function using auxiliary informationW Ratio model

W Use as a substitute of , where .

W Difference estimator

W The extra variance due to estimating ratio is negligible (RKM, 1990).

Rxi yi R =P

S º yi =¼iPS º x i =¼i

bT(R) = 1N

nPSº

1¼i

I(yi · t) +hP

U I(R xi · t) ¡ PSº

1¼i

I(R xi · t)i o

yi = Rxi + ²i ; ²i » I D(0;xi ¾2)

Page 49: Multivariate outlier detection

49/52

Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when

the poverty line is drawn at 60% of the median income.

with population quantity

W Assume that , the extra variance depends on .

bT(q) = 1N

PSº

1¼i

I(yi · 0:6q)

TN (qN ) = 1N

NPi=1

I(yi · 0:6qN )

limN ! 1

TN (°) = FY (0:6°)@F Y (0:6° )

Page 50: Multivariate outlier detection

50/52

Nondifferentiable estimating equationsW The sample p-th quantile can be defined through

estimating equations

W The usual practice is to linearize the estimating function, but this approach is not applicable due to nondifferentiability.

W Provide a set of sufficient conditions on the monotonicity and smoothness of and its limit for proof.

S(t) = 1N

Pi2S

1¼i

I(yi ¡ t· 0) ¡ p

» = inff t : S(t) ¸ 0g

SN (t) = 1N

NPi=1

I(yi ¡ t· 0) ¡ p

»N = inff t : SN (t) ¸ 0g

SN (t)

Page 51: Multivariate outlier detection

51/52

Concluding remarksW Proposed an estimator for subpopulation distance

distribution and demonstrated its statistical properties.

W Application in a large-scale longitudinal survey.

W Theoretical extensions to nondifferentiable survey estimators.

Page 52: Multivariate outlier detection

52/52

Thank you