11 comparison of perturbation approaches for spatial outliers in microdata natalie shlomo* and jordi...

13
1 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester, [email protected] ** IIIA and CSIC, Barcelona [email protected] The project was funded by the Census Statistical Disclosure Control project at Westat, Inc. through the sponsorship of the U.S. Bureau of the Census

Upload: moris-harrell

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

11

Comparison of Perturbation Approaches for Spatial Outliers in

Microdata

Natalie Shlomo* and Jordi Marés**

* Social Statistics, University of Manchester, [email protected]** IIIA and CSIC, Barcelona [email protected]

The project was funded by the Census Statistical Disclosure Control project at Westat, Inc. through the sponsorship of the U.S. Bureau of the Census

Page 2: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

22

Topics Covered

• Introduction• Description of Data• Outlier Detection• Coherence Function• Perturbation Methods

• Record Swapping Method• Hot Deck Method

• Results• Conclusions

Page 3: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

33

• Geographical spatial outliers arise from multivariate relationships between spatial and non-spatial characteristics and have a high probability of identification

• Treat through targetted SDC perturbation in the microdata

• Focus on US American Community Survey (ACS) transportation outputs, trajectories defined as vectors of coordinates: place of residence (origin) and workplace (destination)

• Example of an outlier: overly long commutes to work on a non-typical means of transportation (MOT), such as cycling

• Objective: to inform and guide decisions about best practices that could be used for future dissemination strategies on these and other similar types of datasets by the US Census Bureau

Introduction

Page 4: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

44

• Simulation study based on an artificial population produced from 2006-2008 combined PUMS of the ACS

• Those living in California, employed and worked within the US (N=438,850)

• Latitude and longitude of residence and workplace generated by adding random distances around a radius of the centroid of the relevant PUMA (public-use microdata area with population greater than 100K)

• Did not take into account survey weights (need to recalibrate following perturbation) however use other calibration variables as controls to minimize distortions to original weights

Description of Data

Page 5: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

55

• Outlier detection methods include univariate and multivariate methods and can take parametric or non-parametric forms

• For this study we use a multivariate outlier detection based on the Mahalanobis Distance where large values indicate outliers

• Replace mean vector by median vector and covariance matrix by minimum covariance determinant (MCD) (Rousseeuw, 1985)

• Let h be the minimum number of points which are not outlying:

• Squared Mahalanobis distances based on p variables generallly uses a quantile of the distribution

• Under robust Mahalanobis distances use the adjusted cut-off:

Outlier Detection

xh

1M

1ijMCD

h

j

Th

jsscfccf )M(x )M(x

1-h

1S MCDij

1MCDijMCD

)975.0(20 pD

)5.0(

},..,{)975.0(

212

0p

np

RDRDmedianD

iRD

Page 6: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

66

• Robust Mahalanobis distances calculated on distance travelled and minutes to work

DistanceToWork=geodist(latitude,longitude,POW_latitude,POW_longitude,'DM');

• Determine explanatory variables predictive of distance travelled to produce classes: mode of transport, sex, earnings and occupation

• SAS macro: ‘Robcov’ Version 1.3-2 (written by Michael Friendly)

• Collapse classes to at least 20 individuals and calculate robust Mahalanobis distance with a flag if exceeds critical value

• Reduced dataset to 283,423 without missing values and high degree of consistencies: 60,007 outliers (21.2%) reduced to 59,080 (20.8%) outliers after deleting ‘other’ mode of transport

Outlier Detection

Page 7: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

77

• Coherence function maximum and minimum velocity for each mode of transport based on the set of non-outliers

• Assign high coherence to individuals whose travelled distance is close to mean, and low coherence to individuals whose travelled distance isfar from mean

• Use as objective function toguide perturbationwhere we aim to obtain a higher coherence for outliers

Coherence Function

Page 8: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

88

• Pair outliers with different workplaces by swapping place of residence and increase coherence funcion for at least one of the outliers (without decreasing coherence)

• Carry out within classes: mode of transport, sex and age group

• Split outliers according to workplace, calculate coherence function by swapping residence of outlier with all other outliers in different workplace

• If one of the outliershave higher coherence then swap

• Continue iteratively

Record Swapping

Page 9: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

99

• Impute residence of outlier by residence of non-outlier within the class and having same workplace

2 approaches for selecting donor (note: need more than one individual in the workplace)

1. Candidate donors among those having distance to work within the coherence range of distances and donor selected that maximiazes coherence function, i.e. candidate donor whose distance to work is closer to the mean velocity)

2. Instead of coherence function, choose donor from non-outlier in the same workplace having similar travelled minutes (nearest neighbor)

Hot Deck

Page 10: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

1010

Results Original Outliers

Total Outliers after Swapping

Outliers after HD (Coherence Measure)

Outliers after HD (Minutes)

Yes No Yes No Yes No

Yes 59,080

(20.9%)

42,788

(92.0%)

16,292

(6.9%)

27,099

(76.2%)

31,981

(12.9%)

28,123

(79.3%)

30,957

(12.5%)

No 224,343

(79.2%)

3,731

(8.0%)

220,612

(93.1%)

8,456

(23.8%)

215,887

(87.1%)

7,321

(20.7%)

217,022

(87.5%)

Total 283,423

(100%)

46,519

(100%)

236,904

(100%)

35,555

(100%)

247,868

(100%)

35,444

(100%)

247,979

(100%)

• Swapping corrected fewer outliers than hot deck methods (16K vs 31K) but swapping carried out only on outliers

• Some non-outliers that became outliers since we changed the distribution structure following perturbation (4K swapping vs 8K hotdeck))

• Number of non-outliers defined as outliers following perturbation was much less than those outliers that were corrected to non-outliers

Page 11: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

1111

Results

• Individuals who had their PUMA changed due to the perturbation: Swapping Method: 56,562 ; Hot Deck Method (Minutes): 53,945 ; Hot Deck Method (Coherence): 53,181

• Hotdeck methods perturb bivariate counts more than swapping since swapping does not change marginal frequencies

• Hotdeck using the coherence function approach resulted in less information loss than nearest neighbor approach

Bivariate Variables

Crossed with PUMA

Normalized Absolute Difference Normalized Hellinger’s Distance

Swapping HD Minutes HD Coherence

Swapping HD Minutes HD Coherence

AGE9 0 0.109 0.095 0 0.154 0.134

AGEP 0.048 0.119 0.107 0.059 0.166 0.147

SEX 0 0.113 0.095 0 0.161 0.140

OCCUPATION 0.094 0.134 0.125 0.164 0.215 0.203

EARNINGS 0.024 0.104 0.089 0.029 0.148 0.129

MODE 0 0.104 0.087 0 0.154 0.131

Nqfrefreqattrattrdist

attrDomjattrDomi

ijij 2/||),(

)()(

21

2

1

N/)qfrefreq(5.)attr,attr(dist

)attr(Domj)attr(Domi

2ijij21

2

1

Page 12: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

1212

Discussion

• Record swapping had lowest information loss (especially for bivariate counts of swapping variable with other control variables) but only corrected 21.3% of the outliers, while the hot-deck methods corrected ~ 40.0% of the outliers

• Hot-deck method transformed more non-outliers to outliers compared to record swapping

• Recommendation would be to carry out both methods, starting with record swapping and then proceeding to hotdeck method on remaining outliers

• Recalibrate survey weights to new place of residence but including calibration variables as controls minimizes distortion to survey weights, especially under record swapping

Page 13: 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

13

Thank you for your attention