1 statistical disclosure control methods for census outputs natalie shlomo sdc centre, ons january...

28
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005

Upload: berenice-hill

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Statistical Disclosure Control

Methods for Census Outputs

Natalie Shlomo

SDC Centre, ONS

January 11, 2005

2

• Introduction

• Developing SDC methods » Output requirements, risk management

» Risk-Utility decision problem

• Methods for protecting census outputs

» Pre and post-tabular methods

» Safe settings, contracts

• Work plan for Census 2011

» Assessment of Census 2001 SDC methods

» User and other stakeholder’s requirements

» Assessment of alternative methods

Topics for Discussion

3

• Previous ONS talk on SDC implementation for Census 2001 and lessons learnt

• This talk will focus on developing SDC methodology for Census 2011 based on census output requirements and user needs more generally. The planned approach is a risk-utility decision framework.

Goal:

To provide adequate protection against the risk of re-identification taking into account user needs and output requirements and the usefulness of the data

Improve on existing and develop new SDC methods, assess SDC impact on data quality, and clarify advantages and disadvantages of each method.

Introduction

4

Developing SDC Methodology

• What are the Census outputs requirements?

» Variables to be disseminated

» Standard and flexible (web generated?) tables

» Origin-Destination tables (workplace tables)

» SARs microdata

• What are the disclosure risk scenarios, i.e. realistic assumptions on information available to the public that increases the probability of disclosure?

Comment: Note that the problem are 1’s and 2’s in tables and not the 0’s (except for extreme cases)

5

• Disclosure risk measures - quantifies the risk of re- identification:

» probability that a sample unique is a population unique,

» probability of a correct match between a record in the microdata to an external file

» probability that a record is perturbed

• Information loss measures - quantifies the loss of information content in the data as a result of the SDC method. Utility depends on the user needs and use of the data:

» distortion to distributions (bias)

» weaknesses in the measures of association

» impact on the variance of estimates

» changes to the likelihood functions.

6

Developing SDC Methodology• Methods for protecting outputs:

» Data masking:

perturbative methods - methods that alter the data: swapping, random noise, over imputation, rounding

non-perturbative methods - methods that preserve data integrity: recoding, sub-sampling, suppression

» Data access under contract in a safe setting, and additionally ensuring non-disclosive outputs

Need to develop SDC methods for checking outputs, i.e. residuals of regression analysis, bandwidth for kernel estimation of distributions.

7

Developing SDC Methodology

• In some cases, parameters can be given to users to make corrections in their analysis or they are embedded in the SDC method to minimize information loss.

• SDC an optimization problem:

Choose SDC method as a function of the data: which will maximize the utility of the data: subject to the constraint that the disclosure risk will be below a threshold:

• Risk-Utility decision framework for choosing optimum method

(.)f

)]}([max{arg datafU

)]([.. datafDRts

8

SDC for Census OutputsPre-tabular Methods

1. Random Record Swapping (UK 2001, USA 1991)

Small percentage of records have geographical identifiers (or other variables) swapped with other records matching on control variables ( larger geographical areas, household size and age-sex distribution)

Advantages Disadvantages Consistent totals between

tables Preserves marginal

distributions at higher aggregated levels

Some protection against disclosure by differencing

High proportion of risky (unique) records unperturbed

Errors (bias and variance of estimates) due to perturbation, in particular joint distributions affected

9

Pre-tabular Methods

2. Targeted Record Swapping (USA 2001)

Large percentage of unique records on set of key variables (large households, ethnicity), have geographical identifiers (or other variables) swapped with other records matching on control variables.

Advantages Disadvantages Risky (unique) records more

likely to be perturbed Consistent totals between tables Preserves marginal distributions

at higher aggregated levels Some protection against

disclosure by differencing

Larger errors (bias and variance of estimates) since perturbation carried out on uniques and outliers, in particular joint distributions affected.

10

Pre-tabular Methods3. Over-Imputation

A percentage of randomly selected records have certain variables erased and standard imputation methods are applied by selecting donors matching on control variables.

Advantages Disadvantages Imputation software already

in place Consistent totals between

tables Some protection against

disclosure by differencing

Does not preserve marginal distributions athigher aggregations

Need for re-editing and further imputations Errors (bias and variance of estimates) due

to imputation, in particular for mis-specified models

Assumptions of “missing at random” maynot be applicable

High proportion of risky recordsunperturbed unless targeted

11

Pre-tabular Methods 4. Post-Random Perturbation (PRAM) (UK 2001)

Percentage of records have certain variables misclassified according to prescribed probabilities. Includes method that preserves marginal (compounded) distributions and edit constraints

Advantages Disadvantages Marginal (compounded) distributions

and edits can be maintained Method can be targeted to risky records Consistent totals between tables Some protection against disclosure by differencing All records (risky or not) have the same probability of being perturbed

Errors (bias and variance of estimates) due to perturbation Need for re-editing and further

imputations Probability of perturbation very small

12

Preliminary Evaluation of Record Swapping

• 16,120 households from 1995 Israel CBS Census sample file

• Households randomly swapped within control strata: Broad region, type of locality, age groups (10) and sex

• Strata collapsed for unique records with no swap pair

• Disclosure Risk measure:

where cells of size 1 and 2

• Information Loss measure:

2,1

#2,1

CCinrecordsnumberTotal

recordsunswapped

R CC

C preswapped

dpostswappepreswapped

cf

cfcfIL

)

2

(

))()((

21,CC

13

Preliminary Evaluation of Record Swapping

• Risk-Information Loss (R-IL) Map:

» Swapping rates 5%, 10%, 15%, 20%

» Information Loss - distortion to the distribution: Number of persons in household in each district

Risk-Information Loss AssessmentRandom Record Swapping - 5%-20%

0.7

0.8

0.9

1

0 10 20 30 40

Information Loss

Ris

k

14

Preliminary Evaluation of Record Swapping

• Random record swapping vs. targeted record swapping (on uniques in control strata, i.e. large households).

» Swapping rate: 10%

» Information Loss - distortion to the distribution: Number of persons in household in each district

Risk-Information Loss Assessment 10% Random and Targetted Record Swapping

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Information Loss

Ris

k

15

Preliminary Evaluation of Over-Imputation

• 10% of households had geographic identifier erased:

» Random selection of households

» Targeted selection of households from unique control strata (i.e., large households)

• Geographic identifier imputed using hot-deck imputation within strata: sex age groups.

• Risk measure:

• Information Loss measure: 2,1

(#2,1

CCinrecordsnumberTotal

variable)samewithimputed#imputednot

R CC

C imputed

imputednotimputed

cf

cfcfIL

)

2

(

))()((

16

Preliminary Evaluation of Over-Imputing

• Risk-Information Loss (R-IL) Map:

» 10% Selected Records (Random and Targeted on

Uniques)

» Information Loss - distortion to the distribution: Number of persons in household in each districtRisk-Information Loss Assessment

10% Random and Targeted Record Swapping(blue)10% Random and Targeted Over Imputation (pink)

0.40.50.60.70.80.9

1

0 20 40 60 80 100 120 140

Information Loss

Ris

k

17

Final Comments for Pre-tabular Methods

• Geographies are swapped because they introduce less edit failures and are generally less correlated with other variables. If other variables are swapped (or over-imputed), such as age, the data would be badly damaged, a large amount of re-editing would be necessary and further imputations carried out.

• Swapping does not affect higher (geographical) level distributions within which the records are swapped. This is an advantage and not a disadvantage.

• Over imputation is similar to record swapping but causes more damage to the data. Assumptions of “missing at random” problematic for the analysis of full data sets.

18

SDC for Census Outputs Post-tabular Methods

1. Barnardization (UK 1991)

Every internal cell in an output table modified by (+1,0,-1) according to prescribed probabilities (q, 1-2q, q)No adjustments made to zero cells.

Advantages Disadvantages

Some protection against disclosure by differencing

High proportion of risky (unique) records unperturbed Inconsistent totals between tables since margins calculated by perturbed internal cells

19

Post-tabular Methods

2. Small Cell Adjustments (UK 2001, Australia)

Small cells randomly adjusted upwards or downwards to a base depending on an unbiased stochastic method and prescribed probabilities.

Advantages Disadvantages

Protects the risky (unique) records Lower loss of information for standard tables

Inconsistent totals between tables since margins calculated by perturbed internal cells Can have high errors in totals Little protection against disclosure by differencing Implementation problems for sparse tables (eg., origin- destination tables)

20

Post-tabular Methods 3. Unbiased Random Rounding (UK NeSS, New Zealand,

Canada)

All cells in tables rounded up or down according to an unbiased prescribed probability scheme.

Rounds all cells, including safe cells Requires complex auditing to ensure protection Totals rounded independently from internal cells so tables not additive

Provides good protection against disclosure by differencing (although not 100% guarantee) Easy to apply Totals are consistent between tables within the rounding base

DisadvantagesAdvantages

21

Post-tabular Methods

4. Controlled Rounding (UK NeSS)

All cells in tables rounded up or down in an optimal method that ensures maintaining the marginal totals (up to the base)

Rounds all cells, including safe cells Requires complex SDC tool Tau Argus (and licence) Would require more development to work with Census size tables.

Fully protects against disclosure by differencing Tables fully additive Minimal information loss Works with linked tables and external constraints.

DisadvantagesAdvantages

22

Post-tabular Methods

5. Table Design Methods

• Population thresholds

• Level of detail and number of dimensions in the table

• Minimum average cell size

6. Further development of SDC methods

• Controlled small cell adjustments, controlled rounding

• Better implementation and benchmarking techniques for maintaining totals at higher aggregated levels.

23

Evaluation Study

Origin-Destination (Workplace) Tables and Small Cell Adjustments

• Totals in tables obtained by aggregating internal perturbed cells

• Different tables produced different results, number of flows different between tables

• ONS guidelines: (1) use table with minimum number of categories; (2) combine minimum number of smaller geographical areas for obtaining estimates for larger areas

• Some problems in implementation for origin-destination tables

24

Evaluation Study• Workplace (ward to ward) Table W206 for West

Midlands: small cell adjustment method unbiased (errors within confidence intervals of perturbation scheme), ward to ward totals not badly damaged, skewness in lower geographical areas.

Perturbation Interval Work Place Table

-70

-50

-30

-10

10

30

50

70

90

110

130

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000

Perturbed Total

Err

or

25

• Optimum SDC method a mixture of different methods depending on risk-utility management, output requirements and user needs more generally.

» What is the optimum balance between perturbative and non-perturbative methods of SDC?

» How transparent should the SDC method be? Pre-tabular methods have hidden effects and users are not able to make adjustments in their analysis.

» What are the data used for and how to measure information loss and the impact of the SDC method on data quality?

» Can we improve on post-tabular methods?

» Policies and strategies for access to data through contracts and safe settings? Work started on optimal methods as part of the overall planning for 2011 Census

26

Work Plan Census 2011

I. Assessment of Census 2001 SDC Methods:

• Risk-Utility analysis

• Comprehensive report, forums and discussion groups on SDC methods with users and other agencies

II. Alternative methods for SDC based on results of phase I, user requirements for census outputs and feedback

27

Final Remarks:

We are evaluating our methods and planning future improvements

Our SDC methodology is based on a scientific approach, understanding the needs and requirements of the users and international best practice

Methods for SDC are greatly enhanced by the cooperation and feedback from the user community!

28

Contact Details

Natalie Shlomo

SDC Centre, Methodology DirectorateOffice for National Statistics

Segensworth RoadTitchfield

Fareham PO15 5RR01329 812612

[email protected]