shortcomings of census interaction data
DESCRIPTION
Shortcomings of Census Interaction Data. Oliver Duke-Williams [email protected]. Shortcomings. Overall data quality Statistical Disclosure Control Variant geographies Lack of comparability over time. Overall data quality. Generic issues Unit non-response Item non-response - PowerPoint PPT PresentationTRANSCRIPT
School of GeographyFACULTY OF EARTH & ENVIRONMENT
Shortcomings of Census Interaction DataOliver Duke-Williams
Shortcomings
• Overall data quality
• Statistical Disclosure Control
• Variant geographies
• Lack of comparability over time
Overall data quality
Generic issues
• Unit non-response
• Item non-response
Interaction data issues
• Problems of address recall for migration data
• Problems of address accuracy for workplace data
• Changing concept of usual residence
Non-response
Unit non-response – under-enumeration – is a problem for all Census data
• It particularly affects migration data
• Migrants are 2-10 times more likely to be missed from a Census than residents who have not moved – Simpson & Middleton (1997)
Item non-response refers to those people who have completed a Census form, but not answered a specific question
Patterns of non-response: 2001
• Address one year ago, non-response quantiles
Legend
2%
2% <= 3%
3% <= 4%
> 4%
Patterns of non-response: 2001
• Workplace postcode, non-response quantiles
Legend
4% < = 6%
6% <= 7%
7% <= 9%
> 9%
Patterns of non-response: 2001
• Method of travel, non-response quantiles
Legend
3% <= 4%
4% <= 5%
5% <= 7%
> 7%
Item non-response
• Various possibilities for former residence and workplace addresses
• Address correct but no postcode
• Part postcode given (e.g. ‘LS1’)
• No information given
• The 1991 interaction data included the categories ‘address not stated’ and ‘workplace not stated’
Migrant origin not stated
Migrants with origin unstated as % of total inflow, 1990-91
• Limited spatial patterns
• Significant numbers for most districts
<= 5%
5% <= 10%
10% <= 15%
> 15%
Item non-response
In 2001, unknown or incomplete addresses were imputed using donor records
• First, select possible donors on the basis of predictive variables
• SWS: Industry, occupation, establishment size, mode of transport
• SMS: Other migrants in household, country of birth, marital status
• Use partial information if available
• Then, select geographically nearest donor
Shortcomings
• Overall data quality
• Statistical Disclosure Control
• Variant geographies
• Lack of comparability over time
Statistical Disclosure Control
Methods applied to interaction data
• 1981
• 1991
• 2001
SDC: 1981
• Workplace data – based on 10% sample, therefore no further modification required
• Migration data
• Set 1
• Within ward
• Ward to rest of district (for flows > 25 persons) or ward to rest of county etc.
• Set 2
• Ward level, total males and females only
SDC: 1991
• Workplace data – based on 10% sample, therefore no further modification required
• Migration data
• Suppression applied to some tables
SDC: 1991 – SMS
• Set 1: Flows within and between wards
• Set 2: Flows within and between districts
SDC: 1991 – SMS Set 2
1 m Age (broad) x Sex2 77m Wholly moving households & residents3 m Age (5 yr) x Sex4 m Marital Status x Sex5 m Ethnic group6 m Household residency status x LLTI status7 m Economic position (16+)
8 7 Tenure
8S 7 Tenure
9 7 Households: by sex & economic position of head
10 7m Residents: by sex & economic position of head11S m Gaelic speakers
11W m Welsh speakers
Extent of suppression
• Districts are grouped by county
• Shading:
• Red: Total migrants >= 10
• Blue: Total migrants 0 < n < 10
• White: Total migrants = 0
Greater London
Metro counties
Other counties (sorted alphabetically)
per district totals
Effect of suppressionWhite migrants, 1990-91Published value as % of estimated correct value
Origins
Destinations
North
Yorkshire and H
umberside
East M
idlands
East A
nglia
South E
ast
South W
est
West M
idlands
North W
est
Wales
Scotland
North 99% 73% 22% 20% 25% 17% 18% 50% 11% 41%
Yorkshire and Humberside 69% 100% 77% 51% 50% 46% 51% 73% 27% 37%
East Midlands 20% 72% 99% 65% 40% 28% 59% 44% 21% 22%
East Anglia 15% 46% 64% 100% 55% 33% 29% 26% 23% 39%
South East 23% 48% 42% 73% 97% 67% 41% 36% 25% 34%
South West 16% 43% 25% 34% 56% 99% 52% 31% 45% 29%
West Midlands 24% 55% 65% 34% 42% 56% 99% 59% 50% 21%
North West 54% 75% 43% 29% 38% 38% 55% 100% 65% 27%
Wales 6% 23% 18% 24% 28% 47% 48% 53% 99% 17%
Scotland 41% 33% 24% 33% 35% 30% 21% 31% 14% 99%
Effect of suppressionBlack migrants, 1990-91Published value as % of estimated correct value
Origins
Destinations
North
Yorkshire and H
umberside
East M
idlands
East A
nglia
South E
ast
South W
est
West M
idlands
North W
est
Wales
Scotland
North 100% 83% 50% 0% 47% 45% 33% 71% 29% 25%
Yorkshire and Humberside 86% 100% 88% 62% 72% 70% 70% 83% 22% 63%
East Midlands 60% 86% 99% 84% 66% 81% 83% 72% 17% 38%
East Anglia 0% 89% 64% 100% 59% 35% 40% 50% 33% 25%
South East 39% 74% 67% 87% 99% 71% 70% 62% 43% 47%
South West 54% 63% 31% 56% 66% 99% 69% 62% 73% 67%
West Midlands 60% 80% 85% 56% 69% 71% 100% 76% 67% 75%
North West 54% 72% 58% 17% 70% 39% 69% 100% 67% 20%
Wales 33% 36% 45% 50% 53% 69% 50% 73% 99% 17%
Scotland 38% 65% 54% 33% 61% 67% 29% 63% 17% 98%
Effect of suppressionIndian, P‘stani, B’deshi migrants, 1990-91Published value as % of estimated correct value
Origins
Destinations
North
Yorkshire and H
umberside
East M
idlands
East A
nglia
South E
ast
South W
est
West M
idlands
North W
est
Wales
Scotland
North 99% 79% 38% 68% 50% 44% 45% 46% 67% 67%
Yorkshire and Humberside 87% 100% 87% 72% 74% 56% 85% 85% 38% 61%
East Midlands 30% 89% 99% 70% 64% 72% 85% 56% 27% 26%
East Anglia 18% 88% 79% 100% 69% 9% 58% 15% 33% 22%
South East 36% 66% 65% 73% 98% 67% 66% 55% 51% 51%
South West 60% 48% 72% 27% 65% 100% 72% 63% 50% 41%
West Midlands 58% 83% 88% 34% 70% 72% 100% 69% 69% 49%
North West 40% 90% 76% 57% 49% 34% 65% 100% 55% 30%
Wales 5% 11% 26% 20% 49% 48% 62% 38% 99% 38%
Scotland 71% 84% 76% 58% 50% 8% 70% 65% 0% 99%
Effect of suppressionChinese and other migrants, 1990-91Published value as % of estimated correct value
Origins
Destinations
North
Yorkshire and H
umberside
East M
idlands
East A
nglia
South E
ast
South W
est
West M
idlands
North W
est
Wales
Scotland
North 100% 82% 45% 60% 52% 31% 32% 50% 0% 38%
Yorkshire and Humberside 85% 100% 84% 59% 69% 64% 81% 88% 55% 51%
East Midlands 42% 81% 99% 63% 57% 32% 73% 66% 33% 46%
East Anglia 5% 78% 72% 100% 63% 55% 70% 32% 27% 58%
South East 35% 61% 54% 74% 98% 71% 55% 51% 40% 47%
South West 26% 72% 34% 71% 59% 99% 54% 40% 47% 33%
West Midlands 48% 82% 65% 30% 65% 75% 100% 67% 50% 35%
North West 67% 86% 56% 49% 53% 38% 71% 99% 55% 54%
Wales 56% 26% 42% 33% 54% 65% 72% 66% 99% 19%
Scotland 51% 65% 30% 38% 54% 52% 74% 61% 0% 99%
Effect of suppressionMis-reporting of largest non-white migrant group
Origins
Destinations
North
Yorkshire and H
umberside
East M
idlands
East A
nglia
South E
ast
South W
est
West M
idlands
North W
est
Wales
Scotland
North X XYorkshire and Humberside X XEast Midlands XEast Anglia X XSouth East XSouth West X X XWest Midlands
North West XWales X XScotland X X
Coping with problems - 1991
Under-enumeration
Suppression
The MIGPOP data set
0
200
400
600
800
1,000
1,200
1-4
5-9
10-1
4
15-1
9
20-2
4
25-2
9
30-3
4
35-3
9
40-4
4
45-4
9
50-5
4
55-5
9
60-6
4
65-6
9
70-7
4
75-7
9
80-8
4
85+
Age group
Nu
mb
er o
f m
igra
nts
(th
ou
san
ds)
MIGPOP
SMS2
The MIGPOP data set
0
200
400
600
800
1,000
1,200
1-4
10-1
4
20-2
4
30-3
4
40-4
4
50-5
4
60-6
4
70-7
4
80-8
4
Age group
Nu
mb
er
of
mig
ran
ts (
tho
usan
ds)
MIGPOP data set
• Produced by Simpson and Middleton (1999)
• Available from CIDER through WICID
• Allows for
• ‘Missing million’
• Under-reporting of migrants
• Migrants with unknown origin
• Contains one age by sex table
Suppression
Migration from Mid-Bedfordshire to Avon, 1990-91
Bath
Bristol
Kingsw
ood
Northavon
Wansdyke
Woodspring
TO
TA
L
White 0 11 0 16 0 69 99 3
Black 0 1 0 0 0 0 1 0
Indian, Pakistani,
Bangladeshi0 0 0 0 0 0 0 0
Chinese and other 0 0 0 0 0 1 1 0
TOTAL 3 12 0 16 0 70 101
SMSGAPS
SMSGAPS dataset incorporates recovered and estimated data for most suppressed tables
• Produced by Rees and Duke-Williams (1997)
• Contains versions of all SMS Set 2 tables except 11S and 11W
• Available from CIDER through WICID
SDC: 2001
• Outputs of the 2001 Census were subject to Small Cell Adjustment Methodology
• Initial version of cross-tabulation produced from raw data
• ‘Small values’ were then modified
• Sub-totals and totals for each table were then recalculated from the modified values
SCAM example
Persons Male Female
Total
0-15 1 2
15-Pensionable 17 15
Pensionable+ 1 4
SCAM example
Persons Male Female
Total 40 19 21
0-15 3 1 2
15-Pensionable 32 17 15
Pensionable+ 5 1 4
SCAM example
Persons Male Female
Total 40 19 21
0-15 3 1 2
15-Pensionable 32 17 15
Pensionable+ 5 1 4
? ?
?
SCAM example
Persons Male Female
Total 40 19 21
0-15 3 0 3
15-Pensionable 32 17 15
Pensionable+ 5 3 4
SCAM example
Persons Male Female
Total 42 20 22
0-15 3 0 3
15-Pensionable 32 17 15
Pensionable+ 8 3 4
SCAM
• SCAM was applied differentially across the UK
• This is particularly confusing for the interaction data, as they are explicitly presented as UK level data set
• SCAM was applied on the basis of where the data were collected
• Migration data were collected at the destination
• Flows with destinations in England, Wales and Northern Ireland were subject to SCAM
• Workplace data were collected at the residence (origin)
• Flows with origins in England, Wales and Northern Ireland were subject to SCAM
• In addition, OA level workplace data with origins in Scotland were subject to SCAM
• OA level workplace data were not published for Northern Ireland
Effects of SCAM
Interaction data are characterised by:
• Sparse matrices
• Dominance of small values
• 2001 data characterised by over-reporting of multiples of 3
1
10
100
1000
10000
100000
1000000
10000000
0 100 200 300 400
Flow size (number of migrants)
Nu
mb
er o
f fl
ow
s
Frequency of flow totals, 2001SMS Table MG301
0
200000
400000
600000
800000
1000000
1200000
1400000
0 10 20 30 40 50
Flow size
Fre
qu
ency
Frequency of flow totals, 2001SMS Table MG301: detail
0
1000000
2000000
3000000
4000000
5000000
6000000
0 10 20 30 40 50
Flow size
Fre
qu
ency
Frequency of flow totals, 2001 SWS Table W301: detail
2001 data and multiples of 3
• It is the interior cells that are modified
• Flow totals are re-calculated from these modified values
Contribution of interior cells to SCAM adjustment of MG301
Number of potentially modified interior cells per table Number of flows Number of migrants
6 1,761,815 5,852,646
5 31,973 218,929
4 4,626 78,013
3 457 8,428
2 182 6,435
1 6 388
0 2 157
Coping with problems: 2001
Tactics for using SCAM affected data
• Use average values?
• Useful in some situations, but could lead to errors if rates are calculated
• Use minimum number of cells to calculate required value
Shortcomings
• Overall data quality
• Statistical Disclosure Control
• Variant geographies
• Lack of comparability over time
Variant geographies
Changes between Censuses
• A problem that is common across all Census outputs
Differences compared to other Census products
• Problems specific to the interaction data, in particular the 2001 data
Differences between Census products
The 2001 interaction data have geographies that do not always match those in the other aggregate data• Level 1: Output Areas
• Interaction data are the same as other outputs
• Level 2: ‘Wards’
• Interaction data are an amalgam of
• CAS wards in England and Wales
• ST wards in Scotland
• Standard wards in Northern Ireland
• Level 3: ‘Districts’
• Interaction data are an amalgam of
• London boroughs, metro and other districts, Unitary authorities, Scottish Council Areas
• Parliamentary constituencies in Northern Ireland
Problems of different geographies
• When mapping data, correct boundary sets are time consuming to assemble
• When constructing rates, correct denominators are time consuming to gather
• Not all area data are easily available for all of these geographies
Shortcomings
• Overall data quality
• Statistical Disclosure Control
• Variant geographies
• Lack of comparability over time
Lack of comparability over time
As well as changes in geography, there are significant changes in data structure over time
General issues
• Changes in population base, inclusion of students etc.
• Handling of unknown migrant origins or workplace locations
Migration data
• Handling of overseas origins
• Use of ‘no usual residence’
Workplace data
• Handling of off-shore workers
• Handling of home-workers
No usual residence in 2001 migration data
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Percentage of all migrants 2000-1, by district, who had ‘no usual residence’ one year prior to the Census
• Mean: 6.9%
• Minimum: 3.7% - Ribble Valley
• Maximum: 19% - Newham
• 19/20 districts with highest levels are in London
Home-workers
1981 – Workplace at home is part of general ‘within ward’ flow
• Home-workers only be distinguished from others in the ‘mode of transport’ table
1991 – Workplace at home is a distinct workplace location
• All tables can be extracted separately for home-workers
2001 – Workplace at home is part of general ‘within ward’ flow
• Home-workers only be distinguished from others in the ‘mode of transport’ table
Coping with compatibility issues
Various data sets exist that attempt to bridge some of these gaps
• Re-estimate for newer geographies
• eg 1981 data on 1991 and 2001 boundaries (Boyle and Feng, 2002)
• Create hybrid sets
• eg merge home-workers into main flow for 1991
• Create best-fit geographies than span time periods
• eg CIDS common geographies
Summary
• The interaction data suffer from problems related to
• Disclosure control modifications
• Changes over time
• Awkward geographies in 2001
• These have been addressed by
• Estimated and re-worked data sets
• Data estimated for different boundary sets
References
Boyle PJ and Feng Z (2002) A method for integrating the 1981 and 1991 GB Census interaction data Computers, Environment and Urban Systems 26 241-56
Rees, P.H. and Duke-Williams, O. (1997) Methods for estimating missing data on migrants in the 1991 British Census, International Journal of Population Geography, 3: 323-368
Simpson, S. and Middleton, E. (1997) Who is missed by a national Census? A review of empirical results from Australia, Britain, Canada and the USA, CCSR Working Paper No 2 Centre for Census and Survey Research, University of Manchester
Simpson, S. and Middleton, E. (1999) Undercount of migration in the UK 1991 Census and its impact on counterurbanisation and population projections, International Journal of Population Geography, 5: 387-405