postal code conversion for data analysis - … · 26/11/2015 1 postal code conversion for data...
TRANSCRIPT
26/11/2015
1
www.statcan.gc.ca
Postal Code Conversion for Data Analysis
An overview of the PCCF and PCCF+
Saeeda KhanMichael Tjepkema
Health Analysis Division, Statistics Canada
December 1, 2015
Outline
1. Postal codes
• Components of a postal code
• Uses of small-area data
2. Introduction to the Postal Code Conversion File (PCCF) and the Postal Code Conversion File Plus (PCCF+)
3. Single link indicator geocoding versus population-weighting
4. Why PCCF+?
5. Limitations of PCCF & PCCF+
11/26/2015Statistics Canada • Statistique Canada2
26/11/2015
2
1. Postal Codes
11/26/2015Statistics Canada • Statistique Canada3
What are postal codes?
• An identifier managed by Canada Post Corporation for the efficient sorting and delivery of mail.
• They are not created as units for the analysis or mapping of population, business or dwelling characteristics.
• However, postal codes are part of most administrative data sets and are usually the only variable available for geographic identification
• Thus, they are important identifiers for geocoding
11/26/2015Statistics Canada • Statistique Canada4
26/11/2015
3
Components of a postal code
• The postal code is a six-character alphanumeric code
• Postal codes are not geographic attributes
• Only spatial in that mail is delivered by geographic area
• Six character code ‘ANA NAN’
• First 3 – Forward Sortation Area (FSA)
• Last 3 – Local Delivery Unit (LDU)
11/26/2015Statistics Canada • Statistique Canada5
Statistics Canada. Postal Codes Conversion File (PCCF), Reference Guide. Catalogue no. 92-153-G, no 02. Ottawa, ON: Statistics Canada, 2011.
What is a postal code?
11/26/2015Statistics Canada • Statistique Canada6
ANA NAN
Province / Territory / Region First Character
Newfoundland and Labrador A
Nova Scotia B
Prince Edward Island C
New Brunswick E
Eastern Québec G
Metropolitan Montréal H
Western Québec J
Eastern Ontario K
Central Ontario L
Metropolitan Toronto M
Southwestern Ontario N
Northern Ontario P
Manitoba R
Saskatchewan S
Alberta T
British Columbia V
Northwest Territories and Nunavut X
Yukon Y
ForwardSortationArea
LocalDeliveryUnit
if 0 then ruralif 1-9 then urban
26/11/2015
4
Components of a postal code
11/26/2015Statistics Canada • Statistique Canada7
Components of a postal code
• Local Delivery Unit (LDU)
• Letter carrier delivery to ordinary urban address
• Community mailbox
• Apartment building
• Business building
• Large firm or organisation (Foothills Medical Centre: T2N2T9; CBC: M5W 1E6)
• Federal department or agency (Statistics Canada: K1A 0T6)
• Mail delivery route (suburban, rural, or mobile)
• General delivery and post office boxes (large or small)
11/26/2015Statistics Canada • Statistique Canada8
Statistics Canada. Postal Codes Conversion File (PCCF), Reference Guide. Catalogue no. 92-153-G, no 02. Ottawa, ON: Statistics Canada, 2011.
26/11/2015
5
Components of a postal code
Haydu G. The Postal Code – Geographic classification code conversion file, a tool for social science research. Paper presented at the
1979 annual meeting of the Canadian Association of Geographers, Victoria, BC, Canada.
11/26/2015Statistics Canada • Statistique Canada9
How can postal codes be used for analysis
• Postal codes are part of most administrative data sets
• PCCF, PCCF+, and related tools are now the standard
• Allows for the conversion of address and postal code attributes to standard geographical codes
• Used in data collection, processing, and analysis, e.g., dissemination area (DA), census tract (CT), health region (HR)
• Resulting small-area geography have a variety of uses
• Familiarity with the methods, strengths, and limitations will help researchers exploit the potential
11/26/2015Statistics Canada • Statistique Canada10
26/11/2015
6
Uses of small area data
• Add policy relevance by aggregating to admin areas
• Health Regions, School Districts, etc…
• Deal with changes over time (boundary shifts)
• Assign neighbourhood socio-economic status (SES) and other confounders
• Determine point-distance, road distance, travel time
• Allow for studies of migration over time (longitudinal)
• Help in the imputation of missing data
• Obtain additional identifiers for record linkage
11/26/2015Statistics Canada • Statistique Canada11
2. Introduction to the PCCF and PCCF+
11/26/2015Statistics Canada • Statistique Canada12
26/11/2015
7
What is the PCCF?
• A flat file that links postal codes (active and retired) to standard geographic areas
• Allows for:• Association of postal codes to standard geographic areas
• Selection of statistical units by geographic areas
• Provides linkages (including a single link indicator (SLI)) to block face (BF), dissemination block (DB), and dissemination area (DA)
• However, some postal codes are only linked to post office locations, many serve multiple DAs, and some are non-residential (government offices, etc)
11/26/2015Statistics Canada • Statistique Canada13
Statistics Canada. Postal Codes Conversion File (PCCF), Reference Guide. Catalogue no. 92-153-G, no 02. Ottawa, ON: Statistics Canada, 2011.
What is the PCCF+?
• The PCCF+ consists of:
1. SAS control program,
2. reference files primarily derived from the PCCF
3. postal code population-weight file derived from the Census of Population
• Assigns geographic identifiers based on postal codes
• Full diagnostic output (troublesome postal codes, precision of geocoding, etc.)
• Provides residential & institutional coding separately
11/26/2015Statistics Canada • Statistique Canada14
Wilkins R, Peters PA. PCCF+ Version 5K User’s Guide: Automated geocoding based on the Statistics Canada Postal Code Conversion File.
Catalogue no. 82F0086-XDB. Ottawa, ON: Statistics Canada, 2011.
26/11/2015
8
Importance of Identifying Non-residential PCs
• PCCF+ is able to identify non-residential postal codes
• Government Offices, e.g., Statistics Canada
• Coroners Offices
• Children’s Aid Societies
• Hospitals in a Birth File
• Tax preparers office in a Tax File
• UPS Store, Mailboxes Etc,
11/26/2015Statistics Canada • Statistique Canada15
How does the PCCF+ geocode postal codes?
• Assigns geographic identifiers based on postal codes in a staged approached:
1. assigns 6-digit postal codes in rural areas to disseminations areas (DA) and dissemination blocks (DB) using population-weighted random allocation
2. assigns 6-digit postal codes with an exact match to a PCCF unique record
3. randomly assigns 6-digit postal codes with an exact match to a PCCF duplicate record
4. imputes full geography for the first 5-, first 4- and first 3-digit postal codes using census population weights
5. imputes partial geography for the first 2-digit postal codes
11/26/2015Statistics Canada • Statistique Canada16
Wilkins R, Peters PA. PCCF+ Version 5K User’s Guide: Automated geocoding based on the Statistics Canada Postal Code Conversion File.
Catalogue no. 82F0086-XDB. Ottawa, ON: Statistics Canada, 2011.
26/11/2015
9
Uses of the PCCF and the PCCF+
• A 2011 literature review for publications using the PCCF and PCCF+ resulted in 622 publications
• Health Sciences 463 (74%)
• Social Sciences & Economics 93 (15%)
• Education, data, & statistics 34 (6%)
• Natural & applied sciences 12 (2%)
• Other 20 (3%)
• Articles appeared in 233 different journals, top two:
• Canadian Medical Association Journal (23)
• Canadian Journal of Public Health (19)
11/26/2015Statistics Canada • Statistique Canada17
Peller P. An analysis of the Postal Code Conversion File’s use in research. DLI research paper series, 2011. Calgary, AB: University of Calgary.
3. PCCF-SLI vs. PCCF+
11/26/2015Statistics Canada • Statistique Canada18
26/11/2015
10
Single-link (PCCF-SLI) vs. PCCF+
• PCCF-SLI forces each postal code to be assigned to a single dissemination area (DA) & dissemination block (DB), regardless of how large the actual service area may be
• For most research purposes, the distribution of the population across the entire service area is needed
• PCCF+ uses a population-weighted method of geocoding where multiple-matches are possible
• As such, the distribution of respondents more accurately reflects the underlying population
• “Numerator-denominator consistency”
11/26/2015Statistics Canada • Statistique Canada19
11/26/2015Statistics Canada • Statistique Canada20
PCCF (SLI) PCCF+
Of 10 records reporting this postal code,
all 10 will be assigned to DA 1 using the
PCCF single link indicator (SLI)
Of 10 records reporting this postal code, 6
will be assigned to DA 1, 3 to DA2 and 1 to
DA 3 using the PCCF+
10
0
0
A1A 1A1
DA 1
60%DA 2
30%
DA 3
10%
A1A 1A1
6
3
1
A1A 1A1
26/11/2015
11
Population assignment using PCCF-SLI
11/26/2015Statistics Canada • Statistique Canada21
Saskatchewan
Manitoba
Alberta
Population assignment using PCCF+
11/26/2015Statistics Canada • Statistique Canada22
Saskatchewan
Manitoba
Alberta
26/11/2015
12
Population non-assignment via PCCF-SLI & PCCF+
11/26/2015Statistics Canada • Statistique Canada23
Geographic Unit PCCF-SLI PCCF+
# of Units Percent of Population
# of Units Percent of Population
DA 8,476 2.9 187 0
CT 73 0.1 7 0
CMA .. .. .. ..
CSD 1,438 0.6 109 0
CD .. .. .. ..
Percent of total 2006 census population in areas with no respondent assignment
Population assignment using PCCF-SLI
11/26/2015Statistics Canada • Statistique Canada24
Gatineau
Ottawa
26/11/2015
13
Population assignment using PCCF+
11/26/2015Statistics Canada • Statistique Canada25
Gatineau
Ottawa
Population miss-assignment using PCCF-SLI & PCCF+
11/26/2015Statistics Canada • Statistique Canada26
Geographic Unit PCCF PCCF+
% of total population % of total population
DA 37.4 7.6
CT 6.6 1.4
CMA 4.3 0.1
CSD 11.4 2.7
CD 1.1 0.3
Comparison of population coding errors using PCCF-SLI versus PCCF+ (5J)*
* Population coding errors are defined as the sum over all areas at this geographic level of the absolute value of the population coded less the population known from the census sample, expressed as a percentage of the total population in all areas at this level.
26/11/2015
14
Limitation of SLI (e.g., 2001 Census Geography)
• Over a third of the total population of rural and small town Canada can never get the correct dissemination area (DA) code when using the PCCF SLI since nearly 11,000 DAs are never linked to postal codes when only the SLI is selected.
• Also at the census subdivision (CSD) level, over a quarter of all CSDs never get coded using SLI. In rural and small town Canada, nearly 30% of CSDs never get coded using the SLI.
11/26/2015Statistics Canada • Statistique Canada27
4. Why PCCF+?
11/26/2015Statistics Canada • Statistique Canada28
26/11/2015
15
Why PCCF+ and not regular PCCF (with SLI=1)?
1. Population weighted approach
2. Supplemental coding
3. Postal codes less than perfect
4. Documentation and diagnostics
5. Modifiable SAS code
6. Vintage of postal codes
7. Postal codes used by residents for “incompletely enumerated Indian Reserves”
11/26/2015Statistics Canada • Statistique Canada29
Why PCCF+? – 1: population weighting
• Almost all rural and several urban categories of postal code provide service to multiple dissemination areas (DAs), census subdivisions (CSDs), etc…
• Use of the single link indicator (SLI) equal to 1 in PCCF forces any occurrence of a postal code to only one set of geocodes
• Using single-link approach introduces systematic bias
• PCCF+ probabilistically assigns each postal code record using census derived population weights
11/26/2015Statistics Canada • Statistique Canada30
26/11/2015
16
Why PCCF+? – 2: supplemental coding• ID, PCODE
• PR, CD, CSD, CCSD
• CMA, CT, MIZ, ER, FED
• DA, BLK
• BLKURB*, DPL*
• LAT, LONG
11/26/2015Statistics Canada • Statistique Canada31
* Poorly coded and not recommended for analytic use
PCCF-SLI
&
PCCF+
Why PCCF+? – 2: supplemental coding• ID, PCODE
• PR, CD, CSD, CCSD
• CMA, CT, MIZ, ER, FED
• DA, BLK
• BLKURB*, DPL*
• LAT, LONG
• HR, AHR
• QAIPPE, IMMTER
• CSIZE, NSREL, AIRLIFT, AR
• EA81uid, EA86uid, EA91uid EA96uid, DA01uid, DA06uid, DA11uid
11/26/2015Statistics Canada • Statistique Canada32
* Poorly coded and not recommended for analytic use
PCCF-SLI
&
PCCF+
PCCF+
only
26/11/2015
17
Why PCCF+? – 3: postal codes less than perfect
• Most files will include some postal codes that never existed (reporting or data capture errors)
• Sensitive files may omit the last digit of the postal code
• Some files may only contain the first 3 digits of the postal code
• PCCF+ can be used to geocode the above information
11/26/2015Statistics Canada • Statistique Canada33
Why PCCF+? – 4: documentation & diagnostics
• Output is documented with user manual and version
• Method has been validated in many publications
• Diagnostic codes for problem codes are provided
• Two outputs: Full file & Problem File
11/26/2015Statistics Canada • Statistique Canada34
DMT, DMTDIFF RPF, SERV, PREC
LINK (PROB) BLG NAME + ADR
SOURCE CSDNAME + TYPE
NCSD, NCD CPCCODE
RESFLG, INSTFLG
This variable provides a
measure of the quality
of the geographic
coordinates assigned to
the representative point
26/11/2015
18
Why PCCF+? – 5: Modifiable SAS code
11/26/2015Statistics Canada • Statistique Canada35
• Length of ID variable can be changed
• SAS code can be easily tweaked so results are exactly reproducible
• Define a specific kernel for probabilistic assignment
/********************************************************************************************/
/* Random Seed Value */
/* If the seed value is 0 (default) then computer time is used */
/* Change this value as desired to use the same seed between PCCF+ trials */
%let seedVal=0;
Why PCCF+? – 6: “Vintage” of postal codes
• PCCF+ assigns full census geography for most recent census year
• It also assigns dissemination (DA) area or enumeration area (EA) from each previous census back to 1981
• Useful for time-varying analysis
• For higher levels of vintage geography (e.g., CMA) use the Geographic Attributes File (GAF) or the Geographic Tape File (GTF)
11/26/2015Statistics Canada • Statistique Canada36
26/11/2015
19
Why PCCF+? – 7: Indian Reserves
• Your file includes postal codes used by residents of “incompletely enumerated Indian Reserves”
• These postal codes will not properly be coded by PCCF-SLI
• PCCF+ includes census population weights adjusted to account for estimates of the population living on the incompletely enumerated reserves
11/26/2015Statistics Canada • Statistique Canada37
Summary: PCCF+ vs PCCF-SLI
• Consider using PCCF+ rather than PCCF-SLI if any of the following apply
• You want to do better coding in rural areas
• You want to use variables present on the PCCF+ which are not present in regular PCCF
• Your file is less than perfect with respect to postal codes
• You want help to evaluate the quality of the postal code on your data file
• The “vintage” of the postal codes on your file spans more than one census
• If your file includes postal codes used by residents of “incompletely enumerated Indian Reserves”
11/26/2015Statistics Canada • Statistique Canada38
26/11/2015
20
4. Limitation of the PCCF-SLI & the PCCF+
11/26/2015Statistics Canada • Statistique Canada39
Limitations with PCCF-SLI & PCCF+
• In rural areas and at urban fringe, probabilistic assignment leads to random misclassification of dissemination area (DA) and neighbourhood income quintiles
• Reduced ability to detect effects in rural areas
• Lower risk ratios (RRs) and risk differences (RDs) for epidemiologic studies
• This is effect modification not confounding, so it is recommended to stratify analysis by urban & rural
• Take care in interpreting lower effect estimates in rural versus urban areas
11/26/2015Statistics Canada • Statistique Canada40
26/11/2015
21
Limitations with PCCF and PCCF+
• Postal codes may change over time
1. Many technical changes to address ranges• Usually no change at block-face of block level
• Very little change at higher levels
2. Some reuse of retired postal codes within same FSA
3. Two FSA in British Columbia moved in mid-90s
• Generally, these changes translate to • no change of the block face (BF) or dissemination block (DB)
latitude/longitude
• very little change at higher levels (dissemination area (DA), census tract (CT), etc.)
• Moral – code as received and interpret the output11/26/2015Statistics Canada • Statistique Canada41
Concluding remarks
• Small-area geography & spatial coordinates are part of most data sets and useful in most studies
• Familiarity with methods, limitations, and interpretation of data helps researchers more meaningfully exploit data potential
• It is not enough to use the data mechanically, users need to think about what they are doing and why
• Consult the PCCF+ documentation
11/26/2015Statistics Canada • Statistique Canada42
26/11/2015
22
Thank you!
• Acknowledgments
• Russell Wilkins (retired), Paul A Peters (University of New Brunswick) & Michael Tjepkema (Health Analysis Division)
• For more information please contact:
11/26/2015Statistics Canada • Statistique Canada43