illustration of spatial linkage between electronic healthcare … · census block group is the most...
TRANSCRIPT
Illustration of spatial linkage between electronic healthcare records and data from the US Census
Bureau to incorporate socioeconomic status information in epidemiologic studies
Rishi J Desai,1 Chandrasekar Gopalakrishnan, 1 Sara Dejene, 1 Joshua J Gagne 1
1 Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and
Women’s Hospital and Harvard Medical School, Boston, MA
Correspondence
Rishi J. Desai, M.S., Ph.D.
Division of Pharmacoepidemiology and Pharmacoeconomics,
Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School,
1620 Tremont Street, Suite 3030-R, Boston, MA 02120, USA
Phone: 617-278-0930 | Fax: 617-232-8602
Email: [email protected]
Words: 1,151
Figures: 2
Background and objective
Large healthcare databases are increasingly being used to conduct studies of the comparative
effectiveness and safety of drug treatments. It is important to account for a variety of patient
characteristics including demographics, health status, and socioeconomic status to draw reliable
inferences regarding drug effects on health outcomes in such studies as these characteristics can affect
treatment selection as well as risk of health outcomes and therefore can confound the drug-outcome
relation under investigation. While the data sources most commonly used for these studies (electronic
medical records (EMRs) and health insurance claims) capture patient demographics and health services
utilization information,1 they typically do not contain data on socioeconomic variables. In this
manuscript, we outline the steps for linking data from the American Community Survey (ACS), informally
known as the census, which is an ongoing survey administered by the US Census Bureau that captures
data on socioeconomic factors, such as income, education, and unemployment,2 with patient-level
medical records data using spatial linkage methods.
Case study
To reflect a typical pharmacoepidemiologic investigation using EMR data, a cohort of patients
hospitalized with acute myocardial infarction (AMI) and prescribed either atorvastatin or rosuvastatin
for secondary prevention of cardiovascular events at two large academic hospitals in Boston, MA was
identified. Patients’ street-level addresses were retrieved from their EMRs for linkage with data from
the ACS and the following steps were taken. This process assumes that a patient cohort has been
identified and researchers have access to data on patient addresses to perform the spatial merge.
Step 1- Processing of area-level socioeconomic data
Data on socioeconomic variables are collected annually by the US Census Bureau as a part of
their Decennial Census Program. Each year, the ACS randomly samples approximately 3.5 million
addresses in a way that is representative of the US population and produces area-level statistics that
cover 1-year, 3-year, and 5-year periods for neighborhoods across the US. These data are freely
available online (http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml). Using either the
‘Advanced Search’ menu or the ‘Download Center’ tab on this website, users can specify the
socioeconomic variables of interest and download summary statistics for each variable at the given
geographic unit of interest. Census tracts, census block groups and ZIP codes are some examples of
geographic units for which aggregate data are available. Census block group is the most granular
geographic unit at which aggregate statistics are available. According to the 2010 tally, within the US
there are 73,057 census tracts, each containing 1,200 to 8,000 individuals and 217,740 block groups,
each containing 600 to 3,000 individuals.3 We downloaded data from the 2015 ACS for the median
household income at the census tract level for all counties of the state of Massachusetts (MA) for this
exercise. Example of other relevant socioeconomic variables available in the ACS include: percentage of
unemployment; percentage below the poverty line; median value of owner-occupied homes;
percentage without a high school diploma or general education development (GED) certificate;
percentage with a Bachelor’s degree or higher; and percentage living in crowded residencies. In the
downloaded dataset, one tract is represented in each row and identified with a unique identifier (11 or
12-digit string ‘Geo.id2’).
To spatially link this file with patient-level data, information on geographic location of each tract is
required. This information is available in TIGER/Line files, which are shapefiles available for download on
the Census Bureau’s website (http://www.census.gov/geo/maps-data/data/tiger-line.html) and can be
opened in any geographic information system (GIS) software. These files contain data on coordinates
encompassing the entire area covered by a geographic unit of interest (e.g., census tract) and are
linkable with the data from US censuses and surveys through a common identifier (‘Geoid’, linkable to
‘Geo.id2’).
In this example, we used the GIS software ArcMap (Version 10.4.1, Environmental Systems Research
Institute, Redlands, CA) to read in the TIGER/Line shapefiles and perform a tabular join with the ACS
data. Please refer to the Appendix for details of this procedure.
Step 2- Processing and geocoding of patient-level data
A cohort of 1,839 patients treated with either atorvastatin or rosuvastatin after an AMI
hospitalization episode was identified. Street-level addresses were extracted from patients’ EMRs after
receiving Institutional Review Board (IRB) approval from the Brigham and Women’s Hospital. Based on
the geocoding strategy outlined by Goldstein et al.4, patient addresses were assigned spatial coordinates
using the R packages, RCurl and RJSONIO (R version 3.2.3). Briefly, these packages use Google Maps
application programming interfaces (APIs) to geocode specified addresses using proprietary matching
algorithms. For additional information on this approach and R codes, we refer the reader to the research
letter by Goldstein et al.4 GIS software packages (including ArcMap) also have the capability of
geocoding, but we opted for the strategy using R because of its ease of application and high matching
rate.
Step 3- Spatially linking area-level socioeconomic data with patient-level data
In the last step, datasets obtained from Step 1 (ACS data merged with TIGER/Line shapefiles
containing coordinates corresponding to each census tract in MA) and Step 2 (patient-data containing
coordinates corresponding to their residential addresses) are spatially linked in ArcMap (Version 10.4.1,
Environmental Systems Research Institute, Redlands, CA). Briefly, this spatial join assigns the value of
specified area-level socioeconomic variables to a particular patient based on the census tract of his/her
residential address. Please refer to the Appendix for details on how to perform this join.
In Figure 1, we illustrate the results of this linkage strategy on a map of Massachusetts. Census
tracts are marked by borders within the map and points represent residential addresses of individual
patients in our example cohort. Block groups are color coded based on stratifying the census tracts into
median household income quintiles. Points are color coded based on an example outcome of 1-year
mortality as recorded in the patients’ EMRs (dead or alive). As shown on the map, a patient was
assigned to one of five categories serving as a proxy of their income status based on the median
household income of the census tract in which they live. In Figure 2, crude estimates of 1-year mortality
proportions are plotted in each of the five income categories. These data suggest a lower mortality
proportion (14%) in the quintile with highest median household income and higher proportions in lower
quintiles (19-23%), indicating a crude association between income status and mortality. While this
example is not designed to draw inference about this association, it serves as a good illustration of how
aggregate-level socioeconomic variables can be defined in patient cohorts using spatial linkage methods.
These variables can then be used as explanatory variables in predictive modeling or as confounders for
risk adjustment. Another application of such linkage methods is monitoring of healthcare disparities by
demographic characteristics based on aggregate-level socioeconomic data as demonstrated in the Public
Health Disparities Geocoding Project.5
Conclusion and recommendations
Large scale electronic healthcare data are becoming widely available and increasingly being used
to answer clinical research questions. The unavailability of socioeconomic data, a major limitation of
these databases, can be addressed by using aggregate level data recorded by the US Census Bureau
through spatial linkage methods outlined in this report.
Figure legends
Figure 1- Illustration of spatial linkage of geocoded patient addresses with US census data
Figure 2- Crude estimates of 1-year mortality proportions in each of the five income categories
References
1. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic
research on therapeutics. J Clin Epidemiol. Apr 2005;58(4):323-337.
2. United States Census Bureau. ACS summary file technical documentation. Available at
http://www2.census.gov/programs-
surveys/acs/summary_file/2013/documentation/tech_docs/2013_SummaryFile_Tech_Doc.pd
f.
3. US Census Bureau. Geographic Terms and Concepts, available from
https://www.census.gov/geo/reference/gtc/gtc_bg.html, accessed 1-17-2017.
4. Goldstein ND, Auchincloss AH, Lee BK. A no-cost geocoding strategy using R. Epidemiology
(Cambridge, Mass.). 2014;25(2):311-313.
5. Krieger N, Chen JT, Waterman PD, Rehkopf DH, Subramanian SV. Race/ethnicity, gender, and
monitoring socioeconomic gradients in health: a comparison of area-based socioeconomic
measures--the public health disparities geocoding project. American journal of public health.
Oct 2003;93(10):1655-1671.
Survival at 1-year follow-up
Alive (1,496)
Dead (343)
Median annual household income in census tract ($)
13488 - 53485
53486 - 70057
70058 - 83393
83394 - 102434
102435 - 226181
Census tracts contributing no patients in the study
Appendix
Tabular join between area-level socioeconomic data from ACS and TIGER/Line files in ArcMap (Step 1)
1. First, both the ACS file (saved as csv or excel) and the TIGER/Line shape file must be added to ArcMap using File
Add data Add data (or button on the toolbar) function.
2. To properly execute the join, we need to format the type of ‘Geoid’ variable in the TIGER/Line shapefile from a
‘String’ to a ‘Double’. To do this, right click the TIGER/Line shape file Open attribute table Add field (find
from white button on top left corner). Select ‘double’ as the type of variable you want to add and name the
variable (named here ‘id_for_mer’), a new column will be added. Right click on the column and select Field
calculator. In Fields box, find and double click the ‘Geoid’ variable and click OK. This adds a new column stored in
proper format to execute the tabular join with ACS data.
3. Next, right click the TIGER/Line shapefile Joins and Relates Join, and select join options on the ‘Join Data’
screen as demonstrated in the following screenshot. This step will add the ACS columns in the TIGER/Line
shapefile.
Spatial join between TIGER/Line files with ACS data (from Step 1) and patient-level geocoded address data (Step 3)
1. First, patient address data geocoded using R containing longitude and latitude information (saved as csv or
excel) must be added to ArcMap using File Add data Add data (or button on the toolbar) function. We
then need to specify which numeric columns in this Excel file correspond to geographic coordinates. To do so,
right click the imported file ‘Display XY Data’ and select options demonstrated in the following screenshot,
As the original geocoded data are stored and imported as an Excel file, the resulting file does not contain a
unique ArcMap identifier variable (‘Object ID’) and therefore is not ready for spatial join. To address this, we first
need to save this file as a Shapefile by right clicking this file Data Export Data and add the resulting
shapefile as a new layer in ArcMap.
2. Finally, we can perform a Spatial Join between this new layer and the TIGER/Line file prepared for merge in Step
1 by right clicking this layer Joins and Relates Join and selecting the following join options on the ‘Join
Data’. This step will result in a final dataset which will have assigned the value of specified area-level
socioeconomic variables derived from ACS files to a particular patient based on the census tract of his/her
residential address.
This dataset can be exported as excel by clicking on the ArcToolbox ( ) Conversion Tools Excel Table To Excel
and read/processed into your software of choice (R/SAS) for running analytic models.