illustration of spatial linkage between electronic healthcare … · census block group is the most...

12
Illustration of spatial linkage between electronic healthcare records and data from the US Census Bureau to incorporate socioeconomic status information in epidemiologic studies Rishi J Desai, 1 Chandrasekar Gopalakrishnan, 1 Sara Dejene, 1 Joshua J Gagne 1 1 Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA Correspondence Rishi J. Desai, M.S., Ph.D. Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, 1620 Tremont Street, Suite 3030-R, Boston, MA 02120, USA Phone: 617-278-0930 | Fax: 617-232-8602 Email: [email protected] Words: 1,151 Figures: 2

Upload: others

Post on 11-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Illustration of spatial linkage between electronic healthcare records and data from the US Census

Bureau to incorporate socioeconomic status information in epidemiologic studies

Rishi J Desai,1 Chandrasekar Gopalakrishnan, 1 Sara Dejene, 1 Joshua J Gagne 1

1 Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and

Women’s Hospital and Harvard Medical School, Boston, MA

Correspondence

Rishi J. Desai, M.S., Ph.D.

Division of Pharmacoepidemiology and Pharmacoeconomics,

Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School,

1620 Tremont Street, Suite 3030-R, Boston, MA 02120, USA

Phone: 617-278-0930 | Fax: 617-232-8602

Email: [email protected]

Words: 1,151

Figures: 2

Page 2: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Background and objective

Large healthcare databases are increasingly being used to conduct studies of the comparative

effectiveness and safety of drug treatments. It is important to account for a variety of patient

characteristics including demographics, health status, and socioeconomic status to draw reliable

inferences regarding drug effects on health outcomes in such studies as these characteristics can affect

treatment selection as well as risk of health outcomes and therefore can confound the drug-outcome

relation under investigation. While the data sources most commonly used for these studies (electronic

medical records (EMRs) and health insurance claims) capture patient demographics and health services

utilization information,1 they typically do not contain data on socioeconomic variables. In this

manuscript, we outline the steps for linking data from the American Community Survey (ACS), informally

known as the census, which is an ongoing survey administered by the US Census Bureau that captures

data on socioeconomic factors, such as income, education, and unemployment,2 with patient-level

medical records data using spatial linkage methods.

Case study

To reflect a typical pharmacoepidemiologic investigation using EMR data, a cohort of patients

hospitalized with acute myocardial infarction (AMI) and prescribed either atorvastatin or rosuvastatin

for secondary prevention of cardiovascular events at two large academic hospitals in Boston, MA was

identified. Patients’ street-level addresses were retrieved from their EMRs for linkage with data from

the ACS and the following steps were taken. This process assumes that a patient cohort has been

identified and researchers have access to data on patient addresses to perform the spatial merge.

Page 3: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Step 1- Processing of area-level socioeconomic data

Data on socioeconomic variables are collected annually by the US Census Bureau as a part of

their Decennial Census Program. Each year, the ACS randomly samples approximately 3.5 million

addresses in a way that is representative of the US population and produces area-level statistics that

cover 1-year, 3-year, and 5-year periods for neighborhoods across the US. These data are freely

available online (http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml). Using either the

‘Advanced Search’ menu or the ‘Download Center’ tab on this website, users can specify the

socioeconomic variables of interest and download summary statistics for each variable at the given

geographic unit of interest. Census tracts, census block groups and ZIP codes are some examples of

geographic units for which aggregate data are available. Census block group is the most granular

geographic unit at which aggregate statistics are available. According to the 2010 tally, within the US

there are 73,057 census tracts, each containing 1,200 to 8,000 individuals and 217,740 block groups,

each containing 600 to 3,000 individuals.3 We downloaded data from the 2015 ACS for the median

household income at the census tract level for all counties of the state of Massachusetts (MA) for this

exercise. Example of other relevant socioeconomic variables available in the ACS include: percentage of

unemployment; percentage below the poverty line; median value of owner-occupied homes;

percentage without a high school diploma or general education development (GED) certificate;

percentage with a Bachelor’s degree or higher; and percentage living in crowded residencies. In the

downloaded dataset, one tract is represented in each row and identified with a unique identifier (11 or

12-digit string ‘Geo.id2’).

To spatially link this file with patient-level data, information on geographic location of each tract is

required. This information is available in TIGER/Line files, which are shapefiles available for download on

the Census Bureau’s website (http://www.census.gov/geo/maps-data/data/tiger-line.html) and can be

opened in any geographic information system (GIS) software. These files contain data on coordinates

Page 4: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

encompassing the entire area covered by a geographic unit of interest (e.g., census tract) and are

linkable with the data from US censuses and surveys through a common identifier (‘Geoid’, linkable to

‘Geo.id2’).

In this example, we used the GIS software ArcMap (Version 10.4.1, Environmental Systems Research

Institute, Redlands, CA) to read in the TIGER/Line shapefiles and perform a tabular join with the ACS

data. Please refer to the Appendix for details of this procedure.

Step 2- Processing and geocoding of patient-level data

A cohort of 1,839 patients treated with either atorvastatin or rosuvastatin after an AMI

hospitalization episode was identified. Street-level addresses were extracted from patients’ EMRs after

receiving Institutional Review Board (IRB) approval from the Brigham and Women’s Hospital. Based on

the geocoding strategy outlined by Goldstein et al.4, patient addresses were assigned spatial coordinates

using the R packages, RCurl and RJSONIO (R version 3.2.3). Briefly, these packages use Google Maps

application programming interfaces (APIs) to geocode specified addresses using proprietary matching

algorithms. For additional information on this approach and R codes, we refer the reader to the research

letter by Goldstein et al.4 GIS software packages (including ArcMap) also have the capability of

geocoding, but we opted for the strategy using R because of its ease of application and high matching

rate.

Step 3- Spatially linking area-level socioeconomic data with patient-level data

In the last step, datasets obtained from Step 1 (ACS data merged with TIGER/Line shapefiles

containing coordinates corresponding to each census tract in MA) and Step 2 (patient-data containing

coordinates corresponding to their residential addresses) are spatially linked in ArcMap (Version 10.4.1,

Environmental Systems Research Institute, Redlands, CA). Briefly, this spatial join assigns the value of

Page 5: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

specified area-level socioeconomic variables to a particular patient based on the census tract of his/her

residential address. Please refer to the Appendix for details on how to perform this join.

In Figure 1, we illustrate the results of this linkage strategy on a map of Massachusetts. Census

tracts are marked by borders within the map and points represent residential addresses of individual

patients in our example cohort. Block groups are color coded based on stratifying the census tracts into

median household income quintiles. Points are color coded based on an example outcome of 1-year

mortality as recorded in the patients’ EMRs (dead or alive). As shown on the map, a patient was

assigned to one of five categories serving as a proxy of their income status based on the median

household income of the census tract in which they live. In Figure 2, crude estimates of 1-year mortality

proportions are plotted in each of the five income categories. These data suggest a lower mortality

proportion (14%) in the quintile with highest median household income and higher proportions in lower

quintiles (19-23%), indicating a crude association between income status and mortality. While this

example is not designed to draw inference about this association, it serves as a good illustration of how

aggregate-level socioeconomic variables can be defined in patient cohorts using spatial linkage methods.

These variables can then be used as explanatory variables in predictive modeling or as confounders for

risk adjustment. Another application of such linkage methods is monitoring of healthcare disparities by

demographic characteristics based on aggregate-level socioeconomic data as demonstrated in the Public

Health Disparities Geocoding Project.5

Conclusion and recommendations

Large scale electronic healthcare data are becoming widely available and increasingly being used

to answer clinical research questions. The unavailability of socioeconomic data, a major limitation of

these databases, can be addressed by using aggregate level data recorded by the US Census Bureau

through spatial linkage methods outlined in this report.

Page 6: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Figure legends

Figure 1- Illustration of spatial linkage of geocoded patient addresses with US census data

Figure 2- Crude estimates of 1-year mortality proportions in each of the five income categories

Page 7: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

References

1. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic

research on therapeutics. J Clin Epidemiol. Apr 2005;58(4):323-337.

2. United States Census Bureau. ACS summary file technical documentation. Available at

http://www2.census.gov/programs-

surveys/acs/summary_file/2013/documentation/tech_docs/2013_SummaryFile_Tech_Doc.pd

f.

3. US Census Bureau. Geographic Terms and Concepts, available from

https://www.census.gov/geo/reference/gtc/gtc_bg.html, accessed 1-17-2017.

4. Goldstein ND, Auchincloss AH, Lee BK. A no-cost geocoding strategy using R. Epidemiology

(Cambridge, Mass.). 2014;25(2):311-313.

5. Krieger N, Chen JT, Waterman PD, Rehkopf DH, Subramanian SV. Race/ethnicity, gender, and

monitoring socioeconomic gradients in health: a comparison of area-based socioeconomic

measures--the public health disparities geocoding project. American journal of public health.

Oct 2003;93(10):1655-1671.

Page 8: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Survival at 1-year follow-up

Alive (1,496)

Dead (343)

Median annual household income in census tract ($)

13488 - 53485

53486 - 70057

70058 - 83393

83394 - 102434

102435 - 226181

Census tracts contributing no patients in the study

Page 9: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According
Page 10: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Appendix

Tabular join between area-level socioeconomic data from ACS and TIGER/Line files in ArcMap (Step 1)

1. First, both the ACS file (saved as csv or excel) and the TIGER/Line shape file must be added to ArcMap using File

Add data Add data (or button on the toolbar) function.

2. To properly execute the join, we need to format the type of ‘Geoid’ variable in the TIGER/Line shapefile from a

‘String’ to a ‘Double’. To do this, right click the TIGER/Line shape file Open attribute table Add field (find

from white button on top left corner). Select ‘double’ as the type of variable you want to add and name the

variable (named here ‘id_for_mer’), a new column will be added. Right click on the column and select Field

calculator. In Fields box, find and double click the ‘Geoid’ variable and click OK. This adds a new column stored in

proper format to execute the tabular join with ACS data.

3. Next, right click the TIGER/Line shapefile Joins and Relates Join, and select join options on the ‘Join Data’

screen as demonstrated in the following screenshot. This step will add the ACS columns in the TIGER/Line

shapefile.

Page 11: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

Spatial join between TIGER/Line files with ACS data (from Step 1) and patient-level geocoded address data (Step 3)

1. First, patient address data geocoded using R containing longitude and latitude information (saved as csv or

excel) must be added to ArcMap using File Add data Add data (or button on the toolbar) function. We

then need to specify which numeric columns in this Excel file correspond to geographic coordinates. To do so,

right click the imported file ‘Display XY Data’ and select options demonstrated in the following screenshot,

As the original geocoded data are stored and imported as an Excel file, the resulting file does not contain a

unique ArcMap identifier variable (‘Object ID’) and therefore is not ready for spatial join. To address this, we first

need to save this file as a Shapefile by right clicking this file Data Export Data and add the resulting

shapefile as a new layer in ArcMap.

2. Finally, we can perform a Spatial Join between this new layer and the TIGER/Line file prepared for merge in Step

1 by right clicking this layer Joins and Relates Join and selecting the following join options on the ‘Join

Data’. This step will result in a final dataset which will have assigned the value of specified area-level

socioeconomic variables derived from ACS files to a particular patient based on the census tract of his/her

residential address.

Page 12: Illustration of spatial linkage between electronic healthcare … · Census block group is the most granular geographic unit at which aggregate statistics are available. According

This dataset can be exported as excel by clicking on the ArcToolbox ( ) Conversion Tools Excel Table To Excel

and read/processed into your software of choice (R/SAS) for running analytic models.