gis in public health research: understanding spatial analysis and interpreting outcomes 1-31-14

GIS in Public Health Research: Understanding Spatial Analysis &

Interpreting Outcomes Kristin Osiecki PhD

Houston Aerosol Characterization & Health Experiment (HACHE)

• UT Health Science Center School of Biomedical Informatics

• University of Houston Department of Earth and Atmospheric Sciences

• Rice University Department of Sociology and Department of Civil & Environmental Engineering

Applications in Public Health Research

• Space matters – communities,census tracts, counties, states

• Multidisciplinary and Interdisciplinary • Collaborative • Simple and Complex Models

What research questions are we trying to answer?

• Do we need visualizations or maps? OR • Are we interested in investigating possible

spatial relationships within the data?

ArcGIS Toolbox

Handyman’s Dream or

Do-it-yourself nightmare?

Objectives

• Traditional Statistics & Spatial Analysis • Permutations • Spatial Weights • EDA & ESDA

"Spatial Statistics" does not mean applying traditional (non-spatial)

statistical methods to data that just happens to be spatial (has X and Y

coordinates). Source: ESRI

http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/how_generate_spatial_weights_matrix_spa

tial_statistics_works.htm

Traditional Statistical Methodology

Spatial Methodology

Spatial Analysis

Global Model

Local Model

EDA ESDA

Global & Local

Global autocorrelation Local autocorrelation

The most crucial step in the process

Exploring the Data: EDA & ESDA

Scatter Plot Matrix

p_blck x p_FHH10.80.60.40.20

1

0.8

0.6

0.4

0.2

0

p_blck

p_FH

H

pct_pov

pct_

pov

Exploratory Spatial Data Analysis

• Interactively visualize and explore data where space matter

• Detect patterns • Hypothesis generation

• spatial modeling is needed to test hypotheses

• Works on point feature and polygon features (i.e. census, epidemiology, demographic layers)

What is Spatial Randomness? • Observed spatial pattern of value is equally as

likely as any other spatial pattern • Value at one location does not depend on

values at neighboring locations under spatial randomness, the location of values may be altered without affecting the information content of the data

• random permutation or reshuffling of values Dr. Luc Anselin 2012

Spatial Randomness • Spatial Randomness Null Hypothesis

– Spatial randomness is absence in any pattern – If rejected, evidence of spatial structure

Dr. Luc Anselin 2012

ArcGIS Spatial Autocorrelation • The Randomization Null Hypothesis: Where appropriate, the tools in the

Spatial Statistics toolbox use the randomization null hypothesis as the basis for statistical significance testing. The randomization null hypothesis postulates that the observed spatial pattern of your data represents one of many (n!) possible spatial arrangements. If you could pick up your data values and throw them down onto the features in your study area, you would have one possible spatial arrangement of those values. (Note that picking up your data values and throwing them down arbitrarily is an example of a random spatial process). The randomization null hypothesis states that if you could do this exercise (pick them up, throw them down) infinite times, most of the time you would produce a pattern that would not be markedly different from the observed pattern (your real data). Once in a while you might accidentally throw all the highest values into the same corner of your study area, but the probability of doing that is small. The randomization null hypothesis states that your data is one of many, many, many possible versions of complete spatial randomness. The data values are fixed; only their spatial arrangement could vary.

http://resources.arcgis.com/en/help/main/10.1/index.html#//005p00000006000000

Permutations

• A numerical approach to testing for statistical significance (in contrast to analytical approaches)

• It is data-driven and makes no assumptions (such as normality) about the data

Permutations in Geoda

• Permutation inference is shuffling values around and re-computing statistics each time with a different set of random numbers to construct a reference distribution.

• Permutations are used to determine how likely it would be to observe the Moran’s I value of an actual distribution under conditions of spatial randomness.

• P-values are dependent on the number of permutations so they are “pseudo p-values”

Permutations

The first step in the analysis of spatial autocorrelation is to construct a spatial weights file that contains information on the “neighborhood” structure for each location (luc anselin)

Spatial Weights

Generation of Spatial Weights ESRI

• For binary strategies (fixed distance, K nearest

neighbors, or contiguity) a feature is either a neighbor (1) or it is not (0).

• For weighted strategies (inverse distance or zone of indifference) neighboring features have a varying amount of impact (or influence) and weights are computed to reflect that variation.

Row Standardization

• Adjusts the weights in a spatial weights matrix • Each weight is divided by its row sum • The row sum is the sum of weights for a

feature’s neighbors. • A weights matrix is row-standardized when

the values of each of its rows sum to one.

Binary vs. row-standardized

• A binary weights matrix looks like:

• A row-standardized matrix it looks like:

0 1 0 0

0 0 1 1

1 1 0 0

0 1 1 1

0 1 0 0

0 0 .5 .5

.5 .5 0 0

0 .33 .33 .33

Spatial Weights • Formal expression of locational similarity

Distance Models

• Inverse distance – all features influence all other features, but the closer something is, the more influence it has

• Distance band – features outside a specified distance do not influence the features within the area

• Zone of indifference – combines inverse distance and distance band

Inverse Distance (impedance) (ArcGIS) • features impact/influence all other features

– farther away something is, the smaller the impact

• specify a Distance Band/Threshold Distance value to reduce the number of required computations – especially with large datasets. – If not specified, a default threshold value is computed for you

• Choosing an appropriate distance is important – Some spatial statistics require each feature to have at

least one neighbor for the analysis to be reliable.

Distance band (sphere of influence) • impose a sphere of influence, or moving window

conceptual model of spatial interactions onto the data • Neighbors within the specified distance are weighted

equally. Features outside have no influence (weight = 0) • Evaluate the statistical properties of your data at a

particular (fixed) spatial scale • have at least one neighbor, or results will not be valid • if the input data is skewed make sure that your distance

band is neither too small (only one or two neighbors) nor too large (include all other features as neighbors) – resultant z-scores less reliable.

Adjacency Models

• K Nearest Neighbors – a specified number of neighboring features are included in calculations

• Polygon Contiguity – polygons that share an edge or node influence each other

K-nearest neighbors • each feature assessed in the spatial context of a

specified number of its closest neighbors. If K (t is 8, then eight closest neighbors to the target feature will be included If feature density is high - spatial context of the analysis will be smaller.

• If feature density is sparse, the spatial context for the analysis will be larger.

• method is available using the Generate Spatial Weights Matrix tool

Polygon contiguity (first order) • polygons that share an edge (that have

coincident boundaries) are included in computations for the target polygon

• modeling some type of contagious process or are dealing with continuous data represented as polygons.

Binary Contiguity Weights • contiguity = common border • i and j share a border, then wij = 1 • i and j are not neighbors, then wij = 0 • weights are 0 or 1, hence binary

Distance-Based Weights • distance between points • distance between polygon centroids or central points • distance-band weights: wij nonzero for dij < d less than a critical distance d • k-nearest neighbor weights: same number of neighbors for all observations potential problems with ties

Global vs. Local Statistics

• Global statistics (Clustering) – identify and measure the pattern of the entire study area – Do not indicate where specific patterns occur

• Local Statistics (Clusters) – identify variation across the study area, focusing on individual features and their relationships to nearby features (i.e. specific areas of clustering)

Spatial Autocorrelation (Moran’s I)

• Global statistic • Measures whether the pattern of feature values is clustered,

dispersed, or random. • Compares the difference between the mean of the target

feature and the mean for all features to the difference between the mean for each neighbor and the mean for all features.

Mean of Target Feature

Mean of all

features

Mean of each neighbor

Z-Score & P-value (ArcGIS)

• Very high or very low (negative) z-scores, associated with very small p-values, are found in the tails of the normal distribution

• it is unlikely that the observed spatial pattern reflects the theoretical random pattern represented by your null hypothesis (CSR)

• The null hypothesis for the pattern analysis tools is Complete Spatial Randomness (CSR), either of the features themselves or of the values associated with those features.

http://resources.arcgis.com/en/help/main/10.1/index.html#//005p00000006000000

Pseudo P-Value

• significance levels are dependent on the number of permutations

• One-sided significance test • For instance, if an observed Moran's I value is

higher than any of the randomly generated Moran's I values, the pseudo p-value would be 1/100=0.01 for 99 permutations or 1/1,000=0.001 for 999 permutations

Spatial Autocorrelation (Moran’s I) Polygon Contiguity (first order)

Spatial Autocorrelation (Moran’s I) Polygon Contiguity (first order)

Percent Black Population, Cook County, IL

Generate Spatial Weights Matrix K-Nearest Neighbor

Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor


If the z-score value is positive, the observed General G index is larger than the expected General G index, indicating high values for the attribute are clustered in the study area

Spatial Autocorrelation (Getis –Ord General G High/Low Clustering) Polygon Contiguity


Geoda Spatial Autocorrelation (Moran’s I) Percent Black Population, Cook County, IL

Geoda Spatial Autocorrelation (Moran’s I) Queen Contiguity Weight (1st order)


Geoda Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor (eight)


Geoda Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor (four)


Anselin Local Moran’s I

• Local statistic

• Measures the strength of patterns for each specific feature.

• Compares the value of each feature in a pair to the mean value for all features in the study area.


• Positive I value: – Feature is surrounded by features with similar values, either high or low.

– Feature is part of a cluster.

– Statistically significant clusters can consist of high values (HH) or low values (LL)

• Negative I value: – Feature is surrounded by features with dissimilar values.

– Feature is an outlier.

– Statistically significant outliers can be a feature with a high value surrounded by features with low values (HL) or a feature with a low value surrounded by features with high values (LH).

• The z- scores and p-values are measures of statistical significance which tell you whether or not to reject the null hypothesis, feature by feature.

• Indicate whether the apparent similarity (or dissimilarity) in values for a feature and its neighbors is greater than one would expect in a random distribution.

http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/cluster_and_outlier_analysis_colon_anselin_local_moran_s_i_spatial_statistics_.htm


Anselin’s Local Moran’s I Polygon Contiguity Weight Percent Black Population Cook County, IL

p-value z-score index

HH LH

Geoda Univariate LISA Queen Contiguity Weight


p-values 499 Permutations p-values 999 Permutations

Geoda Univariate LISA Queen Contiguity Weight


HH HL 999 Permutations

Comparison ArcGIS & Geoda Results Queen Contiguity Weight

Percent Black Population, Cook County, IL p-values

Comparison ArcGIS & Geoda Univariate LISA Queen Contiguity Weight


HH HL 999 Permutations HH HL

High - High

High - Low

Low-High

Low-Low

Percent Poverty

Non

-poi

nt S

ourc

e Ca

ncer

Risk

# of

Observations

R^2 Constant Std

Error

t-statistic p-value Slope Std

Error

t-statistic p-value

1343 0.209 0.00442 0.0176 0.251 0.802 0.332 0.0176 18.8 0

80 0.1116 1.58 0.0797 19.8 0 0.045 0.0475 0.957 0.342

1263 0.118 -0.0794 0.0161 -4.92 0 0.223 0.0172 13 0

INTERCEPT SLOPE

Bivariate LISA Scatterplot

Chow test for selected/unselected regression subsets distribution F(2,1339) ratio=214.6 p-value=0

Global Model

Local Model

EDA ESDA

gis in public health research: understanding spatial analysis and interpreting outcomes 1-31-14

Health & Medicine

spatial weights matrix

spatial weights file

observed spatial pattern

spatial pattern of value

spatial pattern value

spatial statistics toolbox

arcgis spatial autocorrelation

data values