gis in public health research: understanding spatial analysis and interpreting outcomes 1-31-14
DESCRIPTION
Geographic information systems (GIS) allow us to visualize data to better understand public health issues in our communities. Maps help recognize patterns for hypothesis generation; however, spatial analysis is necessary to substantiate relationships and produce meaningful outcomes. In this presentation we will discuss a few of the basic questions related to spatial analysis:TRANSCRIPT
GIS in Public Health Research: Understanding Spatial Analysis &
Interpreting Outcomes Kristin Osiecki PhD
Houston Aerosol Characterization & Health Experiment (HACHE)
• UT Health Science Center School of Biomedical Informatics
• University of Houston Department of Earth and Atmospheric Sciences
• Rice University Department of Sociology and Department of Civil & Environmental Engineering
Applications in Public Health Research
• Space matters – communities,census tracts, counties, states
• Multidisciplinary and Interdisciplinary • Collaborative • Simple and Complex Models
What research questions are we trying to answer?
• Do we need visualizations or maps? OR • Are we interested in investigating possible
spatial relationships within the data?
ArcGIS Toolbox
Handyman’s Dream or
Do-it-yourself nightmare?
Objectives
• Traditional Statistics & Spatial Analysis • Permutations • Spatial Weights • EDA & ESDA
"Spatial Statistics" does not mean applying traditional (non-spatial)
statistical methods to data that just happens to be spatial (has X and Y
coordinates). Source: ESRI
http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/how_generate_spatial_weights_matrix_spa
tial_statistics_works.htm
Traditional Statistical Methodology
Spatial Methodology
Spatial Analysis
Global Model
Local Model
EDA ESDA
Global & Local
Global autocorrelation Local autocorrelation
The most crucial step in the process
Exploring the Data: EDA & ESDA
Scatter Plot Matrix
p_blck x p_FHH10.80.60.40.20
1
0.8
0.6
0.4
0.2
0
p_blck
p_FH
H
pct_pov
pct_
pov
Exploratory Spatial Data Analysis
• Interactively visualize and explore data where space matter
• Detect patterns • Hypothesis generation
• spatial modeling is needed to test hypotheses
• Works on point feature and polygon features (i.e. census, epidemiology, demographic layers)
What is Spatial Randomness? • Observed spatial pattern of value is equally as
likely as any other spatial pattern • Value at one location does not depend on
values at neighboring locations under spatial randomness, the location of values may be altered without affecting the information content of the data
• random permutation or reshuffling of values Dr. Luc Anselin 2012
Spatial Randomness • Spatial Randomness Null Hypothesis
– Spatial randomness is absence in any pattern – If rejected, evidence of spatial structure
Dr. Luc Anselin 2012
ArcGIS Spatial Autocorrelation • The Randomization Null Hypothesis: Where appropriate, the tools in the
Spatial Statistics toolbox use the randomization null hypothesis as the basis for statistical significance testing. The randomization null hypothesis postulates that the observed spatial pattern of your data represents one of many (n!) possible spatial arrangements. If you could pick up your data values and throw them down onto the features in your study area, you would have one possible spatial arrangement of those values. (Note that picking up your data values and throwing them down arbitrarily is an example of a random spatial process). The randomization null hypothesis states that if you could do this exercise (pick them up, throw them down) infinite times, most of the time you would produce a pattern that would not be markedly different from the observed pattern (your real data). Once in a while you might accidentally throw all the highest values into the same corner of your study area, but the probability of doing that is small. The randomization null hypothesis states that your data is one of many, many, many possible versions of complete spatial randomness. The data values are fixed; only their spatial arrangement could vary.
http://resources.arcgis.com/en/help/main/10.1/index.html#//005p00000006000000
Permutations
• A numerical approach to testing for statistical significance (in contrast to analytical approaches)
• It is data-driven and makes no assumptions (such as normality) about the data
Permutations in Geoda
• Permutation inference is shuffling values around and re-computing statistics each time with a different set of random numbers to construct a reference distribution.
• Permutations are used to determine how likely it would be to observe the Moran’s I value of an actual distribution under conditions of spatial randomness.
• P-values are dependent on the number of permutations so they are “pseudo p-values”
Permutations
The first step in the analysis of spatial autocorrelation is to construct a spatial weights file that contains information on the “neighborhood” structure for each location (luc anselin)
Spatial Weights
Generation of Spatial Weights ESRI
• For binary strategies (fixed distance, K nearest
neighbors, or contiguity) a feature is either a neighbor (1) or it is not (0).
• For weighted strategies (inverse distance or zone of indifference) neighboring features have a varying amount of impact (or influence) and weights are computed to reflect that variation.
Row Standardization
• Adjusts the weights in a spatial weights matrix • Each weight is divided by its row sum • The row sum is the sum of weights for a
feature’s neighbors. • A weights matrix is row-standardized when
the values of each of its rows sum to one.
Binary vs. row-standardized
• A binary weights matrix looks like:
• A row-standardized matrix it looks like:
0 1 0 0
0 0 1 1
1 1 0 0
0 1 1 1
0 1 0 0
0 0 .5 .5
.5 .5 0 0
0 .33 .33 .33
Spatial Weights • Formal expression of locational similarity
Distance Models
• Inverse distance – all features influence all other features, but the closer something is, the more influence it has
• Distance band – features outside a specified distance do not influence the features within the area
• Zone of indifference – combines inverse distance and distance band
Inverse Distance (impedance) (ArcGIS) • features impact/influence all other features
– farther away something is, the smaller the impact
• specify a Distance Band/Threshold Distance value to reduce the number of required computations – especially with large datasets. – If not specified, a default threshold value is computed for you
• Choosing an appropriate distance is important – Some spatial statistics require each feature to have at
least one neighbor for the analysis to be reliable.
Distance band (sphere of influence) • impose a sphere of influence, or moving window
conceptual model of spatial interactions onto the data • Neighbors within the specified distance are weighted
equally. Features outside have no influence (weight = 0) • Evaluate the statistical properties of your data at a
particular (fixed) spatial scale • have at least one neighbor, or results will not be valid • if the input data is skewed make sure that your distance
band is neither too small (only one or two neighbors) nor too large (include all other features as neighbors) – resultant z-scores less reliable.
Adjacency Models
• K Nearest Neighbors – a specified number of neighboring features are included in calculations
• Polygon Contiguity – polygons that share an edge or node influence each other
K-nearest neighbors • each feature assessed in the spatial context of a
specified number of its closest neighbors. If K (t is 8, then eight closest neighbors to the target feature will be included If feature density is high - spatial context of the analysis will be smaller.
• If feature density is sparse, the spatial context for the analysis will be larger.
• method is available using the Generate Spatial Weights Matrix tool
Polygon contiguity (first order) • polygons that share an edge (that have
coincident boundaries) are included in computations for the target polygon
• modeling some type of contagious process or are dealing with continuous data represented as polygons.
Binary Contiguity Weights • contiguity = common border • i and j share a border, then wij = 1 • i and j are not neighbors, then wij = 0 • weights are 0 or 1, hence binary
Distance-Based Weights • distance between points • distance between polygon centroids or central points • distance-band weights: wij nonzero for dij < d less than a critical distance d • k-nearest neighbor weights: same number of neighbors for all observations potential problems with ties
Global vs. Local Statistics
• Global statistics (Clustering) – identify and measure the pattern of the entire study area – Do not indicate where specific patterns occur
• Local Statistics (Clusters) – identify variation across the study area, focusing on individual features and their relationships to nearby features (i.e. specific areas of clustering)
Spatial Autocorrelation (Moran’s I)
• Global statistic • Measures whether the pattern of feature values is clustered,
dispersed, or random. • Compares the difference between the mean of the target
feature and the mean for all features to the difference between the mean for each neighbor and the mean for all features.
Mean of Target Feature
Mean of all
features
Mean of each neighbor
Z-Score & P-value (ArcGIS)
• Very high or very low (negative) z-scores, associated with very small p-values, are found in the tails of the normal distribution
• it is unlikely that the observed spatial pattern reflects the theoretical random pattern represented by your null hypothesis (CSR)
• The null hypothesis for the pattern analysis tools is Complete Spatial Randomness (CSR), either of the features themselves or of the values associated with those features.
http://resources.arcgis.com/en/help/main/10.1/index.html#//005p00000006000000
Pseudo P-Value
• significance levels are dependent on the number of permutations
• One-sided significance test • For instance, if an observed Moran's I value is
higher than any of the randomly generated Moran's I values, the pseudo p-value would be 1/100=0.01 for 99 permutations or 1/1,000=0.001 for 999 permutations
Spatial Autocorrelation (Moran’s I) Polygon Contiguity (first order)
Spatial Autocorrelation (Moran’s I) Polygon Contiguity (first order)
Percent Black Population, Cook County, IL
Generate Spatial Weights Matrix K-Nearest Neighbor
Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor
Percent Black Population, Cook County, IL
Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor
Percent Black Population, Cook County, IL
If the z-score value is positive, the observed General G index is larger than the expected General G index, indicating high values for the attribute are clustered in the study area
Spatial Autocorrelation (Getis –Ord General G High/Low Clustering) Polygon Contiguity
Percent Black Population, Cook County, IL
Geoda Spatial Autocorrelation (Moran’s I) Percent Black Population, Cook County, IL
Geoda Spatial Autocorrelation (Moran’s I) Queen Contiguity Weight (1st order)
Percent Black Population, Cook County, IL
Geoda Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor (eight)
Percent Black Population, Cook County, IL
Geoda Spatial Autocorrelation (Moran’s I) K-Nearest Neighbor (four)
Percent Black Population, Cook County, IL
Anselin Local Moran’s I
• Local statistic
• Measures the strength of patterns for each specific feature.
• Compares the value of each feature in a pair to the mean value for all features in the study area.
Anselin Local Moran’s I
• Positive I value: – Feature is surrounded by features with similar values, either high or low.
– Feature is part of a cluster.
– Statistically significant clusters can consist of high values (HH) or low values (LL)
• Negative I value: – Feature is surrounded by features with dissimilar values.
– Feature is an outlier.
– Statistically significant outliers can be a feature with a high value surrounded by features with low values (HL) or a feature with a low value surrounded by features with high values (LH).
• The z- scores and p-values are measures of statistical significance which tell you whether or not to reject the null hypothesis, feature by feature.
• Indicate whether the apparent similarity (or dissimilarity) in values for a feature and its neighbors is greater than one would expect in a random distribution.
http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/cluster_and_outlier_analysis_colon_anselin_local_moran_s_i_spatial_statistics_.htm
Anselin Local Moran’s I
Anselin’s Local Moran’s I Polygon Contiguity Weight Percent Black Population Cook County, IL
p-value z-score index
HH LH
Geoda Univariate LISA Queen Contiguity Weight
Percent Black Population, Cook County, IL
p-values 499 Permutations p-values 999 Permutations
Geoda Univariate LISA Queen Contiguity Weight
Percent Black Population, Cook County, IL
HH HL 999 Permutations
Comparison ArcGIS & Geoda Results Queen Contiguity Weight
Percent Black Population, Cook County, IL p-values
Comparison ArcGIS & Geoda Univariate LISA Queen Contiguity Weight
Percent Black Population, Cook County, IL
HH HL 999 Permutations HH HL
High - High
High - Low
Low-High
Low-Low
Percent Poverty
Non
-poi
nt S
ourc
e Ca
ncer
Risk
# of
Observations
R^2 Constant Std
Error
t-statistic p-value Slope Std
Error
t-statistic p-value
1343 0.209 0.00442 0.0176 0.251 0.802 0.332 0.0176 18.8 0
80 0.1116 1.58 0.0797 19.8 0 0.045 0.0475 0.957 0.342
1263 0.118 -0.0794 0.0161 -4.92 0 0.223 0.0172 13 0
INTERCEPT SLOPE
Bivariate LISA Scatterplot
Chow test for selected/unselected regression subsets distribution F(2,1339) ratio=214.6 p-value=0
Global Model
Local Model
EDA ESDA