1 an optimal spatial sampling design for social...

1

1

An Optimal Spatial Sampling Design for Social Surveys 1 2 3 Naresh Kumar, Dong Liang and Marc Linderman 4 5 Division of Environment and Public Health 6 University of Miami, Miami, FL 33146 7 8 Email: [email protected] 9 Web: web.ccs.miami.edu/~nkumar 10 11

12

2

2

ABSTRACT 13 This paper presents an optimal spatial sampling (OSS) design for fielding the social surveys. The 14 proposed design (a) develops a context specific sampling frame at a fine spatial resolution, (b) 15 captures maximum spatial autocorrelation-controlled semivariance in the selected attribute (a 16 composite index of population concentration and socio-economic characteristics in the context of 17 this paper) of the sampling domain, (c) ensures spatial coverage and representation, (d) 18 minimizes sample size, and (e) minimizes redundancy in the selection of sample sites. OSS was 19 tested for drawing a sample for fielding a pilot General Social Survey (GSS) in Chicago 20 metropolitan area (MSA) in the summer of 2010. Fine resolution LandScan population data, 21 coupled with the U.S. Census data, were used to develop a multivariate contextual sampling 22 frame. A semivariance optimization algorithm with the control for spatial autocorrelation was 23 developed and implemented in C++. Our analysis suggests that a set of 97 sample sites captured 24 80% of the total spatial autocorrelation-controlled semivariance in the composite index used for 25 optimizing sample sites. Maximizing spatial autocorrelation-controlled semivariance using OSS 26 also ensured representation of the population variance. 27 28 Our analysis suggests that the OSS design outperformed other widely-used spatial sampling 29 designs, such as Generalized Random Tessellation Stratified sampling (GRTS) in terms of 30 spatial coverage and population representation. The domain (or area) of each optimal site, 31 defined using the extent of local spatial autocorrelation, serves as a stratum and formulates bases 32 for drawing inferences. The simulation experiment suggests that the relative efficiency of the 33 OSS was better than that of other sampling designs. However, for a skewed quantity the 34 efficiency of OSS drops and prediction bias (measured by percent difference between observed 35 and predicted mean) increases. Therefore, it is important that the variable used for optimization 36 of sample sites is normalized to achieve the best performance of the OSS. 37 38 Various methods, including reverse geocoding, can be used to develop enumeration list and draw 39 respondent(s) from each stratum. Geocoding respondent is also useful for the collection of multi-40 layer socio-physical contextual data at reduced cost. This, in turn, is likely to extend the scope of 41 the survey data to a multi-level, interdisciplinary setting. 42 43 Keywords: spatial sampling, sampling frame, geospatial analysis, socio-physical contexts, GIS, 44 geospatial technologies and Chicago. 45 46 47

3

3

INTRODUCTION 48 49 The spatial distribution of human population and its socio-economic characteristics shows 50 significant spatial patterns of segregation and a slower tempo of change (Osborne and Rose 51 1999). This means that people with the similar socio-economic characteristics tend to live closer 52 to each other, and the change in the spatial patterns of their socio-economic activities is relatively 53 slow. Consequently, peoples’ long-term exposure to their immediate place-specific socio-54 physical environment is likely to influence their attitudes, behavior, social and economic 55 outcomes, and health (Caughy and O'Campo 2006, Chen et al. 2008, Frumkin 2003). For 56 example, recent literature suggests that place-specific socio-physical contexts are significantly 57 associated with the prevalence of obesity, communicable diseases, behavioral outcomes, school 58 performance, and socio-economic achievement/success (Abercrombie et al. 2008, Cerin et al. 59 2006, Downey 2006, Freedman et al. 2008, Messer et al. 2006, Nelson et al. 2004, Wiehe et al. 60 2008, Winslow and Shaw 2007, Yonas et al. 2006, Zhang et al. 2006). Therefore, researchers are 61 increasingly seeking georeferenced multi-level socio-physical contextual data. These data are 62 critically important for linking individual responses to their social and physical environment, 63 developing an understanding of how these socio-physical environmental contexts shape an 64 individual’s attitudes, behavior, and health, and theorizing the role of place (and its associated 65 socio-physical contexts). Spatial sampling is ideally suited for ensuring spatial coverage and 66 representation of population, and collecting socio-physical contextual data. 67 68 Although there has been a recent surge of interest in spatial sampling among social scientists 69 (Goodchild and Janelle 2004), its frequent usage has been predominantly restricted to sampling 70 environmental phenomena, such as air and water pollution and plant and animal species. Despite 71 this, researchers continue to face a number of challenging problems for implementing spatial 72 sampling. First, a spatially organized, finite, and complete sampling frame of population is rarely 73 available. For example, to select sample sites for air pollution monitoring, it is rather impossible 74 to have prior information about air pollution at each and every point of the sampling domain. 75 Generally, the sampling domain, i.e. area of sampling frame, is partitioned into a finite set of 76 pixels (or area) by overlaying a geometric grid on it. Second, the size and composition of the 77 sampling frame is largely influenced by the way population is represented (by point, line and 78 polygon/pixel) and spatial resolution of this representation. For example, for continuously 79 varying phenomena (air pressure, elevation and air pollution) the sampling frame is an infinite 80 set of point locations; or the sampling domain is partitioned arbitrarily into a finite set of pixels. 81 Third, population intensity varies across geographic space and this variation should be taken into 82 account for deriving a spatially balanced sample (Stevens and Olsen 2004). But the information 83 about this varying population intensity at a local spatial scale is rarely available. Fourth, most 84 environmental and human phenomena register spatial patterns (or the presence of spatial 85 autocorrelation) and pose a major challenge for obtaining observations that are nearly 86 independent to draw classical statistical inferences. Researchers suggest the use of systematic 87 sampling using a regular grid to reduce spatial autocorrelation and draw a spatially balanced 88 sample (Bickford et al. 1963; Stevens and Olsen 2004). Since the strength and extent of spatial 89 autocorrelation can vary locally or regionally, sample sites chosen at a systematic distance 90 interval are unlikely to account for geographically varying spatial autocorrelation and ensure 91 independence in the sample selection. 92 93

4

4

Modern advances in geo-spatial technologies, spatial analytical methods, and computationally 94 efficient algorithms offer solutions to most of the above problems. Building on these advances, 95 an optimal spatial sampling (OSS) design is developed for fielding a pilot General Social Survey 96 (GSS) in the Chicago MSA. Unlike the classic probabilistic sampling designs, the OSS design 97 optimizes the locations of sample sites that capture maximum spatial autocorrelation-controlled 98 population semivariance and achieves maximum spatial coverage. 99 100 The OSS offers several advantages over the classical designs.1 First, the implementation of OSS 101 relies on a contextual (proxy) sampling frame of finite number of pixels (areas), instead of a 102 sampling frame of human population per se. This is done because a spatially organized, finite 103 and complete sampling frame of human populations is rarely available and the shape and size of 104 the sampling units (at which population data are complied and aggregated) vary dramatically. 105 These limitations constrain our ability to draw a spatially balanced sample. In this research, we 106 utilize LandScan data at a fine spatial resolution (400 x 400m for this study), and utilize the U.S. 107 Census and other ancillary data to estimate socio-economic characteristics of the population for 108 each pixel to construct a contextual sampling frame. A unique advantage of this method is that 109 the sampling frame can be updated frequently and can be constructed at any spatial resolution for 110 the population in question. 111 112 Second, the OSS is an efficient design in terms of spatial coverage, population representation, 113 and sample size. It captures the maximum spatial semivariance with a minimal sample size. 114 Since autocorrelated sites are excluded from the sample, the spatial coverage of the sample sites 115 selected by the OSS design ensures better spatial representation as compared to other spatial 116 sampling designs. Capturing population variance is an essential characteristic of a spatial 117 sampling design. Unlike the classical sampling design, the OSS design provides a prior estimate 118 of the semivariance, likely to be captured by the selected sample. 119 120 Third, the OSS minimizes redundancy by avoiding the selection of autocorrelated sites. In the 121 presence of spatial autocorrelation, the classical method of sampling can inflate the sample size 122 (Griffith 2005). Things that are closer (in geographic space) to each other are more likely to be 123 similar than those a far distance apart (Tobler 2004). According to the first law of geography (of 124 course with some exceptions), most social and natural phenomena record spatial patterns or the 125 presence of spatial autocorrelation. This indicates similarity in the presence and quantity of a 126 phenomenon among neighbors. This similarity, however, declines with the increase in 127 geographic distance between neighbors. 128 129 Finally, geocoded respondents (around the optimal sites) can be embedded in the multi-level 130 multi-layers of socio-physical contexts without an extra burden on respondents. Numerous socio-131 physical datasets are available from various public and private agencies, such as the U.S. Census, 132 EPA, USGS, NASA and Google, Inc. These datasets can be imputed and integrated with 133 respondents’ data with the aid of geostatistical methods, such as interpolation, aggregation and 134 disaggregation. This, in turn, can augment the scope of the survey data and support 135 interdisciplinary multi-level social science research. 136 137

1 The classical design here and for the rest of the paper refers to as probabilistic sampling design, such as simple random, stratified random and systematic random.

5

5

The remainder of this paper provides a theoretical framework for the OSS design, followed by 138 the details on the implementation of this design. The final section presents a summary of our 139 findings and a detailed discussion of the limitations and implications of the OSS for fielding 140 socio-economic and demographic surveys. 141 142 2. THEORETICAL FRAMEWORK 143 144 Classical probability sampling methods, such as simple random sampling, stratified random 145 sampling or clustered sampling and prediction based sampling, have been used extensively for 146 spatial sampling. Although there have been significant developments on spatial sampling during 147 the past several decades, we continue to face the fundamental question: “what is an appropriate 148 or optimal spatial sampling method?” The optimality of a spatial sampling design can be 149 evaluated by a number of factors: the population quantity of interest, the estimation objectives, 150 the chosen estimators and stochastic or deterministic views of the sampling frame. Thompson 151 (2002) and Haining (2003) provide an extensive discussion on the appropriateness/optimality of 152 a sampling method. The literature review suggests that the methods of spatial sampling can be 153 grouped into four categories: design based, model based, hybrid design and deterministic design. 154 Next, we summarize the literature on these four categories of spatial sampling and then present a 155 theoretical framework of the optimal spatial sampling design. 156 157 Design based approaches to spatial sampling assumes that the population is fixed and the 158 randomness originates purely from repeated sampling. A sampling design assigns a known 159 positive first order inclusion probability to each element of the frame. Global parameters such as 160 mean, total or proportion of the population can be estimated with the known precision. The 161 standard errors can be estimated using the classical sampling methods, such as independent 162 random, systematic random or stratified random sampling (Stehman and Overton 1996). For 163 example, Overton and Stehman (1993) suggest a stratified design using tessellation based on 164 triangular grid. Stevens et al. (2004) proposed the generalized random tessellation stratified 165 sampling to achieve a spatially balanced sample. The population was stratified regularly by 166 mapping the geographical space onto the unit interval with known probability measure. Such 167 mapping keeps the spatial structure and accounts for variable inclusion probability. 168 169 Model based approaches to spatial sampling are conceptualized and implemented in a 170 geostatistical framework. The population is viewed as a realization of the underlying data 171 generating model. Assuming the model is known, sampling design can be used to optimize 172 according to a given objective function. Such approaches account for both estimation and 173 prediction using approximation based on maximum likelihood (Zhu and Stein 2006, Zimmerman 174 2006) or Bayesian methods (Diggle and Lophaven 2006). A systematic review of the model 175 based approaches for optimal parameter estimation is available elsewhere (Müller 1998); Cressie 176 (1993) provides a systematic review on optimal designs for prediction based on Kriging theory. 177 In practice the model is rarely known and sampling design involves estimation of this model. 178 The uncertainty in the model estimation thus affects the optimality of a design. Different design 179 objectives usually lead to different optimal designs. It is important to note that the optimal 180 designs for estimating spatial covariance functions differ from that for optimal prediction 181 (Zimmerman 2006). 182 183

6

6

Recent development in spatial sampling includes hybrid approaches, termed as model assisted 184 sampling (Sarndal et al. 1992). Design unbiased estimator is shown to be robust in model 185 assisted sampling (Thompson 2002). Griffith (2005, Griffith 2008) utilized model assisted 186 approach for contaminant mapping of urban soil. He utilized auxiliary data to estimate a spatial 187 model; the estimated model was then used to calculate effective sample sizes, and then a design 188 based approach (Stehman and Overton 1996) was utilized to draw the final sample. 189 190 More recent development on spatial sampling involves the use of deterministic approach to draw 191 a set of optimal sites. For example, Kanaroglou et al (2005) used location allocation model 192 (LAM) to select a set of NOx monitoring locations. Such design can be used to partition the study 193 area to achieve uniform semivariance of environmental covariates (Minasny et al. 2007) 194 195 While model based approaches offer a conceptual framework to generalize inference(s) to super-196 population but the unknown model plays a significant role in the sampling design. Thus, this 197 requires validation using ancillary knowledge. However, design based approaches rely on 198 classical sampling theory and conditioned on the availability of data. Consequently, design based 199 sampling is more robust for model (mis)specification. This paper offers a pragmatic model 200 assisted design based solution to large scale social survey and presents an optimal spatial 201 sampling (OSS) method that allows for the identification of a set of sample sites based on the 202 specified optimization function and ensure greater spatial coverage; the proposed design can be 203 used to draw robust statistical inferences as well. The remainder of this section presents a 204 theoretical framework for implementing OSS to draw a sample of optimal sites in Chicago MSA. 205 206 Once the study domain (D) is identified, the implementation of OSS involves the following: (a) 207 construction and characterization of the sampling frame (based on the selected attributes) of the 208 selected sub-regions, (b) optimization of sampling locations, (c) identification of the potential 209 households (or residential units) around the optimal locations, and (d) recruitment of 210 respondents. 211 212 2.1 Constructing and characterizing the spatial sampling frame: 213 214 Spatial sampling requires determination both sample size n and sample locations s={s1,...,sn} in 215 the study area D. The frame is usually formed by regular tessellation of the study domain D. In a 216 classical sampling design, we assume that the sampling frame is known and a complete list of the 217 ultimate sampling units, such as household or employee, is available. For spatial sampling, 218 however, a finite population of identifiable geographic units is constructed as proxy sampling 219 frame. This is achieved by tessellation, such as overlaying a geometric (square, circle, hexagon, 220 or rectangle) pattern onto the geographic area to generate a finite number of identifiable units, 221 and one of the classic methods of sampling (such as simple random, stratified random, or cluster 222 random) is employed to select the sampling units (not the residential units or respondents per se). 223 224 Constructing and characterizing a sampling frame of human populations, however, can be 225 challenging because of incomplete and/or missing data about populations and their associated 226 socio-economic attributes. In the areal probabilistic sampling, employed for GSS and other social 227 surveys, census units (such as block or block groups) are identified, and then households are 228 drawn from the identified units using a classical method of probability sampling. However, the 229

7

7

key problem with the U.S. Census data is that these data are aggregated by census units, such as 230 blocks, block groups and census tracts, and the intra-boundary distribution of populations are 231 hard to find. 232 233 LandScan population distribution dataset, developed by the Oak Ridge National Laboratories 234 (Bhaduri 2005, ORNL 2008), provides population estimates at three different spatial resolutions. 235 Utilizing most current land use and land cover data (LULC) coupled with ancillary data fine 236 resolution population estimates can be generated in a cost-effective and timely manner. These 237 data can be utilized for constructing the sampling frame. For illustration purposes we utilized 238 LandScan data at 400m spatial resolution (Fig 1). In these datasets, a grid is overlaid onto the 239 study area and population is estimated for each pixel; an uninhabited pixel is assigned a value of 240 zero (grey color in the Fig 1). Unlike the U.S. Census data, LandScan data provide estimates of 241 population concentration at a high spatial resolution and eliminates uninhabited pixels from the 242 sampling frame. If a sample is to be drawn based on population concentration, all pixels with a 243 value greater than zero could serve as the sampling frame (of inhabited areas) and population of 244 each pixel can serve as its weight, affecting the probability of selection. 245 246 2.2 Contextualizing sampling frame: If population (distribution) is the only criterion, the 247 sampling frame of population is adequate to optimize population weighted locations; this means 248 pixels with high population are given higher weights and vice-versa. However, capturing socio-249 economic and demographic characteristics is equally important for most social surveys to ensure 250 adequate socio-economic representation of the population. The data on socio-economic 251 characteristics and the physical environment are available at different geographic resolutions. 252 Therefore, spatial analytical techniques, including Kriging and spatial aggregation and 253 disaggregation, can be used to derive estimates of these characteristics for each stratification 254 dimension of the sampling frame. The methodology adopted to construct Zk (i.e. the weighted 255 index of population concentration and selected socio-economic characteristics) for the Chicago 256 MSA is described in the implementation section. 257 258 2. 3 Optimizing sample site: Overlaying LandScan data onto the study region partitions the 259 sampling domain D into K pixels. Our goal is to draw a subset of sample sites s= 1, … , ; 260 with a given set of optimization function, such as spatial coverage and variance maximization. 261 To begin with let the objective function be the maximization of variance of selected variable 262

, which can be written as 263 264

max| | ∑ ̅ (1) 265 266 where max| | is the maximum possible variance of Z that can be captured by sites and ̅ 267 denotes the population mean of pixels. 268 269 The equation (1) may be problematic because it does not take into account the spatial 270 autocorrelation and spatial heterogeneity. Variance measures the average squared deviations of 271 attribute values from the mean; sites with similar attribute values contribute similarly to the 272 variance, regardless of the values in their neighborhood. However, sites with high local spatial 273 autocorrelation need to be sampled sparsely, because these sites are autocorrelated and selection 274 of one among these sites can represent this group of autocorrelated sites. 275

8

8

276 To maximize representation while controlling for redundancy, we propose a widely used concept 277 in geostatistics: semivariance. We define the total semivariance over the entire study (without the 278 control for spatial autocorrelation) area as 279 280

≡ ∑ ∑ (2) 281

282 Let ≡ ∑ ̅

1 denote the sample variance, it is easy to show that the 283

average semivariance and sample variance ( are about equal. Therefore, maximizing 284 semivariance of Z by selecting n sample sites, we also maximize . 285 286 Utilizing the concept of local spatial autocorrelation (Anselin 1995), it is straightforward to 287 control for spatial autocorrelation in semivariance by excluding autocorrelated sites around each 288 pixel. The similarity between at kth site and other pixels of the sampling domain | |, 289 determines its strength and extent of local spatial autocorrelation. The extent of local spatial 290 autocorrelation in selected variable of each pixel ( ) can be computed with the aid of a local 291 variogram as suggested by Kumar (2009). Following Cressie (1993), global (for the entire area) 292 semivariance for distance interval b can be defined as 293 294

∑ ∑ ∀ ,∈

∑ ∑ ∀ , ∈

3

295 where dk,l is the distance between kth pixel and lth neighbor,∀ , 1 when the distance between 296 kth site and its neighbors ( , ≤ b. A variogram is the visual display of gamma (i.e. average 297 semivariance and inversely associated with the spatial autocorrelation) and distance lags; a lower 298 value of gamma indicates high spatial autocorrelation and vice-versa. A visual display of gamma 299 can help determination of the extent of local spatial autocorrelation. Since represents the 300 trend of semivariance for the entire study area (termed as global trend) and can vary from place 301 to place, following Kumar (2009) site specific local semivariance can be computed as 302 as: 303

∑ ∀∑ ∀

4

304 With the aid of local variogram the extent of local spatial autocorrelation } can be computed. 305 This, in turn, provide bases for partitioning neighbors | |, into two sets: autocorrelated 306 neighbors and un-autocorrelated neighbors . Spatial autocorrelation controlled local 307 semivariance (

) of th pixel with respect to un-autocorrelated neighbors can be written as 308 309

≡ ∑ ∈ (5) 310 Summation of over all pixels can result in the total spatial autocorrelation controlled 311 semivariance, as 312

∑ ∈ (6) 313 314

9

9

Since selecting a site from a set of auto-correlated pixels can represent the entire set (to some 315 extent), the representation domain of each pixel ∪ and the semivariance that Dk 316 capture can be expressed as: 317 318 ≡ ∑ ∑ ∈∈ (7) 319 320 Based on the spatial autocorrelation controlled semivariance, we propose the following objective 321 function to maximize semivariance by n sample sites 322 323

| | ∑ (8) 324 325 where max| | is the maximum possible semivariance represented by a set of sample sites. 326 The quantity measures the semivariance captured by kth site and its neighboring 327 autocorrelated neighboring pixels. 328 329 330 In the proposed OSS design, sample sites are selected sequentially without replacement. The 331 initial sampling domain of pixels is | 0. This means that the site (denoted by k1) 332 that captures the highest (spatial autocorrelation controlled) semivariance is selected first and all 333 pixels in its domain of representation are eliminated from the sampling domain 334 \ for the selection of subsequent sites. The number of pixels left in the sampling 335 domain gradually decline with the subsequent selection of sample sites. Likewise, for the 336 second optimal site k2, the set of autocorrelated neighbors and un-autocorrelated neighbors 337

decline with the elimination of pixels in . 338 339 The effectiveness of the chosen sample sites can be determined by evaluating the proportion of 340 spatial autocorrelation controlled semivariance cn captured by n sample sites; cn ~ max|Zn | / 341 and 0 < cn < 1. For example, a value cn of 0.5 would ensure that the selected sample sites capture 342 50% of the total spatial autocorrelation controlled semivariance. The value cn allows for 343 efficient inverse sampling. Unlike conventional way of sample size calculation, an optimal 344 sample size ( ) from each D can be determined based on the fraction of semivariance to be 345 captured by the sample size. For example, setting cn to 0.5 will produce an optimal set of 346 sample sites that capture 50% of the total semivariance of Z, controlling for spatial 347 autocorrelation. 348 349 2.4. Inferential methods of OSS: There have been extensive discussions on design based and 350 model based inferences using the survey data (Muller 2005, Thompson 2002). In general design 351 based inferences are suitable for exploring global parameters, such as population mean and 352 variance. However, model based inferences are more appropriate for prediction or mapping 353 spatial details (Haining 2003). Design based inference makes less efficient use of spatial 354 information than prediction based approach. Design based standard error calculation becomes 355 analytically complex with spatial data. Model based inference, on the other hand relies heavily 356 on model assumptions. The validity of model plays a significant role in determining the validity 357 of the inference. The model based approach becomes computationally intensive when 358 enumerating all possible samples. 359

10

10

360 Let Z denote a study variable and D denote the finite frame of size K, assumed to be known. 361 Suppose x is an auxiliary variable known to be correlated with the study variable. Let s denote a 362 generic sample of size n where n < K. We assume that xk is known for each element of the frame 363 and Zk is observable only for ∈ . Suppose that our objective is to estimate the unknown 364 population mean ̅ ≡ ∑ where δk denotes population weight. For simplicity of notation, 365 we drop the subscript D and let ̅ without subscript denotes implicitly the unknown weighted 366 population mean. 367 368 In the presence of auxiliary information, design based ratio estimator can greatly improve the 369 precision of mean estimates (Sarndal, Swensson and Wretman 1992). Let ≡ argmin ∑370

. The design unbiased ratio estimator is ̅ ≡ ̅ ∑ ̌ where ̅ ∑ / , ̌371

/ , denotes the first order inclusion probability and for ∈ . Since the 372 OSS design suggests one sample per optimal site or stratum, classical Horvitz Thompson 373 estimation of the standard error based on stratified sampling does not apply. A contrast based 374 local neighborhood variance estimator, proposed for the generalized random tessellation 375 sampling (GRTS) designs (Stevens and Olsen 2003), can be used to address this problem. 376 377 We can also adopt a model based approach to draw inference, using a general linear model M. 378 Let z denote the unknown K vector of study variable, X the K by p matrix of known auxiliary 379 variable and B the corresponding unknown coefficients. We assume and 380

where W denotes a K by K diagonal known population weight matrix. Suppose the first 381 rows of X are in sample s, then X can be expressed as (Xs

T,XrT)T, partition the weight matrix 382

accordingly into Diag(Ws,Wt) and let Zs denote a sample data. According to the general 383 prediction theory (Valliant et al. 2000), the model based estimator of population mean can be 384

derived as ̅ ≡ where . Let ∑ , 385

an estimated variability is ̅ . 386

387 2.5 Validation of OSS: Independent random sampling (IRS) of human population for a large 388 geographical area is challenging because of significant spatial variation in the concentration of 389 human population and its socio-economic characteristics. Complex designs such as area 390 probability designs are used for fielding national social surveys, such as GSS (Smith 2006). GSS 391 designs use systematic random sampling to draw household after sorting the households 392 according to socio-economical variables so that the sample is representative in terms of SES of 393 the population. The resulting sample, however, usually exhibits spatial autocorrelation that 394 reduces the efficiency of the sample and increases the complexity in deriving design consistent 395 estimators. 396 397 Since spatial sampling methods have been employed successfully for environmental monitoring 398 (Stehman and Overton 1996), it is important to evaluate the OSS with respect to these widely 399 used methods for spatial sampling. The optimality of sampling design can be evaluated by a 400 number of criteria, including spatial balance of points (Stevens and Olsen 2004), minimal 401 Kriging variance and efficiency of estimators (van Groenigen et al. 1999). To evaluate the 402 robustness of OSS, we choose a statistics that evaluates both efficiency and population 403

11

11

representation. Stevens and Olsen (2004) proposed a statistic based on Voronoi polygons to 404 measure the regularity of sample points. Let D denote the pixels of the sampling domain. Let {D 405 k} denote the collection of auto-correlated points that are closer to k and it is called the Voronoi 406 Polygon of point k. Suppose {Z k} denotes the individual composite index for sampling purposes, 407

they define v' k ≡ kDl lZ . A set of spatially balanced sample points lead to little variation 408

among v'k because the inclusion probabilities ( ∝ ) and the size of the Voronoi polygons 409

compensate each other. 410 411 If the objective were to sample pixels with probability proportional to size ( ), the above 412 statistic would be sufficient for validation. But we are interested in maximizing semivariance in 413 the chosen attributes of the population. Sites with extreme values better represent heterogeneity 414 in the population and contribute to the overall variability of a sample. Based on sample kurtosis, 415 we define 416

3)(1

e wherv 44k

ZZS

GG kkDl

l

k

(9) 417

where Z and S measure the center and spread of the attribute distribution in the study region, 418 respectively. For a given sample n , the v k value indicates the variance captured by kth pixel (or 419 site). We then use ξ ≡ Var{ vk}, k ={1,…,n } to describe both regularity and representation of a 420 sample. Areas where attribute values vary significantly are on average sampled more densely, 421 the Voronoi polygons are on average smaller but the Gk values are larger. On the other hand, 422 regions with common attribute values tend to contain fewer sample points, the sizes of Voronoi 423 polygons are larger but the values are smaller. Therefore, small values of indicate a 424 spatially balanced and representative sample. 425 426 The distribution of ξ measures the overall efficiency and spatial coverage of a given 427 probabilistic sampling design. The derivation of a closed form for the distribution of ξ is difficult 428 due to the spatial dependency of the underlying index values. But it is straightforward to simulate 429 from the distribution of ξ for a given design. Therefore, we perform a simulation study based on 430 Monte Carlo methods to estimate this distribution. We use the independent random sampling 431 (IRS) design as a reference to compare other designs. The ratio ξ(IRS)/ξ(proposed design) 432 quantifies the performance (or optimality) of the proposed design, where a ratio larger than one 433 indicates that the proposed design generates a more spatially balanced sample with diverse 434 attribute values. Utilizing this method, performance of the OSS can be examined with respect to 435 other commonly used spatial sampling methods. The results of the OSS performance are 436 presented in the implementation section. 437 438 2.6 Recruiting Respondents: Once the n optimal sites are identified, their representative 439 area/domain (Dk) is determined using their corresponding extent (h k) of local spatial 440 autocorrelation. To select respondents (or residential units or addresses) we need an enumeration 441 list for each Dk. There are many different ways to develop an enumeration list of 442 respondents/residential addresses, such as the U.S. Postal Service can provide such a list or it can 443 also be generated with the aid of reverse geocoding available with Google Earth and Yellow 444 pages (Kumar 2010). Since many respondents may decline to participate in the study, the number 445 of respondents to be drawn around each site must be inflated to account for non-response rate. 446

12

12

447 The final respondents can be drawn randomly within the domain of each optimal site k in s - the 448 set of optimal locations. Thus optimal spatial sampling can be considered as one sample per 449 stratum stratified random sampling. In practice, multiple respondents can (or should) be 450 contacted and efforts be made to ensure one respondent per optimal site. The inclusion 451 probability of each respondent (πi) within the domain of site k (Dk) is dictated by several factors: 452 (1) the distance of each respondent to the optimal location; we strive to recruit respondent close 453 to the optimal site to maximize representation; (2) The abundance of the target population within 454 the domain Dk; (3) the quality of the respondent frame; (4) the propensity of response for 455 individual respondent. Since these factors are not known, we assume that the abundance of 456 population as proportional to the population data from LandScan, and the non-response and 457 frame errors be random. 458 459 3. DATA 460 461 3.1 Population Data: Modern advances in satellite remote sensing and geo-spatial methodologies 462 have been used to develop population estimates at very fine spatial resolution. National Oakridge 463 Laboratories has developed a nationwide estimate of population (ORNL 2008). These data are 464 processed at three different spatial resolutions – (1) 1km global coverage, (2) 400m for the U.S., 465 and (3) 90m spatial resolution for the continental U.S., and are available to researchers. We 466 utilized these data to construct the sampling frame of inhabited areas for the Chicago MSA (Fig 467 1). Pixels with a darker shade have a high population concentration as compared to the lighter 468 color pixels, and gray pixels are uninhabited and are not part of the sampling frame. 469 470 3.2 Socio-economic Data: Socio-economic data for this research comes from the U.S. Census 471 (2000). We utilized a deprivation index and population size to compute the weight of each pixel 472 (Z k) of the sampling domain (Fig 2). Although various variables have been used to develop the 473 deprivation index (Bell et al. 2007, Messer et al. 2006, Rezaeian et al. 2006), we utilized three 474 variables, namely median family income (I h), proportion of high school graduates (X h), and 475 ethnic/racial diversity (E h). Summary statistics of these variables are available in Table 4. These 476 three variables captured some aspects of social and economic characteristics. The data for these 477 three variables were acquired from the U.S. Census. Using block level SF1 data, an ethnic/racial 478

diversity index was computed as

Ss

shsh PPS

E21

; where P hs is the proportion of population 479

of sth racial/ethnic group in the hth census block of the region, and sP is the proportion of 480

population of sth ethnic/racial group in the entire region. The proportion of different racial/ethnic 481 groups was computed using the U.S. Census 2000 data (Table 2). The SES data were available at 482 the Census Tract level and the spatial resolution of these data was coarser than that of population 483 data. Kriging (Cressie 1993) was employed to estimate SES for each pixel and uninhabited 484 pixels were excluded from the sampling domain. 485 486 4. IMPLEMENTATION 487 488 The implementation strategy of the OSS is shown in Fig 3. The proposed sampling design is 489 implemented in three steps. In the first step, a sampling domain of inhabited areas is identified 490 and contextualized (with socio-economic characteristics). In the second step, the extent and 491

13

13

magnitude of spatial autocorrelation and semivariance are examined for controlling redundancy. 492 In the final step, an optimal set of sample sites is chosen. These three steps are described below 493 sequentially. 494 495 4.1 Constructing and contextualization of the sampling frame: A complete sampling frame is 496 critically important to implement any sampling design. We constructed a proxy sampling frame 497 of inhabited areas. The population concentration can be used for a probabilistic sampling design, 498 but it will not ensure adequate representation of socio-economic and demographic (SES) 499 characteristics of the sampling domain, which are critically important for many social surveys to 500 capture the representation of population with different socio-economic strata. We utilized the 501 2000 U.S. Census data to contextualize our sampling frame with the SES. The median family 502 income (I), proportion of high school graduates (X), ethnic/racial diversity (E), were standardized 503 by computing their z-scores. Finally, we imputed scale-free (divided by mean) , , and for 504 each item, and the final weight for each item was determined as, Z k = 100 0.15 k) + (0.15 505 k 0.15 ( k0.55). The population size that serves as an important criterion for 506 probabilistic sampling design, especially for social surveys, was given the highest weight of 507 0.55, and 0.15 was used for each of the remaining three variables to capture socio-economic 508 characteristics of the population. 509 510 4.2 Control for redundancy: Literature regarding neighborhood patterns and spatial segregation 511 suggests that people with similar levels of SES tend to cluster (Chang 2006). We controlled for 512 the selection of spatially autocorrelated sites by avoiding the selection of sites in the same 513 neighborhoods with the similar value of the composite weight (Z k). We evaluated global (for the 514 entire MSA) spatial autocorrelation of the selected SES variables and the composite index with 515 the aid of a variogram (Fig 4a, 4b, 4c, and 4d), implemented in R (2006). 516 517 The visual display of Fig 4a through 4d suggests that the extent and trend of spatial 518 autocorrelation vary significantly across these four variables. For example, the extent of spatial 519 autocorrelation drops steeply after 2.5km in the composite index. For income, however, the 520 decline is relatively gradual and extends up to 20km. Empirical values of local average 521 semivariance were used to identify the model of best-fit among four widely used models in 522 geostatistics: spherical, exponential, Gaussian, and linear. The model that produced the least 523 squared difference between empirical and fitted (model) values was used to derive predicted 524 semivariance ( ) at distance interval b. The extent of local spatial autocorrelation (h k) was 525 computed as half of the sill in the best fitted model, so that highly autocorrelated sites were 526 excluded and less autocorrelated sites, located at greater distance, still had a chance of selection. 527 528 4.3 Optimizing sample sites: A computer application was developed in C++ .net framework to 529 implement the optimized sampling design. The application requires a file with four variables for 530 each item of the sampling frame, namely ID, x-coordinate, y-coordinate, and composite weight 531 (Fig 5). The user has the option to define the distance threshold and distance lag to be used for 532 computing the local variogram. In the global variograms of four variables, the maximum 533 distance within which spatial autocorrelation existed did not reach beyond 30,000m. Therefore, 534 the maximum distance threshold used for computing spatial autocorrelation was restricted to 535 30,000m at 1000m intervals. Selecting a large distance threshold is useful to determine sill and 536 the model of best-fit. Nonetheless, the user has the option of defining the h k constraint. This 537

14

14

constraint is particularly useful to avoid excessive exclusion of candidate locations, for example 538 the selection of a sample site with a large h k will avoid the selection of all other candidates 539 within h k distance of the selected site. We utilized distance at half of the sill value, i.e. 540 (11,000m), in the global variogram of weighted Z k. Since this was a spherical model and the 541 gamma at 5500m distance represented more than 75% of the sill value, utilizing this constraint 542 was unlikely to exclude many autocorrelated candidates. 543 544 Although the final sample was drawn using the composite index (that comprised of population 545 and the selected three socio-economic variables), for illustration purposes, four sets of optimal 546 sites were computed using four different weights, namely the composite index (Fig 6A), income 547 (Fig 6B) the race/ethnic diversity index (Fig 6C), and post-secondary education (Fig 6D). These 548 four sets were overlaid onto each other (Fig 7). The size of the black circles in the first three 549 maps shows the importance of a sample site: the larger the size the greater the semivariance it 550 represents. From the visual display of these figures it is evident that the optimal spatial 551 configuration of the sample site changes with respect to the input weight (Z k). It is important to 552 note that the variable in which semivariance is to be maximized is chosen with care, because a 553 variable may not represent all desired characteristics of the population. The sampling application 554 provides users with two options: (a) to compute sample size based on the proportion of 555 autocorrelation controlled semivariance to be captured, or (b) selecting a pre-defined number of 556 sample sites. We utilized semivariance threshold (80%) to determine the sample size for all three 557 scenarios. 558 559 Our analysis suggests that 97 sites, selected using the OSS design, captured 80% of the total 560 spatial autocorrelation controlled semivariance of the composite index (Table 3 and Fig 8). 561 Given that the extent and strength of spatial autocorrelation was high for family income and 562 racial/ethnic, only 45 and 62 sample sites were needed to capture 80% of the total spatial 563 autocorrelation controlled semivariance in family income and ethnic/racial diversity, 564 respectively. The spatial distribution of post-secondary education in the Chicago MSA was quite 565 different from three other variables used in the analysis: it required the largest sample size (108) 566 to capture 80% of the total spatial autocorrelation controlled variability. 567 568 The relationship between % spatial autocorrelation controlled semivariance captured and sample 569 size varies significantly across the selected four variables. The semivariance increases gradually 570 with the increase in sample size for the composite index and post-secondary education. For 571 family income and racial/ethnic diversity the semivariance increases rapidly with the increase in 572 sample size, but this increase is quite abrupt up to 40% variability. For example, only 10 sample 573 sites were needed to capture 40% of the total variability, but to capture an additional 40% of the 574 variability, an additional 35 and 52 sample sites were needed for family income and racial/ethnic 575 diversity (Fig 8). 576 577 4.4 Recruiting Respondents: In the final stage, we plan to recruit a respondent around each of the 578 optimal sites. It is important to note that a sample site represents an area and there can be many 579 respondents within the sampling domain of a sample site . Recruiting a respondent around 580 an optimal site requires a list of residential addresses or housing units within . This list can be 581 prepared in many different ways. First, by overlaying a certain distance threshold, such as 400m 582 radius from the site, field investigators can prepare a list of all households. Although the field 583

15

15

listing is expensive and time consuming, the proposed method can reduce the cost dramatically 584 because the number of households around all sites within the limited distance threshold is very 585 small as compared to the list of all households within the sampling frame. Second, utilizing the 586 U.S. Census Street Data, a list of all potential addresses can be generated within the specified 587 distance to a sample site. Third, a list of all addresses can be acquired from the U.S. Postal 588 Service by the zip code or by the Census units (such as census tracts or census block groups) at 589 or around the optimal sites. Finally, a list of residential addresses can be also generated with the 590 aid of reverse geocoding that utilizes publically available georeferenced databases, such as 591 Google Earth coupled with Yellow Pages. Unlike geocoding in which a pair of XY coordinates is 592 generated using an address, reverse geocoding find address(es) at or around a location (identified 593 by XY coordinates). For fielding the pilot GSS in Chicago, we relied on the last two methods. 594 Given that a significant number of households may not participate in the study, the number of 595 sample sites will be inflated so that we have at least one respondent from the sampling domain of 596 each optimal site. 597 598 4.5 Validation of optimal site: Since we utilize the composite index for the final draw of our 599 sample sites, we utilize this composite index to evaluate the performance of OSS vis-à-vis other 600 spatial sampling designs. We used IRS to investigate the spatial coverage and representation of 601 five sampling designs: (1) restricted random sampling (RRS); (2) weighted restricted random 602 sampling (WRS); (3) Generalized Random Tessellation Stratified sampling (GRTS) (Stevens and 603 Olsen 2004); (4) Variance QuadTree algorithm (VQT) (Minasny, McBratney and Walvoort 604 2007) and (5) OSS. The RRS scheme is an independent sampling scheme with sequential 605 removal of correlated sites; the WRS scheme further generalizes RRS by sampling with 606 probability proportional to the weight index Z k. The GRTS scheme performs regular 607 stratification of the domain and draws one sample per stratum with probability proportional to Z 608 k. The VQT scheme divides the domain into quadrants with approximately equal variance and 609 draws one sample per quadrant. 610

For validation purposes, 250 sites were selected using the OSS, and 1,000 replicated samples of 611 the same size (250) were simulated for other spatial (probabilistic) designs. We used the 612 spsurvey package (Kincaid et al. 2009) for GRTS sampling, the Fortran code by Minasny, et al. 613 (2007) for VQT sampling, the sample function in R statistical language 614 (R Development Core Team 2006) for IRS sampling, and a C++ program for RRS and WRS 615 sampling. The statistics ξ for each sample was computed and the ratio ξ(proposed design)/ 616 ξ(OSS) for each replication and probabilistic design was computed (Fig 9). The ratios of ξ(IRS) 617 to all spatial sampling designs were computed and compared with the ratio of ξ(IRS)/ξ(OSS) 618 (Table 4). Among all the selected designs, the OSS design performs the best, followed by 619 GRTS,WRS, VQT and RRS designs. Among the probabilistic designs, GRTS and WRS are 620 significantly better than IRS in that we controlled for both spatial correlation and variable 621 probability selection (based on weighted index). The RRS and VQT designs that merely control 622 for spatial correlation do not significantly outperform IRS. This indicates the importance of 623 semivariance maximization in spatial sampling to achieve spatial coverage and population 624 representation. 625 626 We also compared the performance of the OSS and other designs in terms of % spatial 627 autocorrelation controlled semivariance cn they captured (Table 4). As expected, the OSS design 628 outperformed the probabilistic designs in terms of the proportion of spatial autocorrelation 629

16

16

controlled semivariance. Its performance was also comparable with the best probabilistic designs 630 in capturing population variance. This analysis suggests that although we optimized sample sites 631 based on spatial autocorrelation controlled semivariance maximization, the performance of the 632 OSS was comparable with the best spatial probabilistic designs in terms of population variance it 633 captured (Table 4). 634 635 4.6 Simulation experiments to study estimation efficiency of OSS: Estimation precision of the 636 OSS was evaluated through simulation experiments. Suppose that our objective is to estimate 637 population weighted mean ̅. We validated OSS using probabilistic sampling methods, for which 638 known design consistent estimator exists based on the Horvitz Thompson theorem (Sarndal, 639 Swensson and Wretman 1992). Another objective of this experiment is to assess the robustness 640 of OSS to draw model based inferences under the selected mis-specified generating models. 641 642 Data were simulated based on either known geostatistical models or real social economical status 643 data in Chicago MSA. We illustrate with the real data, which exhibit strong spatial patterns and 644 highly skewed empirical distribution. We utilized median family income, proportion of high 645 school graduates and ethnic/racial diversity. Using Kriging these data were interpolated to 400 646 meter pixels over Chicago MSA. In the simulation experiments, we assumed that these data were 647 unknown and only reported through surrogates. We considered two simulation factors in 648 generating the surrogates: spatial structures and population weights. Errors in the surrogates were 649 assumed either spatially correlated or independent. Spatial error represents any un-controlled 650 confounder that interacts over space. Errors were also assumed either inversely proportional to 651 population density or identically distributed. Population weighted error accounts for the 652 heterogeneity in the underlying population. The surrogates were simulated according to the 653 following model: 654

1 where ∈ 655 where xk denotes the surrogate measure of Zk: the unknown population quantity and k denotes 656 pixel over the study region. The additive bias α0 equals the minimum of Z and a multiplicative 657 bias α1 equals 0.95; the α2 controls the spatial structure in the errors. First case α2=0 and ∼658 N 0, corresponds to the identical and independent error (IIE), where σε

2 equals the standard 659 deviation in Z. Second case α2 were changed to 0.5 and ∼ N 0, /2 , the S denotes spatially 660 correlated errors following an exponential model with sill /2 and range 2400 meters. Spatial 661 structure is introduced in the error. Third case α2=0 but ∼ N 0, / where nk denotes the 662 LandScan population density, so errors are independent but weighted by population. Fourth case 663 α2=0.5 and ∼ N 0, / so that the errors exhibits both structure and heterogeneity due to 664 population distribution. For each case, we generated D=399 datasets. We ran each probabilistic 665 sampling design to select M=1000 replicates of a sample of size 250 and obtained the OSS 666 sample of the same size based on the composite index. 667 668 We choose bias (b) and mean square error (mse) to measure the estimation precision. For a given 669

dataset d, let d̅esign, denote the estimated mean for replication m of a given design, then design670

∑ d̅esign, ̅ and design ∑ d̅esign

, ̅ measure the bias and mean square error 671

of a given design and dataset. For large enough replications M, these Monte Carlo estimates 672

accurately measure the true bias and mean square error of each given design. Let design673

17

17

IRS / design denote the relative efficiency of a given design with respect to independent 674

random sampling, larger re values denote more precise designs and corresponding estimates. 675 676 The biases (measured by % difference between observed mean and the mean estimated using 677 mis-specified models) of OSS for racial diversity, income and education are within 5%, 2% and 678 1% of the population quantity under all scenarios and hence are not reported. The variation of 679 bias among D datasets is larger than that of probabilistic designs because OSS lacks 680 randomization. Table 4 shows the relative efficiency (median and 80% confidence limits) of each 681 given designs with respect to IRS for each SES variable and model mis-specification. Skewed 682 distribution of the composite index seems to result in a larger bias (Table 5) and reduction in the 683 efficiency (Table 4) of the OSS. For example, some 80% confidence intervals of the relative 684 efficiency of OSS, estimated using different model mis-specification, include 1. Therefore, it is 685 important that variable (used for optimization of sample sites) is transformed to normal. 686 687 The overall median relative efficiency of OSS is consistently larger than other the probabilistic 688 designs under all variables and model (mis)-specifications. This confirms that the OSS results in 689 a relatively smaller standard error than that from IRS, and the performance of the OSS is robust 690 to the chosen model (mis)-specification. 691 692 4.7 Scaling up OSS for a nationally representative survey: The methodology we have described 693 thus far is useful for drawing a set of sample sites within small regions. This methodology 694 integrated with other classical methods of sampling can be scaled up for multi-stage sampling to 695 draw a nationally representative sample. To demonstrate how this can be used for a nationally 696 representative GSS survey, let the sampling domain (A) be the continental U.S., representing 697 97.5% of the U.S. population. In the first stage, A can be partitioned into homogeneous regions 698 Ri, such as the U.S. Census Regions or the EPA regions (Fig 11). If homogeneity (in the desired 699 attributes) within these regions is questionable, homogenous regions can be formulated with the 700 aid of multivariate spatial clustering. At this stage, let the ten EPA regions for the continental 701 U.S. adequately represent the spatial distribution of environmental, topographic, and climatic 702 characteristics. The number of respondents to be drawn from each EPA region (proportionate to 703 population size) is presented in Table 6. For illustration purposes, let n be predetermined (e.g. 704 ~2000 respondents to be drawn from the continental U.S.) and optimal is unknown. 705 706 In the next stage, regions can be stratified into sub-regions (groups of census tracts or counties) 707 the sample size for each stratum can be proportionate its population size (Table 7). Following the 708 U.S. Census definition of urbanness, sub-regions can be stratified into metropolitan statistical 709 area (MSA), micropolitan or intermediate statistical area, and rural area (not included in the first 710 two categories), respectively. Within each region there can be multiple sub-regions of each type, 711 for example there are many MSA’s or counties within each region. Population weights are used 712 to draw a sub-region of each type within a region. To accomplish this, sub-regions can be 713 arranged randomly, and their cumulative population proportion can be generated. Using a pseudo 714 random number (Venables and Ripley 2002), the range of the cumulative population proportion 715 can be identified and the sub-region of this range can be selected for the third stage of sampling. 716 For example, let Rij be an MSA, the cumulative population proportion range of the second MSA 717 in the ith region is 0.1 to 0.42 (as in Fig 12), and if the random number (that ranges between 0 718 and 1) is within this range, it can be chosen for the third stage of sampling. In essence, jth sub-719

18

18

regions with larger populations are more likely to be selected because their range will be wider, 720 but this procedure does not exclude smaller sub-regions, because sub-regions of each type will 721 be selected randomly. 722 723 One (or two) of each MSA, MSA/counties, and counties, respectively, can be drawn from the ith 724 region, and the number of household or residential units (nij) to be drawn from the jth sub region 725 of the ith region is in proportion to its population size (Fig 11). For example, following the 726 probability sampling 284 of 360 respondents should be drawn from MSA, and 71 and 5 from 727 micropolitan and rural counties, respectively, in the EPA Region 5. It is interesting to note that 728 the OSS suggests only 97 sites that can capture 80%. Given the small number for rural counties, 729 the sample size can be inflated to capture at least 50% of the total variability in the distribution of 730 population and social economical status. In the final stage, the OSS is proposed to select sample 731 sites and draw a household/ residential unit around each of these sites. 732 733 5. DISCUSSION 734 This paper demonstrates the application of an optimal spatial sampling (OSS) design to draw a 735 representative social survey. The proposed design, which builds heavily on the use of geospatial 736 technologies and spatial analytical methods, offers several advantages over the classical methods 737 of fielding social surveys. First, the proposed design ensures spatial coverage and population 738 representation at multiple geographic scales - national, regional and sub-regional, which are 739 critically important for fielding a nationally representative survey. The recent literature suggests 740 that socio-physical contexts vary across geographic space. Therefore, geographic representation 741 and coverage are becoming important for drawing a representative (and spatially balanced) 742 sample. Second, the methodology proposed can be used to construct and characterize an up-to-743 date sampling frame of inhabited areas (critically important for social surveys) at a fine spatial 744 resolution (90m and 30m), because a complete, finite sampling frame of human population is 745 rarely available. Third, the implementation of the OSS at the third stage captures maximum 746 spatial autocorrelation controlled variance in the selected weight (or attribute) with a minimum 747 sample size, and controls for redundancy by avoiding the selection of autocorrelated sites. 748 Therefore, the OSS design is efficient in terms of population representation, spatial coverage and 749 sample size. Fourth, geocoding sample sites allows for the collection of multi-level multi-layer 750 socio-physical contextual data/information in a cost-effective manner without any additional 751 burden on respondent’s time. Finally, the OSS design outperforms other widely used spatial 752 sampling designs in terms of spatial coverage and population representation, and draw robust 753 statistical inferences about the population characteristics. 754 755 The proposed sampling design is likely to have important implications for survey methodology 756 for collecting social, demographic and health data, and interdisciplinary multi-level social 757 science research. While spatial sampling has received due attention and acceptance for collecting 758 physical environment data, such as air pollution, soil minerals and plant species (Di Zio et al. 759 2004, Gan et al. 2006, Kumar, Peters, Nixon, Sinha, Jiang and Ziegenhorn 2009, Rodgers and 760 Oliver 2007), its usage for collecting social data has been limited and is relatively new (Jie et al. 761 2009, Kumar 2007). Unlike natural sciences, the distribution of social attributes observe greater 762 spatial segregation and a slower tempo of creativity (Osborne and Rose 1999), such as income 763 and/or racial/ethnic composition do not change overnight, much less over years, within a 764 neighborhood. For social surveys, we rely on the existing sampling frames, such as household 765

19

19

enumeration lists (developed by Census and other organizations). But the existing sampling 766 frames are not complete and may lack attribute data needed to stratify and contextualize the 767 sampling frame. As demonstrated in this paper, spatial sampling coupled with geo-analytical 768 methods can be used to overcome this problem. A flow diagram for implementation of the OSS 769 is shown in Fig 3. 770 771 The spatial sampling for collecting social survey data enhances the scope of these data in two 772 major ways. First, georeferencing or geocoding the survey data provides a frame to integrate 773 these data with other socio-physical contexts. This, in turn, adds a new dimension to social 774 science research, because it allows social scientists to connect individual lives to wider social-775 physical environmental contexts, and to model space as a continuous, unbounded surface where 776 contextual variables themselves endlessly vary, rather than being bounded by arbitrary census or 777 jurisdictional boundaries (Downey 2006). Second, a sampling frame of population with socio-778 economic and demographic attributes can be developed from scratch, which is especially useful 779 in the absence of a contextual and complete sampling frame. For example, if the sampling goal is 780 to draw a representative sample of income, a sampling frame of human population with the 781 indirect measures (or proxy) of income can be developed with the aid of remote sensing and 782 other ancillary datasets. For public health research, the OSS is even more important to draw a 783 representative sample with adequate representation of potential risk factors. Modern geo-spatial 784 technologies can be used to develop a sampling frame stratified by the main covariates (and/or 785 known risk factors) of the health outcomes in questions. For example, Kumar et al. (2007) 786 developed a sampling for residential areas stratified by the levels of air pollution and (air 787 pollution) emission sources, and respondents were drawn randomly from different strata 788 proportionate to their residential area. This type of implementation ensured adequate 789 representation of population exposure to ambient air pollution from different emission sources. 790 791 The availability of georeferenced data with the multi-level (neighborhood/community, sub-792 regional, and regional contexts) socio-physical contexts is likely to augment the scope of these 793 data to interdisciplinary settings and advance opportunities for interdisciplinary multi-level 794 collaboration by attracting a variety of researchers, who are increasingly interested in 795 understanding the role different covariates play and operate at different levels. For example, in 796 the obesity/overweight research, while sociologists, psychologists, and epidemiologists are 797 interested in individuals and household characteristics, geographers, urban planners, statisticians, 798 and economists are interested in understanding the role of neighborhood level socio-physical 799 contexts, and local and regional policy measures. The availability of these integrated multi-level 800 datasets is likely to bring often-fragmented disciplines together. Likewise, georeferenced data 801 also facilitate spatial proximity/adjacency research that focuses on the role of 802 proximity/adjacency to social/environmental facilities/services on social, behavioral, and health 803 outcomes. There are numerous applicable examples of research investigating the relationships 804 between spatial proximity and social and health outcome, such as employment opportunities, 805 prevalence of asthma, obesity, and crimes (Cagney 2006, Cerin, Saelens, Sallis and Frank 2006, 806 Cervero and Duncan 2003, Chen et al. 2008, Diez Roux 2001, Kasarda 1989, Lichter et al. 807 1998). 808 809 Survey data collection is expensive and time consuming. Thus, it is essential that the sampling 810 design be efficient and effective. The optimization of sample sites is efficient for a number of 811

20

20

reasons. First, the proposed design is efficient in that it captures the maximum variance in 812 selected attribute(s) and minimizes the sample size based on the chosen threshold of semi-813 variance to be captured. Second, it narrows the areas around sample sites. Therefore, the list of 814 households (or residential units) around the chosen distance threshold can be prepared in a cost-815 effective manner. Third, a large amount of contextual neighborhood/community data/information 816 prepared for a sampling site can be directly integrated with the survey data, which further 817 enhances the scope of the survey data. Finally, our results suggest that the OSS outperforms 818 other widely used sampling methods for parameter estimation of the SES of population. 819 820 Although the proposed sampling design offers a number of advantages over the classical 821 methods, a number of weaknesses remain. First, the use of a variable with a skewed distribution 822 can bias the sample selection in favor of extreme values that are likely to produce a high sum of 823 the squared differences from the values observed at other locations. For example, the relative 824 efficiency of OSS dropped for the skewed composite index. But on the normally distributed 825 median income variable, the relative efficiency of the OSS was best (highest) among all 826 sampling designs used for estimating population parameters from mis-specified models (Table 827 5). Therefore, it is important that input attributes (to be used for semivariance maximization) are 828 normalized. Second, the proposed optimization algorithm makes use of one variable. Therefore, 829 we utilized a composite index developed based on four variables. The future research will be 830 geared towards hierarchical optimization, in which multiple variables could be used in the order 831 of their weight. Third, the specification of the extent of local spatial autocorrelation (h k) could be 832 problematic, especially when coarse resolution census data are used to contextualize the fine 833 resolution sampling frame. For example, coarse resolution census tract data interpolated at 400m 834 pixels (of the sampling frame) are likely to have stronger spatial autocorrelation, because all 835 pixels within a coarse resolution census tract are likely to receive the same census value. Finally, 836 geographic space is receiving diminished attention due to the increased use of cyber space and 837 information technology for collecting social survey data. The proposed sampling design lacks the 838 integration of cyber space to administer surveys. The integration of both cyber space and 839 geographic space is important, because recent research suggests that the geographic location and 840 its associated socio-physical contexts continue to influence human behavior and health outcomes 841 even with the phenomenal advances in cyber space and its impact on our lives. For example, in a 842 recent article, Chio and colleagues (2010) demonstrate that the geographic proximity coupled 843 with similarities in socio-economic and demographic characteristics shape the evolution of 844 demand for online retailers. Thus, future research will be geared towards the integration of cyber 845 and geographic space, as it has been possible to track and geocode cyber traffic. 846 847 In the gradually expanding cyber/virtual space, geographic distance is narrowing and geographic 848 boundaries are blurring. Nonetheless, the importance of geographic space cannot be ignored. 849 Human populations tend to segregate across geographic space based on socio-economic and 850 demographic attributes (Do et al. 2008), which do not change frequently (Downey 2006). The 851 proposed spatial sampling emphasizes the importance of geographic space and place, and 852 ensures spatial coverage and population representation capturing by the maximum semivariance 853 in the selected attribute(s) with a minimal sample size. In addition, geocoding respondents allows 854 us to embed them in multi-level layers of socio-physical contextual information/data (in a cost-855 effective manner), thereby augmenting the scope of social survey data. This, in turn, is likely to 856 enhance the scope of interdisciplinary research. 857

21

21

858 Acknowledgement: We would like to thank three anonymous reviewers and Professor Daniel 859 Griffith for the criticisms and suggestion that helped enormously to improve the revision. This 860 work was supported by the NSF (Grant # 0825588). 861 862 References 863 864 L. C. Abercrombie, J. F. Sallis, T. L. Conway, L. D. Frank, B. E. Saelens and J. E. Chapman 865 (2008) Income and racial disparities in access to public parks and private recreation facilities. 866 American Journal of Preventive Medicine, 9-15. 867 L. Anselin (1995) Local indicators of spatial association – LISA. Geographical Analysis, 93-115. 868 N. Bell, N. Schuurman and M. V. Hayes (2007) Using GIS-based methods of multicriteria 869 analysis to construct socio-economic deprivation indices. Int J Health Geogr, 17. 870 B. Bhaduri (2005) LandScan USA: High-Resolution Population Distribution Model. Oak Ridge 871 National Laboratory, 872 K. A. Cagney (2006) Neighborhood age structure and its implications for health. J Urban 873 Health, 827-834. 874 M. O. Caughy and P. J. O'Campo (2006) Neighborhood poverty, social capital, and the cognitive 875 development of African American preschoolers. Am J Community Psychol, 141-154. 876 E. Cerin, B. E. Saelens, J. F. Sallis and L. D. Frank (2006) Neighborhood Environment 877 Walkability Scale: validity and development of a short form. Med Sci Sports Exerc, 1682-1691. 878 R. Cervero and M. Duncan (2003) Walking, bicycling, and urban landscapes: Evidence from the 879 San Francisco Bay area. American Journal of Public Health, 1478-1483. 880 V. W. Chang (2006) Racial residential segregation and weight status among US adults. Soc Sci 881 Med, 1289-1303. 882 C. Chen, H. M. Gong and R. Paaswell (2008) Role of the built environment on mode choice 883 decisions: additional evidence on the impact of density. Transportation, 285-299. 884 J. T. Chen, B. A. Coull, P. D. Waterman, J. Schwaltz and N. Krieger (2008) Methodologic 885 implications of social inequalities for analyzing health disparities in large spatiotemporal data 886 sets: An example using breast cancer incidence data (Northern and Southern California, 1988-887 2002). Statistics in Medicine, 3957-3983. 888 J. Choi, S. Hui and D. R. Bell (2010) Spatiotemporal Analysis of Imitation Behavior Across New 889 Buyers at an Online Grocery Retailer. Journal of Marketing Research, 1-15. 890 N. Cressie (1993) Statistics for Spatial Data, New York: Wiley. 891 S. Di Zio, L. Fontanella and L. Ippoliti (2004) Optimal spatial sampling schemes for 892 environmental surveys. Environmental and Ecological Statistics, 397-414. 893 A. V. Diez Roux (2001) Investigating neighborhood and area effects on health. Am J Public 894 Health, 1783-1789. 895 P. Diggle and S. Lophaven (2006) Bayesian geostatistical design. Scand J Stat, 53-64. 896 D. P. Do, B. K. Finch, R. Basurto-Davila, C. Bird, J. Escarce and N. Lurie (2008) Does place 897 explain racial health disparities? Quantifying the contribution of residential context to the 898 Black/white health gap in the United States. Soc Sci Med, 1258-1268. 899 L. Downey (2006) Using geographic information systems to reconceptualize spatial relationships 900 and ecological context. American Journal of Sociology, 567-612. 901 V. A. Freedman, I. B. Grafova, R. F. Schoeni and J. Rogowski (2008) Neighborhoods and 902 disability in later life. Soc Sci Med, 2253-2267. 903

22

22

H. Frumkin (2003) Healthy places: Exploring the evidence. American Journal of Public Health, 904 1451-1456. 905 P. Gan, R. Yu, B. E. Smets and A. A. Mackay (2006) Sampling methods to determine the spatial 906 gradients and flux of arsenic at a groundwater seepage zone. Environ Toxicol Chem, 1487-1495. 907 M. F. Goodchild and D. G. Janelle, eds (2004) Spatially Integrated Social Science, Oxford: 908 Oxford University Press. 909 D. A. Griffith (2005) Effective geographic sample size in the presence of spatial autocorrelation. 910 Annals of the Association of American Geographers, 740-760. 911 D. A. Griffith (2008) Geographic sampling of urban soils for contaminant mapping: how many 912 samples and from where. Environmental Geochemistry and Health, 495-509. 913 R. Haining (2003) Spatial Data Analysis Theory and Practice, New York, NY: Cambridge 914 University Press. 915 Y. Jie, P. F. Landry and L. Y. Ren (2009) GPS in China Social Surveys: Lessons from the ILRC 916 Survey. China Rev, 147-163. 917 P. S. Kanaroglou, M. Jerrett, J. Morrison, B. Beckerman, M. A. Arain, N. L. Gilbert and J. R. 918 Brook (2005) Establishing an air pollution monitoring network for intra-urban population 919 exposure assessment: A location-allocation approach. Atmospheric Environment, 2399-2409. 920 J. D. Kasarda (1989) Urban Industrial Transition and the Underclass. Ann Am Acad Polit Ss, 26-921 47. 922 T. Kincaid, T. Olsen, D. Stevens, C. Platt, D. White and R. Remington (2009) spsurvey: Spatial 923 Survey Design and Analysis. 924 N. Kumar (2009) An Optimal Spatial Sampling Design for Intra-Urban Population Exposure. 925 Atmospheric Environment 1153-1155. 926 N. Kumar (2007) Spatial Sampling for a Demography and Health Survey. Population Research 927 and Policy Review, 581-599. 928 N. Kumar, T. Peters, V. Nixon, K. Sinha, X. Jiang and S. Ziegenhorn (2009) An Optimal Spatial 929 Configuration of Sample Sites for Air Pollution Monitoring. J. Air & Waste Manage. Assoc, 930 1308–1316. 931 D. T. Lichter, D. K. McLaughlin and D. C. Ribar (1998) State abortion policy, geographic access 932 to abortion providers and changing family formation. Fam Plann Perspect, 281-287. 933 L. C. Messer, J. S. Kaufman, N. Dole, D. A. Savitz and B. A. Laraia (2006) Neighborhood 934 crime, deprivation, and preterm birth. Ann Epidemiol, 455-462. 935 L. C. Messer, B. A. Laraia, J. S. Kaufman, J. Eyster, C. Holzman, J. Culhane, I. Elo, J. G. Burke 936 and P. O'Campo (2006) The development of a standardized neighborhood deprivation index. J 937 Urban Health, 1041-1062. 938 B. Minasny, A. B. McBratney and D. J. J. Walvoort (2007) The variance quadtree algorithm: 939 Use for spatial sampling design. Computers & Geosciences, 383-392. 940 W. G. Muller (2005) A comparison of spatial design methods for correlated observations. 941 Environmetrics, 495-505. 942 W. G. Müller (1998) Collecting Spatial Data: Optimum Design of Experiments for Random 943 Fields, New York: Physica-Verlag. 944 J. A. Nelson, M. A. Chiasson and V. Ford (2004) Childhood overweight in a New York City 945 WIC population. American Journal of Public Health, 458-462. 946 ORNL (2008) LandScanTM Global Population Database. Oak Ridge National Laboratory, Oak 947 Ridge, TN (Available from http://www.ornl.gov/landscan/.) 948

23

23

T. Osborne and N. Rose (1999) Do the social sciences create phenomena?: the example of public 949 opinion research. Brit J Sociol, 367-396. 950 W. S. Overton and S. V. Stehman (1993) Properties of Designs for Sampling Continuous Spatial 951 Resources from a Triangular Grid. Communications in Statistics-Theory and Methods, 2641-952 2660. 953 R Development Core Team (2006) R: A language and environment for statistical computing. R 954 Foundation for Statistical Computing, Vienna, Austria (Available from http://www.R-955 project.org.) 956 M. Rezaeian, G. Dunn, S. St Leger and L. Appleby (2006) Ecological association between 957 suicide rates and indices of deprivation in the north west region of England: the importance of 958 the size of the administrative unit. J Epidemiol Community Health, 956-961. 959 S. E. Rodgers and M. A. Oliver (2007) A geostatistical analysis of soil, vegetation, and image 960 data characterizing land surface variation. Geographical Analysis, 195-216. 961 C. E. Sarndal, B. Swensson and J. Wretman (1992) Model Assisted Survey Sampling, New York, 962 NY: Springer. 963 T. W. Smith (2006) The Subsampling of Non-Respondents in the 2004 GSS In GSS 964 Methdological Report No. 106, Chicago, IL: NORC. 965 S. V. Stehman and W. S. Overton (1996) Spatial sampling In Practical Handbook of Spatial 966 Statistics (ed S. Arlinghaus), pp. 31-63, Boca Raton, FL: CRC Press. 967 D. L. Stevens and A. R. Olsen (2004) Spatially balanced sampling of natural resources. Journal 968 of the American Statistical Association, 262-278. 969 D. L. Stevens and A. R. Olsen (2003) Variance estimation for spatially balanced samples of 970 environmental resources. Environmetrics, 593-610. 971 S. K. Thompson (2002) On sampling and experiments. Environmetrics, 429-436. 972 W. Tobler (2004) On the first law of geography: A reply. Annals of the Association of American 973 Geographers, 304-310. 974 U.S. Census (2000). 975 R. Valliant, A. Dorfman and R. M. Royall (2000) Finite population sampling and inference : a 976 prediction approach, New york: John Wiley & Sons. 977 J. W. van Groenigen, W. Siderius and A. Stein (1999) Constrained optimisation of soil sampling 978 for minimisation of the kriging variance. Geoderma, 239-259. 979 W. N. Venables and B. D. Ripley (2002) Modern Applied Statistics with S, New York: Springer. 980 S. E. Wiehe, A. E. Carroll, G. C. Liu, K. L. Haberkorn, S. C. Hoch, J. S. Wilson and J. D. 981 Fortenberry (2008) Using GPS-enabled cell phones to track the travel patterns of adolescents. 982 International Journal of Health Geographics, -. 983 E. B. Winslow and D. S. Shaw (2007) Impact of neighborhood disadvantage on overt behavior 984 problems during early childhood. Aggressive Behav, 207-219. 985 M. A. Yonas, P. O'Campo, J. G. Burke and A. C. Gielen (2006) Neighborhood-Level Factors 986 and Youth Violence: Giving Voice to the Perceptions of Prominent Neighborhood Individuals. 987 Health Educ Behav. 988 X. Zhang, K. K. Christoffel, M. Mason and L. Liu (2006) Identification of contrastive and 989 comparable school neighborhoods for childhood obesity and physical activity research. Int J 990 Health Geogr, 14. 991 Z. Y. Zhu and M. L. Stein (2006) Spatial sampling design for prediction with estimated 992 parameters. J Agr Biol Envir St, 24-44. 993

24

24

D. L. Zimmerman (2006) Optimal network design for spatial prediction, covariance parameter 994 estimation, and empirical prediction. Environmetrics, 635-652. 995 996 997

Original Scale Normalized Scale

Statistics Popul-

ation Diversity

Index

% Population with Post-Secondary Education

Median Household Income ($)

Popul-ation

Diversity Index

% Population with Post-Secondary Education

Median Household Income ($)

Mean 430 0.134 53.16 59070 1 1.45 1 1

SE of Mean

3.97 0.0009 0.051 88.21 0.009 0.005 0.0009 0.0015

Skewness 1.57 4.86 0.148 0.977 1.578 5.67 0.148 0.97

Kurtosis 4.56 28.65 2.57 8.05 4.56 8.6 2.5 8.0

Table 1: Summary statistics (sample size N = 25,011)

Population White Black Others

Proportion 0.669214 0.182206 0.14858

Table 2: Population distribution by Racial/Ethnic groups.

% variance captured

Ethnic/Racialdiversity

Median Household

Income

% Population with Post-SecondaryEducation

CompositeIndex

<= 0.1 0 (0) 1 (1) 5 (5) 4 (4)

<= 0.2 2 (2) 2 (3) 7 (12) 7 (11)

<= 0.3 3 (5) 3 (6) 9 (21) 8 (19)

<= 0.4 5 (10) 4 (10) 10 (31) 12 (31)

<= 0.5 8 (18) 6 (16) 13 (44) 13 (44)

<= 0.6 10 (28) 7 (23) 16 (60) 16 (60)

<= 0.7 15 (43) 10 (33) 22 (82) 17 (77)

<= 0.8 19 (62) 12 (45) 26 (108) 20 (97)

Table 3: Sampling size and % spatial autocorrelation controlled semivariance captured (cumulative sample size in parenthesis)

Sampling Designs

RRS WRS GRTS GRTSW VQT OSS

Composite index

Cn(ij) 0.27

(0.25,0 .29) 0.31

(0.29,0.33)0.38

(0.36,0.40)0.37

(0.35,0.39)0.31

(0.30,0.33) 0.50

ξ(IRS)/ ξ(proposed design)

1.04 (0.61,1.78)

2.29 (1.48,3.74)

1.18 (0.81,1.90)

3.92 (2.77,6.23)

1.21 (0.73,2.08)

4.56

IIE 1.04

(0.94,1.16) 0.15

(0.14,0.16)1.37

(1.23,1.53)0.08

(0.08,0.09)1.62

(1.06,1.98) 0.14

(0.09,0.27)

SE 1.05

(0.95,1.16) 0.15

(0.14,0.16)1.38

(1.24,1.55)0.08

(0.08,0.09)1.67

(0.99,2.01) 0.14

(0.09,0.25)

WE* 0.97

(0.86,1.12) 1.43

(1.25,1.61)1.01

(0.89,1.13)1.53

(1.35,1.73)1.24

(0.58,1.65) 4.07

(0.41,2085.05)

SWE 1.05

(0.95,1.17) 0.19

(0.17,0.21)1.29

(1.13,1.47)0.1

(0.09,0.11)1.58

(0.89,2.02) 0.17

(0.1,0.42)Median household income, known population quantity (55,186)

IIE 1.04

(0.96,1.12) 1.29

(1.2,1.41)1.31

(1.22,1.41)0.97

(0.9,1.06)1.42

(0.99,1.72) 1.71

(0.43,40.07)

SE 1.05

(0.98,1.12) 1.33

(1.22,1.43)1.33

(1.23,1.44)0.99

(0.89,1.1)1.46

(0.93,1.81) 1.79

(0.46,38.21)

WE* 0.97

(0.9,1.05) 1.73

(1.59,1.86)1.01

(0.94,1.1)2.13

(1.96,2.31)1.27

(0.89,1.46) 4.56

(0.78,148.87)

SWE 1.06

(0.98,1.14) 1.48

(1.33,1.63)1.23

(1.14,1.32)1.23

(1.05,1.43)1.49

(0.94,1.87) 2.53

(0.53,55.5)Proportion of high school graduates, population quantity (53.34)

IIE 1.08

(1,1.14) 1.96

(1.83,2.1)1.44

(1.33,1.53)3.39

(3.15,3.63)0.93

(0.6,1.33) 13.5

(2.23,242.37)

SE 1.09

(1.01,1.16) 2.01

(1.87,2.14)1.45

(1.34,1.56)3.46

(3.21,3.74)0.95

(0.58,1.41) 13.87

(2.19,433.05)

WE* 0.97

(0.89,1.06) 1.73

(1.59,1.88)1.01

(0.93,1.1)2.17

(2,2.36)1.24

(0.87,1.49) 5.28

(0.81,189.82)

SWE 1.08

(1,1.16) 1.95

(1.84,2.1)1.32

(1.23,1.44)3.04

(2.82,3.27)1.11

(0.61,1.52) 9.21

(1.61,264.41)Ethnic/racial diversity, population quantity (0.198)

IIE 1.08

(1.01,1.15) 1.96

(1.8,2.1)1.32

(1.22,1.42)2.32

(2.16,2.52)1.63

(1.18,1.9) 3.43

(0.73,129.17)

SE 1.1

(1.02,1.18) 2.05

(1.91,2.21)1.35

(1.25,1.46)2.42

(2.18,2.67)1.7

(1.21,2.04) 3.86

(0.75,94.34)

WE* 0.98

(0.9,1.07) 1.76

(1.63,1.9)1.02

(0.94,1.11)2.2

(2.05,2.4)1.27

(0.88,1.52) 5.59

(0.8,178.03)

SWE 1.1

(1.03,1.17) 2.04

(1.87,2.23)1.29

(1.19,1.39)2.42

(2.17,2.68)1.69

(1.14,2.01) 4.63

(0.76,115.25)Table 4: Relative efficiency (median and 80% confidence limits in parenthesis) of each given design with respect to

independent random sampling (IRS). The experiment involved 399 independent datasets, and 1000 sampling replications and utilizes four models: identical independent errors (IIE), spatial errors(SE), population weighted errors(WE) and population weighted spatial errors (SWE) on simulated income, education and diversity socio

economic status data for 2000 Chicago MSA. * specified (or know or correct) model

Sampling Methods

IRS RRS WRS GRTS GRTSW VQT OSS

Composite index

IIE -0.44 -0.26 12.78 -0.25 17.88 0.76 13.77SE -0.44 -0.27 12.78 -0.25 17.88 0.67 13.85WE 0.00 0.00 0.07 0.00 0.09 0.01 0.06SWE -0.26 -0.14 10.28 -0.12 14.66 0.71 11.32Median household income

IIE 0.03 0.02 -1.06 -0.01 -1.8 -0.68 -1.8SE 0.04 0.02 -1.07 0 -1.8 -0.71 -1.72WE 0 0 0 0 0 -0.01 -0.01SWE 0.01 0 -0.67 -0.01 -1.13 -0.4 -1.14

Table 5: Average percent of deviation between sample and population mean in terms of population mean based on 399 simulation data sets.

Sampling Stage

Sample Size Location

Calculation Criteria Criteria Representation

First Stage ni = Pi/PUS Population SizeEPA Regions Social/Physical

Environment

Second Stage

One/two each MSA,

MSA/Counties and Counties

Randomly UrbanizationRural &

Urban

Third Stage nij = Pij/Pj Semivariance

thresholdVariance

MaximizationPopulation &

SESP = Population, PUS = Population in the continental US

Table 6: Sampling stages with key characteristics

EPA Region

Population in thousands Sample

Size

Proposed Sample size

Total Metro-politan

Micro-politan

RuralCounty

Metro-politan

Micro-politan

RuralCount

y*1 13,800 11,800 1,889 124 99 85 14 12 26,900 25,400 1,498 5 194 183 11 03 27,200 22,200 4,496 529 196 160 32 44 53,300 38,700 12,700 1,820 383 279 91 135 50,100 39,400 9,911 745 360 284 71 56 32,900 25,200 7,113 635 237 181 51 57 12,900 7,542 4,431 949 93 54 32 78 9,327 6,315 2,343 670 67 45 17 59 41,000 39,000 1,927 48 295 281 14 010 10,600 7,909 2,575 111 76 57 19 1Total 278,027 223,466 48,883 5,636 2000 1608 352 41* The sample size of rural counties is relatively smaller. Using the methodology discussed in Ott et al.11 a sample size for each rural county will be computed such that it is able to capture at least 50% of the total variability in the distribution of population and SES using the equation (4).

Table 7: Population and estimated sample size by metropolitan/urban status across EPA regions.

Fig 1: 400m LandScan population data for Chicago, 2005.3

weighted composite index

Fre

quen

cy

0 1 2 3 4 5 6

020

0040

0060

0080

00

population with some post-secondary education (%)

Fre

quen

cy

30 40 50 60 70

020

0040

0060

00

median household income ($)

Fre

quen

cy

0 50000 100000 150000

020

0060

00

racial/ethnicity index

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

0060

0010

000

Fig 2: Statistical distribution of the selection SES variables.

Fig 3: Implementation strategy of the Optimal Spatial Sampling Design.

Identify population or predictor(s) of population of interest

An optimal set of sample sites

Input

Processing

Output

Multi-level socio-physicalContexts

List of residentialunits around optimal

sites

Validation (comparison of OSSWith other Spatial Sampling

Methods)

Drawing Respondents

Optimal Spatial Sampling(OSS)

Application

(variance maximization &sample minimization)

max|Zn(ij)|

Geo-spatial processing(control for spatialautocorrelation)

Pre-processing

Post-processing

Collecting/ImputingContextual Data

Geospatial Processing of data

Contextual samplingFrame {Zijk}

Sampling Constraint (hijk)

Sample Size (nij) orVariance Threshold (Cij)

Identify population or predictor(s) of population of interest

An optimal set of sample sites

Input

Processing

Output

Multi-level socio-physicalContexts

List of residentialunits around optimal

sites

Validation (comparison of OSSWith other Spatial Sampling

Methods)

Drawing Respondents


Application


max|Zn(ij)|



Application


max|Zn(ij)|


Pre-processing

Post-processing











0 5000 10000 15000 20000 25000 30000

0.1

00.1

50.

20

0.2

5

distnace (m)

gam

ma

A. Semivariance of composite weighted index

0 5000 10000 15000 20000 25000 30000

0.0

00.0

50.1

00.1

50.2

00.2

50.3

0

distnace (m)gam

ma

C. Diversity Index

0 5000 10000 15000 20000 25000 30000

010

20

30

40

distnace (m)

gam

ma

B. Post-secondary education

0 5000 10000 15000 20000 25000 30000

0.0

e+00

5.0

e+07

1.0

e+08

1.5

e+08

2.0

e+08

distnace (m)

gam

ma

D. Median family income

Fig 4: Variograms of the selection four SES variables.

Fig 5: Sample site opmization program written in VC++.

A. Composite index: Final sample of 97 sites.

C. Diversity index

B. Median household income D. post-secondary education

Fig 6: Optimized sample sites based on four different input weight

Fig 7: Optimized sample sites based on composite index, household income, ethnic/racial disparity and post-secondary education

.2.4

.6.8

% c

umu

lativ

e va

rian

ce c

apt

ure

d

20 40 60 80 100number of sample sites

Composite Index Ethnicity/Racial DiversityHousehold Income Post-secondary Education

Fig 8: Sample size and % spatial autocorrelation controlled semivariance captured.

Fig 9: Histogram of regularity statistics from 1000 samples using independent random sampling, Restricted Random Sampling, Weighted Restricted Random Sampling, Generalized Random-Tessellation Stratified and Variance Quadtree,

comparing with the statistics from an Optimal Spatial Sample of the same size.

Land Scan Population Count

Fre

quency

0 500 1000 1500 2000 2500

050

0010

000

1500

0

340000 360000 380000 400000 420000 440000

4600

000

4650

000

4700

000

Land Scan Population

Easting

Northin

g500

1000

1500

2000

2500

340000 360000 380000 400000 420000 440000

4600

000

4650

000

4700

000

Logarithm Weight Index

Easting

Northin

g

-1.0

-0.5

0.0

0.5

1.0

Fig 10: Distribution of population and composite index in Chicago, MSA 2000.

Fig 11: EPA regions

k1

10

k2

0

………………..…Uniform distribution of PRNG

Rij=1

10

0.1 1


Range of Rij = 2 Cumulative population proportion of jth sub-regions of ith EPA region

Rij=2 Rij=n

0.42

k1

10

k2

0


Rij=1

10

0.1 1


Range of Rij = 2 Cumulative population proportion of jth sub-regions of ith EPA region

Rij=2 Rij=n

0.42

Fig 12: Process of selecting sub-regions (MSA, MSA/County and County) at the second stage.

1 an optimal spatial sampling design for social...

Documents