use of gis in analyzing environmental math. ura-reports/032/zheng.cai/report.doc · web...

Click here to load reader

Post on 14-May-2018




1 download

Embed Size (px)




Zheng Cai under the supervision of Prof. D. Myers and Seumas Rogan

University of Arizona

Spring 2003


The Atlas of Cancer Mortality in United States (1950-1994) tabulates the distribution of cancer in the United States by county. Its formation and utility embraces various assumptions. It particularly implies the equality of cancer risks across the county and the universality of the reported cancer risk representation for the residents of the county. This assumption would be reasonable if the county is quite homogeneous in both the characteristics of the underlying population and exposure risks. Nevertheless this assumption may be not appropriate for many states in the western United States. These states are usually divided into only a few counties, each of which covers large geographical spaces with uneven distributed populations. Therefore, it is unlikely that the county-level statistics adequately represent the range of actual county-level experiences in the approved manner.

In general, the goal of this three-year research project is to examine the geographic variation in the relationship of cancer risk and arsenic in United States. Exposure to arsenic may be a reason for the development of bladder, lung, kidney, and skin cancers in some respects. Meanwhile, arsenic concentrations show a variation between geographic locations. The state of Arizona is divided into several counties. Each county covers large geographic spaces with an uneven distribution of population. In order to represent the range of actual county experiences, we need to do further research on the geographic variation by using Geographic Information Systems (GIS) software.

This report is going to present the research on the geographic variation by using GIS software to analyze data for Cochise County, Arizona. Modeling of ground water arsenic concentrations was the main goal of this research project.

Methods applied

Geographic information system (GIS) is a type of mapping software that links data of real-word objects with onscreen map. It has data creation, data display, analysis and output four main usages. Realistically, it is used to connect multiple sources of georeferenced health statistics data. Insights concerning diverse health-environment-behavior interactions can be derived by identification of clusters of cancer incidences followed by comparison with cluster locations in the mapped distribution of arsenic. The Geostatistical Analyst feature in ArcMap of version 8.1 build by ESRI Inc was used in geostatistical analyses reported below.

Interpolation techniques are mainly categorized as deterministic and stochastic. Deterministic interpolation uses the techniques of creating new surfaces from the measured points, basing on the either the extent of similarity or the degree of smoothing. It can be divided into two subgroups: global and local. Geostatistical interpolation applies the techniques of utilizing the statistical properties of the measured points.

1. Inverse Distance Weighted (IDW)

Assume that values at locations that are close to one another are more alike than those that are farther apart, Inverse Distance Weighted (IDW) will use the measured values surrounding the prediction location to predict a value for any unmeasured location. In IDW, the closer measured values to the prediction have more influence on the predicted value than those farther away from it.

In this experiment, IDW was conducted on the 170 points in the training set using power of 2 and the neighborhood method. There are 15 neighbors included, and include at least 10 of them using the elliptical (quadrant) search window . An example of this method is shown below:

2. Ordinary Kriging

Ordinary Kriging assumes the model, Z(s) = + (s), where is an unknown constant and the (s) are random fluctuations. It allows for local influences due to nearby neighborhood values. It produces prediction, quantile, probability or standard error maps using the data points that are continuous in space. Due to the unknown mean, there are few assumptions can be made for the ordinary kriging, which made this method particularly flexible.

In this analysis, Ordinary Kriging was conducted on the 170 points in the training set using a Spherical model variogram, automatic Lag Size of 4364.1 and 12 lags. We also used neighborhood method, including 5 neighbors and with at least 2 using shape type . The equation for the variogram is

= 161.77*Spherical(20473) + 32.728*Nugget ,

where the Nugget effect is the sum of measurement error and small-scale irregularities (microscale variation). Because either component can be zero, the Nugget effect can be comprised wholly of one or the other.

The plot of the experimental variogram and the fitted model is shown in the following figure

The figure above shows a typical search neighborhood for the ordinary kriging

3. Local Polynomial Interpolation (mean value)

The conceptual basis for Local Polynomial interpolation is to fit many smaller overlapping planes, and then use the center of each plane as the prediction for each location in the study area. The resulting surface will be more flexible and perhaps more accurate. This interpolation fits many polynomials each within specified overlapping neighborhoods. Local Polynomial Interpolation is sensitive to the neighborhood distance.

In the experiment, Local Polynomial Interpolation was conducted on 170 points in the training set using a weight of 125644.96 and power of 1. It also takes the neighborhood method, including 165 (at least 10). An example neighborhood is shown in the following figure

4. Global Polynomial Interpolation (mean value)

The Global Polynomial Interpolation method fits a plane between the sample points based on the overriding trend. A plane is a special case of a family of mathematical formulas called polynomials. The goal for interpolation is to minimize error. One can measure the error subtracting each measured point from its predicted value on the plane, square it, and add them up. This sum is referred to as a least squares fit. This process is the theoretical basis for the first-order Global Polynomial interpolation. Global Polynomial interpolation fits a smooth surface that is defined by a mathematical function (a polynomial) to the input sample points. The Global Polynomial surface changes gradually and captures coarse-scale pattern in the data.


Before starting the geostatistical analyst, we randomly divide the data into two parts, with 170 data in the training set and 64 data in the validation set. The following diagrams show the frequency distributions of the all wells data set, training data set and validation data set.

It appears that all three diagrams have the same general shape.

1. Inverse Distance Weighted (IDW)

Figure 1 illustrates the modeling results using IDW. IDW results in a pattern with many local hot-spots and cold-spots. There appears to be a characteristic trend in high arsenics from south-central to north-central in Cochise County.

Figure 2 with a table summarizes the descriptive statistics for the difference (error) between the validation set (N=64 points) and the modeled arsenic concentrations using IDW. The mean and standard Deviation of the error are 1.977741mg/L and 9.241922mg/L respectively. The Frequency Distribution shows that most errors are between -4.9 and 10.5 mg/L.

Figure 3 shows the IDW is a conservative method, since it underestimates the points compared to the measured points. The regression function for the blue line below is 0.057*x + 4.249. The mean and root-mean-square errors are -0.2329 and 12.92 mg/L.

Figure 4 shows the error of the predicted map is pretty good between 0.01 and 0.19, but there are few points have quite big errors above 0.19 value. Its regression function is -0.943*x + 4.249.

2. Ordinary Kriging

Figure 1 shows the results using Ordinary Kriging. Ordinary Kriging results in a mere pattern with many local hot-spots appears to be a characteristic trend in high arsenic from south-central to north-central Cochise county.

Figure 2 with table summarized the descriptive statistics for the difference (error) between the validation set (N = 64 points) and the modeled arsenic concentrations using Ordinary Kriging. The mean and standard deviation of the error are 1.8047777 mg/L and 9.605348 mg/L respectively. The Frequency Distribution shows that most errors are between -6.5 and 6.9 mg/L.

Figure 3 shows the diagram between measured points and the predicted points. From it, we can see that the Ordinary Kriging underestimates the points according to the training set points. Thus, Ordinary Kriging is also a conservative method. The regression function for it was 0.062*x + 4.761. The mean error was 0.07891 mg/L. The Root-Mean-Square was 12.54 mg/L. the average standard error was 12.51 mg/L. The mean standardized error was 0.005696 mg/L. The root-mean-square standardized error was 1.002.

Figure 4 shows the errors between the predicted points and the measured points. In general the error was small enough, since most points are around the 0 scale. The regression function for it was -0.938*x + 4.761.

Figure 5 shows the QQPlot Tab (Quantile-Quantile-plot) of predication standardized error. We note that the errors are not normally distributed.

The following map is Prediction Standard Error Map gotten from the Ordinary Kriging method. It is shown that the prediction is usually nice around those areas with a great a