black hawk county water analysis

19
H 2 Uh-Oh, a Black Hawk County and Iowa Water Quality Analysis: Exploring time-series using one-way analysis of variance and Mann- Kendall/Sen-Theil Regression Salil Kalghatgi 12/31/13

Upload: crazzycannuck

Post on 08-Feb-2016

17 views

Category:

Documents


0 download

DESCRIPTION

H-2-Uhoh, a Black Hawk County and Iowa Water Quality Analysis:Exploring time-series using one-way analysis of variance and Mann-Kendall/Sen-Theil Regression

TRANSCRIPT

H2Uh-Oh, a Black Hawk County and Iowa Water Quality Analysis:

Exploring time-series using one-way analysis of variance and Mann-Kendall/Sen-Theil Regression

Salil Kalghatgi

12/31/13

IntroGlobal warming and pollution risk a disturbing shift away from a Holocene Earth. Scientists

struggle quantifying the major explanatory variables as the task is dauntingly large and vividly complex (Rockström et al., 2009). Interwoven fabrics of nature connect Iowa fertilizer runoff with increasing global temperatures, desiring local analysis in postulating global risk (Groffman et al., 2006). Furthermore, pollution degrades local health for both humans and other creatures; humans must treat the Earth better, and monitoring ecology is vital to properly allocated policy. Water's centrality to life makes water quality an important variable of interest. To contribute and learn, I statistically analyze Black Hawk county and Iowa water quality gaining preliminary insight into trends between wells (interwell) and within wells (intrawell) through STORET\WQX Water Quality Database (Iowa DNR).

My lacking familiarity in chemical interactions directed focus towards using Iowa Water Quality Index (IWQI). Other indexes exist, but geographical variations make IWQI much more relevant. 70 currently monitored well testing locations across Iowa compose IWQI, with water quality ranging from a scale of 0-100(best). No data is collected for consecutive 2008 months, presumably due to floods. A special thanks must be given to Richard Langel, a geologist at the Iowa Department of Natural Resources who provided database queries.

Fortunately, several statistical resources exist for amateur statisticians including the EPA's “Unified Guide” and Pazerdnik's analysis in “Murky Waters”, respectively help determine acceptable statistical methods and analyze Iowa's Water quality(2009;2012), and I believe it is within my scope to present preliminary analysis for first validating Pazdernik's state-wide conclusions and subsequently utilizing these models to analyze Black Hawk County waters.

Our state-wide sample sites consist of:Site Names STORET ID

Volga River Near Elkport 10220002

Soldier River near Pisgah 10430002

Cedar River Downstream of Cedar Rapids

10570001

South Skunk River near Oskaloosa 10620001

East Nodaway near Clarinda 10730002

North River near Norwak 10910002

While Black Hawk sample sites are located at

Site Names STORET ID

Beaver Creek 10070001

Wolf Creek 10070002

West Fork Cedar River 10070003

Black Hawk Creek 10070004

Cedar River Upstream 10070005

Cedar River Downstream 10070006

Iowa's Water Quality Index is calculated using nine common water parameters: Dissolved oxygen, E. coli bacteria, 5-day BOD, total phosphorus, nitrate + nitrite as N, total detected pesticides pH total dissolved solids, and total suspended solids

The state-wide sample populations were chosen from the 70 high quality Iowa well sites using a random number generator.

The data structure makes multiple contrasts an attractive goal for judging interwell aspects, where

analysis may provide information determining either magnitude of difference between water quality in Iowa or identify most desperate situations to allocate policy resources (or optimistically, good water quality models). Water's role in life is so vital, it is clearly a long-term resource (Spiceland 2010), subject to and providing essentially all observable life. In order to ensure long-term health of this resource, continued monitoring is essential, raising a set of questions regarding water quality trends. Admittedly, I did not originally intend to perform time series analysis, but the data's nature necessitates temporal explorations.

One major missing component of water quality in this report is 'water flow'. Water flow is very important in determining water quality, and further analysis should better incorporate water flow. In conjunction with water flow, I have omitted chemical composition analysis and creating water quality sub-indexes in intrawell tests, due to time limitations (Pazerdnik, 2012). Other equally important variables excluded in this analysis include precipitation, temperature, and cultural practices (exogenous variables (Helsel & Hirsch, 2011). Additionally, this research does not incorporate well-specific background data to identify naturally occurring groundwater constituents, meant to assess human and natural forces. Background data is noted as being very important to water quality analysis (EPA, 2009).

Matters of space and time are a common theme in ecology analysis where different dimensional life aspects dictate data patterns. Environmental statistical analysis compensates for these data distortions by reshaping data through transformations, or performing robust tests. Non-parametric test are used when data does not follow a normal curve, and because of concepts such as seasonality, are used often in water analysis.

Each well location is tested monthly for the past thirteen years, and I am comfortable with the sample size and quality of data; however due to variations involved with water quality, we cannot assume strict independence – certainly one of the largest reasons for decreases in statistical power. As example, in Black Hawk Creek at Waterloo, an Index rating of 54 in August 2013, is tied to both the previous and after monthly scores of 28 and 85 respectively (autocorrelation), and therefore is not the result of an independent test; similarly from a spatial perspective, some well locations are closer to others, and water quality data is non-stationary as its mean and variance change with time and space. Because these independence assumptions can be difficult to swallow (although transformations certainly help), our multiple comparison tests (ANOVA and Kruskal-Wallis) primarily identify spatial and temporal variations, while our trend analysis (Mann-Kendall and Seasonal Mann-Kendall) is specific to each well population (Harrison, 2013; EPA, 2009). Some of our data organization grants us normality in our data, or equality of variances, but the tests conducted suggest non-normal, heteroskadstic data. It is important to note that while our tests may identify either spatial or temporal variations more concretely, it is difficult to completely separate out interactions.

Monthly Site PopulationsCertain well locations seemingly share similar distributions, and distribution similarities (and

differences) testify in some part to space, time, randomness, or other variables. After spatial-temporal variations are removed, distributions may explain water quality on a statewide basis, and in a manner easier for trend detections.

Each well location has at least 151 observations, so we are comfortable with the accuracy of Shapiro-Francia test for normality, and comfortable rejecting the null hypothesis of normality within wells. While we can certainly continue our exploration into equality of variances, I believe it is more beneficial to first introduce concepts of seasonality.

From the boxplot above, we see different distribution patters, outliers, large ranges, and mostly skewed data. The combination of these characteristics is not wholly inviting, but a fairly common reality within data and environmental analysis. Specifically, because many questions involve the 'end-result' and prediction, time series analysis are useful but add complexity. Whether one is trying to judge movements of a NYSE stock price - the combination of psychological attitudes derived from a multitude of variables - or

the equally difficult task of analyzing water quality, we must at least attempt to uncover significance in patterns.

Annual Site PopulationsSeasonality is perhaps best communicated through this following decomposition graph of the Volga

River (note, due to missing data during 2008, I have enabled an approximation function visualized by the straight line in the 'data' section of the graph). Our seasonal graph indicates a consistent reoccurring trend having an annual pattern and frequency of 12 months. This seasonality is caused mainly by environmental and human patterns, including non-point pollution sources of fertilizer pollution from run-off.

Seasonality complicates trend analysis by obscuring long-term trends through patterns existing at smaller time frame (e.g. quarterly, monthly, daily, etc). Our goal is using relevant data collected during the past 13 years to best characterize long-term trends. To do so, we isolate only the April's (randomly chosen) from each of the six well locations in our preliminary tests. By isolating one month among the 13 years (reducing our observation points per well from 151 to 13), and under the assumption of seasonality, we reason our newer data set better showcases long-term patterns with less intimate temporal variation; we are essentially blocking using a single month. Further research should analyze seasonality in regards to each of the twelve months when determining trend existence.

Annual data – our new data setWell

IWQI APRIL _07 y

IWQI APRIL_08 y

Monthly- data - our original data setWell

IWQI APRIL_07 y

IWQI MAY_07 yNote, the labeling of 'annual' and 'monthly' may seem counterintuitive, but this nomenclature strives to

explain the purpose of choosing data from one month.

The boxplot below describes increased standardization, and Shaprio-Wilk normality tests indicate increased, but still relatively weak, normality. Due to smaller sample sizes we will instead test residuals for normality using the Shaprio-Francia, evidencing some normality (p=0.05743). Using the ladders of power transformations (Helsel & Hirsch, 2011), transforming the data to the -6th power (achieved through trial an error, and coincidentally the same transformation used in the annual analysis) maximizes equality of variance (Brown-Forsythe-Levene's p-value) . For both, annual and monthly data, we achieve data

containing equal variances and non-existent normality; and, rejecting the Kruskal-Wallis null hypothesis, subsequent leads to interwell, multiple comparison tests using paired tests under a Bonferroni alpha adjustment (future non-parametric ad-hoc tests also incorporate Bonferroni alpha adjustments).

IOWA ANALYSIS Monthly^-6 Annual^-6

Equality of Variance P=0.1723 P=0.3617

Normality P<2.2e-16 P=4.339e-12

Detection of differences P<2.2e-16 P=0.0409

Multiple comparisons Volga & Nodaway are similar to each other, and different from other wells (which are similar)

Some signs of significant difference between Volga vs. Cedar and Skunk (which are similar)

After isolating some time variation, our annual analysis is apparently much more stringent in declaring differences, possibly suggesting greater similarities among Iowa water quality.

April PopulationsOne final data technique I believe helps describe data is simply transposing the April data:

IWQI APRIL_07 IWQI APRIL_08

Well y yNow, we can trend across the April years, hopefully better understanding the temporal aspects of

state wide Iowa Water Quality.Because each sample now only has six observations, we shall use the residuals to perform normality

tests, to the successful tune of finding very strong normal distribution (p=0.4351). Because of Hartley's F-test extreme sensitivity to normality departure, we will continue using the BFL-test, in which we find some equality of variance (p=0.07944).

At this point running either ANOVA, or the Kruskal-Wallis test, rests in one's alpha level, and both can certainly be run to test under both equal and non-equal variance conditions.

I am inclined to suggest not using an ANOVA, mainly due to questions of independence, however an experienced hydrologist may feel the assumptions are correctly met. Most of these methods are superseded by prediction limits and control charts, but we will not extend our analysis into these realms. In

fact, “regulatory restrictions for per-constituent alpha levels using ANOVA make it difficult to adequately control site-wide false positive rates” (EPA, 2009).

While the variances are similar, because the ratio between the largest and smallest year's standard deviation is greater than 3 (appx. 4.3), the F-test will severely lose power (EPA, 2009).

To best compare April populations to our monthly and annual analysis, I performed both an F and Kruskal-Wallis test, respectively leading to a TukeyHSD and Wilcox-rank-sum multiple comparison tests.

IOWA ANALYSIS April – Untransformed April-Transformed (^-3)

Equality of Variance P= 0.07944 P=0.9195

Normality P=0.4351 P=1.8e-05

Detection of differences in means/medians

P= 0.000673 P=0.0003512

Multiple comparisons 2002,2005,& 2011 vs 2006 2002,2003,2005,2011,2013 vs 2006,2007

Iowa Exploration Test SummariesTransformations equalize variances and letting us detect significant differences between higher and

lower quality wells, but these differences are less pronounced when using annual data. Significant differences also exist between years as we see low water quality across the board during 2006 and 2007, due to floods. Some suggest excluding outliers in water quality analysis, but with data only existing for the past thirteen years, I believe excluding the possibility of flood events occurring periodically is premature (Skopec, 2010). Notable differences in centrality when categorizing samples by site versus categorizing samples by year allude to existing spatial and temporal effects: events occurring across time significantly effect water quality throughout the state; events occurring at different locations create significant differences between well water quality results. Dissecting the data differently, and analyzing a greater number of sample sites will help better understand temporal and spatial effects on Iowa water quality.

At this stage, we have separated our Iowa data into three distinct forms:Characteristics/Name Monthly Site

PopulationsAnnual Site Populations April Populations

Column Names e.g. Volga, Soldier Volga, Solider April_07, April_08

Row Names e.g. Apr_07, May_07 April_07, April_08 Volga, Soldier

Description 12 observations, 13 years 1 observation per year, 13 years

6 observations per year, 13 years

Exhibiting extreme observations, whether through the lens of time or location, we find the data generally hard to read using ANOVA or Kruskal-Wallis tests and are somewhat limited in our ability to perform interwell tests. However, using transformations, we achieve equal variance and can perform paried multiple comparison tests identifying differences between well locations and time. Our annual site population tests primarily display different distribution patterns among a spectrum of site water quality. If equal variance is found, a two-way ANOVA may help analysis, but a major cause of skewed data is because of the particularly low water quality during the summer months, and some suggest analyzing the summer months exclusive of the other months (Pazdernik, 2012). Similarly, our April population tests show that while centrality is often higher than 40, 4 years exhibit severely low water quality ratings across

the board, attributable to heavy floods during those years (Skopec, 2010). We may be inclined to remove these flood years, but we must remember that floods, seem to be part of a trend and must be involved in water quality discussion. Time-series analysis can be tempting to extrapolate (do flood data exhibit patterned characteristics?) but as time frames of cyclical periodic patterns increases, more data is necessary before properly describing existent flood patterns.

Mann-Kendall & Theil-Sen testsCommonly cited in water quality analysis, the Mann-Kendall and Theil-Sen tests are non-

parametric, robust against heterogeneity in variance, resistant to outliers, and most importantly can handle paired-observations (Mann-Kendall tau-beta). Mann-Kendall tests the existence of a trend by analyzing randomness about a constant mean through a comparison between IWQI and time rankings. If no trend exists, then the fluctuations between pairs of observations will not follow any discernible trend across time. Tau represents the probability of water quality rankings in relationship to time, using concordant and discordant pairs (Stevenson,2012). The Theil-Sen trend line is then used in conjunction to analyze the magnitude of the trend, if the Mann-Kendall test suggests trend existence, and uses medians for slope values, but does not compensate for physical dependence (Butler, 2013).

Tests show that across our two data sets, trends are not common. There are signs that Cedar and Skunk river are both trending positively, but lower-bound Theil-Sen confidence intervals (which include zero) show there may not be any magnitude in trend. This analysis mirrors that found in “Murky-Waters”, and highlights the need for improvement as the state's current water quality is fairly poor. We cannot use these tests with the April samples because those samples exhibit time as a categorical factor, not as a covariate.

Characteristic/Name Monthly Site Populations

Annual Site Populations April Populations

Trend Existence Skunk p=0.0076 No significant existence N/A

Trend Magnitude Skunk trend=0.0323

Black Hawk County AnalysisPerforming two separate analyses explores spatial variation, and its significance for local

populations. In accordance to our research concerning time series, we find both sets of data lack normality. Through our analysis of variance we find the Iowa stations do not share equal distributions, whereas Black Hawk county stations do share equal distributions (without transformation), allowing us to perform a Kruskall-Wallis test. The test results are not significant, indicating Black Hawk county wells are similar. April samples only obtain equal variance through a power transformation to the -3rd. It is important Black Hawk county take steps further analyzing local water as a group to discern how chemical and spatial variations specifically impact local water quality, identified by unequal variances found in our state-wide analysis.

We mold the data into the same groupings as the Iowa analysis:Characteristics/Name Monthly Site

Populations^-6Annual Site Populations^-3

April Populations

Column Names e.g. Beaver, Wolf Beaver, Wolf April_07, April_08

Row Names e.g. Apr_07, May_07 April_07, April_08 Beaver, Wolf

Description 12 observations, 13 years 1 observation per year, 13 6 observations per year,

Characteristics/Name Monthly Site Populations^-6

Annual Site Populations^-3

April Populations

years 13 years

Purpose Find differences in wells; analyze long-term trends

Compensate seasonality;find differences in wells

Compensate seasonality; find differences in years

Normality P<2.2e-16 P= 9.265e-06 5.492e-08

Equality of Variance P=0.8694 p=0.9919 P=0.1517

Kruskal-Wallis P=8.375e-12 P=0.2415 p=0.0002011

Multiple Comparisons Beaver vs DownCedar;Wolf vs West, UpCedar;West vs BH,DownCedar;BH vs UpCedar;UpCedar vs DownCedar

2001 vs 2002,2010;2006 vs 2002,2010,2011;2008 vs 2002, 2003 2010,2011,2013

Trend Existence Beaver p=0.044142Black Hawk p=0.093634

No significant trends N/A

Trend Magnitude Beaver trend=0.061Black Hawk trend=0.041

These tests highlight the benefit of approaching water quality data from different angles as we see no significant difference in centrality when comparing annual data, but very significant differences in centrality for monthly data. Without analyzing other annual data, it is hard to identify how time affects Black Hawk county wells; future research may hypothesize how different wells are treated over the course of the entire year, or how different wells react to events correlated with time. Dendograms may help distinguish some of the similarities using different months and locations as different variables, using median measures as water quality data exhibits large variability:

Similar to the Iowa analysis, we see existence of positive trends across two wells, but this is not especially comforting taking into consideration all trend lower-bounds include zero or negative numbers.

ConclusionThe validity and soundness of our assumptions (lacking stationarity, autocorrelation), and therefore

power of tests, raises issues in our results. Nevertheless, the analysis shines major perspective on water quality analysis in general, and for locally actionable progress. Identifying characteristics of water quality at a regional and local level reconfirms the importance of spatial variations. For instance, the time-series decomposition combined with the April analysis, paint a picture of seasonality and temporal variations such as floods. Separating the data along all the twelve months (two-way ANOVA) is a simple procedure, and may be the next logical step for this research. Our goals for analysis are important to remember, as Hirsch believes, we need to move away strict hypothesis testing, and instead identify the “nature and magnitude of change”, and newer models are being developed (weighted regression on time, discharge, and season – WRTDS;). Future research should also explore chemical compositions, as they provide additional avenues of insight. We also did not utilize control charts or prediction limits, which the EPA strongly suggests (2009); these techniques should certainly be considered in future analysis.

Ultimately, with the loss of statistical power and pollution heavily impacting water quality, I am inclined to believe a zero trend with a slight negative or positive relation is very concerning, with water inequality a potentially frightful situation.

Bibliography

EnvironmentalPrior, J. C. (2003). Iowa’s groundwater basics (1st ed.). Iowa City, Iowa: Iowa Dept. of Natural Resources.

Rockström, J., Steffen, W., Noone, K., Persson, A Asa, Chapin, F. S., Lambin, E. F., … Schellnhuber, H. J.

(2009). A safe operating space for humanity. Nature, 461(7263), 472–475.

Skopec, M. (2010). Iowa floods: the “new normal.” Iowa Natural Heritage. Retrieved from

http://www.inhf.org/pdfs/protect/pages8-10_inhf_fall_mag_pp1-16_final.pdf

Spiceland, J. D. (2011). Intermediate accounting (6th ed., combined ed.). New York: McGraw-Hill Irwin.

Water supply. (2008). New York: H.W. Wilson Co.

Water supply and pollution control. (2009) (8th ed.). Upper Saddle River, NJ: Pearson Prentice

Hall.

StatisticalBronaugh, D., & Werner, A. (2013, September 19). Zhang + Yui-Pilon trends package. CRAN Repository.

Butler, K. (2013, February 26). “Assignment 10.” Statistics for the Life and Social Sciences. Course.

Retrieved December 26, 2013, from http://www.utsc.utoronto.ca/~butler/d29/a10.html

Cox, C., Hug, A., & Pazdernik, K. (2012). Murky Waters: Farm Pollution Stalls Cleanup of Iowa streams.

Environmental Working Group. Retrieved from s

tatic.ewg.org/reports/2012/murky_waters/Murky_Waters.pdf

Crichton, N. (2001). Kendall’s Tau. Journal of Clinical Nursing, (10). Retrieved from

http://arizona.openrepository.com/arizona/handle/10150/194407

Environmental Protection Agency. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA

Facilities - Unified Guidance (No. EPA 530/ R-09-007).

Gross, J., & Ligges, M. U. (2012, 29). Nortest Package. CRAN Repository. Retrieved from

http://cran.uvigo.es/web/packages/nortest/nortest.pdf

Helsel, D. R., & Hirsch, R. M. (2011). Statistical methods in water resources (Vol. 49). Elsevier.

Iowa Department of Natural Resrouces. (2013). STORET/WQX Iowa Water Quality Index. Retrieved from

https://programs.iowadnr.gov/iastoret/

Joel Harrison. (2013, May 27). The heat is on…. or is it? Trend Analysis of Toronto Climate Data. R-

bloggers. Retrieved December 14, 2013, from http://www.r-bloggers.com/the-heat-is-on-or-is-it-

trend-analysis-of-toronto-climate-data/

McLeod, A. I. (2011, 16). Kendall rank correlation and Mann-Kendall trend test. CRAN Repository. R

etrieved from http://btr0x2.rz.uni-bayreuth.de/math/statlib/R/CRAN/doc/packages/Kendall.pdf

Mozejko, J. (2012). Detecting and Estimating Trends of Water Quality Parameters. InTech. Retrieved from

http://cdn.intechopen.com/pdfs/35048/InTech-Detecting_and_estimating_trends_of_water_quality_

parameters.pdf

State of Oregon Department of Environmental Quality. (n.d.). Trend Analysis and Presentation. Retrieved

from www.deq.state.or.us/lab/wqm/docs/TrendAnalysisCD.pdf

Statistical Analysis for Monotonic Trends. (2011). National Nonpoint Source Monitoring Program,

TechNotes(6). Retrieved from

http://www.bae.ncsu.edu/bae/programs/extension/wqg/issues/notes135_monotonic_trends.pdf

Stevenson, W. (2012, September 5). Kendall-tau. Statistical-Research.com. Retrieved from

http://statistical-research.com/wp-content/uploads/2012/09/kendall-tau1.pdf

Tian, J., & Fernandez, G. (2000a). Seasonal trend analysis of monthly water quality data. University of

Nevada, Reno. Retrieved from http://www.ag.unr.edu/gf/pdf/joyce.pdf

Tian, J., & Fernandez, G. (2000b). Seasonal trend analysis of monthly water quality data. University of

Nevada, Reno. Retrieved from http://www.ag.unr.edu/gf/pdf/joyce.pdf

Zuur, A. F., Ieno, E. N., & Elphick, C. S. (2010). A protocol for data exploration to avoid common

statistical problems: Data exploration. Methods in Ecology and Evolution, 1(1), 3–14.

doi:10.1111/j.2041-210X.2009.00001.x