pollution source direction identification: embedding dispersion models to solve an inverse problem

13
Research Article Environmetrics Received: 22 July 2010, Revised: 15 March 2011, Accepted: 9 May 2011, Published online in Wiley Online Library: 27 July 2011 (wileyonlinelibrary.com) DOI: 10.1002/env.1124 Pollution source direction identification: embedding dispersion models to solve an inverse problem Basil Williams a , William F. Christensen b * and C. Shane Reese c We develop a Bayesian method for identifying pollution source directions that combines deterministic and stochastic mod- els. We frame the source direction identification as an inverse problem, embedding the deterministic dispersion model American Meteorological Society/United States Environmental Protection Agency Regulatory Model (AERMOD) directly into the likelihood function. AERMOD’s fast computation time allows us to run the model at each iteration of the Markov chain Monte Carlo (MCMC), thereby creating a simulated likelihood function and obviating the need for an emulator. The method is flexible enough to identify multiple source directions for cases in which a species or source type of interest is emit- ted at more than one location, and reversible jump MCMC is used to evaluate the appropriate number of sources. Source direction identification is an important part of the pollution source apportionment problem, which entails identifying and describing pollution sources and their contributions. Copyright © 2011 John Wiley & Sons, Ltd. Keywords: pollution source apportionment; deterministic model; computer model; wind direction; circular data; Bayesian hierarchical model 1. INTRODUCTION On 22 July 2009, the US Environmental Protection Agency (EPA) announced that it would reconsider ambient monitoring requirements for lead, which has no known safe level in the body. Children are particularly vulnerable to the affects of lead, with demonstrated negative effects on cognition, IQ, and behavior. Part of the potential change in policy relates to increased monitoring of lead levels near large expected pollution sources and in population centers (US Environmental Protection Agency, 2009a). While point sources such as lead smelters would be natural sites for such monitoring, there is evidence to believe that substantial amounts of lead may be emitted from less obvious locations including power plants, incinerators, recycling facilities, steel mills, and many other locations within an urban environment (US Environ- mental Protection Agency, 2009b). Thus, identification of the most prominent sources based on empirical data is an important component of complying with regulation and protecting human health. Despite abundant evidence for negative heath and environmental effects of air pollution, few studies have investigated formal statistical procedures for identifying the physical location of major pollution sources, which is surprising given the current position air pollution occu- pies in the scientific and political discussion. Some statistical methods of analyzing pollution sources have been developed in depth, such as pollution source apportionment (PSA), which attempts to apportion ambient particulate measurements taken at a receptor into identifi- able sources, deriving from the data both the source profiles (the compositional “fingerprint” of each source) and their contributions (the impact of each source at the receptor). Reviews of approaches for PSA can be found in Christensen and Gunst (2004) and Hopke (1991). Most approaches, however, rarely employ anything more than heuristic, graphical techniques for describing the geographical location of prominent pollution sources. In this paper, we develop new statistical methodology for the geographic identification of pollution sources. These methods can be of value for policy makers and analysts who create and evaluate environmental policy via PSA and for those who are tasked with monitoring the state of regulated pollutants such as lead. The method developed herein may also prove to be useful in that it presents a new way to incor- porate deterministic computer models directly within a Bayesian hierarchical framework. Specifically, we develop an approach for utilizing the sophisticated scientific theory encompassed within a complex computer model in order to yield improved estimates for parameters of interest along with associated measures of uncertainty. Our approach uses receptor measurements to determine the direction of the most prominent pollution sources near a receptor. The iden- tification of source directions is important to environmental science for several reasons. Henry et al. (2002) cited the “identification of the * Correspondence to: William F. Christensen, Department of Statistics, Brigham Young University, Provo, UT, U.S.A. E-mail: williamstat.byu.edu a Department of Finance, Duke University, Durham, NC, U.S.A. b Department of Statistics, Brigham Young University, Provo, UT, U.S.A. c Department of Statistics, Brigham Young University, Provo, UT, U.S.A. 962 Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd.

Upload: basil-williams

Post on 11-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Research Article Environmetrics

Received: 22 July 2010, Revised: 15 March 2011, Accepted: 9 May 2011, Published online in Wiley Online Library: 27 July 2011

(wileyonlinelibrary.com) DOI: 10.1002/env.1124

Pollution source direction identification:embedding dispersion models to solve aninverse problemBasil Williamsa, William F. Christensenb* and C. Shane Reesec

We develop a Bayesian method for identifying pollution source directions that combines deterministic and stochastic mod-els. We frame the source direction identification as an inverse problem, embedding the deterministic dispersion modelAmerican Meteorological Society/United States Environmental Protection Agency Regulatory Model (AERMOD) directlyinto the likelihood function. AERMOD’s fast computation time allows us to run the model at each iteration of the Markovchain Monte Carlo (MCMC), thereby creating a simulated likelihood function and obviating the need for an emulator. Themethod is flexible enough to identify multiple source directions for cases in which a species or source type of interest is emit-ted at more than one location, and reversible jump MCMC is used to evaluate the appropriate number of sources. Sourcedirection identification is an important part of the pollution source apportionment problem, which entails identifying anddescribing pollution sources and their contributions. Copyright © 2011 John Wiley & Sons, Ltd.

Keywords: pollution source apportionment; deterministic model; computer model; wind direction; circular data; Bayesianhierarchical model

1. INTRODUCTIONOn 22 July 2009, the US Environmental Protection Agency (EPA) announced that it would reconsider ambient monitoring requirementsfor lead, which has no known safe level in the body. Children are particularly vulnerable to the affects of lead, with demonstrated negativeeffects on cognition, IQ, and behavior. Part of the potential change in policy relates to increased monitoring of lead levels near large expectedpollution sources and in population centers (US Environmental Protection Agency, 2009a). While point sources such as lead smelters wouldbe natural sites for such monitoring, there is evidence to believe that substantial amounts of lead may be emitted from less obvious locationsincluding power plants, incinerators, recycling facilities, steel mills, and many other locations within an urban environment (US Environ-mental Protection Agency, 2009b). Thus, identification of the most prominent sources based on empirical data is an important componentof complying with regulation and protecting human health.

Despite abundant evidence for negative heath and environmental effects of air pollution, few studies have investigated formal statisticalprocedures for identifying the physical location of major pollution sources, which is surprising given the current position air pollution occu-pies in the scientific and political discussion. Some statistical methods of analyzing pollution sources have been developed in depth, suchas pollution source apportionment (PSA), which attempts to apportion ambient particulate measurements taken at a receptor into identifi-able sources, deriving from the data both the source profiles (the compositional “fingerprint” of each source) and their contributions (theimpact of each source at the receptor). Reviews of approaches for PSA can be found in Christensen and Gunst (2004) and Hopke (1991).Most approaches, however, rarely employ anything more than heuristic, graphical techniques for describing the geographical location ofprominent pollution sources.

In this paper, we develop new statistical methodology for the geographic identification of pollution sources. These methods can be ofvalue for policy makers and analysts who create and evaluate environmental policy via PSA and for those who are tasked with monitoringthe state of regulated pollutants such as lead. The method developed herein may also prove to be useful in that it presents a new way to incor-porate deterministic computer models directly within a Bayesian hierarchical framework. Specifically, we develop an approach for utilizingthe sophisticated scientific theory encompassed within a complex computer model in order to yield improved estimates for parameters ofinterest along with associated measures of uncertainty.

Our approach uses receptor measurements to determine the direction of the most prominent pollution sources near a receptor. The iden-tification of source directions is important to environmental science for several reasons. Henry et al. (2002) cited the “identification of the

* Correspondence to: William F. Christensen, Department of Statistics, Brigham Young University, Provo, UT, U.S.A. E-mail: williamstat.byu.edu

a Department of Finance, Duke University, Durham, NC, U.S.A.

b Department of Statistics, Brigham Young University, Provo, UT, U.S.A.

c Department of Statistics, Brigham Young University, Provo, UT, U.S.A.962

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd.

Page 2: Pollution source direction identification: embedding dispersion models to solve an inverse problem

POLLUTION SOURCE DIRECTION IDENTIFICATION Environmetrics

causes of local toxic ‘hot spots’ ” as one use but also pointed out that “reconciliation of emission inventories to observed concentrations”also requires the identification of pollution source directions. See Henry et al. (1997) for a study that confirms significant misreporting ofemissions. Perhaps the most important use, however, occurs during widely employed PSA analyses, in which identified source directionsfrequently occupy a central role in determining the identity of a source. For example, using PSA, one might identify a source profile believedto be a point source, but if the contributions seem to be emanating from a specific direction, the interpretation of the source (or the entiremodel) may need to be revised. Alternatively, the identification of a municipal incinerator is supported when large estimated contributionsfrom the hypothesized incinerator occur on days where winds transport air from the known direction of an incinerator. Additionally, sourcedirection identification can be an important tool for constructing prior distributions in a Bayesian PSA approach such as those considered byLingwall et al. (2008) and Heaton et al. (2010). For examples of recent statistical developments in the modeling of wind-related phenomenagenerally, see Herring and Genton (2010) and Kestens and Teugels (2002).

The complicated nature of atmospheric dispersion physics has been an obstacle to the development of sound statistical procedures foridentifying source directions. Most scientific modeling of pollution dispersion involves numerically solving numerous partial differentialequations simultaneously, an approach that does not lend itself well to standard statistical analysis. Direct inference on even ordinary dif-ferential equation parameters is a fledgling field in statistics, pioneered by such papers as Ramsay et al. (2007). Also see Tarantola et al.(2005) for Bayesian nonlinear inverse problem techniques. Such difficulties have led PSA practitioners in the past to incorporate highlysimplified atmospheric physics into their models, such as Park et al. (2005), who use the Gaussian dispersion equations to sharpen theirestimates of source emission rates and contributions, taking source directions as known. The computer experiments literature, however,such as Kennedy and O’Hagan (2001) and Higdon et al. (2004), is replete with cases where the difficulties of performing inference oncomplex dynamic systems are overcome through the use of dynamic system computer models, in which the dynamics of the computermodel itself are treated as a black box in the statistical model. This is the approach taken in this paper. We use the output of the AmericanMeteorological Society/US EPA Regulatory Model (AERMOD) (US Environmental Protection Agency, 2004a), a pollution dispersionmodel developed by the US EPA, to help build a simulated likelihood of the observed pollution concentrations. This likelihood is, in turn,used to solve the inverse problem of pollution source direction determination. We also compare the estimates obtained from AERMOD to asimpler, purely statistical approximation of dispersion.

We consider here a new approach for performing inference on computer models, in which we use the output of AERMOD directly in thelikelihood function. In the field of computer experiments, the computational intensity of the deterministic models typically prohibits directuse of computer output in the statistical model. For example, many global climate models require months to run for a single set of parametervalues. As a result, the values at which to run the climate model must be chosen carefully in order to adequately represent the behavior of thecomputer model throughout the parameter space. After several runs have been completed, the statistician builds a statistical emulator of theoutput in order to interpolate output for parameter values at which the model was not run. This emulator is then used in place of the actualcomputer model when the parameters are ultimately estimated. In this paper, however, we use the output directly, shedding the emulationstep of the modeling process. AERMOD was designed to run on a simple desktop computer and can simulate 2 years of hourly receptorconcentrations in 1.6 s on a desktop computer. Consequently, throughout the Markov chain Monte Carlo (MCMC), we directly computeAERMOD’s estimated receptor concentrations for any proposed source location, not simply interpolate between several previous runs ofAERMOD for previously decided source locations. Therefore, in our proposed approach, AERMOD is run at every iteration of the MCMC,and its output is used directly in computing the likelihood of the observed concentrations, thus forming a simulated likelihood. In addition toproviding a formal statistical framework for performing inference on the inverse problem of estimating source directions, the methods in thispaper also incorporate current scientific knowledge of dispersion phenomenology by including a computational dispersion model directlyas part of the statistical model. In this way, this article represents a new approach for integrating the work of both the dispersion and thereceptor modeling communities via a Bayesian model.

Atmospheric physics have been incorporated into Bayesian hierarchical models in such papers as Wikle et al. (2001) and Fuentes andRaftery (2005). In Wikle et al., the objective is to interpolate wind data so that the forcing values used in the more complex numerical modelhave a finer resolution. They then interpolate by embedding a physical model that offers an impressive degree of sophistication in theirestimates but is much simpler than the numerical model that will ultimately receive their interpolated estimates. In our case, however, thedevelopment of a simpler physical model is unnecessary because the full numerical model (AERMOD) computes estimates fast enough to beused directly as the process prior. In Fuentes and Raftery, on the other hand, the goal is to interpolate pollution concentration by combiningmeasurements and numerical model output, as well as to validate the numerical model and estimate bias. Rather than use the model outputas the process prior, they treat it as data, so model output is fixed in the statistical estimation, and model parameters are not estimated. In con-trast, our numerical model (AERMOD) is the data-generating process (contained in the likelihood), so we recompute AERMOD estimatesat each iteration of the MCMC in order to estimate AERMOD’s input parameter of source direction.

The data used to illustrate our approach consist of daily PM2:5 concentration measurements from the St. Louis—Midwest Supersite. TheSt. Louis airshed has been characterized with the use of source apportionment studies for over 30 years, beginning with the early workof Alpert and Hopke (1981), Liu et al. (1982), and Spengler and Thurston (1983). St. Louis particulate matter has received attention as acomponent of several epidemiological studies (e.g., Laden et al., 2000). More recently, advances in source apportionment methodology haveled to new insights about St. Louis PM sources, with contributions by Lee and Hopke (2006), Lee et al. (2006), Lingwall and Christensen(2007), Hwang et al. (2008), Wang et al. (2009), and Heaton et al. (2010). The species we consider were measured between May 2001 andMay 2003 and are a collection of metals typically associated with point sources. Details about data collection and chemical analyses canbe found in Lee et al. (2006). In this paper, we use only elements typically associated with point (as opposed to area) sources of pollution.Specifically, we are interested in using Mn and Fe to identify a steel production facility located roughly 10 km to the north (10ı east of duenorth), Cu and Zn to identify copper and zinc smelters located adjacent one another, roughly 2 km to the SSW (210ı), and Pb to identify alead smelter located roughly 40 km to the SSW (205ı).

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/environmetrics

963

Page 3: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Environmetrics B. WILLIAMS, W. F. CHRISTENSEN AND C. S. REESE

Of course, researchers may well be interested in source characteristics in addition to direction. For example, source distance from thereceptor would be important to identify, particularly if several prominent sources lie in the same direction from the receptor. We do notaddress this question in our paper, as our single-receptor data set does not allow us to jointly identify additional source characteristics suchas distance, stack height, and emission rate. Data from several receptors at different locations would enable triangulation of distance andother parameters, but we leave such work to future research.

Section 2 presents new approaches for source direction identification based on Bayesian regression models. In Section 3 motivates ourchoice of priors, and Section 4 presents results of our analysis of several metal species from the St. Louis Supersite. Section 5 presentsconclusions and recommendations for implementation.

2. PARAMETRIC MODELS FOR LINKING CONCENTRATIONS WITH WINDThe most common methods of identifying the locations of point sources are graphical. They include the conditional probability function(Kim et al., 2003) and the potential source contribution function (Ashbaugh et al., 1985). These approaches are useful for visualizing sourcelocations or directions, but they do not allow for statistical inference on the direction of the source. Another approach for source locationidentification based on non-parametric regression was developed by Henry et al. (2002). Although the authors suggest a method for obtaining“a rough estimate of the uncertainty in the peak location,” the approach still lacks formal inference such as a confidence or credible intervalsin the usual sense and can yield unsuitably wide intervals in the presence of small to moderate sample sizes.

Our goal here is to perform inference on parameters representing source direction, and we propose two statistical models to accomplishthis. The first involves the use of a function based on the von Mises pdf, which relates wind direction with concentration. The second modelreplaces the von Mises-based function with the deterministic dispersion model AERMOD. AERMOD is treated as a black box, and it modelsthe relationship between many meteorological variables, source parameters (including direction), and element concentration.

2.1. Von Mises-based regression model

Standard models for regressing a linear variable such as concentration on an angular variable such as wind direction typically involve map-ping the angular variable to an ordinal space, such as through a sinusoid. For example, Mardia and Jupp (2000) proposed the followingmodel for regressing a linear variable X on an angular variable � :

X j� �N.ˇ0C ˛1 cos � C ˛2 sin �; �2/

This can be rearranged using trigonometric identities into the more easily interpretable

X j� �N.ˇ0C ˇ1 cos.� � �/; �2/

where ˇ1 Dq˛21 C ˛

22 and � D� arctan.˛1=˛2/C , and � represents the circular ordinate corresponding to the peak predicted value—the

pollution source, in our case. However, inspection of model fit reveals that cos.� � �/ is too smooth to capture the abrupt rise in elementconcentration when wind direction approaches the source direction. Pollution concentration spikes considerably when wind blows directlyfrom the source to the receptor, even when the data are log-transformed, and the function above does not model this behavior effectively.

To correct this problem, we propose a function with circular support that allows more peaked rise near the source direction. The vonMises pdf

f .� j�; �/De� cos.��� /

2 I0.�/

where I0.�/ is the modified Bessel function of order 0 and has the desired shape; higher values of � created a more peaked function. Becausewe use the function to fit a curve instead of a distribution, we use only the kernel of the von Mises pdf, shifting by its minimum and scalingby its range to create a function with range [0,1]:

Z.�; �; �/De� cos.���/ � e��

e� � e��(1)

This function achieves its maximum of 1 when � and � are equal—that is, wind direction is equal to source direction. As � goes to 0,Z.�; �; �/ goes to Œ.1=2/ cos.� � �/C .1=2/�; so Z.�; �; �/ allows for both the smooth sinusoid in standard models and more peaked curvestypical of pollution data. Wind speed was also found to have significant explanatory power, so we include a term for wind speed in the model,yielding

yi j�i ; si � LN.ˇ0C ˇ1Z.�i ; �; �/C ˇ3si ; �2/

where yi , �i , and si are pollutant concentration, wind direction, and wind speed on day i , respectively, ˇ0 is the mean “background” log-concentration, ˇ1 is the mean log-concentration attributable to the source when wind blows directly from the source to the receptor, ˇ3 is theeffect of wind speed on log-concentration, �2 is the variance of the residuals of the log-concentration, � is the degree of peakedness in thecurve, and � is the source direction, our primary interest. We leave ˇ2 as the coefficient for an additional source, explained in Section 2.3.964

wileyonlinelibrary.com/journal/environmetrics Copyright © 2011 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 962–974

Page 4: Pollution source direction identification: embedding dispersion models to solve an inverse problem

POLLUTION SOURCE DIRECTION IDENTIFICATION Environmetrics

2.2. AERMOD-based regression model

The Z function in Equation (1) is a statistical approximation for complex atmospheric dispersion physics. Much of the dynamics of dis-persion, however, can be modeled using a computational dispersion model. The EPA officially endorses and distributes a dispersion modelknown as AERMOD (US Environmental Protection Agency, 2004a), used to forecast the pollution impact of factories proposed for construc-tion. AERMOD processes user specifications about the pollution source (such as stack height, temperature, emission rate, stack diameter,and source location) and meteorological data formatted by its preprocessor, AERMET (US Environmental Protection Agency, 2004a), togenerate pollution concentration estimates at locations in the area surrounding the pollution source. AERMOD occupies 964 kB of memoryand can simulate 2 years of hourly receptor concentrations in 1.6 s.

In our second model, we use the output of AERMOD, scaling it by the maximum estimated value in order to map the output to [0,1].We denote the scaled output by A.xi ; �/ and substitute it for the approximating Z function, with an important distinction. The Z functionwas found to nicely describe the log-concentrations, but AERMOD is designed to simulate concentrations directly, not log-concentrations.However, because of the heavily right-skewed nature of the data, the log-normal distribution is still appropriate for describing the residu-als. That is, we find the log of AERMOD plus background to be a good model of the log-concentrations. Consequently, we propose themodel

yi jxi � LN.log.˛0C ˛1A.xi ; �//; �2/ (2)

where xi is the meteorological data for day i and � again represents the direction of the source. The parameter ˛0 is the median backgroundconcentration—not log-concentration, as with ˇ0—and ˛0 C ˛1 is the median concentration when meteorological conditions are such thatthe source contributes the 2-year maximum amount to the receptor. As in the von Mises-based model, �2 is the variance of the residuals ofthe log-concentration.

In addition to � and x, AERMOD requires many inputs to describe chemical transport, such as source distance, emission rate, stack height,exit temperature, exit velocity, and stack diameter. We do not estimate these parameters but instead integrate over reasonable distributionsfor them, which are described below. That is, we draw random values for these parameters from phenomenologically justifiable distributionsthroughout the MCMC, without attempting to use the data to estimate their posterior distribution. This approach requires AERMOD to berun at each iteration of the estimation algorithm, and AERMOD’s fast computation time makes this possible.

The reasons we avoid estimating these additional parameters are twofold. First, the original motivation for this paper is to use concen-tration data to identify, not necessarily describe fully, pollution sources. That is, we may know several smelters emit lead, but knowingthe direction of the chief lead emitter enables us to identify which smelter is primarily responsible for the lead observed at the recep-tor, whereas identifying exit velocity and emission rate would not be nearly as useful in distinguishing between several potential sources.Second, we found through trial runs of AERMOD that those stack parameters were unable to be identified given our data. For example,both distance and emission rate mainly have a scaling effect on the pollution concentrations. The only way to identify both distance andemission rate would be to make use of measurements from an additional receptor site to triangulate the location, which is beyond thefocus of this paper. In addition, identification of these parameters would have little if any impact on the estimated direction, our primaryconcern.

The emission rate specified for AERMOD directly scales the estimated concentration at the modeled site. Thus, because we scale AER-MOD’s estimates after each run to have a maximum of one, randomly varying emission rate has no effect on our method. Therefore, wefix emission rate at 100 g s�1. The remaining stack parameter inputs to AERMOD are randomly varied according to gamma distributions,as they all are positive valued, with shape and scale given in Table 1. The distribution for distance was elicited from modeling experts whoindicate that AERMOD’s estimates are less reliable beyond 50 km. AERMOD’s estimates are most reliable when the distance is less than10 km, but it is routinely used for up to 50 km for air quality permit analyses. The distributions of the remaining parameters are designed toapproximate default values specified in the Stack Parameter Defaults file on the Clearinghouse for Inventories and Emission Factors (CHIEF)website of the US EPA. The default values were derived from the analyses of point source data reported to the EPA by state agencies; detailsof the derivation can be found in the paper of the US Environmental Protection Agency (2006).

Table 1. Gamma parameters (a = shape; b = rate) used in generatingrandom stack inputs to AERMOD

Input a b

Distance (km) 2.20 0.14Stack height (m) 5.04 0.33Stack temperature (K) 56.18 0.15Exit velocity (m s�1) 13.26 1.47Stack diameter (m) 4.02 4.46

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/environmetrics

965

Page 5: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Environmetrics B. WILLIAMS, W. F. CHRISTENSEN AND C. S. REESE

2.3. Multiple sources

We also specify a model to represent more than one source simultaneously. In this case, an additional term involving another � is includedin the model, as follows:

yi j�i ; si � LN.ˇ0C ˇ1Z.�i ; �1; �1/C ˇ2Z.�i ; �2; �2/C ˇ3si ; �2/

and

yi jxi � LN.log.˛0C ˛1A.xi ; �1/C ˛2A.xi ; �2//; �2/

When the model is formulated this way, we can perform inference on two sources that have been estimated simultaneously.

2.4. Reversible jump: choosing the number of sources

Although the model in the preceding section allows us to model multiple sources, it requires us to specify in advance the number of sourcesto include in the model. We would like to be able to simulate the posterior distribution of the model space (consisting of several models withdifferent numbers of sources), not simply the posterior distribution of the parameters within a fixed model. We simulate from the model pos-terior distribution using reversible jump Markov chain Monte Carlo (RJMCMC; see Green, 1995), applying the algorithm to the one-sourceand two-source von Mises-based models, and then to the one-source and two-source AERMOD-based models.

To implement RJMCMC in this context, at each iteration of the algorithm we propose to jump models with probability 1=2. If a modeljump is not proposed, Metropolis–Hastings is used to draw from the posterior distribution of the parameters within the current model. If amodel jump is proposed, we generate proposal values for parameters of the proposed model and then accept the model jump with probabilitymin(r ,1), where r is the acceptance probability for RJMCMC explained below.

When proposing to change models within an iteration of RJMCMC, the algorithm requires proposal values for the parameters of the modelwe propose to jump to. This is accomplished by generating an augmenting random variable u from some proposal density J . The proposedparameter values are typically some deterministic function g of the random variable u. Because only a small portion of the proposed model’sparameter space contains a relatively high posterior density, proposing reasonable values for the proposed model’s parameters can be diffi-cult. If parameter proposals are chosen poorly, mixing across the model space can be quite poor, so J.u/ must be chosen wisely, ideally asan approximation to the posterior distribution of the proposed model’s parameters.

We choose the density J.u/ so that its support equals the parameter space of the model we are jumping to; in our application, for allparameters except � , the proposals are drawn from a truncated multivariate normal distribution, and for � , the proposals are drawn fromvon Mises distributions. The proposal density J.u/ is therefore the product of the truncated multivariate density function and the von Misesdensity function. In order to draw from a specific distribution, however, we must specify means and variances for the truncated multivariateand von Mises proposal distributions. We determine these means and variances by performing pilot MCMC runs in the one-source and two-source models and then computing the empirical means and variances of the pilot posterior draws. Then, we run the RJMCMC algorithm,using those means and variances in the proposal distribution J.u/. Our notation and method here compose a common approach to obtainingefficient proposals in RJMCMC. See, for example, Gelman et al. (2003).

The acceptance ratio for jumping from the one-source model to the two-source model is

r Dp2.yju1;2/p.u1;2/

p1.yj�1/p.�1/�J2;1.�1/

J1;2.u1;2/

where �a represents the current parameter values of the a-source model, ua;b represents the proposed parameter values for the b-sourcemodel, generated when jumping from the a-source model to the b-source model, pa.yj�/ is the likelihood of observing y under thea-source model under parameter vector �, and Ja;b is the density of the distribution used to generate proposed parameter values for theb-source model, used when jumping from the a-source model to the b-source model. Conversely, the acceptance ratio for jumping fromthe two-source model to the one-source model is

r Dp1.yju2;1/p.u2;1/

p2.yj�2/p.�2/�J1;2.�2/

J2;1.u2;1/

In our implementation, the auxiliary variable ua;b is used directly for the proposed parameter values, with no transforming function g, sothe Jacobian in the acceptance ratio r is simply 1. Also, the prior probabilities for the one-source and two-source models are set to be equal,so they cancel in the acceptance ratio. The same is true for the probability of proposing to jump between competing models.

3. PRIORSWe now specify prior distributions for each of the parameters in our models. We choose the distributional form and hyperparameters by con-sulting atmospheric dispersion texts, running AERMOD simulations, inspecting the Z function and Gaussian dispersion equation (Turner,1994), and consulting an alternative data set gathered using different instruments during the same period. These activities lead us to choosethe following prior distributions, with hyperparameter values displayed in Table 2:

�1;2 � vonMises.m�1;2 ; k�1;2/966

wileyonlinelibrary.com/journal/environmetrics Copyright © 2011 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 962–974

Page 6: Pollution source direction identification: embedding dispersion models to solve an inverse problem

POLLUTION SOURCE DIRECTION IDENTIFICATION Environmetrics

�2 � Gamma.a�2 ; b�2 /

˛0 � Gamma.a˛0 ; b˛0/

˛1;2 � Gamma.a˛1;2 ; b˛1;2/

ˇ1;2 � Gamma.aˇ1;2 ; bˇ1;2 /

�1;2 � Gamma.a�1;2 ; b�1;2/

ˇ0 �N.mˇ0 ; sˇ0/

ˇ3 �N.mˇ3 ; sˇ3/

In order to mimic a purely exploratory setting, we assume no prior information about � , a parameter with circular support, so we use avon Mises distribution with precision parameter equal to 0. Setting the precision parameter equal to 0 specifies a completely flat von Misesdistribution. This is not an improper prior, however, because the von Mises distribution has finite support, Œ0; 2 /. Essentially then, the prioron � is a uniform distribution U.0; 2 / but continuously connected between 2  and 0.

The choice of the gamma prior for � is based on inspection of the effect of � on the Z function. As � approaches 0, the Z functionapproaches a simple sinusoid with period 2 , a curve not peaked enough to capture the sharp rise in concentration near the direction of thesource. Negative values of � create an even less peaked, more bulbous, wider rise in the curve, implying that relatively high concentrationvalues can be expected when the wind direction is completely misaligned with the source direction—an idea ill supported by atmosphericphysics. So although theZ function is defined for negative values of �, the function is only phenomenologically justifiable for positive valuesof � not close to 0. We use a gamma distribution for � in order to restrict it to the positive real line. The hyperparameters for identifyingthe prior distribution for � are selected by running several AERMOD simulations, as AERMOD is a reasonable characterization of expertknowledge on atmospheric dispersion. We fix the receptor location and run AERMOD for several different source locations, randomly vary-ing the source’s direction, distance, emission rate, height, temperature, exit velocity, and diameter, according to the distributions describedin Table 1. This process yields an estimated time series of pollutant concentrations at the receptor for each unique source location. Thecollection of these time series is used to inform a center for the prior distribution on �.

The priors for the ˛s, ˇs, and �2 are constructed using statistics from an alternative data set collected at the same St. Louis location duringthe same period. Unlike the 2-year daily data used for our analysis, this alternative data set consists of hourly data gathered using differentinstruments from those of our main data set, during several different focus weeks scattered throughout the 2-year study period. Statistics ofthe alternative data set were used as loose surrogates and bounds on our beliefs for each parameter. For example, �2 is interpreted as thevariance of the residuals for the model, so we expect the variance of the log data to be much larger than the variance of the fitted residuals.Consequently, we set the variance of the log data as the 95th percentile in our prior on �2 and use half of the variance as the 50th percentile.Hyperparameters a�2 and b�2 (shape and rate) are chosen so that the resulting gamma distribution has the required percentiles.

Table 2. Hyperparameter values for prior distributions

Cu Fe Mn Pb Zn

m k m k m k m k m k

�1;2 3.14 0.00 3.14 0.00 3.14 0.00 3.14 0.00 3.14 0.00

a b a b a b a b a b

�2 4.69 3.61 4.69 12.86 4.69 10.78 4.69 7.87 4.69 7.73˛0 0.53 564.83 1.00 123.50 1.10 1280.73 1.21 552.24 1.50 224.02˛1;2 1.00 7.36 1.00 9.34 1.00 64.79 1.00 14.55 1.00 5.13ˇ1;2 1.00 0.13 1.00 0.26 1.00 0.24 1.00 0.20 1.00 0.21�1;2 1.44 0.05 1.44 0.05 1.44 0.05 1.44 0.05 1.44 0.05

m s m s m s m s m s

ˇ0 �7.72 1.26 �5.18 0.89 �7.39 0.85 �6.43 0.81 �5.24 0.73ˇ3 �0.04 0.16 �0.04 0.16 �0.04 0.16 �0.04 0.16 �0.04 0.16

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/environmetrics

967

Page 7: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Environmetrics B. WILLIAMS, W. F. CHRISTENSEN AND C. S. REESE

The parameter ˛0 can be interpreted as the median background concentration not attributable to a pollution point source. We use a gammadistribution in order to restrict the range of alpha to the positive real line. To choose the shape a˛0 and rate b˛0 , we use the lower valuesof the data as a loose upper bound on the background. We calculate the 50th and 5th percentiles of the alternative data set and then choosehyperparameters that set those percentiles equal to the 95th and 50th percentiles, respectively, of the prior distribution for ˛0.

Because ˇ0 has a similar interpretation as the median background log-concentration, we use the same procedure as when generating theprior for ˛0, calculating the 50th and 5th percentiles of the alternative data set, and setting those equal to the 95th and 50th percentiles ofˇ0’s prior. There is no need to restrict ˇ0 to the positive real line, because it describes the median background on the log scale, so we usea normal distribution with hyperparameters such that the appropriate percentiles are achieved. The parameters ˛1 and ˛2 can be interpretedas the change in median log-concentration attributable to the first and second sources when meteorological conditions result in maximumpollution transport to the receptor. By definition, this change must be non-negative, so we use a gamma distribution in order to capture thisconstraint. Setting ˛0, ˛1, and ˛2 as gamma distributed guarantees that the argument of the log function in Equation 2 will be positive.We used the 95th percentile of the data as a surrogate for the median under maximal meteorological conditions and used the 5th percentileas a surrogate for the “background”—that is, the median concentration when meteorology transports no pollution from the point source tothe receptor. The difference between those values was used as the 50th percentile in our prior distribution. In the absence of an additionalpercentile for specifying two hyperparameters, we set the shape parameter a˛1;2 equal to 1, effectively choosing the single-parameter expo-nential distribution to represent our beliefs about ˛1 and ˛2. We expect point sources to only increase the concentration and the exponentialdistribution restricts ˛1 and ˛2 to be positive. The same procedure was used for generating a prior distribution on ˇ1 and ˇ2, except that thelog of the quotient of the 95th and 5th percentiles was used as the 50th percentile of the prior, because ˇ1 and ˇ2 have a multiplicative effecton the log scale.

The parameter ˇ3 is the main effect of wind speed on log-concentration, so we regress every element in our alternative data set on windspeed, generating a set of estimates of ˇ3. The maximum value in the set is used as the 80th percentile of the prior distribution, and theminimum value is used as the 20th percentile. There is no need to restrict the range of ˇ3, so we use a normal distribution with mean andvariance generating the desired percentiles.

4. RESULTSTo compute the posterior distribution of both the von Mises-based model and the AERMOD-based model, we ran MCMC for 100,000iterations. Assessment of time series plots indicated that adequate mixing had occurred. The Toxic Release Inventory (TRI) of the St. Louisarea provides a report of the total amount of pollution emitted by each of the major point sources in the St. Louis area. The TRI also reportsthe distance and direction of each pollution source in relation to the receptor. Although the TRI is subject to error in both reported emissionsand precise locations, the TRI provides primary directions that we expect the posterior distribution of � to contain. We compare the posteriordistributions of � for each element with the direction of known substantial emitters of that element.

4.1. Source direction identification

Table 3 shows the 95% credible intervals of � for both the von Mises-based and AERMOD-based one-source models. Each interval is nomore than 17ı wide, giving a substantially more precise estimate of the source direction than alternative graphical methods allow. In addition,this estimate is probabilistic, not merely graphical. Most importantly, almost all intervals contain the direction corresponding to the mostdominant emitter listed in the TRI, listed in the table under “TRI Source Direction.” The exceptions to these are copper (both models) andzinc (AERMOD-based model), missing the source by no more than a mere 5ı.

In order to account for both source emission volume and source distance, we define source dominance by dividing total yearly emissionsby distance to the receptor. Sources make a large impact on the receptor by emitting large amounts of pollutant, but greater distance from thereceptor diminishes this impact. Dividing emissions by distance allows both distant but voluminous emitters and close but sparse emitters tobe considered as dominant sources. Thus we define the primary expected source direction (ESD) as the direction of the pollution source with

Table 3. 95% credible intervals for � , in degrees, based on both the vonMises-based and AERMOD-based one-source models

Species Primary ESD von Mises AERMOD

Cu 207.6 (196.0, 207.4) (194.6, 202.3)Fe 9.7 (9.2, 21.5) (5.3, 17.9)Mn 9.7 (3.9, 13.7) (�2.9, 10.9)Pb 205.8 (202.1, 215.6) (195.1, 212.1)Zn 219.0 (211.2, 222.2) (211.1, 216.9)

Each credible interval is compared with the primary expected sourcedirection (ESD), defined as the direction of the facility with largestoutput-to-distance ratio.

968

wileyonlinelibrary.com/journal/environmetrics Copyright © 2011 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 962–974

Page 8: Pollution source direction identification: embedding dispersion models to solve an inverse problem

POLLUTION SOURCE DIRECTION IDENTIFICATION Environmetrics

largest emission-to-distance ratio. The TRI does not include data for iron, but we expect the steel production facility at 9.7ı to be a majorsource of iron and define this direction as the primary ESD for iron (see Lee et al., 2006; Lingwall et al., 2008).

Table 4 shows the 95% credible intervals for �1 and �2 under the two-source von Mises-based model. We leave blank the second ESDfor iron because only the primary ESD is identified a priori. The two-source models capture the primary ESD with the same success asthe one-source models, containing it for all examined elements but copper (von Mises and AERMOD) and zinc (AERMOD), for which thecredible intervals miss by less than 5ı. The direction of the secondary ESD, however, escapes the second credible interval for all elements,coming close only with zinc. We demonstrate below, however, that the two-source models capture clusters of sources extremely well.

Two factors may explain the cases where the credible interval misses the expected source directions: source clustering and emissionsmisreporting. At times, the most dominant sources cluster so closely together that the model treats them as a single source. Copper’s secondsource, for example, is only 10ı off of its first source, and the credible intervals for �2 are nowhere near the direction of the second mostdominant source. According to the TRI, the top 10 expected source directions for lead are clustered in two groups: 198ı to 219ı and 0ı to9.7ı. Although the top seven sources are all near 206ı to 219ı, the employed method effectually finds the top two source clusters. A similarsituation exists for the source directions identified in the zinc analysis.

In addition, emissions misreporting accounts for the lack of second source capture. The data contained in the TRI are reports issued bythe plants themselves, rather than independent external analysts; this policy is fraught with incentive to misreport. See Henry et al. (1997)for a study that confirms significant misreporting of emissions. Credible intervals that do not contain the ESD could well be finding a signif-icant source of underreported pollution. This is likely the case with manganese, for which the TRI contains no record of substantial sourcescontained in the credible intervals for �2.

4.2. Parameter estimates

Table 5 presents the von Mises-based and AERMOD-based parameter estimates for lead. The two models estimate roughly the same inter-vals for � , and they give comparable estimates for �2. Figure 1 gives intuition for the interpretation of the remaining parameters. The plotteddata are log-concentration adjusted for the effect of wind speed, calculated by

y0i D logyi � Q̌3si (3)

where yi is lead concentration for day i , si is the wind speed for day i , Q̌3 is the median of the posterior draws of ˇ3, and y0i is the adjusteddata plotted in the figure. We calculate the fitted curve by computing the median von Mises predicted value for 1ı, 2ı; : : : ; 360ı, joining

Table 4. 95% credible intervals for �1 and �2, in degrees, using both the von Mises-based (vM) andAERMOD-based (A) two-source models

Species Primary SecondaryESD �1 (vM) �1 (A) ESD �2 (vM) �2 (A)

Cu 207.6 (194.9, 206.1) (194.8, 202.1) 217.4 (�114.8, 131.0) (�103.3, 144.1)Fe 9.7 (6.7, 21.9) (2.7, 10.6) (65.2, 185.2) (48.9, 92.1)Mn 9.7 (3.7, 13.1) (�2.9, 9.9) 0.0 (163.5, 207.9) (192.9, 210.4)Pb 205.8 (201.1, 214.4) (194.1, 207.1) 217.4 (�11.0, 30.6) (4.5, 19.6)Zn 219.3 (212.0, 221.5) (210.5, 216.3) 9.7 (�12.3, 4.1) (�3.3, 9.6)

Each credible interval is compared with the primary and secondary expected source direction (ESD), thedefined as the directions of the facilities with the largest output-to-distance ratios.

Table 5. Posterior medians and 95% credible intervals (95% CI) for the one-source models (vonMises and AERMOD) applied to lead

Parameter von Mises AERMODMedian 95% CI Median 95% CI

� 208.7 (202.1, 215.6) 200.3 (195.1, 212.1)�2 0.61 (0.55, 0.69) 0.63 (0.60, 0.76)ˇ0 �4.0 (�4.1, �3.8)ˇ1 0.82 (0.60, 1.2)ˇ3 �0.34 (�0.42, �0.29)� 9.07 (5.1, 17.2)˛0 7:1� 10�3 (6:0� 10�3, 8:1� 10�3)˛1 3:8� 10�2 (2:8� 10�2, 5:3� 10�2)

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/environmetrics

969

Page 9: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Environmetrics B. WILLIAMS, W. F. CHRISTENSEN AND C. S. REESE

Figure 1. Fitted von Mises model to the lead data, with credible intervals for � . Log-concentration is adjusted for the estimated effect of wind speed. Thesolid vertical line represents the most dominant source of lead indicated in the Toxic Release Inventory. Dashed vertical lines represent the credible intervalfrom the von Mises-based model, and dotted vertical lines represent the credible interval from the AERMOD-based model. This figure is available in colour

online at wileyonlinelibrary.com/journal/environmetrics

the points together in a line. The vertical solid line is the direction of the dominant source of lead, the dashed lines delineate the von Misescredible interval for � , and the dotted lines delineate the AERMOD credible interval for � .

The parameter ˇ0 is the baseline of the peaked curve and indicates the expected log concentration when the wind is blowing from a direc-tion much different from that of the dominant pollution source. This parameter represents “background” log-concentration not attributable tothe dominant source, so eˇ0 is the median background concentration. According to Table 5, the median “background” log concentration notattributable to the dominant source is �4.0, and exponentiating gives an estimated median background concentration of 1:8�10�2 g m�3.This corresponds roughly with the estimate for ˛0, which represents the median background concentration under the AERMOD model. Themedian of ˛0’s posterior, at 7:1�10�3 g m�3, is slightly smaller than the exponentiated estimate of ˇ0, and this is because of the sophisti-cated way that AERMOD handles the meteorological data. On days with winds blowing outside a narrow angular range controlled by �, thevon Mises-based model classifies the concentration as background. In this case, the � median of 9.07 restricts this angular range to roughly˙50ı. Figure 1 illustrates that on days with winds blowing outside the range of 210˙ 50ı, the concentrations are treated as background.However, some days may contain highly variable hourly wind directions that transport some concentration from the pollution source to thereceptor, but that ultimately result in a mean daily wind direction far off of the source direction. Such days are treated as background in thevon Mises model, but because AERMOD processes hourly meteorological data, it infers source contribution on days where the daily meanwind direction is much different from the source direction. Consequently, AERMOD can infer source contribution for many more days thanthe von Mises model can and therefore attributes less of the pollution to background and more to source contribution. This is why ˛0 isslightly smaller than eˇ0 .

The value of ˇ1, on the other hand, is represented in Figure 1 as the height of the peak—that is, the distance from theZ function’s baselineto its maximum. It can be interpreted as the dominant source’s expected contribution to log concentration when the wind is blowing directlyfrom the source to the receptor (wind direction and source direction are equal). The median of ˇ1’s posterior distribution is 1.69, so thedominant source (presumed to be a copper mill at 208ı) increments the log-concentration of copper by 1.69 when the wind blows directlyfrom the mill to the receptor. The quantity eˇ1 D 5:403 is the multiplicative change in median concentration when the wind blows directlyfrom the source to the receptor. Ignoring the effect of wind speed, the median concentration when wind blows from the source to the receptorcan be estimated by computing the median of eˇ0Cˇ1 , which is 4:3� 10�2 g m�3.

Alignment of mean daily wind direction with the source direction is the von Mises-based model’s attempt to capture meteorological con-ditions with maximal transport potential. AERMOD’s sophistication, however, allows a much more informed characterization of maximaltransport meteorological conditions, taking into account additional information such as hourly wind direction, humidity, and barometricpressure. The parameter ˛1 has a similar interpretation to ˇ1, but on the untransformed scale, representing median additive source con-tribution when meteorological conditions are such to transport the greatest amount of pollution to the receptor. The posterior median for˛1 indicates that the impact of the primary lead source is incremented by 2:8 � 10�2 g m�3 beyond baseline level when meteorologicalconditions are optimal for transport to the receptor. Further, the posterior median for ˛0 C ˛1 indicates that the median concentration giventransport-maximizing meteorological conditions is 4:6 � 10�2 g m�3. Because AERMOD incorporates more information than simplywind direction to determine what meteorological conditions characterize maximum transport potential, ˛0 C ˛1 is slightly higher than theanalogous quantity from the von Mises model (eˇ0Cˇ1 ).

The parameter ˇ3’s 95% credible interval is entirely less than 0; this is consistent with the phenomenology of atmospheric dispersion.When wind blows at higher speeds, the pollution plume is diluted with clean air, decreasing pollution concentration (Turner, 1994). This is970

wileyonlinelibrary.com/journal/environmetrics Copyright © 2011 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 962–974

Page 10: Pollution source direction identification: embedding dispersion models to solve an inverse problem

POLLUTION SOURCE DIRECTION IDENTIFICATION Environmetrics

also consistent with the Gaussian dispersion equation, widely used to model atmospheric dispersion. In the Gaussian dispersion equation,wind speed occurs in the denominator, multiplicatively decreasing concentration with higher wind speeds. Because our model is on the logscale, we would expect wind speed to have an additively negative effect on log concentration, just as the credible interval for ˇ3 indicates.

Table 6 presents the parameter estimates for the two-source von Mises-based and AERMOD-based models. Figure 2 illustrates the fittedvon Mises model and plots the dominant source from each cluster, as explained in Section 4.1. The estimates for �1 correspond closely to theestimate for � in the one-source model, and the credible intervals for �1 and �2 contain the dominant source of each cluster. The estimatesfor �2 are smaller for the two-source model because the additional source explains additional variation. The estimates for ˇ0 and ˛0 areboth lower than in the one-source models, because the additional source claims contribution for what was previously considered background.Figure 2 illustrates this effect, showing the baseline of the fitted von Mises curve to be significantly lower than in the one-source model. Thelowered estimates for background in the two-source model force the estimates of ˇ1 and ˛1 to increase in order to reach the same height asin the one-source model.

4.3. Inference on the number of sources

Tables 7 and 8 present the posterior probabilities for the appropriate number of sources to include in each model, computed using RJMCMC.The tables indicate that the von Mises-based model and the AERMOD-based model roughly agree on the posterior probabilities. In both the

Table 6. Posterior medians and 95% credible intervals (95% CI) for the two-source models(von Mises and AERMOD) applied to lead

Parameter von Mises AERMODMedian 95% CI Median 95% CI

�1 207.8 (201.1, 214.4) 198.1 (194.1, 207.1)�2 14.1 (�11.2, 30.6) 10.8 (4.5, 19.6)�2 0.60 (0.53, 0.67) 0.63 (0.55, 0.72)ˇ0 �4.2 (�5.4, �3.9)ˇ1 1.0 (0.75, 2.1)ˇ2 0.58 (0.22, 1.7)ˇ3 �0.35 (�0.42, �0.29)�1 4.5 (1.3, 10.5)�2 2.9 (0.15, 23.9)˛0 5:0� 10�3 (3:5� 10�3, 6:2� 10�3)˛1 4:5� 10�2 (3:1� 10�2, 6:1� 10�2)˛2 2:1� 10�2 (1:1� 10�2, 3:6� 10�2)

Figure 2. Fitted two-source von Mises-based model to the lead data, with credible intervals for �1 and �2. Log-concentration is adjusted for the estimatedeffect of wind speed. Solid vertical lines represent dominant sources of lead indicated in the Toxic Release Inventory, displaying only the dominant sourceof each cluster as explained in Section 4.1. Dashed vertical lines represent the credible intervals from the von Mises-based model, and dotted vertical lines

represent credible intervals from the AERMOD-based model. This figure is available in colour online at wileyonlinelibrary.com/journal/environmetrics

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/environmetrics

971

Page 11: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Environmetrics B. WILLIAMS, W. F. CHRISTENSEN AND C. S. REESE

Table 7. Posterior model probabilities for reversible jump applied tothe von Mises-based model

Element One source Two sources

Cu 0.85 0.15Fe 0.24 0.76Mn 0.01 0.99Pb 0.20 0.80Zn �0:00 �1:00

Table 8. Posterior model probabilities for reversible jump applied tothe AERMOD-based model

Element One Source Two Sources

Cu 0.79 0.21Fe 0.14 0.86Mn �0:00 �1:00

Pb 0.01 0.99Zn �0:00 �1:00

von Mises-based and AERMOD-based models, zinc and manganese are strongly weighted toward the two-source model, iron is moderatelybut clearly weighted toward the two-source model, and copper is moderately weighted toward the one-source model. The only element forwhich the von Mises-based and AERMOD-based models differ substantially is lead, which is moderately two-source under the von Mises-based model, but strongly two-source under the AERMOD-based model. In fact, for all elements, AERMOD assigns slightly greater weightto the two-source model than the von Mises-based model does. AERMOD incorporates more information than does the von Mises-basedmodel, so it has greater power to detect and confirm the presence of additional sources.

The posterior probabilities are consistent with estimates of �2 within each model. For example, copper’s credible interval for �2 is (�114.8,131.0) and (�103.3, 144.1), under the von Mises-based and AERMOD-based models, respectively. These two intervals cover over two-thirdsof the entire circle, giving virtually no information about the direction of the second source. The inability of the two-source models to find afocused interval for a second source indicates that the copper data do not support an additional source, so it is not surprising that the posteriorprobabilities give little weight to the two-source model. In fact, the posterior probabilities for the two-source model in Tables 7 and 8 exhibita strong negative correlation with the credible interval widths for the �2s in Table 4.

5. DISCUSSIONAs a tool for PSA, the deterministic dispersion model has been an indispensable tool because of the sophisticated way it can incorporatemeteorological observations into complex atmospheric physics theory, yielding valuable insights about the transport and fate of pollutants.The drawback of deterministic dispersion models is that they require an a priori identification of all pollution sources, along with theiremission rates and other facts about the physical characteristics of the pollution-emitting facility (e.g., smokestack height and temperatureof emissions). Without complete knowledge about major sources, source apportionments can be dramatically lacking.

In contrast, receptor-based models, which are statistical in nature, have the flexibility to incorporate unidentified sources but are relativelyunsophisticated in that they make no use of known properties of atmospheric physics. In the source apportionment community, there haslong been a recognized need to synthesize the best approaches of the dispersion modelers and the receptor modelers.

In this article, we synthesize deterministic and stochastic models in an effort to solve the inverse problem of identification of pollutionsource directions and carry out statistical inference about those pollution source directions. Additionally, this approach may be useful indeveloping hybrid approaches for formal PSA studies and for statistical modeling with deterministic components in general.

For this study, we analyzed five pollutants in the St. Louis area in order to identify the directions of their respective dominant pollutionsources. As a tool for comparison with our deterministic/stochastic hybrid approach, a model based on the kernel of the von Mises densityfunction was used to regress concentration on wind direction and wind speed. The simplicity of the model allows for ease of interpretation,relatively fast computation of the posterior distributions, and the easy insertion of additional sources into the model. In most cases, the vonMises-based model succeeds at generating credible intervals containing the “true” source direction, or clusters of “true” source directions,identified in the TRI.

We also used a model based on the deterministic dispersion model AERMOD, which allows a much more sophisticated, phenomenolog-ically justifiable model. AERMOD is used in the simulated likelihood of the concentration data and is run at each iteration of the MCMCin order to estimate the source direction and integrate over other required inputs. The AERMOD-based model also succeeds at generating972

wileyonlinelibrary.com/journal/environmetrics Copyright © 2011 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 962–974

Page 12: Pollution source direction identification: embedding dispersion models to solve an inverse problem

POLLUTION SOURCE DIRECTION IDENTIFICATION Environmetrics

credible intervals containing dominant pollution sources. The AERMOD-based model incorporates significantly more meteorological datathan the von Mises-based model allows, so the resulting parameter estimates attribute less concentration to background and more to sourcecontribution. AERMOD’s complexity also gives it greater power to detect the presence of additional sources, which is why RJMCMC in theAERMOD-based model assigns greater weight to two sources than in the von Mises-based model. For all elements other than copper, thedata supported the use of the two-source model.

The approaches developed in this paper may be beneficial in three different areas. First, on a methodological level, we implement a newapproach for incorporating deterministic computer models directly into a Bayesian hierarchical model, with the deterministic model provid-ing a simulated likelihood that is used in every iteration of the MCMC. Second, PSA is increasingly used for the development and evaluationof environmental policy and the quantification of health effects associated with various pollution source types. Proper point source identi-fication and proper understanding of the transport of point source pollution are seminal to the success of source apportionment analyses.Finally, the methods developed herein can be used to help reveal previously unidentified locations of local pollution, facilitating compliancewith environmental regulation and protecting human health.

AcknowledgementsThis work was supported in part by the STAR Research Assistance Agreement No. RD-83216001-0 awarded by the US EnvironmentalProtection Agency (EPA) and National Science Foundation CMG grant No. ATM-0934490. The article has not been formally reviewedby the EPA. The views expressed in this document are solely those of the authors, and the EPA does not endorse any products or com-mercial services mentioned in this publication. The authors thank Dr Jay Turner for the assistance in obtaining the data from the EPA St.Louis—Midwest Supersite.

REFERENCES

Alpert DJ, Hopke PK. 1981. A determination of the sources of airborne particles collected during the regional air pollution study. Atmospheric Environment15: 675–687.

Ashbaugh LL, Malm WC, Sadeh WZ. 1985. A residence time probability analysis of sulfur concentrations at Grand Canyon National Park. AtmosphericEnvironment 19: 1263–1270.

Christensen WF, Gunst RF. 2004. Measurement error models in chemical mass balance analysis of air quality data. Atmospheric Environment 38: 733–744.Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, Ferris BG, Speizer FE. 1993. An association between air pollution and mortality in six U.S.

cities. The New England Journal of Medicine 329: 1753–1759.Dominici F, Samet JM, Zeger SL. 2000. Combining evidence on air pollution and daily mortality from the 20 largest U.S. cities: a hierarchical modelling

strategy. Journal of the Royal Statistical Society. Series A (Statistics in Society) 163: 263–302.Fuentes M, Raftery A. 2005. Model evaluation and spatial interpolation by Bayesian combination of observations with output from numerical models.

Biometrics 61: 36–45.Green PJ. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732.Gelman A, Carlin JB, Stern HS, Rubin DB. 2003. Bayesian Data Analysis. Chapman and Hall/CRC: Boca Raton, FL.Heaton MJ, Reese CS, Christensen WF. 2010. Incorporating time-dependent source profiles using the Dirichlet distribution in multivariate receptor models.

Technometrics 52: 67–79.Henry RC, Spiegelman CH, Chang YS. 2002. Locating nearby sources of air pollution by nonparametric regression of atmospheric concentrations on wind

direction. Atmospheric Environment 36: 2237–2234.Henry RC, Spiegelman CH, Collines JF, Park E. 1997. Reported emissions of organic gases are not consistent with observations. Proceedings of the National

Academy of Sciences 94: 6596–6599.Herring AS, Genton MG. 2010. Powering up with space-time wind forecasting. Journal of the American Statistical Association 105: 92–104.Higdon D, Kennedy M, Cavendish JC, Cafeo JA, Ryne RD. 2004. Combining field observations and simulations for calibration and prediction. SIAM Journal

of Scientific Computing 26: 448–466.Hopke PK. 1991. Receptor Modeling for Air Quality Management. Elsevier: New York.Hwang I, Hopke PK, Pinto JP. 2008. Source apportionment and spatial distributions of coarse particles during the regional air pollution study. Environmental

Science and Technology 42(10): 3524–3530.Kennedy MC, O’Hagan A. 2001. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(3):

425–464.Kestens E, Teugels JL. 2002. Challenges in modelling stochasticity in wind. Environmetrics 13: 821–830.Kim E, Hopke PK, Edgerton ES. 2003. Source identification of Atlanta aerosol by positive matrix factorization. Journal of Air Waste Management Association

53: 731–739.Laden F, Neas LM, Dockery DW, Schwartz J. 2000. Association of fine particulate matter from different sources with daily mortality in six U.S. cities.

Environmental Health Perspectives 108(10): 941–947.Lee JH, Hopke PK. 2006. Apportioning sources of PM2:5 in St. Louis, MO using speciation trends network data. Atmospheric Environment 40(Supplement 2):

360–377.Lee JH, Hopke PK, Turner JR. 2006. Source identification of airborne PM2:5 at the St. Louis – Midwest Supersite. Journal of Geophysical Research 111:

D10S10. doi:10.1029/2005JD006329.Lingwall JW, Christensen WF. 2007. Pollution source apportionment using a priori information and positive matrix factorization. Chemometrics and

Intelligent Laboratory Systems 87: 281–294.Lingwall JW, Christensen WF, Reese CS. 2008. Dirichlet based Bayesian multivariate receptor modeling. Environmetrics 19: 618–629.Liu CK, Roscoe BA, Severin KG, Hopke PK. 1982. The application of factor analysis to source apportionment of aerosol mass. American Industrial Hygiene

Association Journal 43: 314–318.Mardia KV, Jupp PE. 2000. Directional Statistics. Wiley: New York.Park SS, Pancras JP, Ondov J. 2005. A new pseudodeterministic multivariate receptor model for individual source apportionment using highly time-resolved

ambient concentration measurements. Journal of Geophysical Research 110: D07S15. doi:10.1029/2004JD004664.

Environmetrics 2011; 22: 962–974 Copyright © 2011 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/environmetrics

973

Page 13: Pollution source direction identification: embedding dispersion models to solve an inverse problem

Environmetrics B. WILLIAMS, W. F. CHRISTENSEN AND C. S. REESE

Ramsay JO, Hooker G, Campbell D, Cao J. 2007. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the RoyalStatistical Society: Series B (Statistical Methodlogy) 69(5): 741–796.

Spengler JD, Thurston G. 1983. Mass and elemental composition of fine and coarse particles in six U.S. cities. Journal of the Air Pollution Control Association33: 1162–1171.

Tarantola A. 2005. Inverse problem theory and methods for model parameter estimation. Society for Industrial and Applied Mathematics: Philadelphia, PA.Turner DB. 1994. Workbook of Atmospheric Dispersion Estimates. Lewis Publishers: Boca Raton, FL.US Environmental Protection Agency. 2004a. User’s guide for AERMOD meteorological preprocessor (AERMET), EPA-454/B-03-002. Office of Air Quality

Planning and Standards, Research Triangle Park, NC.US Environmental Protection Agency. 2004b. User’s guide for the AMS/EPA regulatory model—AERMOD, EPA-454/B-03-001. Office of Air Quality

Planning and Standards, Research Triangle Park, NC.US Environmental Protection Agency. 2006. NEI quality assurance and data augmentation for point sources. Emissions Monitoring and Analysis Division,

Research Triangle Park, NC.US Environmental Protection Agency. 2009a. EPA to reconsider lead air quality monitoring requirements fact sheet. Retrieved 2 September 2008, from U.S.

EPA website: http://www.epa.gov/air/lead/actions.html.US Environmental Protection Agency. 2009b. SPECIATE 4.2: speciation database development documentation. Office of Research and Development,

Research Triangle Park, NC.Wang G, Hopke PK, Fu G. 2009. Identification of major sources of PM2:5 in St. Louis Missouri USA. Journal of Ocean University of China 8(2): 101–110.Wikle CK, Milliff RF, Nychka D, Berliner LM. 2001. Spatiotemporal hierarchical Bayesian modeling: tropical ocean surface winds. Journal of the American

Statistical Association 96: 382–397.

974

wileyonlinelibrary.com/journal/environmetrics Copyright © 2011 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 962–974