building statistical models to analyze species distributions

February 2006 33CONTEMPORARY STATISTICS AND ECOLOGY

33

Ecological Applications, 16(1), 2006, pp. 33–50q 2006 by the Ecological Society of America

BUILDING STATISTICAL MODELS TO ANALYZESPECIES DISTRIBUTIONS

ANDREW M. LATIMER,1,4 SHANSHAN WU,2 ALAN E. GELFAND,3 AND JOHN A. SILANDER, JR.1

1Department of Ecology and Evolutionary Biology, University of Connecticut, 75 North Eagleville Road, Unit 3043,Storrs, Connecticut 06269 USA

2Department of Statistics, University of Connecticut, 215 Glenbrook Road, Unit 4120, Storrs, Connecticut 06269 USA3Institute for Statistics and Decision Sciences, Box 90251, Duke University, Durham, North Carolina 27708 USA

Abstract. Models of the geographic distributions of species have wide application inecology. But the nonspatial, single-level, regression models that ecologists have oftenemployed do not deal with problems of irregular sampling intensity or spatial dependence,and do not adequately quantify uncertainty. We show here how to build statistical modelsthat can handle these features of spatial prediction and provide richer, more powerfulinference about species niche relations, distributions, and the effects of human disturbance.We begin with a familiar generalized linear model and build in additional features, includingspatial random effects and hierarchical levels. Since these models are fully specified sta-tistical models, we show that it is possible to add complexity without sacrificing inter-pretability. This step-by-step approach, together with attached code that implements asimple, spatially explicit, regression model, is structured to facilitate self-teaching. Allmodels are developed in a Bayesian framework. We assess the performance of the modelsby using them to predict the distributions of two plant species (Proteaceae) from SouthAfrica’s Cape Floristic Region. We demonstrate that making distribution models spatiallyexplicit can be essential for accurately characterizing the environmental response of species,predicting their probability of occurrence, and assessing uncertainty in the model results.Adding hierarchical levels to the models has further advantages in allowing human trans-formation of the landscape to be taken into account, as well as additional features of thesampling process.

Key words: Bayesian; Cape Floristic Region; hierarchical; Proteaceae; spatial dependence;spatial random effects; spatially explicit; species distribution model; uncertainty.

INTRODUCTION

Ecologists increasingly use species distribution mod-els to address theoretical and practical issues, includingpredicting the response of species to climate change(e.g., Midgley et al. 2002), identifying and managingconservation areas (e.g., Austin and Meyers 1996),finding additional populations of known species orclosely related sibling species (e.g., Raxworthy et al.2003), and seeking evidence of competition among spe-cies (e.g., Leathwick 2002). In all of these applications,the core problem is to use information about where aspecies occurs (and where it does not) and about theassociated environment to predict how likely the spe-cies is to be present or absent in unsampled locations.

Spatial prediction of species distributions is thus di-rectly related to the concept of the environmental niche,a specification of a species’ response to a suite of en-

Manuscript received 2 April 2004; revised 27 September2004; accepted 27 September 2004; final version received 2 De-cember 2004. Corresponding Editor: D. S. Schimel. For reprintsof this Invited Feature, see footnote 1, p. 3.

4 E-mail: [email protected]

vironmental factors (MacArthur et al. 1966, Austin etal. 1990, Brown et al. 1995). But are contemporaryenvironmental factors alone sufficient to account forspecies distributions? Other ecological processes in-cluding dispersal, reproduction, competition, and thedynamics of large and small populations may also af-fect the spatial arrangement of species distributions(Gaston 2003). Most species-distribution models ig-nore spatial pattern and thus are based implicitly ontwo assumptions: (1) that environmental factors are theprimary determinants of species distributions; and (2)that species have reached or nearly reached equilibriumwith these factors. These assumptions underlie bothtypes of modeling approaches that are currently dom-inant: generalized linear and generalized additive re-gression models (GLM and GAM), and climate enve-lope models (see Guisan and Zimmermann 2000).These assumptions may or may not be adequate ap-proximations, depending on the relative influence ofenvironmental change, including both climate changeand direct human transformation of landscapes, micro-evolution, and dispersal-related lags in species move-ment across landscapes (Peterson 2003). To the extent

34 INVITED FEATURE Ecological ApplicationsVol. 16, No. 1

that model assumptions are violated, or that speciesdistribution data are inadequate, simple species distri-bution models may fail to provide adequate predictivepower, or may underestimate the degree of uncertaintyin their predictions, resulting in poor characterizationof the environmental response of the species. In ad-dition to these fundamental ecological issues, spatialmodeling also raises practical complications relating tosampling, including observer error, variable samplingintensity, and gaps in sampling. There may also be aspatial mismatch between different data sources (e.g.,Agarwal et al. 2002). Here we present models that ad-dress these problems, show how they are related tosimpler models, and demonstrate how to build them.

The literature on species distribution modeling cov-ers many modeling approaches and applications. For-tunately, there are useful review papers that organizeand categorize model approaches (Guisan and Zim-mermann 2000, Ferrier et al. 2002, Guisan et al. 2002).This paper aims to complement these reviews by fo-cusing on Bayesian regression models for species pres-ence/absence and taking a model-building approach.

We start with a brief overview of the problem ofprediction of species distributions in space, and discussthe features of a statistical model suitable for this prob-lem. As an initial step, we introduce a familiar gen-eralized linear model that relates distribution data toenvironmental attributes. This basic model, and vari-ants of it, are currently in wide use in ecological re-search (Guisan and Zimmermann 2000). Then we con-sider what can be done to improve this model. We buildfrom this familiar starting point toward more complexmodels. The final model we consider is a spatially ex-plicit hierarchical regression model implemented in aBayesian framework (Wikle 2003, Gelfand et al.2005a). Hierarchical models are statistical models inwhich data can enter at different stages, where thesestages describe conceptual but unobservable latent pro-cesses (see, e.g., Carlin and Louis [2000] for a generaldiscussion of hierarchical modeling and Bannerjee etal. [2003] in the context of spatial data). As we con-struct the models, we consider the advantages of dif-ferent model features, such as including spatial featuresin the mean vs. using spatial random effects, specifyingspatial relationships (point vs. areal), and modelingwith hierarchical levels. The goal is to give ecologistreaders a coherent framework for understanding thebasic regression approach, and simultaneously to showhow to implement newer statistical techniques in theirown modeling. To facilitate this self-teaching, we in-clude code for some of the models that readers can runwith the free, publicly available software packageWinBUGS (available online).5

All of the models presented here are implementedas Bayesian models. The simpler models are equally

5 ^http://www.mrc-bsu.cam.ac.uk/bugs/winbugs&

easy to implement through classical maximum likeli-hood methods. However, the Bayesian approach hasstrong advantages when building complex, hierarchicalmodels (Link and Sauer 2002, Clark 2003, Hooten etal. 2003), so we construct all the models in this frame-work. Complexity can then be added without switchingthe basic inferential paradigm. A Bayesian model usesBayes’ theorem to combine the information in the datawith additional, independently available information(the prior) to produce a full probability distribution(posterior distribution) for all parameters (Gelman etal. 1995, Carlin and Louis 2000, Congdon 2001). TheBayesian perspective allows us to ask directly howprobable are the hypotheses, given the data. Bayes The-orem can be expressed as

P(Parameters z Data)

P(Data z Parameters) 3 P(Parameters)5 (1)

P(Data)

where the notation P(x z y) means the probability of xconditional on y. This theorem, introduced posthu-mously by Thomas Bayes in 1763, enables drawinginferences directly about the parameters of interest (theleft-hand side of the equation). This ‘‘posterior prob-ability distribution,’’ P(Parameters z Data), provides afull picture of what is known about each parameterbased on the model and the data, together with anyprior information, P(Parameters). For a particular pa-rameter, this posterior distribution, unlike the mean andconfidence interval produced by classical analyses, en-ables explicit probability statements about the param-eter. Thus the region bounded by the 0.025 and 0.975quantiles of the posterior distribution has an intuitiveinterpretation: under the model, the unknown param-eter is 95% likely to fall within this range of values.

One feature of Bayesian models, and also a sourceof much debate within the statistics community, is theirability to incorporate already-known or ‘‘prior’’ infor-mation about parameters into the models. In some ap-plications, this can be a critical advantage, particularlywhen data are sparse, decisions must be made, andexpert opinion is available (Gelman et al. 1995). Buthere we do not put informative priors on the parame-ters, because, though there may be previous studiesrelating presence/absence to various environmentalfactors, it is not straightforward to transform this in-formation into prior specifications for regression co-efficients in the complex models we consider. More-over, we are primarily interested in learning what in-formation is contained in the data. Thus, all priors inthese models are vague or uninformative, and thus theposterior distributions for all parameters are driven bythe data.

This paper does not attempt to be a primer in Bayes-ian statistics; the books mentioned above cover thatground (see also Ellison [2004] for an introduction to


the Bayesian perspective for ecologists). We do notaddress the use of informative priors in Bayesian mod-els, because we use only vague priors here; discussionis available in the foregoing texts on Bayesian statistics(e.g., Gelman et al. 1995). We also do not considerformal statistical model comparison. The current stateof the art is that there are tools to effectively comparemodels that share a common hierarchical specification.However, for comparing models having a differentnumber of hierarchical levels or different stochasticmechanisms at these levels, as would be the case acrossthe models we are examining here, criteria for modelcomparison are not well developed, and there is littletheory to provide guidance. See the paper by Spiegel-halter et al. (2002) and the associated discussion sec-tion for more on this issue. Moreover, customary modelcomparison that reduces a model to a single numbermay be misleading. We argue that examination of pre-dicted species distributions across models relative toobserved species presence and absence patterns will bemore illuminating. Indeed, this is what we do, provid-ing maps of predicted distributions and using receiveroperating characteristics (ROC) curves and related cri-teria to assess relative model performance.

Finally, this paper does not address in detail the me-chanics of fitting these models using Markov chainMonte Carlo (MCMC) methods. For more on imple-menting MCMC and assessing its convergence, sourcesinclude Carlin and Louis (2000), Gelman et al. (1995),and Gilks et al. (1995).

Problems in spatial prediction in ecology

Most problems in spatial ecological prediction arisefrom three features of species distributions and the datathat are gathered to describe them. First, there are prob-lems that relate to sampling. In general, the spatial areacovered by the species distributions is vastly larger thatthe area sampled, and correspondingly, the spatial unitof prediction is usually larger than the sites sampledon the ground. An additional sampling problem relatesto the heterogeneity of sampling intensity: while largeparts of a domain may be unsampled, other parts maybe relatively heavily sampled. Finally, the environ-mental data for the region of interest are typically avail-able at a much coarser spatial resolution than the scaleat which species distribution data may be collected.Relating the species distribution data to the environ-mental data and to the region where prediction is soughtthus often presents problems of spatial misalignment.For example, some data may be available at point lo-cations (point data), while other data are in the formof a regular grid (e.g., climate data) or irregular poly-gons (e.g., edaphic or geological data). These differentkinds of data sources do not line up (e.g., Agarwal etal. 2002). There may also be sample bias in the avail-able data, so that sampling locations and/or intensityare unevenly distributed with respect to relevant char-

acteristics of the region sampled (Mugglin et al. 2000,Gelfand et al. 2002). These problems are often ignoredin spatial modeling, or are addressed indirectly by at-tempting to minimize bias through stratified samplingacross major environmental gradients (e.g., Ferrier etal. 2002).

Second, there is the problem of spatial dependenceor spatial autocorrelation. Data are spatially dependentor autocorrelated when the degree of correlation amongobservations depends on their relative locations. Inecological data, pairs of observations that are closertogether often tend to be more similar than pairs ofobservations that are farther apart, and this producespositive autocorrelation. Because ecological processessuch as reproduction and dispersal typically generatespatial autocorrelation in species occurrences, and be-cause some residual autocorrelation in environmentalfactors tends to remain even when many environmentalfactors included in a model, the ‘‘residuals’’ of a purelyenvironment-based predictive model will usually ex-hibit some degree of autocorrelation. Using models thatignore this dependence can lead to inaccurate param-eter estimates and inadequate quantification of uncer-tainty (Ver Hoef et al. 2001). Equally important, toignore this spatial dependence is to throw out mean-ingful information (Wikle 2003). Ecological predictionoften ignores this problem or deals with it in a lessthan satisfactory way (see Guisan and Zimmerman2000). One might, for example, include latitude andlongitude through a trend surface in the mean to im-prove prediction, but this approach may still miss spa-tial dependence which explicit modeling of spatial as-sociation can capture. Another alternative provided bythe generalized regression analysis and spatial predic-tion (GRASP) modeling package treats spatial auto-correlation at the data stage, by feeding into the modela new data layer that reflects neighborhood values inmodel predictions (software available online).6 Themodel is then iterated to convergence (Augustin et al.1996, Lehmann et al. 2002). This approach does dealwith autocorrelation, but fails to quantify explicitly thestrength of spatial pattern in the residuals and associ-ated uncertainty.

Third, spatial modeling presents problems in quan-tifying uncertainty. Because the spatial domain of pre-diction is typically very large relative to the domainin which the data were collected, environmental dataare not available on a scale as fine as that experiencedby individual organisms. Predictions frequently in-volve extrapolation to unobserved parts of the studyregion and to larger-scale areal units. Assessment ofuncertainty in such predictions is crucial when they areused to set conservation policy or to evaluate the im-pact of climate change on species (e.g., Thomas et al.2004). But again, ecological distribution models that

6 ^http://www.cscf.ch/grasp/&


focus exclusively on the mean may yield false confi-dence in the predictions made (Ver Hoef et al. 2001).

The tools presented here can be used to deal withthese three kinds of problems in a straightforward,transparent way. Within the Bayesian framework, fullinference about uncertainty, given what we have ob-served (the data) and what we know or assume aboutthe process (the model), comes ‘‘free’’ with the modelpredictions. Spatial autocorrelation can be incorporatedinto a regression model through random effects thatcapture spatial dependence in the data. Since the ran-dom effects are model parameters, they also emergewith a full posterior distribution that allows quantifi-cation of uncertainty. By adding hierarchical levels toa regression model, issues of sampling intensity, holesin the data, and human transformation of the landscapecan be dealt with explicitly (Gelfand et al. 2005b).Hierarchical models are statistical models in which datacan enter at various levels and, thus, model parametersor unknowns are themselves functions of other modelparameters and data. These hierarchical stages can de-scribe conceptual but unobservable latent processesthat are ecologically important, such as error in theobservation process, or potential suitability of a site ina transformed landscape for a particular species. In thisway, uncertainty attached to unknowns at differentmodel stages is propagated across model levels to moreaccurately reflect overall inferential uncertainty.

Crucial to the approach presented here is the notionof transparency. One reaction of ecological modelersto the problems inherent in species distribution mod-eling is to adopt more flexible methods like neural net-works, genetic algorithms, and discriminant analysis(Manel et al. 1999, Moisen and Frescino 2002,D’Heygere et al. 2003). These methods offer the ad-vantages of responding flexibly to interactions and non-linearities in data relationships, but often at the expenseof interpretability or mechanistic insights. We hope toshow here that many of the difficulties involved inecological distribution modeling can be dealt with ina very general way through hierarchical modeling with-out sacrificing the interpretability of simpler statisticalmodels.

METHODS

Data sources and preparation

We illustrate the modeling through prediction of dis-tributions of plant species in the Cape Floristic Regionof South Africa (CFR), a global hotspot of plant di-versity, endemism, and rarity (Cowling et al. 1997,Meyers et al. 2000). The data used here consist of ob-servations of species presence/absence and environ-mental characteristics at grid-cell level. More precisely,each species presence/absence observation is refer-enced to a particular point in space, although, in fact,sampling was conducted in a small areal unit. The sam-ple locations are displayed with a map of the CFR in

Fig. 1. Fig. 1 also presents a close-up view of a smallpart of the CFR to illustrate the data in more detail,and shows in particular that the sampling intensity isquite variable and that there are many cells in whichthe species have been observed to be both present andabsent at different sample locations in a single cell.While these sample locations in fact represent compactareal surveys, not points, they are smaller than a gridcell and vastly smaller than the entire CFR, so it isreasonable to take these units as points. The speciesdata were collected by the Protea Atlas Project of SouthAfrica’s National Botanical Institute. The grid-cell ref-erenced environmental data are data layers listed inTable 1 and displayed in Appendix B. See also Gelfandet al. (2005b). Here the grid cells are 19 3 19 rectangleswhich, at this latitude, translates to about 1.55 3 1.85km. The data layers cover various aspects of the en-vironment potentially important for plant species per-sistence, including temperature, precipitation, topog-raphy (elevation, roughness), and edaphic features (soilfertility, texture, pH).

For illustration, we selected two species in the familyProteaceae, a diverse group of flowering woody plantsthat includes the king protea (Protea cynaroides), thenational flower of South Africa. These species, Proteamundii and P. punctata, are relatively widespread with-in the region and are closely related, with abutting butnonoverlapping distributions. The species exhibit someecological differences, with P. punctata occupyinghigher elevation, inland sites and P. mundii nearer thecoast and at lower elevations. The abrupt transitionbetween the two species’ distributions, however, is notassociated with any obvious sudden environmentaltransition. These two species thus present a good chal-lenge for models, both to predict distribution bound-aries correctly and to identify environmental variablesthat distinguish the environmental responses of the twospecies.

A critical aspect of environment that is not routinelyconsidered in species distribution modeling is the ex-tent to which the landscape has been transformed froma natural state by human intervention or by introducedalien species. Since most areas of the world, includingthe Cape region, have been significantly affected byurbanization, agriculture, alien invasive species, andother impacts, for most purposes it would seem nec-essary to consider the effect of these changes on theoccurrence of species. The final model presented here,the Bayesian hierarchical spatial model, enables thisinformation to be taken into account and predictionsto be made for untransformed conditions as well ascurrent, transformed conditions. To make these pre-dictions, we incorporate a data layer that combines themajor classes of human impact and specifies the pro-portion of the landscape transformed for each grid cell(Rouget et al. 2003).


FIG. 1. The Cape Floristic Region with Protea Atlas Project sample points overlaid. The thumbnail shows the region’slocation on the African continent. The pull-out box displays, for a small (12 3 13 km) part of the region, the grid cells usedfor modeling and sample locations scattered across these grid cells, including both sample locations at which Protea mundiiwas observed (solid triangles) and the ‘‘null sites’’ at which it was not (open triangles).

TABLE 1. Posterior summary of significant/suggestive coefficients (b’s) for Protea mundii.

Variable AbbreviationModel

1Model

2Model

3Model

4

Roughness ROUGH 2 1 (1) 1Elevation ELEV 1 1 NS NS

Potential evapotranspiration POTEVT 1 NS NS NS

Interannual CV precipitation PPCTV (2) NS NS NS

Frost season length FROST 2 NS NS NS

Heat units HEATU 2 2 2 2January maximum temperature JANMAXT 1 1 (1) 1July minimum temperature JULMINT NS 1 NS NS

Seasonal concentration of precipitation PPTCON 2 2 (2) 2Summer soil moisture days SUMSMD 1 NS 1 (1)Winter soil moisture days WINSMD 2 NS 2 NS

Enhanced vegetation index EVI 1 1 1 1Low fertility FERT1 1 NS 1 NS

Moderately low fertility FERT2 NS NS (1) NS

Moderately high fertility FERT3 2 NS NS NS

Fine texture TEXT1 1 NS NS NS

Acidic soil PH1 2 NS 2 NS

Alkaline soil PH3 2 NS 2 NS

Notes: For Tables 1 and 2, 1 and 2 denote positive and negative coefficients with 95%credible intervals that do not overlap 0; (1) and (2) denote positive and negative coefficientswith 90% credible intervals that do not overlap 0; NS, not statistically significant.


A simple generalized linear model

Let us begin with the simplest GLM that directlyrelates presence/absence data to environmental explan-atory variables. Here, we have two options to deal withthe misalignment between the species presence/ab-sence data, which are referenced to point locations, andthe environmental data, which are referenced to gridcells. We can work at the scale of the responses, i.e.,the sample sites, and assign to each sample site thevalues of the environmental factors for the grid cell inwhich the site falls. Alternatively, we can work at thegrid cell level, assigning presence to the grid cell ifany sample site in that cell showed presence, or as-signing absence if it was not found at any sample site.The latter choice seems less attractive in that it failsto account for the sampling intensity attached to a gridcell. Absence at four sample sites in a cell, for example,should be viewed differently from absence based onlyon one sample site. Expressed in a different way, thelatter reduces the response probability for a grid cellto a single Bernoulli trial (i.e., a coin toss) while theformer results in a binomial variable reflecting the num-ber of trials on the grid cell. Since the models areequally routine to fit, we choose the former here.

Let Y(s) be the presence/absence (1/0) of the modeledspecies at sample location s in cell i. Summing up Y(s)over the number of sample sites in cell i (ni) yieldsgrid-cell level counts: Yi1 5 Ss∈grid i Y(s). Since all ofthese sites are assigned the same levels for the envi-ronmental factors, if we assume independence for thetrials, a binomial distribution results for Yi1:

Y ; Binomial(n , p ).i1 i i (2)

We assume that the probability that the species occursin cell i, pi, is related functionally to the environmentalvariables. A logistic (logit) function is convenient torelate pi to the linear predictor b (other link func-w9itions, e.g., the probit, could be considered but wouldnot be expected to affect subsequent predictions of spa-tial distribution), yielding

pilog 5 w9b. (3)i1 21 2 pi

In this logistic regression model, is a vector of ex-w9iplanatory environmental variables associated with celli and b is a vector of the associated coefficients. Notethat here the first entry in is a 1, which provides anw9ioverall intercept for the regression, so that b 5 b0w9i1 b1wi1 1 · · · 1 btwit. For an unsampled cell (ni 5 0),there will be no contribution to the likelihood. For asampled cell (ni $ 1), there will be a contribution tothe likelihood and the likelihood component can bewritten as

9w b yin (e )iP(Y 5 y) 5 . (4)i1 9w b n1 2 ii(1 1 e )y

Model 1 can be fitted easily by classical or Bayesian

approaches. Available software includes SAS (SAS In-stitute, Cary, North Carolina, USA), S-Plus (InsightfulCorporation, Seattle, Washington, USA), and Win-BUGS (see footnote 5). The first two are widely usedin classical framework while the later is the most user-friendly software available for Bayesian modeling.Note that, to fully specify a Bayesian model, a priordistribution has to be assigned for every random pa-rameter. The only parameters in this model are the b’s.We use normal priors centered at 0 with a fixed largevariance, b ; N(0, ), where is a hyperprior whose2 2s sb b

value is assigned by the modeler. These priors are al-most flat; they contain very little information about thelocation of the parameters, and are thus examples of‘‘vague priors’’ (see, e.g., Gelman et al. [1995] for priorspecification methods). A script for implementingModel 1 in WinBUGS is provided, along with data andinitial values, in Appendix A.

A more detailed discussion of the model results ap-pears in the next section, but as we work through themodels we present species distribution maps createdfrom the model output to allow for an immediate visualcomparison. Fig. 2 shows this model’s prediction ofthe distributions of Protea mundii and P. punctata.Because of the large number of grid cells in the mod-eled region (36 907), pull-out boxes are used in thisand the other model prediction figures to display de-tailed views of small, representative parts of the grid.Though the model correctly picks out the core part ofP. mundii’s distribution, there are also areas of apparentunderprediction, where the model is not predicting ahigh probability of presence even though the specieshas been observed repeatedly. This problem is partic-ularly noticeable in the western populations of this spe-cies, which are separated from the rest by hundreds ofkilometers and might thus be expected to present adifficult challenge for the model.

A simple spatially explicit model

Spatial structure or autocorrelation in ecological pat-tern and process is pervasive. In the context of speciesdistribution patterns, we would anticipate that the pres-ence/absence of a species at one location may be as-sociated with presence/absence at neighboring loca-tions. A variety of different mechanisms could generatesuch association (e.g., dispersal of propagules amongneighboring cells, similar ecological attributes amongneighboring cells, etc.). We select a model that canrespond flexibly to this spatial dependence, while re-taining the environmental response coefficient (i.e.,

b) structure. This can be achieved by adding spatialw9irandom effects to the model, which yields our Model2. Modeling at the grid-cell level, under Eq. 2, a spatialterm ri associated with grid cell i is added in Eq. 3

pilog 5 w9b 1 r . (5)i i1 21 2 pi

In this equation, each grid cell has an associated ran-


FIG. 2. Predicted distribution for (a) Protea mundii and(b) P. punctata from Model 1. For this and all other distri-bution maps (Figs. 4–6), each grid cell is assigned the meanof the posterior distribution for pi, the predicted probabilityof occurrence of the species in cell i. Because the full regionincludes 37 000 grid cells, for purposes of visualization, pull-out boxes are used to present close-ups of selected portionsof the predicted region. Sample locations at which the specieswere observed are overlaid as triangles.

FIG. 3. Diagram of the grid cell neighborhood used in theconditional autoregressive (CAR) models employed in Mod-els 2 and 4.

dom effect ri that adjusts the probability of presenceof the modeled species up or down, depending on thevalues of r in cell i’s spatial neighborhood. To capturethis process, we employ a Gaussian (intrinsic) condi-tional autoregressive (CAR) model (Besag 1974). To

specify this model, we assume the conditional distri-bution of the spatial random effect in cell i, given val-ues for the spatial random effect in all other cells j ±i, depends only on the spatial random effect of theneighboring cells of i, dii. Here, we specify that cell iis a neighbor of j if their boundaries intersect (see Fig.3). In this version, the spatial effect for any given celldepends only on the values of r for the cells in itsneighborhood, and the neighborhood encompasses onlythe eight immediately adjacent cells. The neighborhoodcould alternatively be defined to be larger, and differentweights could be assigned to cells at different distanc-es.

Formally, the Gaussian CAR model for the spatialrandom effect at cell i can be presented by a conditionaldistribution (conditioned on all the other cells)

a r O i j j 2srj∈di r z r ø N , j ± i (6)i j a a i1 i1

where ai1 denotes the total number of cells which areneighbors of i, and aij 5 1 if sites i and j share thesame boundary and 0 otherwise. The conditional var-iance is a hyper-parameter (a parameter that is a2sr

prior for another parameter) and requires a prior dis-tribution. As is a variance, we assign an inverse2sr

gamma prior, ; IG(2, br). This distribution has mean2sr

br with infinite variance. Again, for the bs, we assignnormal priors centered at 0 with a fixed large variance.This spatially explicit model can be directly fitted inWinBUGS using Bayesian simulation-based methods.WinBUGS code is provided in Appendix A.

Fig. 4 shows predicted distributions for Protea mun-dii and P. punctata using Model 2. It is immediatelyapparent that this model improves over the nonspatialmodel. For example, unlike Model 1, this model is ableto predict the presence of P. mundii in its isolated west-ern populations, and also shows a tighter fit in the maineastern populations. Two methods of quantifying modelperformance are presented in Results, and both confirm


FIG. 4. Predicted distribution for (a) Protea mundii and(b) P. punctata from Model 2. See the explanation in the Fig.2 legend.

that making the model spatially explicit dramaticallyimproves performance.

A point-level spatial model

Models 1 and 2 were built at the grid-cell scale, sospatial association was specified as a relationship be-tween grid cells in a neighborhood. Point-level modelsprovide an alternative way of building spatially explicitmodels for species distributions. Point-level modelsuse the (geo-coded) locations of the observation pointsthemselves, rather than referencing the points to dis-crete grid cells, to specify spatial relationships. For

example, the Protea Atlas Project data we are usinghere includes abundance information as well as pres-ence/absence; these abundance estimates are only forthe point locations, not a continuous grid. So if wewere modeling abundance, it might be preferable tomodel at the point level. Point-level models have beenwidely used to explain ecological patterns and pro-cesses, such as animal movement and metapopulationdynamics (e.g., Turchin 1998, Ettema et al. 2000, Stoy-an et al. 2000). To build a point-level model, spatialdependence can be modeled directly between the pointsbased on the distances among them, with the degree ofautocorrelation decreasing as some function of dis-tance. This modeling of dependence as a continuousfunction of distance contrasts with the neighborhoodapproach typically used in grid-level or lattice models,in which spatial dependence is modeled between dis-crete sets of neighboring cells (compare the individualpoints in Fig. 1 with the grid-cell neighborhood in Fig.3). Kriging is perhaps the most familiar use of point-level modeling: kriging is a way of predicting the val-ues of a response variable (like species abundance orsoil characteristics) at new, unmeasured locations(Bannerjee et al. 2003). Our goal, however, is not sim-ply to predict but also to investigate the relationshipsbetween species distributions, environmental variables,and spatial pattern, so our modeling also includes en-vironmental covariates as well as spatial random ef-fects.

We introduce a point-level spatial model as Model3. In this model, data are referenced to the samplelocations themselves rather than to grid cells in whichthey occur. Because there is only one observation ofpresence or absence of the species per sample site, thedata point Y(s) is the outcome of a single Bernoullitrial:

Y(s) ; Bernoulli[p(s)]. (7)

The probability that the species occurs in site s, p(s),can be related to the set of environmental variablesthrough the logistic regression model shown in Eq. 3.To do this, of course, requires that we have the envi-ronmental data for each sample site. For this illustra-tion, we can use the same grid-based environmentaldata, and let w(s) 5 wi when s is within grid i.

The point-level spatial model can now be speci-fied as

p(s)log 5 w9(s)b 1 r(s) (8)

1 2 p(s)

where r(s) is the spatial random effect associated withpoint s. Because there are an uncountable number oflocations s it is not appropriate to model the r(s) usinga neighbor-based specification such as a CAR. Rather,we describe the spatial dependence at the point levelthrough a spatial process model that directly modelspairwise association using a parametric covariance


FIG. 5. Predicted distribution for (a) Protea mundii and(b) P. punctata from Model 3. See the explanation in the Fig.2 legend.

function: cov(si, s j) 5 f (dij; u), where f is a decreasingfunction of dij, the distance between s i and s j, and u isan unknown parameter of this function. In such a mod-el, the r(s)’s are spatial random effects whose spatialstructure is specified by the covariance function. Forexample, a widely used choice for the covariance func-tion is an exponential function exp(2(wdij)k), where wis a ‘‘decay’’ parameter and k is a ‘‘power’’ parameterapplied to the distance. Together, these parameters con-trol the rate of decay of covariance with distance dij.For a small number of sample sites, such models canbe implemented in WinBUGS. But while overcomingthe disadvantages of the CAR model, this model in-troduces its own disadvantage, computational limita-tion. The computational demand of the MCMC sam-pling algorithm for this model scales as the third powerof the number of sample points, so it rapidly becomesunfeasible to run as the number of points increases.Currently, WinBUGS can handle about 100 samplepoints; individually tailored code can accommodateperhaps up to 1000 points, but with more than that,some form of approximation will be required.

In our case, there are 52 275 sample locations, whichis more than an order of magnitude greater, so we offeras an alternative a kernel smoothing model to capturespatial association at the point level (Higdon et al.1999). This approach reduces the cost of modeling thespatial association to a function of the number of datapoints, rather than a function of the number of pointscubed, as in the spatial process model. We select Klocations over the region, ti, i 5 1, 2, . . . , K, whichcorrespond to the centroids of the 36 907 grid cells usedin Models 1 and 2, and define

r(s) 5 g(\s 2 t \ )Z (9)O i ii

where the Zi are assumed to be normally distributed,N(o, s2), and \ · \ denotes straight-line interpoint (Eu-clidean) distance. Under this specification we havecov(si, sj) 5 s2 Si g( \ s 2 ti \ )g( \ sj 2 ti \ ). g( \ · \ ) shouldbe a function which decreases in \ · \ to 0, for example,an inverse distance function, g( \ s 2 ti \ ) 5 exp{2g \s2 ti \ }. This inverse distance function is the ‘‘kernel’’that defines the spatial structure in r(s). The degree oflocal spatial smoothing decreases with g; in otherwords, a larger gamma makes the spatial random ef-fects in neighboring cells more strongly correlated. Forthe illustration presented here, we chose g such thatthe weight function g( \ s 2 ti\ ) approaches zero when\ s 2 ti \ is three times the length of the side of a gridcell (here 19).

Fairly sophisticated code is required to fit this model;no packaged software is available. Though we showthe results of fitting this model below and in Results,our primary purpose in presenting it is to show thatthere are spatial modeling alternatives to the grid-cell(areal) versions we have focused on.

The distributions predicted by this model are quitedifferent from those in the previous models, with manymore cells assigned high probabilities of occurrence.Fig. 5 shows that the model predicts reasonably highprobability of occurrence for both species where theyhave been observed, but also that the model predictsthem to occur in many adjacent areas where they havenot been observed, i.e., the model overpredicts the spe-cies’ distributions. This behavior appears to result fromthe stronger smoothing applied by the Gaussian kernelsin this model, and results in a smoother, generally high-


er-intensity probability surface. A larger g would im-part less spatial smoothing.

A hierarchical spatially explicit model

In the Cape Floristic Region, as in many other partsof the world, including those with highest conservationpriorities, the landscape has already been substantiallyaltered by human disturbance, including agriculture,urbanization, forestry, and invasion by exotic plant spe-cies. The data on current species presence or absencerepresent species distributions within this transformedlandscape. Some of the sample points inevitably fallwithin areas that have been transformed to some de-gree. The single-level regression models do not allowus to deal explicitly with this, and so the interpretationof the model predictions is somewhat ambiguous.Should they be interpreted as predicting the naturaldistribution of the species in unchanged conditions, oras predicting distributions given human disturbance?With a hierarchical model, here presented as Model 4,we can resolve this problem by incorporating data ontransformation of the landscape into the model, so thatthe contribution of each sample point to the modelprediction reflects the degree to which the cell in whichit falls has been transformed. Then the model will beable to predict explicitly what the distribution wouldhave been in the absence of transformation (i.e., po-tential distribution) and what the distribution would bein the current transformed landscape (i.e., transformeddistribution). To do this, we will add a second hier-archical level to the model.

We have been treating the sample points as Bernoullitrials, and specifying the probability of observing aspecies in a grid cell as a function of environment anda spatial random effect. But the sampling data are ac-tually more complex. Whether a species is observed ina cell is a function of both the ecological attributes ofthe site, or site suitability (which influences whether aspecies is in fact there), and of the sampling process(whether, in the inventory process, the species was ob-served). These distinct but related processes can beinvestigated by adding a third hierarchical level to themodel.

To demonstrate the advantages of hierarchical mod-eling, we develop here a multilevel hierarchical modelthat considers not only irregular sampling intensity andspatial dependence, but also incorporates the influenceof land transformation and the data collection process.In particular, for each grid cell i, we introduce twounobservable (or latent) binary variables. Xi denotesthe event that a randomly selected location in grid celli is environmentally suitable (1) or unsuitable (0) forthe species, that is, that the species could exist thereor not. Vi denotes the same event, adjusted for trans-formation. For a particular grid cell, the probabilitythat a randomly selected point in the cell is suitable ispi 5 P(Xi 5 1). Thus, cells with high values of pi are

predicted to be highly environmentally suitable for thespecies, that is, there is a high probability that the spe-cies can grow there, or a high potential presence. Asingle pi can be viewed as the relative suitability of aparticular grid cell, scaled 0 to 1. Across the grid-celldomain, the suite of pis can be seen as representing thepotential distribution of the species in the absence oflandscape transformation, while P(Vi 5 1) representsthe transformed distribution, given extant human al-terations of the landscape.

In modeling the pi 5 P(Xi 5 1), or the relative suit-ability of the grid cells, we begin with environmentalvariables that are expected to affect the suitability ofgrid cell i for the modeled species. The variables usedin the models described here are listed in Table 1. Forpi, we use a logistic regression model conditional oncell-level environmental variables and on cell-levelspatial random effects just as in Eq. 6, where wi isagain a vector of grid cell level environmental char-acteristics, and b is a vector of the associated coeffi-cients, including an intercept. As in Model 2, the ri

denote spatially associated random effects which aremodeled using a CAR specification.

Next, at the second hierarchical level, we modelP(Vi z Xi). Let Ui denote the proportion of area in theith cell which is transformed, 0 # Ui # 1. Intuitively,we model P(Vi 5 0 z Xi 5 1) 5 Ui, which captures theidea that the probability of absence caused by landtransformation given the potential presence is propor-tional to the transformed percentage. Of course, P(Vi

5 1 z Xi 5 0) 5 0. Marginalizing over Xi (that is, sum-ming P(Vi 5 1 z Xi 5 0) and P(Vi 5 1 z Xi 5 1)) we get

P(V 5 1) 5 (1 2 U )p .i i i (10)

This equation has the interpretation that the probabilityof transformed presence is adjusted by (1 2 Ui)·100%of the probability of the potential presence. A theo-retical justification, viewing the probabilities as aver-ages of binary processes, is presented in Gelfand et al.(2005b).

At the third hierarchical level of the model, we ad-dress the relationship between the observed presence/absence data and the environmental suitability vari-ables. Assume that cell i has been visited ni times (inuntransformed areas) within the cell. Further, let Yij bethe presence/absence status of the species in the ith cellat the jth sample location within that cell. Given Vi 51, we view the Yij as independent, identically distrib-uted, Bernoulli trials with success probability qi. Inother words, qi represents, for an arbitrarily selectedlocation in cell i, the probability that the species isobserved given that it could grow there. There are var-ious reasons the species might not have been observedin a highly suitable grid cell: there may be observationerror, or low sampling intensity, or the species maysimply not have established a population in the cell.


FIG. 6. Predicted potential distribution for (a) Proteamundii and (b) P. punctata from Model 4. See the explanationin the Fig. 2 legend.

Of course, given Vi 5 0 (i.e., the landscape is 100%transformed), Yij 5 0 with probability 1.

Marginalizing over Vi and using Eq. 10, we get P(Yij

5 1) 5 qi(1 2 Ui)pi. Given Vi 5 1, Yi1 5 Yij ;niSj51

Binomial(ni, qi). For an unsampled cell (ni 5 0) therewill be no contribution to the likelihood. For a sampledcell (ni $ 1) there will be a contribution to the like-lihood and, in fact, again marginalizing over Vi, for y. 0,

n y n 2yi iP(Y 5 y) 5 ( )(q ) (1 2 q ) (1 2 U )pi1 y i i i i

and for y 5 0,

niP(Y 5 0) 5 (1 2 q ) (1 2 U )p 1 [1 2 (1 2 U )p ].i1 i i i i i

The two components of this latter expression have spe-cific interpretation. The first, for an arbitrary locationin cell i, provides the probability that the species is notpresent though the location is suitable while the secondprovides the probability that the location is not suitable(so it could not be present). For the transformed cell,the contribution to the likelihood is adjusted by thepercentage of transformation, Ui. If a cell is 100%transformed, then it has no contribution to the likeli-hood.

In modeling for qi, the probability that the specieswas observed at the jth site in cell i, we use a logisticregression model:

qilog 5 w b (11)i1 21 2 qi

where wi are cell level environmental characteristicsthat may affect qi. For the current model, we allow thesame set of environmental characteristics at this levelas were included in the first level of the model (wi).There is also an intercept term in the model.

For fitting such a complex model, simulation-based(MCMC) Bayesian model fitting is currently the onlyoption. The computation is demanding, but in return,we can obtain the full posterior distributions of all pa-rameters, which makes possible full inference for quan-tities of interest. A disadvantage is that this model can-not be implemented in current releases of WinBUGS,and therefore the model must be custom coded.

Fig. 6 presents predictions from this model of thepotential probability of occurrence of both P. mundiiand P. punctata, that is, their potential distributions inthe absence of transformation. Because these species,particularly P. punctata, tend to occur in areas of highelevation and poor soils that are not likely to be usedfor agriculture or housing, the probability surfaces fortransformed probability of occurrence are nearly iden-tical and are not shown. Essentially all points wherethe species were observed lie in areas of predicted mod-erate or high probability of occurrence, with less ap-parent overprediction than Model 3 (compare Figs. 5and 6).

RESULTS

We have seen already that adding complexity to basicgeneralized linear models improved the models’ char-acterization of the distributions of our two examplespecies. We now turn to a more thorough evaluationof the model output, including the estimates for theenvironmental coefficients, the spatial random effectvariables, and the uncertainty associated with modelparameters.

Table 1 summarizes the estimates from all four mod-els for the bs, or coefficients for the explanatory en-


TABLE 2. Posterior summary of significant/suggestive coefficients (b’s) for Protea punctata.

Variable AbbreviationModel

1Model

2Model

3Model

4

Roughness ROUGH 1 1 1 1Elevation ELEV 1 1 1 1Potential evapotranspiration POTEVT 1 1 1 1Interannual CV precipitation PPCTV 2 2 2 2Frost season length FROST 1 1 1 1Heat units HEATU 2 NS 2 NS

January maximum temperature JANMAXT 2 NS 2 NS

July minimum temperature JULMINT 1 NS NS (2)Mean annual precipitation MAP 1 NS (1) NS

Seasonal concentration of precipitation PPTCON 2 NS 2 NS

Winter soil moisture days WINSMD 2 NS NS NS

Enhanced vegetation index EVI 2 NS 2 1Moderately high fertility FERT3 1 1 1 1Fine texture TEXT1 (1) NS NS NS

Medium coarse texture TEXT3 2 NS 2 NS

Coarse texture TEXT4 2 NS 2 NS

Alkaline soil PH3 2 NS NS NS

Notes: For Tables 1 and 2, 1 and 2 denote positive and negative coefficients with 95%credible intervals that do not overlap 0; (1) and (2) denote positive and negative coefficientswith 90% credible intervals that do not overlap 0; NS, not statistically significant.

vironmental variables (w) for Protea mundii. Table 2provides the same information for P. punctata. It isimmediately apparent that the different models presentsomewhat different assessments of the relative contri-butions of environmental explanatory variables. Themost striking and consistent pattern is that the non-spatial model, Model 1, has many more significant co-efficients than the other three models. This is intuitive;with a large sample including more than 52 000 samplesites, in the absence of flexible random effects, moreenvironmental factors would be expected to emerge assignificant. We elaborate this point further in the Dis-cussion section. Generally, the edaphic coefficients(i.e., fertility, texture, pH) drop out more often thanclimate coefficients (variables 3–12 in Tables 1 and 2)when spatial random effects are added to the model.Though the parameter estimates may differ across mod-els, they are generally consistent in sign across themodels, so all the models convey broadly the samepicture about the species’ relationship with environ-mental factors.

From those parameters that are significant acrossmore than one model, a picture emerges of the species’ecological characteristics. Both are associated withhigher elevation, with P. punctata having significantlylarger coefficients; it is typically found at higher ele-vations. P. punctata is also favored by a longer frostseason and higher potential evapotranspiration, anddiscouraged by higher January (summer) maximumtemperature. P. mundii is conversely positively asso-ciated with higher January maximum temperature,though negatively with higher mean summer heat levels(heat units, the sum of the number degrees by whicheach summer month’s mean daily maximum tempera-ture exceeds 188C), and is also significantly associatedwith a more even year-round rainfall pattern (negative

coefficient for seasonal concentration of precipitation,and positive coefficient for summer soil moisture avail-ability, SMDSUM). The edaphic coefficients show thatP. punctata is positively associated with moderatelyfertile soils, while P. mundii is associated, though lessconsistently, with the least fertile soils.

Posterior distributions are available for the spatialrandom effects, and they can be summarized in a mapthat displays the mean value of the effects. Fig. 7 pre-sents the mean value of the spatial random effect forProtea mundii for each grid cell for each of the threespatial models (Models 2, 3, and 4). The spatial effectsfor all models show strong spatial patterning at multiplescales. The spatial effects for Models 2 and 4 are qual-itatively similar, including larger regions and local ar-eas where the species are encouraged (1) or discour-aged (2). The spatial effect surface for Model 3 appearsquite different, with values for individual cells that arenoticeably higher or lower than cells in their immediateneighborhood. Further, these spatial effects appearrather uniform away from the areas in which the specieswere observed, which may contribute to the higheroverall level of predicted probability of occurrencefrom this model (see Fig. 5).

The posterior distributions for probability of speciespresence/absence can also be summarized to give apicture of the level of uncertainty about parameter val-ues. Fig. 8 displays the variance in probability of oc-currence for the four models for P. mundii. The figuresshow the lowest variance for Model 1, with the highestoverall level for Model 3. The nonspatial model, Model1, produces parameter estimates with relatively lowvariance. Further, since the model cannot respond tospatial relationships, the areas of slightly higher vari-ance are not associated spatially with patterns in thesample data. The variance for all spatial models, by


FIG. 7. Comparison of spatial random effects from thethree spatially explicit models for Protea mundii.

contrast, is much higher, and has strong spatial pat-terning. In particular, the variance tends to be high inareas where the species has not been observed, but thatare close to grid cells where it has been found. In thesecells, the model predicts a relatively high probabilityof occurrence, but also a higher level of uncertainty.

Table 3 gives the ‘‘prevalence,’’ or mean probabilityof occurrence across the region, i.e., Si pi, for the twospecies for all models. Prevalence can be seen as ameasure of overall commonness of occurrence, andprovides one simple way of comparing across models.Adding spatial random effects to the models does notin itself greatly change the predicted prevalence foreither species. Note, however, that the confidence in-terval is slightly wider in both cases for the spatialmodel; as in the case of the environmental coefficients,Model 1 appears to be underestimating the degree ofuncertainty about the pi. The point-level model, by con-trast, produces far higher prevalences for each speciesthan any of the other models. This result is consistentwith the overprediction that is clearly visible in thedistribution predictions from this model; see Fig. 5.The hierarchical model produces an intermediate levelof prevalence for both species; this is not simply due

to the effects of transformation, since even the trans-formed prevalence estimates from this model are sig-nificantly higher than those of Models 1 and 2. Noticethat whereas the distribution of Protea mundii is sub-stantially affected by transformation, P. punctata ispredicted to be almost completely unaffected (Table 3).

Receiver operating characteristics (ROC) analysisprovides a quantitative complement to visual assess-ments of the fit of model predictions. ROC plots sum-marize graphically the performance of a model as atradeoff between sensitivity, in this context the prob-ability of correctly predicting a true presence, and spec-ificity, the probability of correctly predicting a trueabsence (see Fielding and Bell 1997). A species dis-tribution model that fits the data well will predict highprobability of presence where the species was ob-served, while minimizing the probability of presencewhere the species was observed to be absent. ROC plotsdisplay sensitivity on the y-axis and (1 2 selectivity)on the x-axis. The area under the curve (AUC) thenprovides an integrated measure of the performance ofthe model, with high values reflecting better perfor-mance (i.e., sensitivity can be increased without losingmuch selectivity, and vice versa). For species distri-bution predictions, ROC curves can be generated usingonly sampled cells, and assigning each cell presenceor absence according to what was observed, or can alsobe generated under the assumption that the species isabsent in all cells except those in which it was ob-served, including unsampled cells. We present resultsusing both approaches in Table 4. Clearly, by the AUCcriterion, the spatially explicit GLM (Model 2) out-performs the nonspatial GLM (Model 1), while the hi-erarchical model (Model 4) performs best. Only whenthe AUC calculated for P. mundii using all cells ranksModel 4 as slightly worse than the other spatial models.This result reflects a disagreement between the model’sprediction and the assumption that species are absentfrom unsampled cells—given that we do not know thetruth about these cells, this result is difficult to inter-pret. Model 3’s tendency to overpredict is reflected ina curve that rises slowly as selectivity decreases, givingthis model an AUC between those of Models 1 and 2.

A recently proposed measure, ‘‘minimum predictedarea’’ (MPA; Engler et al. 2004) is closely related toROC curves: it counts the number of cells in the pre-dicted distribution that are above some threshold valuecorresponding to a specified sensitivity level. Becauseit corresponds to a distribution prediction that can bevisualized on a map, it has a more intuitive spatialinterpretation than the AUC. To find the MPA of amodel prediction, we select a threshold value pt thatgives the desired sensitivity level, say 0.95 (that is,95% of cells in which the species was observed havepredicted probability of occurrence greater than pt).Then we count how many sampled cells in the entirepredicted region have predicted probability of presence


FIG. 8. Variance in the predicted probability of occurrence of P. mundii for Models 1–4. Each grid cell is assigned thevariance of the posterior distribution of pi, the predicted probability of occurrence of the species in cell i.

TABLE 3. Predicted prevalence.

Model

Protea mundii quantile

0.025 0.50 0.975

Protea punctata quantile

0.025 0.50 0.975

Model 1 0.011 0.012 0.013 0.018 0.019 0.020Model 2 0.011 0.012 0.014 0.017 0.018 0.021Model 3 0.051 0.057 0.065 0.057 0.064 0.071Model 4 (potential) 0.040 0.046 0.051 0.042 0.047 0.052Model 4 (adjusted) 0.031 0.035 0.039 0.041 0.046 0.051

Notes: Potential prevalence is the mean probability of occurrence in an untransformed land-scape; adjusted prevalence is the mean probability of occurrence given human transformation.


TABLE 4. Model performance.

Model

Protea punctata

AUC(sam-pled)

AUC(all) MPA

Protea mundii

AUC(sam-pled)

AUC(all) MPA

Model 1 0.964 0.969 4487 0.940 0.948 8367Model 2 0.995 0.992 2128 0.996 0.993 2013Model 3 0.978 0.974 6127 0.992 0.992 2950Model 4 0.996 0.995 1180 0.998 0.989 1180

Notes: Table 4 displays the area under the receiver oper-ating characteristics (ROC) curve (AUC) and minimum pre-dicted area (MPA) for each model. MPA was calculated fora sensitivity value of 0.95. Larger AUC and smaller MPAreflect better model performance.

greater than pt. With this criterion, smaller is better.Note that because it requires setting an arbitrary sen-sitivity level, MPA is the equivalent of only a singlepoint on the ROC curve. In any case, at a sensitivitylevel of 0.95, the spatial models outperform the non-spatial model, with one exception, and Model 4 alwaysdoes best by a large margin (see Table 4).

DISCUSSION

We have seen that the basic nonspatial logistic re-gression model can be improved by adding features tothe model that reflect major known attributes of thedata, including variable sampling intensity, spatial au-tocorrelation, and land use impacts. There is a strongcase for making these models spatially explicit. Indeed,Gaston (2003) and others have called for the incor-poration of spatially explicit methods in determiningthe structure and dynamics of species geographic rang-es, but progress has been limited. Further, one of theproblems in comparative biogeography is that samplingand data gathering are conducted with a multitude ofdifferent methodologies (Gaston 2003, Graham et al.2004). The consequence is uncertainty in comparingand interpreting spatial patterns in species distribu-tions. Are the patterns real or are they simply artifactsof differences in sampling effort? The full, hierarchicalmodel developed here allows one to incorporate vari-ation in sampling effort directly in the model.

Our model results confirm that the spatial pattern ofpresence and absence of a species includes more in-formation than can be explained through just the meaneffect of a suite of environmental variables. There aretwo main explanations for this. One is that biologicalprocesses tend to generate spatial pattern. In the caseof species distributions, the probability that a site con-tains a species depends on not only on its climatic andedaphic characteristics, but also on its neighborhood.Such spatial dependence can arise from biological pro-cesses at a number of levels. Processes in the life his-tory of individual organisms, including reproduction,territoriality, and dispersal, can generate clustering orevenness in species distributions. Interactions of spe-

cies with each other and with resources (e.g., effectsof grazers on vegetation, the role of nurse plants inrecruitment) can likewise cause and perpetuate spatialassociation. The particular occupancy history of a sitecan also exert a long-term spatial influence on its neigh-borhood. These spatial patterns are not mere epiphe-nomena, but rather can strongly influence individualspecies distributions, as well as interspecific interac-tions and thus community composition and potentiallyecosystem processes. The second explanation for au-tocorrelation is the influence of unobserved environ-mental variables, and of nonlinearities in interactionsamong sets of (observed and unobserved) factors, allof which may have some degree of spatial dependenceand interdependence (Ver Hoef et al. 2001). Moreover,it is inevitable that the identity of critical explanatoryvariables may change from one part of a species’ geo-graphical range to another (Gaston 2003). Since modelscannot include all important variables, and may includesome unimportant ones, there will usually be some de-gree of autocorrelation in model residuals. Models(such as Models 2, 3, and 4) that can fit environmentalresponses despite spatial dependence in the data, andthat identify and quantify this spatial dependence, arethus essential tools for investigating and predicting spe-cies distributions.

Critically, without spatial structure in the model, thelevel of uncertainty about model parameters can bedramatically underestimated and poorly characterized.One effect of this is that a model will identify moreexplanatory variables as significantly related to speciespresence/absence (Ver Hoef et al. 2001). Since a non-spatial model like Model 1 will tend to underestimatethe uncertainty about the environmental response ofthe species (i.e., the bs), the credible intervals for theseparameters are smaller and are more likely to excludezero. The nonspatial model can thus be seen to be over-stating the contribution of some environmental param-eters to the species distributions. It is perhaps not sur-prising that many of the coefficients for edaphic var-iables drop out in the spatial models, since these var-iables reflect underlying features of the terrain withstrong local spatial association, pattern that will largelybe taken into account by the spatial random effects.From this perspective the spatial models are preferable,since their spatial random effects take into account bothspatial heterogeneity and spatial autocorrelation in themodel residuals and thereby reduce the confounding ofspatial autocorrelation with species niche relations.

Comparing the uncertainty surfaces for the four mod-els graphically illustrates how nonspatial models maypoorly characterize uncertainty. The variance in pre-dicted probability of occurrence from Model 1 is dra-matically lower than that for the three spatial models,and in particular shows no clear spatial pattern. Intu-itively, it seems obvious that knowing whether a spe-cies has been observed nearby should affect one’s ex-


pectation of encountering a species, and also one’s levelof uncertainty about this. But a nonspatial model, sinceit has no way of referencing grid cells or points inspace, cannot respond to neighborhood information.Accordingly, whereas the spatial models produce var-iance surfaces that respond to the intensity and outcomeof local sampling, the nonspatial model produces a fair-ly flat, low-intensity surface.

The spatial random effects themselves provide acharacterization of spatial pattern in the data (see Fig.7). As discussed above in Methods, the adjustment inthe spatial random effects (ri’s) is based solely on thevalues of the spatial effect in neighboring cells (orsites). Where the value of the spatial effect is positive,those cells (or sites) are more suitable for the speciesthan would be indicated by climate and edaphic fea-tures alone. The converse applies when the effect isnegative. The spatial effects that were implemented asCARs (Models 1 and 3) show spatial structure both atvery fine scales (sharp peaks and valleys in the surface),and at larger, subregional scales (e.g., an east–westtrend for Protea mundii). The spatial effects surfacefor the kernel-smoothing model (Model 3) appearsquite different, though it also indicates some areas ofstrong spatial association.

Model comparison for the four types of models wehave presented is a difficult issue. In general, with bi-nary data, there is no widely accepted criterion forchoosing between hierarchical and nonhierarchicalmodels (see Spiegelhalter et al. 2002, including theassociated discussion). It is not entirely clear how toascribe appropriate penalties for model complexity insuch cases. Even more important in our case is the fact,though the Y’s are always Bernoulli trials with regardto presence/absence, the p’s we are defining are dif-ferent across the models. That is, for Models 1 and 2,we ignore location for a trial in a given grid cell, forModel 3 we envision a p at every location in the regionwith pairwise dependence as a function of distance, forModel 4 the p’s are interpreted through environmentalsuitability as a surrogate for presence/absence. So,comparison among these models is a bit like comparing‘‘apples and oranges.’’ Moreover, it is not necessarilysensible to reduce a model to a single number, as need-ed for model selection. It is arguably preferable to com-pare ‘‘explanation’’ across models with regard to en-vironmental factors and ‘‘prediction’’ across the mod-els with regard to species distribution patterns.

By the two quantitative performance measures pre-sented here, the area under ROC curves and minimumpredicted area, the spatial models clearly outperformthe nonspatial Model 1, with the simple spatially ex-plicit GLM, Model 2, generally falling somewhat shortof the hierarchical Model 4 (see Table 4). The point-level Model 3 overpredicts the extent of the speciesdistribution at high sensitivity levels, while Model 4overall performs best at high sensitivity levels, as re-

flected in the MPA scores (see Table 4). In addition toimproving the quantification of uncertainty, then, andidentifying important environmental factors, making amodel spatially explicit can increase its predictive pow-er. The more complex hierarchical model has advan-tages in interpretability and also provides better pre-diction, giving the least overprediction at high sensi-tivity levels. If it is important to capture accurately alloccurrences of a species, including peripheral ones thatare difficult to capture with simpler models (e.g., Pro-tea mundii in the extreme west), a model like Model4 may be worth the extra work. These performanceevaluations still involve an ‘‘apples-to-oranges’’ com-parison in that the interpretation of pi differs across themodels, but from a purely functional perspective ofcomparing predictive ability, they provide a reasonableapproach.

Using these models to predict the distributions ofexample species from the CFR has demonstrated thatthe spatial regression models can predict species dis-tributions relatively well, and clearly better than a non-spatial model. These models also identify environmen-tal variables that are strongly associated with their pres-ence or absence. Here, for example, the two examplespecies Protea punctata and P. mundii are shown tobe distinguished by temperature, rainfall seasonality,and topographical variables, and less clearly by soilfertility. Of course, correlation is not causation, butsuch results provide hypotheses for testing; for ex-ample, the species can be transplanted across gradientsof temperature and seasonal moisture availability to testwhether the responses are consistent with model pre-diction, something that is now being done for these anda set of other, closely related species (A. M. Latimer,unpublished data).

For practical applications as well as for ecologicalinference, it is important to be able to quantify uncer-tainty. Obtaining the full probability distributions forparameter estimates is a major advantage of the Bayes-ian framework. Some potential applications for thismodel output include screening sites for reintroductionof a species or for reserve expansion. In these cases,the model could be used to select areas in which thereis a high degree of certainty that the species can occur;this can be done by identifying cells that have credibleintervals for predicted probability of occurrence (pi)that lie above some threshold value, say 0.5. Using theposterior distributions enables the level of uncertaintyto be specified, even when the distribution of the pa-rameter estimate is skewed or multimodal. The mostcomplex model presented here, Model 4, predicts spe-cies distributions in the absence of transformation aswell as distributions under current conditions. The po-tential distribution prediction has obvious conservationplanning applications, and by comparing the potentialand transformed predictions, it is possible to present aprecise estimate, again with uncertainty, of the extent


to which the species has been affected by transfor-mation. Where transformation can be subdivided intodifferent categories (e.g., agriculture, urbanization,alien invasive species), it is also possible to comparethe effects of these various forms of transformation (seeLatimer et al. 2004).

Characterizing species distributions, their limits, andtheir driving causes remains elusive, despite their closelinks to fundamental biogeographical questions relat-ing to rarity, species richness and turnover (Gaston2003). Going beyond ad hoc range maps to capturesuch features as holes in species distributions and un-certainty at edges requires probabilistic models that canspecify a species range as a probability surface. Thereality of species distributions is that species are neveractually continuously distributed; the edge of a geo-graphic range does not exist; rather ranges are contin-uously in flux in space and time. The spatial modelswe have presented can encapsulate aspects of this pro-cess via probability surfaces, along with a full speci-fication of associated uncertainty. By allowing rigorousprobabilistic inference about species distributions,these models should enable progress on problems re-lating to species distributions, including distinguishingdifferent components of species distributions, such asarea of occupancy, extent of occupancy, and preva-lence, and testing relationships among prevalence, lo-cal abundance, and range size (see Gaston 2003). Wehave shown here that building such models in a Bayes-ian framework by adding spatial random effects to aGLM is relatively straightforward. With increasingcomputer power, such models can be fitted even to verylarge and rich data sets. Such models hold promise foruntangling key questions about the causes and featuresof distributions of species as well their joint distribu-tion, biodiversity.

ACKNOWLEDGMENTS

This research was supported in part by NSF Grant DEB-008901 to J. A. Silander and A. E. Gelfand, and by NSFGrant DEB-0091961, supporting the Statistics in EcologyWorkshop. The research was also conducted as part of theBayesian Macro-Ecology Working Group and supported bythe National Center for Ecological Analysis and Synthesis, aCenter funded by NSF (Grant DEB-0072909), the Universityof California, and the Santa Barbara campus. A. M. Latimer’swork was also funded by an NSF Graduate Research Fellow-ship. We thank Dr. Anthony G. Rebelo, director of the ProteaAtlas Project of the South African National Botanical Institute(NBI), for providing the species distribution data and sharinghis unparalleled knowledge of protea biogeography. We alsothank Dr. Richard M. Cowling of the University of Port Eliz-abeth, Dr. Henri Laurie of the University of Cape Town, Dr.Alexandra Schmidt of the University of Rio de Janeiro, andWalter Smit of the National Botanical Institute, as well as Dr.Rebelo, for their invaluable participation the Bayesian Macro-Ecology Working Group and collaboration on the Protea Bio-geography Project. Dr. Guy F. Midgley, head of NBI’s climatechange program, provided some of the climate data layers.Finally, we acknowledge the two anonymous reviewerswhose comments have contributed greatly to the readabilityand, we hope, usefulness of this article.

LITERATURE CITED

Agarwal, D. K., A. E. Gelfand, and J. A. Silander, Jr. 2002.Investigating tropical deforestation using two-stage spa-tially misaligned regression models. Journal of AgriculturalBiological and Environmental Statistics 7:420–439.

Augustin, N. H., M. A. Mugglestone, and S. T. Buckland.1996. An autologistic model for the spatial distribution ofwildlife. Journal of Applied Ecology 33:339–347.

Austin, M. P., and J. A. Meyers. 1996. Current approachesto modelling the environmental niche of eucalypts: impli-cations for management of forest biodiversity. Forest Ecol-ogy and Management 85:95–106.

Austin, M. P., A. O. Nicholls, and C. R. Margules. 1990.Measurement of the realized qualitative niche: environ-mental niches of five Eucalyptus species. Ecological Mono-graphs 60:161–177.

Bannerjee, S., B. P. Carlin, and A. Gelfand. 2003. Hierar-chical modeling and analysis for spatial data. Chapman andHall, Boca Raton, Florida, USA.

Besag, J. 1974. Spatial interaction and the statistical analysisof lattice systems (with discussion). Journal of the RoyalStatistical Society Series B 36:192–236.

Brown, J. H., D. W. Mehlman, and G. C. Stevens. 1995.Spatial variation in abundance. Ecology 76:2028–2043.

Carlin, B. P., and T. A. Louis. 2000. Bayes and empiricalbayes methods for data analysis. Second edition. Chapmanand Hall, New York, New York, USA.

Clark, J. S. 2003. Uncertainty and variability in demographyand population growth: a hierarchical approach. Ecology84:1370–1381.

Congdon, P. 2001. Bayesian statistical modelling. John Wileyand Sons, Chichester, UK.

Cowling, R. M., D. M. Richardson, R. J. Schultze, M. T.Hoffman, J. J. Midgley, and C. Hilton-Taylor. 1997. Spe-cies diversity at the regional scale. Pages 447–473 in R.M. Cowling, D. M. Richardson, and S. M. Pierce, editors.Vegetation of southern Africa. Cambridge University Press,Cambridge, UK.

D’heygere, T., P. L. M. Goethals, and N. De Pauw. 2003. Useof genetic algorithms to select input variables in decisiontree models for the prediction of benthic macroinverte-brates. Ecological Modelling 160:189–300.

Ellison, A. M. 2004. Bayesian inference in ecology. EcologyLetters 7:509–520.

Engler, R., A. Guisan, and L. Rechsteiner. 2004. An improvedapproach for predicting the distribution of rare and endan-gered species from occurrence and pseudo-absence data.Journal of Applied Ecology 41:263–274.

Ettema, C. H., S. L. Rathbun, and D. C. Coleman. 2000. Onspatiotemporal patchiness and the coexistence of five spe-cies of Chronogaster (Nematoda: Chronogasteridae) in ariparian wetland. Oecologia 125:444–452.

Ferrier, S., G. Watson, J. Pearce, and M. Drielsma. 2002.Extended statistical approaches to modelling spatial patternin biodiversity in northeast New South Wales. I. Species-level modeling. Biodiversity and Conservation 11:2275–2307.

Fielding, A. H., and J. F. Bell. 1997. A review of methodsfor the assessment of prediction errors in conservation pres-ence/absence models. Environmental Conservation 24:38–49.

Gaston, K. J. 2003. The structure and dynamics of geographicranges. Oxford University Press, New York, New York,USA.

Gelfand, A., D. Agarwal, and J. A. Silander, Jr. 2002. In-vestigating tropical deforestation using two stage spatiallymisaligned regression models. Journal of Agricultural, Bi-ological and Environmental Statistics 7:420–439.


Gelfand, A. E., A. M. Schmidt, S. Wu, J. A. Silander, A.Latimer, and A. G. Rebelo. 2005a. Explaining species di-versity through species level hierarchical modeling. Ap-plied Statistics 65:1–20.

Gelfand, A. E., J. A. Silander, Jr., S. Wu, A. Latimer, P. O.Lewis, A. G. Rebelo, and M. Holder. 2005b. Explainingspecies distribution patterns through hierarchical modeling.Bayesian Analysis 1:41–92.

Gelman, A., J. B. Carlin, and D. B. Rubin. 1995. Bayesiandata analysis. CRC Press, Boca Raton, Florida, USA.

Gilks, W. R., S. Richardson, and D. J. Spiegelhalter. 1995.Markov chain Monte Carlo in practice. CRC Press, BocaRaton, Florida, USA.

Graham, C. H., S. Ferrier, F. Huettman, C. Moritz, and A. T.Peterson. 2004. New developments in museum-based in-formatics and applications in biodiversity analysis. Trendsin Ecology and Evolution 19:497–503.

Guisan, A., T. C. J. Edwards, and T. Hastie. 2002. Generalizedlinear and generalized additive models in studies of speciesdistributions: setting the scene. Ecological Modelling 157:89–100.

Guisan, A., and N. E. Zimmermann. 2000. Predictive habitatdistribution models in ecology. Ecological Modelling 135:147–186.

Higdon, D., J. Swall, and J. Kern. 1999. Nonstationary spatialmodeling. Pages 761–768 in J. M. Bernardo, J. O. Berger,A. P. Dawid, and A. F. M. Smith, editors. Bayesian statistics6. Oxford University Press, Oxford, UK.

Hooten, M. B., D. R. Larsen, and C. K. Wikle. 2003. Pre-dicting the spatial distribution of ground flora on large do-mains using a hierarchical Bayesian model. LandscapeEcology 5:487–502.

Latimer, A. M., J. A. Silander, Jr., A. E. Gelfland, A. G.Rebelo, and D. M. Richardson. 2004. Quantifying threatsto biodiversity from invasive alien plants and other factors:a case study from the Cape Floristic region. South AfricanJournal of Science 100:81–86.

Leathwick, J. R. 2002. Intra-generic competition amongNothofagus in New Zealand’s primary indigenous forests.Biodiversity and Conservation 11:2177–2187.

Lehmann, A., J. M. Overton, and J. R. Leathwick. 2002.GRASP: generalized regression analysis and spatial pre-diction. Ecological Modelling 157:189–207.

Link, W. A., and J. R. Sauer. 2002. A hierarchical analysisof population change with application to Cerulean War-blers. Ecology 83:2832–2840.

MacArthur, R. H., H. F. Recher, and M. L. Cody. 1966. Onthe relation between habitat selection and species diversity.American Naturalist 100:319–332.

Manel, S., J.-M. Dias, and S. J. Ormerod. 1999. Comparingdiscriminant analysis, neural networks and logistic regres-sion for predicting species distributions: a case study witha Himalayan river bird. Ecological Modelling 120:337–347.

Midgley, G. F., L. Hannah, D. Millar, M. C. Rutherford, andL. W. Powrie. 2002. Assessing the vulnerability of speciesrichness to anthropogenic climate change in a biodiversityhotspot. Global Ecology and Biogeography 11:445–451.

Moisen, G. G., and T. S. Frescino. 2002. Comparing fivemodelling techniques for predicting forest characteristics.Ecological Modelling 157:209–225.

Mugglin, A. S., B. P. Carlin, and A. Gelfand. 2000. Fullymodel based approaches for spatially misaligned data. Jour-nal of the American Statistical Association 95:877–887.

Peterson, A. T. 2003. Predicting the geography of species’invasions via ecological niche modeling. Quarterly Reviewof Biology 78:419–433.

Raxworthy, C. J., E. Martinez-Meyer, N. Horning, R. A.Nussbaum, G. E. Schneider, M. A. Ortega-Huerta, and A.T. Peterson. 2003. Predicting distributions of known andunknown reptile species in Madagascar. Nature 426:837–841.

Rouget, M., D. M. Richardson, R. M. Cowling, J. W. Lloyd,and A. T. Lombard. 2003. Current patterns of habitat trans-formation and future threats to biodiversity in terrestrialecosystems of the Cape Floristic Region, South Africa.Biological Conservation 112:63–85.

Spiegelhalter, D. J., N. Best, B. P. Carlin, and A. van derLinde. 2002. Bayesian measures of model complexity andfit (with discussion). Journal of the Royal Statistical SocietySeries B 64:583–639.

Stoyan, H., H. De-Polli, S. Bohm, G. P. Robertson, and E.A. Paul. 2000. Spatial heterogeneity of soil respiration andrelated properties at the plant scale. Plant and Soil 222:203–214.

Thomas, C. D., et al. 2004. Extinction risk from climatechange. Nature 427:145–148.

Turchin, P. 1998. Quantitative analysis of movement: mea-suring and modeling population redistribution in animalsand plants. Sinauer, Sunderland, Massachusetts, USA.

Ver Hoef, J. M., N. Cressie, R. N. Fisher, and T. J. Case.2001. Uncertainty and spatial linear models for ecologicaldata. Pages 214–237 in C. T. Hunsaker, M. F. Goodchild,M. A. Friedl, and T. J. Case, editors. Spatial uncertainty inecology. Springer-Verlag, New York, New York, USA.

Wikle, C. K. 2003. Hierarchical Bayesian models for pre-dicting the spread of ecological processes. Ecology 84:1382–1394.

APPENDIX

A table describing environmental data layers (Ecological Archives A016-003-A1).

SUPPLEMENT

Model code for WinBUGS (Ecological Archives A016-003-S1).

building statistical models to analyze species distributions

Documents