cell a 0 0.2 0.4 0.6 0.8 1 020406080100120140 separation distance (km) 0 0.2 0.4 0.6 0.8 1...

1
cell a 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 county average % ba/acre 50th percentile 60th percentile 7 0th percen tile 80th percentile original data 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 separation distance (km) 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 separation distance (km ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 % basalarea /acre hem lock relative frequency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 % basal area/acre hem lock relative frequency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 % basal area/acre hem lock relative frequency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.00 0.08 0.15 0.23 0.30 0.38 0.45 0.53 0.60 0.68 0.75 0.83 0.90 0.98 % basalarea/acre hem lock relative frequency 0 0.2 0.4 0.6 0.8 1 0 20000 40000 60000 80000 100000 120000 distance(m) (h) range of autocorrelation nugget effect sill component of 1 st structure component of 2 nd structure range of 2 nd structure range of 1 st struct ure Stochastic simulation for mapping ground inventory variables: Creating and using the FIA species distribution maps Abstract SGCS output Sequential Gaussian conditional simulation (SGS): Instead of coming up with a single best estimate, simulation calculates many different, equally probable, alternative realizations. From this set of estimates, an entire distribution function can be built for each cell by “stacking” the individual realizations, representing the range of possible values. It is from this frequency distribution for each cell that we choose a summary statistic, such as the median, as our ‘estimate’ for that cell, and another summary statistic, such as the inter-quartile range, as the measure of uncertainty of our estimate for that cell. Choosing a specific dataset (from the data provided) Question/Description: I am doing research on hemlock woolly adelgid and would like to establish some study sites in the field. I know the adelgid is more frequently found in stands that have at least 20% ba/area of hemlock (hypothetical). I have a limited amount of money and thus the cost of sending crews out to a site that does not have enough hemlock is worse in this case than missing a site that might have had enough hemlock. With these specific objectives, I can incorporate that information into my decision as to which percentile to choose as my map. Calculation: I have decided that visiting a site in error is 1.5 times as bad as missing a site that contained sufficient hemlock – i..e. my cost associated with overestimating is greater than my cost associated with underestimating by a ratio of 1.5:1. Choosing a ‘general’ dataset (the ones presented on our web page) Description: A ‘general’ dataset of species distribution is needed, for example to accompany the FIA reports. Since a decision criteria must be used, it is chosen here to be that percentile which matches most closely the FIA means at the county-level. This assumes that the FIA data are the most ‘correct’ we have available at this level. Calculation: For each county or county-equivalent, the mean of %ba/acre values is calculated for all plots and compared with the mean of %ba/acre values for all estimates within that area. The percentile that yields estimates whose county-level mean values are closest to those of the FIA plot data is chosen. Modeled estimate (nonforest areas are masked from another source: AVHRR, Zhu, 1992) Conclusions Geostatistics is a branch of statistics that studies phenomena in space. Tools such as variograms provide a description of the spatial structure present in the data. Techniques such as kriging or conditional simulation utilize the spatial structure present and estimate the value of unknown points based on the distance, direction, and redundancy of neighboring points. Unlike kriging, stochastic simulation incorporates uncertainty into the model and provides output with a more realistic heterogeneity as well as a clear measure of the uncertainty of each local estimate. Sequential Gaussian conditional simulation (SGS) assumes the random function is multi- gaussian normal, and as a result, is much less computationally intensive than sequential indicator simulation (SIS), which does not make this assumption. SGS is used here. The USFS FIA ground inventory is an unbiased and relatively evenly distributed sample, with a spacing of approximately 5 km. The data values are individual tree species basal area per acre as a proportion of the total (% ba/acre), representing species importance. The data typically exhibit a skewed distribution. There is spatial structure present, ranging from 20- 80% for tree species in this area. There is also unexplained structure (seen in the nugget of the variogram), where a species relative dominance is affected by factors that are below the sampling intensity of the FIA plots (local changes in moisture, soil type, topography, etc.). Approach and Model parameters used: Guidelines for interpreting and utilizing the simulation output: The goal of the current paper is to present a simulation- based estimation technique that incorporates information on the spatial dependence of the data, provides estimates that have properties similar to those of the original dataset, maintains some of the variability we know exists, incorporates uncertainty explicitly into the estimates, and allows the user to choose from among a range of estimates based on the specific objectives of the research or management question. It is this technique that is used to create the FIA species distribution maps. Zones/populations: When subpopulations of a species have a significantly different pattern of spatial distribution, treating the populations separately in the modeling and interpolation will improve the final estimates. Here geographic zones were manually drawn for each species based on Bailey’s ecoregions and the distribution of of %ba/acre values for each species that actually occurred. Ecoregions that were long and narrow were also joined to minimize edge effects. Normal-scoring the data: To meet the assumptions of the Gaussian model, the data were transformed to a univariate normal distribution using a 1:1 invertible normal- score transform. Bivariate normality was checked and confirmed to be very close; multivariate normality was assumed. Model: In general, one or two structures were usually observed, exhibiting the short- and long-range types of spatial dependence that might be expected in the distribution of tree species. The nugget was manually set to be realistic – i.e. a zero nugget was neither what we would expect from this phenomenon nor what the data were hinting to us. The model parameters for hemlock-zone4 were: Anisotropy: Anisotropy exists when the spatial structure in one direction is different from the spatial structure in another direction, as described by the sill + uncertainty - uncertainty • Maps such as these begin to address the need for locally reliable information of specific forest inventory variables. This example maps tree species distributions, but one could also map other FIA variables (growth, volume, etc.), with varying degrees of uncertainty. • SGS provides considerable flexibility and the capability to create both generally useful and study-specific datasets from the output, making it worth the effort to both create and explain when spatial structure is present. It retains many of the characteristics of the original sample data and incorporates uncertainty explicitly into the output (which can be extracted in direct +/- terms). The assumption of multipoint normality of the random function is a big one, but if biases are checked for, the time saving over SIS can be worth it. • In creating the ‘general’ dataset we have assumed that FIA plot data are the most correct information we have at the county level (and minimum of 10 forested plots). If we were to go any smaller than counties, this conclusion would be suspect. •Provided on the web page (http://www.fs.fed.us/ne/fia/specdist/clickmap.html) are the general datasets of %ba/acre distribution and +/- uncertainty for each species. Maps to suit specific criteria can be easily calculated to address specific problems, and a contact address is provided. Information on FIA spatial statistics in general can be found at http://www.fs.fed.us/ne/fia/. • The methodology presented here provides a current balance between the time available and the accuracy desired. Alternative methods are available when time is more limited, Data characteristics and phenomena The zones used for hemlock and the variograms calculated for each. (Substantially) Different variograms in adjacent regions warrant separate modeling and simulation when time permits. Summary of output characteristics desired: provide a set of estimates that maintain as many characteristics of the original dataset as possible; provide a measure of uncertainty along with any estimate; allow the user to choose from among a range of estimates based on the specific goals of the study; preserve some of the local variability to indicate local heterogeneity where present; produced within the constraints of time and computer resources. Why geostats and SGS? Plot map of sample data Model # nugget structures effect function range(m) component 2 .7 spherical 10500 .15 exponential 55000 .15 anisotropy 2.75:1 at 30 o 0.6 0.8 1 0 20 40 60 80 100 120 distance (km) (h) 0.6 0.8 1 0 20 40 60 80 100 120 distance (km) (h) 0.6 0.8 1 0 20 40 60 80 100 120 distance (km) (h) 0.6 0.8 1 0 20 40 60 80 100 120 distance (km) (h) 0.6 0.8 1 0 20 40 60 80 100 120 distance (km) (h) Single realization 70 th percentile Ordinary kriging So we can pull one of the summary statistics from each cell’s distribution to represent its ‘estimate’ – such as the median, or any percentile. And, similarly, one to represent the uncertainty of that estimate – such as the interquartile range of each cell’s distribution. A single realization contains all the spatial variation of the plots, but it doesn’t give us any uncertainty with the estimate, because it’s just one possible scenario. Also, like the points themselves, it is sometimes too noisy to give us a good sense of the spatial pattern of distribution. Ordinary kriging honors the overall mean. The results of ordinary kriging are much more smoothed and there is no comparable estimate of uncertainty, since the kriging variance does not reflect the redundancy of data values, but only the number and location of data points. 0 - .05 .05 -.2 .2 - .5 .5 – 1.0 0 - .05 .05 -.2 .2 - .5 .5 – 1.0 0 - .05 .05 -.2 .2 - .5 .5 – 1.0 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 separation distance (km) autocorrelation (h) 0 0.2 0.4 0.6 0.8 1 1.2 0 20 40 60 80 100 percentile dissim ilarity Comparing percentile maps to FIA statistics using county area means. Here the 70 th percentile appears optimal. Uncertainty (+/-) in terms of %ba/acre --expressed here using interquartile range (IQR) 0 - .05 .05 - .1 .1 - .2 .2 - .3 .3 - .4 .4 - .5 .5 - .6 % ba/acre The probability distribution function (pdf) of values estimated at a single cell 0 5 10 15 20 0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96 % basal area/acre hem lock relative frequency median iqr 0 10 20 30 40 50 60 70 0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96 % basal area/acre hem lock relative frequency 0 5 10 15 20 25 0 0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8 0.88 0.96 % basalarea/acre hem lock relative frequency cell b cell c Alternatively, one can also calculate from the SGS output the probability of any cell having 20%ba/acre of hemlock. This might be useful if $ are best spent by sending crews to the most areas where one is most likely to find hemlock. 0 - .1 .1 - .2 .2 - .3 .3 - .4 .4 - .5 .5 - .6 .6 -.7 Probability of 20%ba/acr e H If we assume a linear loss function using those ratios, or: Then the percentile (p) that we’re probably interested in is: p = 2 1 2 = .4 or 40 th percentile = Thus, this map of the 40 th percentile provides a more limited picture of where concentration s of hemlock occur – increasing the likelihood that those cells contain at least that much %ba/acre of hemlock. %ba/ acre %ba/ acre %ba/ acre 0 - .05 .05 -.2 .2 - .5 .5 – 1.0 %ba/acre 0 - 10 - 20 - 30 30 20 10 0 1 2 Estimation error (%ba/acre) underestimati on overestimatio n 2 = 1 1 = 1.5 loss 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 sep aratio n d istance (km ) Rachel Riemann and Andy Lister Depending on the shape of each cell’s distribution and the choice of percentile, the + and – uncertainties (i.e. the possible magnitude of under- and over- estimation) may be different, and mapping them separately can be useful. 0 - .05 .05 -.2 .2 - .5 .5 – 1.0 nonfor est water %ba/ acre [email protected] / [email protected] Another look: using dissimilarity between FIA and estimated county area means to choose optimal percentile

Upload: sharon-perkins

Post on 04-Jan-2016

231 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Cell a 0 0.2 0.4 0.6 0.8 1 020406080100120140 separation distance (km) 0 0.2 0.4 0.6 0.8 1 020000400006000080000100000120000 distance(m)  (h) range of

cell a

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

county

aver

age

%ba

/acr

e 50th percentile60th percentile70th percentile80th percentileoriginal data

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

separation distance (km)

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

separation distance (km)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.05 0.

1

0.15 0.

2

0.25 0.

3

0.35 0.

4

0.45 0.

5

0.55 0.

6

0.65 0.

7

0.75 0.

8

0.85 0.

9

0.95

1

%basal area /acre hemlock

rela

tive

frequ

ency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

%basal area/acre hemlock

rela

tive

freq

uenc

y

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

%basal area/acre hemlock

rela

tive

freq

uenc

y

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.00

0.08

0.15

0.23

0.30

0.38

0.45

0.53

0.60

0.68

0.75

0.83

0.90

0.98

%basal area/acre hem lock

rela

tive

fre

qu

ency

0

0.2

0.4

0.6

0.8

1

0 20000 40000 60000 80000 100000 120000

distance(m)

(h)

range of autocorrelation

nuggeteffect

sill

component of 1st structure

component of 2nd structure

range of 2nd structurerangeof 1st

structure

Stochastic simulation for mapping ground inventory variables: Creating and using the FIA species distribution maps

Abstract

SGCS outputSequential Gaussian conditional simulation (SGS): Instead of coming up with a single best estimate, simulation calculates many different, equally probable, alternative realizations. From this set of estimates, an entire distribution function can be built for each cell by “stacking” the individual realizations, representing the range of possible values. It is from this frequency distribution for each cell that we choose a summary statistic, such as the median, as our ‘estimate’ for that cell, and another summary statistic, such as the inter-quartile range, as the measure of uncertainty of our estimate for that cell.

Choosing a specific dataset (from the data provided)Question/Description: I am doing research on hemlock woolly adelgid and would like to establish some study sites in the field. I know the adelgid is more frequently found in stands that have at least 20% ba/area of hemlock (hypothetical). I have a limited amount of money and thus the cost of sending crews out to a site that does not have enough hemlock is worse in this case than missing a site that might have had enough hemlock. With these specific objectives, I can incorporate that information into my decision as to which percentile to choose as my map.

Calculation: I have decided that visiting a site in error is 1.5 times as bad as missing a site that contained sufficient hemlock – i..e. my cost associated with overestimating is greater than my cost associated with underestimating by a ratio of 1.5:1.

Choosing a ‘general’ dataset (the ones presented on our web page)

Description: A ‘general’ dataset of species distribution is needed, for example to accompany the FIA reports. Since a decision criteria must be used, it is chosen here to be that percentile which matches most closely the FIA means at the county-level. This assumes that the FIA data are the most ‘correct’ we have available at this level.

Calculation: For each county or county-equivalent, the mean of %ba/acre values is calculated for all plots and compared with the mean of %ba/acre values for all estimates within that area. The percentile that yields estimates whose county-level mean values are closest to those of the FIA plot data is chosen.

Modeled estimate(nonforest areas are masked

from another source:AVHRR, Zhu, 1992)

Conclusions

Geostatistics is a branch of statistics that studies phenomena in space. Tools such as variograms provide a description of the spatial structure present in the data. Techniques such as kriging or conditional simulation utilize the spatial structure present and estimate the value of unknown points based on the distance, direction, and redundancy of neighboring points. Unlike kriging, stochastic simulation incorporates uncertainty into the model and provides output with a more realistic heterogeneity as well as a clear measure of the uncertainty of each local estimate. Sequential Gaussian conditional simulation (SGS) assumes the random function is multi-gaussian normal, and as a result, is much less computationally intensive than sequential indicator simulation (SIS), which does not make this assumption. SGS is used here.

The USFS FIA ground inventory is an unbiased and relatively evenly distributed sample, with a spacing of approximately 5 km. The data values are individual tree species basal area per acre as a proportion of the total (% ba/acre), representing species importance. The data typically exhibit a skewed distribution. There is spatial structure present, ranging from 20-80% for tree species in this area. There is also unexplained structure (seen in the nugget of the variogram), where a species relative dominance is affected by factors that are below the sampling intensity of the FIA plots (local changes in moisture, soil type, topography, etc.).

Approach and Model parameters used:

Guidelines for interpreting and utilizing the simulation output:

The goal of the current paper is to present a simulation-based estimation technique that incorporates information on the spatial dependence of the data, provides estimates that have properties similar to those of the original dataset, maintains some of the variability we know exists, incorporates uncertainty explicitly into the estimates, and allows the user to choose from among a range of estimates based on the specific objectives of the research or management question. It is this technique that is used to create the FIA species distribution maps.

Zones/populations: When subpopulations of a species have a significantly different pattern of spatial distribution, treating the populations separately in the modeling and interpolation will improve the final estimates. Here geographic zones were manually drawn for each species based on Bailey’s ecoregions and the distribution of of %ba/acre values for each species that actually occurred. Ecoregions that were long and narrow were also joined to minimize edge effects.

Normal-scoring the data: To meet the assumptions of the Gaussian model, the data were transformed to a univariate normal distribution using a 1:1 invertible normal-score transform. Bivariate normality was checked and confirmed to be very close; multivariate normality was assumed.

Model: In general, one or two structures were usually observed, exhibiting the short- and long-range types of spatial dependence that might be expected in the distribution of tree species. The nugget was manually set to be realistic – i.e. a zero nugget was neither what we would expect from this phenomenon nor what the data were hinting to us. The model parameters for hemlock-zone4 were:

Anisotropy: Anisotropy exists when the spatial structure in one direction is different from the spatial structure in another direction, as described by the sill and the range observed in the directional variograms. Where there was observed anisotropy it was modeled. Here, an anisotropy ratio of 2.5:1 was observed at 30o.

Search parameters: A search radius over whose distance at least 90% of the total autocorrelation is captured is used.

Other parameters: 100 realizations were run, using a cell size of 2x2 km. A minimum of 4 and a maximum of 24 data points (12 simulated) were used to create each estimate.

+ uncertainty

- uncertainty

• Maps such as these begin to address the need for locally reliable information of specific forest inventory variables. This example maps tree species distributions, but one could also map other FIA variables (growth, volume, etc.), with varying degrees of uncertainty.

• SGS provides considerable flexibility and the capability to create both generally useful and study-specific datasets from the output, making it worth the effort to both create and explain when spatial structure is present. It retains many of the characteristics of the original sample data and incorporates uncertainty explicitly into the output (which can be extracted in direct +/- terms). The assumption of multipoint normality of the random function is a big one, but if biases are checked for, the time saving over SIS can be worth it.

• In creating the ‘general’ dataset we have assumed that FIA plot data are the most correct information we have at the county level (and minimum of 10 forested plots). If we were to go any smaller than counties, this conclusion would be suspect.

•Provided on the web page (http://www.fs.fed.us/ne/fia/specdist/clickmap.html) are the general datasets of %ba/acre distribution and +/- uncertainty for each species. Maps to suit specific criteria can be easily calculated to address specific problems, and a contact address is provided. Information on FIA spatial statistics in general can be found at http://www.fs.fed.us/ne/fia/.

• The methodology presented here provides a current balance between the time available and the accuracy desired. Alternative methods are available when time is more limited, and also when more time and data are available to improve accuracies and decrease uncertainties. Using only FIA plot data, there is a limit to how localized and how precisely the results can actually be applied. Bringing in ancillary information (i.e. conditioning to more information), provided that that information is sufficiently related to the variable of interest, has the potential to improve both the spatial resolution and the uncertainties associated with the output dataset. Future work will be focused on both fronts.

Data characteristics and phenomena

The zones used for hemlock and the variograms calculated for each. (Substantially) Different variograms in adjacent regions warrant separate modeling and simulation when time permits.

Summary of output characteristics desired: provide a set of estimates that maintain as many characteristics of the original dataset as possible; provide a measure of uncertainty along with any estimate; allow the user to choose from among a range of estimates based on the specific goals of the study; preserve some of the local variability to indicate local heterogeneity where present; produced within the constraints of time and computer resources.

Why geostats and SGS?

Plot map of sample data

Model

# nuggetstructures effect function range(m) component 2 .7 spherical 10500 .15

exponential 55000 .15

anisotropy2.75:1 at 30o

0.6

0.8

1

0 20 40 60 80 100 120

distance (km)

(h)

0.6

0.8

1

0 20 40 60 80 100 120

distance (km)

(h)

0.6

0.8

1

0 20 40 60 80 100 120

distance (km)

(h)

0.6

0.8

1

0 20 40 60 80 100 120

distance (km)

(h)

0.6

0.8

1

0 20 40 60 80 100 120

distance (km)

(h)

Single realization 70th percentile Ordinary kriging

So we can pull one of the summary statistics from each cell’s distribution to represent its ‘estimate’ – such as the median, or any percentile. And, similarly, one to represent the uncertainty of that estimate – such as the interquartile range of each cell’s distribution.

A single realization contains all the spatial variation of the plots, but it doesn’t give us any uncertainty with the estimate, because it’s just one possible scenario. Also, like the points themselves, it is sometimes too noisy to give us a good sense of the spatial pattern of distribution.

Ordinary kriging honors the overall mean. The results of ordinary kriging are much more smoothed and there is no comparable estimate of uncertainty, since the kriging variance does not reflect the redundancy of data values, but only the number and location of data points.

0 - .05

.05 -.2

.2 - .5

.5 – 1.0

0 - .05

.05 -.2

.2 - .5

.5 – 1.0

0 - .05

.05 -.2

.2 - .5

.5 – 1.0

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

separation distance (km)

au

toc

orr

ela

tio

n

(h

)

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60 80 100

percentile

dis

sim

ilar

ity

Comparing percentile maps to FIA statistics using county area means. Here the 70th percentile appears

optimal.

Uncertainty (+/-)in terms of %ba/acre --expressed here using

interquartile range (IQR)

0 - .05 .05 - .1 .1 - .2 .2 - .3 .3 - .4 .4 - .5 .5 - .6

% ba/acre

The probability distribution function (pdf) of values estimated at a single cell

0

5

10

15

20

00.

080.

160.

240.

32 0.4

0.48

0.56

0.64

0.72 0.

80.

880.

96

% basal area/acre hemlock

rela

tive

fre

qu

en

cy

median

iqr0

10

20

30

40

50

60

70

00.

080.

160.

240.

32 0.4

0.48

0.56

0.64

0.72 0.

80.

880.

96

% basal area/acre hemlock

rela

tive

fre

qu

en

cy

0

5

10

15

20

25

0

0.08

0.16

0.24

0.32 0.4

0.48

0.56

0.64

0.72 0.8

0.88

0.96

% basal area/acre hemlock

rela

tive

frequ

ency

cell b

cell c

Alternatively, one can also calculate from the SGS output the probability of any cell having 20%ba/acre of hemlock.

This might be useful if $ are best spent by sending crews to the most areas where one is most likely to find hemlock.

0 - .1

.1 - .2

.2 - .3

.3 - .4

.4 - .5

.5 - .6

.6 -.7

Probability of 20%ba/acre H

If we assume a linear loss function using those ratios, or:

Then the percentile (p) that we’re probably interested in is:

p =2

12

= .4 or 40th percentile

=

Thus, this map of the 40th percentile provides a more limited picture of where concentrations of hemlock occur – increasing the likelihood that those cells contain at least that much %ba/acre of hemlock.

%ba/acre %ba/acre%ba/acre

0 - .05

.05 -.2

.2 - .5

.5 – 1.0

%ba/acre

0-10-20-30 3020100

1

2

Estimation error (%ba/acre)

underestimation overestimation

2 = 1 1 = 1.5

loss

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

separation distance (km)

Rachel Riemann and Andy Lister

Depending on the shape of each cell’s distribution and the choice of percentile, the + and – uncertainties (i.e. the possible magnitude of under- and over-estimation) may be different, and mapping them separately can be useful.

0 - .05

.05 -.2

.2 - .5

.5 – 1.0

nonforest

water

%ba/acre

[email protected] / [email protected]

Another look: using dissimilarity between FIA and estimated county area means to choose optimal percentile