brief introduction to (hierarchical) bayesian methods ... introduction to (hierarchical) bayesian...

Brief Introduction to (Hierarchical) Bayesian Methods

Christopher K. WikleDepartment of Statistics

University of Missouri-Columbia

Outline

• Bayesian Statistics

• Hierarchical Bayesian Models

• Examples

– Blending Tropical Surface Winds

– Long-Lead Prediction of SST

– Tornado Report Climatology

• Conclusion

AMS Short Course, January, 2004

1

Geophysical Analysis

Two types of geophysical models?

• Physical:

– Laws of classical physics

– Uncertainties: neglecting higher order terms; choice of space-time averaging (subgrid-scale parameterizations); linking sys-tems (coupling biological processes to weather), etc.

• Statistical:

– Descriptive

– Uncertainties: inefficient in complex settings; data not repre-sentative of full dynamics; application under system changes;sampling; estimation

2

“Physical/Statistical Models”

Better to consider a “spectrum of models” (e.g., Berliner, 2003)

Deterministic ⇐⇒ Stochastic

Bayesian Approach:

• Natural for combining information sources while managing theiruncertainties.

– Multiple data sources

– Uncertain physics

– Physical models with stochastic parameters

– Expert opinion

• Predictive distributions of quantities of interest, conditioned on

observations

3

Distributional Notation

We use the short-hand notation [ ] for probability distribution.

Given random variables A and B, represent:

• Joint distribution of A and B: [A, B]

• Conditional distribution of A given B: [A|B]

• Marginal distribution of A: [A]

Also, note the following are equivalent:

Y |µ, σ2 ∼ N(µ, σ2)

Y = µ + ε, ε ∼ N(0, σ2)

where “∼” means “is distributed as”.

4

Bayesian Viewpoint

Bayes’ Theorem:

[Y |X ] =[Y,X ]

[X ]=

[X|Y ][Y ]∫[X|Y ][Y ]dY

For random variables with known distributions, this is a “fact” fromprobability theory. It is not controversial!

Bayesian Statistics:

Modeling unknown distributions and updating those models based ondata. Might be controversial!

5

Bayesian Statistics

[Y |X ] =[X|Y ][Y ]∫

[X|Y ][Y ]dY∝ [X|Y ][Y ]

• Want to make inference about Y , but we only observe X

• We update our uncertainty about Y after observing X

• [Y ] reflects our prior knowledge of Y

• [X|Y ] is the “likelihood”

• [Y |X ] is the posterior distribution

Bayesian Modeling:

Treat all unknowns as if they are random and evaluateprobabilistically

[Process|Data] ∝ [Data|Process][Process]

6

Simple Example 1

• Temperature observations: T (s1), T (s2), . . . , T (sn)

• These are observations of the true (unknown) process: θ

• We have prior information about θ, say θ0 from a “numerical model”but we don’t believe our model is perfect.

Probabilistically, we may believe:

[Data|Process] : T (si)|θ, σ2 ∼ N(θ, σ2)

[Process] : θ|θ0, τ2 ∼ N(θ0, τ

2)

7

We want the posterior: [Process|Data], i.e.,

[θ|T (s1), . . . , T (sn)] ∝ [T (s1), . . . , T (sn)|θ, σ2][θ|θ0, τ2]

where for simplicity, we assume θ0, τ2, σ2 are known. Then,

θ|T (s1), . . . , T (sn) ∼ N(wT + (1− w)θ0,σ2τ 2

σ2 + nτ 2)

where

T =1

n

n∑i=1

T (si)

w =τ 2

τ 2 + σ2/n.

Note: this is a weighted average of the sample mean and the priormean, where the weights are related to our certainty about the obser-vations and prior.

8

Basic Bayes is Not New to Meteorology

• Epstein, early 1960’s used Bayesian decision making in applied me-teorology

• Weather Modification Research in 1970’s (e.g., Simpson et al. 1973,JAS)

• Data Assimilation can be viewed from Bayesian viewpoint (e.g.,Lorenc 1986, Q. J. Roy. Met. Soc.)

– Kalman filter is Bayesian (e.g., Meinhold and Singpurwalla, 1983)

– Ensemble Kalman Filters (Sequential Monte Carlo); e.g., Evensenand van Leeuwen, 2000, MWR

• Climate Change (e.g., Solow 1988; J. Clim.; Leroy 1998, J. Clim.)

• Satellite Retrieval (many!)

Hierarchical Bayes is new to meteorology!

9

Hierarchical Bayesian Viewpoint

Cornerstone of hierarchical modeling is conditional thinking.

Joint distribution can be represented as products of conditionals. e.g.,

[A, B, C] = [A|B, C][B|C][C]

NOTE: Our choice for this decomposition is based on what we knowabout the process, and what assumptions we are willing and able tomake for simplification:

e.g., conditional independence: [A|B,C]=[A|B].

Often easier to express conditional models than full joint models.

10

Hierarchical Bayesian Modeling of Geophysical Systems

Separate unknowns into two groups:

• process variables: actual physical quantities of interest (e.g., tem-perature, pressure, wind)

• model parameters: quantities introduced in model development(e.g., propagator matrices, measurement error and sub-grid scalevariances, unknown physical constants)

Basic Hierarchical Model

1. [data|process, parameters]

2. [process|parameters]

3. [parameters]

The posterior [process,parameters|data] is proportional to theproduct of these three distributions!

11

Simple Example 1 (cont.)

Say that in Example 1, we believe that the numerical model valueof θ is biased. We do not know exactly the bias, but suspect it isrelated to known factors x ≡ (x1, x2, . . . , xk)

′ (e.g., MOS). Then, ourhierarchical model has the following stages:

• Data Model:

T (si)|θ, σ2 ∼ N(θ, σ2)

• Process Model:

θ|θ0,x, β, τ 2 ∼ N(θ0 + x′β, τ 2)

• Parameter Model:

β|β0,Σ ∼ N(β0,Σ)

We then seek the posterior distribution: [θ, β|T (s1), . . . , T (sn)]

12

Complicated Processes

Usually, we can’t find analytically the normalizing constant in Bayes’theorem. Thus we can’t easily get the posterior distribution.

• Numerical integration (o.k. for low dimensions)

• Use Monte Carlo methods to sample from the posterior. Specifi-cally,

– Markov Chain Monte Carlo (MCMC)

∗ Sample from a Markov Chain that has the same ergodic dis-tribution as the posterior

∗ Don’t need to know normalizing constant

∗ e.g., Metropolis, Gibbs sampler algorithms

∗ revolutionized Bayesian statistics in the late 1980’s

– Very computationally intensive

13

Example 2: Tropical Ocean Winds

Problem: Blending tropical surface winds given high-resolution satel-lite scatterometer observations and low-resolution assimilated modeloutput. (Wikle, Milliff, Nychka, Berliner, 2001; JASA)

14

Wind Problem: Process Model Motivation I

Consider the linear shallow water equations (on an equatorial beta-plane):

∂u

∂t− β0yv + g

∂h

∂x= 0

∂v

∂t+ β0yu + g

∂h

∂y= 0

∂h

∂t+ h(

∂u

∂x+

∂v

∂y) = 0

• β0: constant related to rotation of earth

• g: gravitational acceleration

This linear system can be solved analytically, giving a series of travelingwaves (equatorial normal modes) - Observed in the tropics!

15

Wind Problem: Process/Parameter Model Motivation II

Empirical results suggest turbulent scaling behavior for near surfacetropical winds (e.g., Wikle, Milliff and Large, 1999; JAS)

In particular:

Sv(ω) ∝ σ2v

|ω|κ

where κ = 5/3, and ω is the spatial frequency.

16

Wind Model: Hierarchical Model Sketch (v-Component)

• Data Model: Observations at different resolutions

[Zs′t Za

′t]′ = Ktvt + εt, εt ∼ N(0,Σε)

• Process Model:vt = Φat + γt,

Φ - matrix containing “important” normal mode basis functions(importance determined from empirical studies, e.g., Wheeler andKiladis, 1999)

at - time-varying spectral coefficients

γt - multiresolution spatio-temporal process

• Parameter Models:

at = H(θ)at−1 + ηt, ηt ∼ N(0,Ση)

θ - priors based on equatorial wave theory and empirical studies

γt - priors to suggest turbulent power-law relationships

17

Wind Problem Results: Posterior Decomposition

18

Wind Problem Results: Tropical Cyclone Dale

19

Wind Blending: “Operational”

Tim Hoar, Geophysical Statistics Project, NCAR has made thismodel “operational” on QSCAT data (1/2 deg)

20

Example 3: Long Lead Prediction of SST

Goal: Predict Pacific SST anomalies (2◦ × 2◦ resolution) 7 months inadvance while realistically accounting for uncertainty. [Berliner,Wikle,Cressie, 2000, J. Climate]

Consider data {Z(si; t)} to be anomalies from monthly means.

Want to predict Z(s0; T + τ ); for τ = 7 months, from space-time dataD(T ) ≡ {ZT ,ZT−1, . . . ,Z1}.

21

Data Model

Dimension Reduction:

Zt = Φat + νt,

where

• νt ∼ Gau(0,Σν); measurement error and small-scale spatial vari-ability that is uncorrelated in time

• Φ; truncated empirical orthogonal function basis set, ob-tained from SVD of Z ≡ (Z1, . . . ,ZT ) .

22

Process Model

at+τ = µt + Htat + ηt+τ ,

where, ηt ∼ Gau(0,Ση) for all t.

Critical modeling assumption: Let Ht and µt be dependent on boththe current and future climate regimes:

Ht = H(It, Jt)

µt = µ(It, Jt),

where,

• It - classifies the current regime as “warm” (2), “normal” (1), or“cold” (0)

• Jt - anticipates a transition to one of the three regimes at time t+τ

23

Climate States

Current, It: [Threshold Model, e.g., Tong 1990]

It = 0, if SOIt > low threshold

= 1, if otherwise

= 2, if SOIt < upper threshold,

where SOIt is the Southern Oscillation Index (assumed “known”).

Future, Jt: [Latent (hidden) Process Model]

Jt = 0, if Wt > low threshold

= 1, if otherwise

= 2, if Wt < upper threshold,

where Wt is a latent process which anticipates the future climateregime.

24

Latent Process to Assess “Future”

Wt|βw, σ2w ∼ Gau(x′tβw, σ2

w)

where,

xt = (1, Ut, Utsin(2πmt

12), Utcos(

2πmt

12), U 2

t )′,

where

• Ut - the lowpass filtered E-W component of the wind at 10 metersabove the surface at 5◦ N and 157◦ E. (This is absolutely critical!)

• mt - index of month (0 - 11) at time t

25

Priors on Model Parameters

At next stage of hierarchy:

βw ∼ Gau(βSOI, cβvar(βSOI)),

σ2w ∼ IG(q, r)

where the mean and variance and IG (“Inverse Gamma”) parametersare obtained from fitting

SOIt+τ = x′tβSOI + et

(gives R2 ≈ .7 !)

• vec(Hj), j = 1, 2, 3 - Multivariate Normal distributions

• µj, j = 1, 2, 3 - Multivariate Normal distributions

• Covariance matrices - Wishart distributions

(“Empirical Bayes” priors)

26

Bayesian Prediction

MCMC Implementation

• Probability weighted mixtures:

E(aT+τ |D(T )) =2∑

j=0Pr{JT = j|D(T )}E(µ(IT , j) + H(IT , j)aT |D(T ))

• Pixelwise percentiles of [ΦaT+τ |D(T )]

• Summary measures of El Nino/La Nina (e.g., Nino 3.4 Index) andtheir posterior distributions

27

Results: Nino 3.4 Prediction of 10/97 from 3/97

28

Results: Posterior Predictions for 10/97 and 10/98

29

Results: Component Predictions for 10/98

30

Results: Out of Sample Nino 3.4 Posterior Predictions

31

Example 4: Tornado Counts (Wikle and Anderson, 2003; JGR-A)

Investigate possible dependence between observed tornado counts andclimate indices (ENSO, NAO, NPI).

NWS Storm Reports 1953-2001

32

Modeling Issues

Complicated Variability

• Non-normal data (counts)

• Observational uncertainty in counts

– Uncertainty in procedure for assigning Fujita scale (reliance onhuman observation and variable rating procedure)

– Demographic and technological influences

• Spatial variability in long-term frequency of tornado counts

• Temporal trends that may vary with space

• Exogenous climatological factors with effects that may vary

spatially (e.g., regionally)

33

Data Model

Y (si; t) - tornado count in grid box si in year t

Poisson Data Model

Y (si; t)|λ(si; t) ∼ Poisson(λ(si; t))

assumes conditional independence (i.e., counts are independent condi-tioned on the true Poisson intensity process, λ)

NOTE: In paper, we consider a “zero-inflated Poisson model”.

34

Process Model

log(λ(si; t) = β1(si)X1(t) + . . . + β6(si)X6(t) + η(si; t) + ε(si; t)

where,

• X1 - time

• X2 - ENSO index

• X3 - NPI index

• X4 - NAO index

• X5 - ENSO by NPI interaction

• X6 - ENSO by NAO interaction

• η(si; t) - spatially correlated temporal process (large-scale) [account-ing for other potential climate effects]

• ε(si; t) - uncorrelated random effect [ε(si; t) ∼ N(0, σ2ε )]

35

Parameter Models

βk ∼ N(0,Σ(θk)) - Spatially correlated random fields specified byvariance and dependence parameters (θk).

ηt ∼ N(µ,Σ(θη)) - spatial process, with mean µ and variance anddependence parameters given by θη.

Distributions must also be specified for the “hyper-parameters”: θk,θη, µ, and σ2

ε (see Wikle and Anderson (2003) for details)

36

Posterior Summary: Time Trend (β1) Coefficients

Posterior Mean and Posterior 95% “Credibility”

37

Posterior Summary: ENSO (β2) Coefficients


38

Posterior Summary: NPI (β3) Coefficients


39

Some Other Recent Hierarchical Bayesian Examples

• Air-Sea Interaction: (Berliner, Milliff, Wikle, 2003, JGR-O)

• Climate Change: (Berliner, Levine, Shea, 2000, J. Clim.),

• Stochastic Boundary Value Problems: (Wikle, Berliner, Milliff,2003, MWR)

• Nowcasting Radar: (Xu, Wikle, Fox, 2003)

• Hurricane Climatology: (Elsner and Bossak, 2001, J. Clim.)

• Analysis of Rainfall Data: (Sanso and Guenni, 1999, Applied Statis-tics)

40

Conclusion

• Bayesian methods allow one to quantify uncertainty in all aspects ofthe problem (data, process, parameters) and distributional outputreflects the quantification of uncertainty

• Hierarchical Bayesian methods allow one to decompose the probleminto a series of simpler conditional models

– Can accommodate data, model, and parameter uncertainty

– Can accommodate data of different resolution/alignment (e.g.,GIS layers)

– Can accommodate complicated spatial, temporal, and spatio-temporal dependence

– Can accommodate physics (e.g., shallow water PDE; turbulence)

– Can accommodate expert opinion

• Downside: Complicated and computationally intense (canned soft-ware available for free, winBUGS)

41

* Most of the papers discussed here can be found on myweb page:

http://www.stat.missouri.edu/~wikle/

42

brief introduction to (hierarchical) bayesian methods ... introduction to (hierarchical) bayesian...

Documents