approximate bayesian computation for astrostatistics · 2016-10-26 · approximate bayesian...
TRANSCRIPT
Approximate Bayesian Computation forAstrostatistics
Jessi CisewskiDepartment of Statistics
Yale University
October 24, 2016
SAMSI Undergraduate Workshop
Our goals
Introduction to Bayesian methods
Likelihoods, priors, posteriors
Learn our ABCs....
Approximate Bayesian Computation
Image from http://livingstontownship.org/lycsblog/?p=241
1
Suppose we randomly select a sample of size n women andmeasure their heights: x1, x2, x3, . . . , xn
Let’s assume this follows a Normal distribution: X ∼ N(µ, σ)
f (x ;µ, σ) =1√
2πσ2e−
(x−µ)2
2σ2
2
Stellar Initial Mass Function
3
Probability density function (pdf)−→ the function f (x , µ, σ), with µ fixed and x variable
f (x , µ, σ) =1√
2πσ2e−
(x−µ)2
2σ2
Height (in)
Density
µHeight
Each observation from this
Likelihood−→ the function f (x , µ, σ), with µ variable and x fixed
f (x , µ, σ) =1√
2πσ2e−
(x−µ)2
2σ2
µ
Likelihood
µ X
4
Probability density function (pdf)−→ the function f (x , µ, σ), with µ fixed and x variable
f (x , µ, σ) =1√
2πσ2e−
(x−µ)2
2σ2
Height (in)
Density
µHeight
Each observation from thisLikelihood−→ the function f (x , µ, σ), with µ variable and x fixed
f (x , µ, σ) =1√
2πσ2e−
(x−µ)2
2σ2
µ
Likelihood
µ X
4
Consider a random sample of size n (assuming independence, anda known σ): X1, . . . ,Xn ∼ N(3, 2)
f (x , µ, σ) = f (x1, . . . , xn, µ, σ)=n∏
i=1
1√2πσ2
e−(xi−µ)2
2σ2
1.0 1.5 2.0 2.5 3.0 3.5 4.0
01
23
45
µ
(nor
mal
ized
) Lik
elih
ood
n = 50n = 100n = 250n = 500µ
5
Likelihood Principle
All of the information in a sample is contained in the likelihoodfunction, a density or distribution function.
The data are modeled by a likelihood function.
How do we infer µ? Or any unknown parameter(s) θ?
6
Maximum likelihood estimation
The parameter value, θ, that maximizes the likelihood:
θ = maxθ
f (x1, . . . , xn, θ)
1.0 2.0 3.0 4.00e+00
2e-43
4e-43
6e-43
µ
Likelihood
MLEµ
maxµ f (x1, . . . , xn, µ, σ) =
maxµ∏n
i=11√
2πσ2e−
(xi−µ)2
2σ2
Hence, µ =∑n
i=1 xin = x
7
Bayesian framework
Suppose θ is our parameter(s) of interest
Classical or Frequentist methods for inference consider θ to befixed and unknown−→ performance of methods evaluated by repeated sampling−→ consider all possible data sets
Bayesian methods consider θ to be random−→ only considers observed data set and prior information
8
Posterior distribution
π(θ |Data︷︸︸︷x ) =
Likelihood︷ ︸︸ ︷f (x | θ) ·
Prior︷︸︸︷π(θ)
f (x)=
f (x | θ)π(θ)∫Θ dθf (x | θ)π(θ)
∝ f (x | θ)π(θ)
The prior distribution allows you to “easily” incorporate yourbeliefs about the parameter(s) of interest
Posterior is a distribution on the parameter space given theobserved data
9
Data: y1, . . . , y4 ∼ N(µ = 3, σ = 2), y = 1.278
Prior: N(µ0 = −3, σ0 = 5)
Posterior: N(µ1 = 1.114, σ1 = 0.981)
-10 -5 0 5 10
0.0
0.1
0.2
0.3
0.4
µ
DensityLikelihoodPriorPosterior
10
Data: y1, . . . , y4 ∼ N(µ = 3, σ = 2), y = 1.278
Prior: N(µ0 = −3, σ0 = 1)
Posterior: N(µ1 = −0.861, σ1 = 0.707)
-10 -5 0 5 10
0.0
0.2
0.4
0.6
µ
DensityLikelihoodPriorPosterior
11
Data: y1, . . . , y200 ∼ N(µ = 3, σ = 2), y = 2.999
Prior: N(µ0 = −3, σ0 = 1)
Posterior: N(µ1 = 2.881, σ1 = 0.140)
-6 -4 -2 0 2 4 6
0.0
1.0
2.0
3.0
µ
DensityLikelihoodPriorPosterior
12
Prior distribution
The prior distribution allows you to “easily” incorporate yourbeliefs about the parameter(s) of interest
If one has a specific prior in mind, then it fits nicely into thedefinition of the posterior
But how do you go from prior information to a prior distribution?
And what if you don’t actually have prior information?
And what if you don’t know the likelihood??
13
Prior distribution
The prior distribution allows you to “easily” incorporate yourbeliefs about the parameter(s) of interest
If one has a specific prior in mind, then it fits nicely into thedefinition of the posterior
But how do you go from prior information to a prior distribution?
And what if you don’t actually have prior information?
And what if you don’t know the likelihood??
13
Approximate Bayesian Computation
“Likelihood-free” approach to approximating π(θ | xobs)i.e. f (xobs | θ) not specified
Proceeds via simulation of the forward process
Why would we not know f (xobs | θ)?
1 Physical model too complex to write as a closed form functionof θ
2 Strong dependency in data
3 Observational limitations
ABC in Astronomy:Ishida et al. (2015); Akeret et al. (2015); Weyant et al. (2013);Schafer and Freeman (2012); Cameron and Pettitt (2012)
14
15
Basic ABC algorithm
For the observed data xobs and prior π(θ):
Algorithm∗
1 Sample θprop from prior π(θ)
2 Generate xprop from forward process f (x | θprop)
3 Accept θprop if xobs = xprop
4 Return to step 1
∗Introduced in Pritchard et al. (1999) (population genetics)
16
Step 3: Accept θprop if xobs = xprop
Waiting for proposals such that xobs = xprop would be computationallyprohibitive
Instead, accept proposals with ∆(xobs, xprop) ≤ εfor some distance ∆ and some tolerance threshold ε
When x is high-dimensional, will have to make ε too large in order tokeep acceptance probability reasonable.
Instead, reduce the dimension by comparing summariesS(xprop) and S(xobs)
17
Gaussian illustration
Data xobs consists of 25 iid draws from Normal(µ, 1)
Summary statistics S(x) = x
Distance function ∆(S(xprop), S(xobs)) = |xprop − xobs|
Tolerance ε = 0.50 and 0.10
Prior π(µ) = Normal(0,10)
18
Gaussian illustration: posteriors for µ
Tolerance: 0.5, N:1000, n:25
µ
Density
-1.0 -0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
2.0
True PosteriorABC PosteriorInput value
Tolerance: 0.1, N:1000, n:25
µ
Density
-1.0 -0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
2.0 True Posterior
ABC PosteriorInput value
19
How to pick a tolerance, ε?
Instead of starting the ABC algorithm over with a smallertolerance (ε), decrease the tolerance and use the alreadysampled particle system as a proposal distribution rather thandrawing from the prior distribution.
Particle system: (1) retained sampled values, (2) importanceweights
Beaumont et al. (2009); Moral et al. (2011); Bonassi and West(2004)
20
Gaussian illustration: sequential posteriors
-1.0 -0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
2.0
N:1000, n:25
µ
Density
True PosteriorABC PosteriorInput value
Tolerance sequence, ε1:10:1.00 0.75 0.53 0.38 0.27 0.19 0.15 0.11 0.08 0.06
21
Stellar Initial Mass Function
22
Stellar Initial Mass Function: the distribution of star massesafter a star formation event within a specified volume of space
Molecular cloud → Protostars → Stars
Image: adapted from http://www.astro.ljmu.ac.uk
23
Broken power-law(Kroupa, 2001)
Φ(M) ∝ M−αi ,
M1i ≤ M ≤ M2i
α1 = 0.3 for 0.01 ≤ M/M∗Sun ≤ 0.08 [Sub-stellar]α2 = 1.3 for 0.08 ≤ M/MSun ≤ 0.50α3 = 2.3 for 0.50 ≤ M/MSun ≤ Mmax
Many other models, e.g. Salpeter (1955); Chabrier (2003)∗1 MSun = 1 Solar Mass (the mass of our Sun)
24
ABC for the Stellar Initial Mass Function
MassFrequency
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0200
400
600
Log10(Mass)
Frequency
-1.0 -0.5 0.0 0.5
050
100
150
Image (left): Adapted from http://www.astro.ljmu.ac.uk
25
IMF Likelihood
Start with a power-law distribution: each star’s mass isindependently drawn from a power law distribution with density
f (m) =
(1− α
M1−αmax −M1−α
min
)m−α, m ∈ (Mmin,Mmax)
Then the likelihood is
L(α | m1:ntot ) =
(1− α
M1−αmax −M1−α
min
)ntot
×ntot∏i=1
m−αi
ntot = total number of stars in cluster
26
Observational limitations: aging
Lifecycle of star depends on mass → more massive stars die faster
Cluster age of τ Myr → only observe stars with masses< Tage ≈ τ−2/5 × 108/5
If age = 30 Myr so the aging cutoff is Tage ≈ 10 MSun
Then the likelihood is
L(α | m1:nobs , ntot) =
(1− α
T 1−αage −M1−α
min
)nobs(
nobs∏i=1
m−αi
)× P(M > Tage)
ntot−nobs
ntot = # of stars in cluster
nobs = # stars observed in cluster
Image: http://scioly.org
27
Observational limitations: completeness
Completeness function:
P(observing star | m) =
0, m < Cminm−Cmin
Cmax−Cmin, m ∈ [Cmin,Cmax]
1, m > Cmax
Probability of observing a particular star given its mass
Depends on the flux limit, stellar crowding, etc.
Image: NASA, J. Trauger (JPL), J. Westphal (Caltech)
28
Observational limitations: measurement error
Incorporating log-normal measurement error gives our final likelihood:
L(α | m1:nobs, ntot ) =(
P(M > Tage ) +
(1− α
M1−αmax − M1−α
min
)∫ Cmax
Cmin
M−α ×(
1−M − Cmin
Cmax − Cmin
)dM
)ntot−nobs
×nobs∏i=1
{∫ Tage
2(2πσ2)
− 12 m−1
i e− 1
2σ2 (log(mi )−log(M))2(
1− α
M1−αmax − M1−α
min
)M−α
×(I{M > Cmax} +
(M − Cmin
Cmax − Cmin
)I{Cmin ≤ M ≤ Cmax}
)dM
}
29
IMF
Mass
Density
0 10 20 30 40 50 60
0.0
0.1
0.2
0.3
0.4
0.5
0.6
With aging, completeness, and error
Mass
Density
2 4 6 8 10 12 14 16
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Sample size = 1000 stars, [Cmin,Cmax] = [2, 4], σ = 0.25
30
Simulation Study: forward model
Draw from
f (m) =
(1− α
601−α − 21−α
)m−α, m ∈ (2, 60)
Aged 30 Myrs
Observational completeness:
P(obs | m) =
0, m < 4m−2
2 , m ∈ [2, 4]
1, m > 4.
Uncertainty: logM = logm + 0.25η (with η ∼ N(0, 1))
Prior: α ∼ U[0, 6]∗
31
Simulation Study: summary statistics
We want to account for the following with our summary statisticsand distance functions:
1. Shape of the observed Mass Function
ρ1(msim,mobs) =
[∫ {flog msim
(x)− flog mobs(x)}2
dx
]1/2
32
2. Number of stars observed
ρ2(msim,mobs) = |1− nsim/nobs |
msim = masses of the stars simulated from the forward modelmobs = masses of observed starsnsim = number of stars simulated from the forward modelnobs = number of observed stars
33
Simulation Study
1 Draw n = 103 stars2 IMF slope α = 2.35 with Mmin = 2 and Mmax = 603 N = 103 particles4 T = 30 sequential time steps
Observed Mass Function
Mass
Density
0 5 10 15
0.00
0.05
0.10
0.15
0.20
0.25
34
0 10 20 30 40 50
0.00
0.10
0.20
0.30
IMF
N = 1000 Bandwidth = 0.4971
Density
0 10 20 30 40 50
0.00
0.10
0.20
0.30
Observed MF
N = 510 Bandwidth = 0.5625
Density
Sample size = 1000 stars, [Cmin,Cmax] = [2, 4], σ = 0.25
35
Simulation Study results
1 Draw n = 103 stars
2 IMF slope α = 2.35 with Mmin = 2 and Mmax = 60
3 N = 103 particles
4 T = 30 sequential time steps
0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
(A) Observed Mass Function
Log10(M)
Log10(f M)
Observed MF95% Credible bandPosterior Median
1.6 1.8 2.0 2.2 2.4 2.6 2.8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
(B) Posteriors
α
Density
ABC PosteriorTrue PosteriorInput value
Initial ABC posterior...Intermediate ABC posterior...Final ABC Posterior
36
Summary
ABC can be a useful tool when data are too complex to define areasonable likelihood,
E.g. the Stellar Initial Mass Function
Selection of good summary statistics is crucial for ABC posterior tobe meaningful
THANK YOU!
37
Summary
ABC can be a useful tool when data are too complex to define areasonable likelihood,
E.g. the Stellar Initial Mass Function
Selection of good summary statistics is crucial for ABC posterior tobe meaningful
THANK YOU!37
Bibliography I
Akeret, J., Refregier, A., Amara, A., Seehars, S., and Hasner, C. (2015), “Approximate Bayesian computation forforward modeling in cosmology,” Journal of Cosmology and Astroparticle Physics, 2015, 043.
Bate, M. R. (2012), “Stellar, brown dwarf and multiple star properties from a radiation hydrodynamical simulationof star cluster formation,” Monthly Notices of the Royal Astronomical Society, 419, 3115–3146.
Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., and Robert, C. P. (2009), “Adaptive approximate Bayesiancomputation,” Biometrika, 96, 983 – 990.
Bonassi, F. V. and West, M. (2004), “Sequential Monte Carlo with Adaptive Weights for Approximate BayesianComputation,” Bayesian Analysis, 1, 1–19.
Cameron, E. and Pettitt, A. N. (2012), “Approximate Bayesian Computation for Astronomical Model Analysis: ACase Study in Galaxy Demographics and Morphological Transformation at High Redshift,” Monthly Notices ofthe Royal Astronomical Society, 425, 44–65.
Chabrier, G. (2003), “The galactic disk mass function: Reconciliation of the hubble space telescope and nearbydeterminations,” The Astrophysical Journal Letters, 586, L133.
Ishida, E. E. O., Vitenti, S. D. P., Penna-Lima, M., Cisewski, J., de Souza, R. S., Trindade, A. M. M., Cameron,E., and Busti, V. C. (2015), “cosmoabc: Likelihood-free inference via Population Monte Carlo ApproximateBayesian Computation,” Astronomy and Computing, 13, 1 – 11.
Kroupa, P. (2001), “On the variation of the initial mass function,” Monthly Notices of the Royal AstronomicalSociety, 322, 231–246.
Moral, P. D., Doucet, A., and Jasra, A. (2011), “An adaptive sequential Monte Carlo method for approximateBayesian computation,” Statistics and Computing, 22, 1009–1020.
Pritchard, J. K., Seielstad, M. T., and Perez-Lezaun, A. (1999), “Population Growth of Human Y Chromosomes:A study of Y Chromosome Microsatellites,” Molecular Biology and Evolution, 16, 1791 – 1798.
Salpeter, E. E. (1955), “The luminosity function and stellar evolution.” The Astrophysical Journal, 121, 161.
Schafer, C. M. and Freeman, P. E. (2012), Statistical Challenges in Modern Astronomy V, Springer, chap. 1,Lecture Notes in Statistics, pp. 3 – 19.
Weyant, A., Schafer, C., and Wood-Vasey, W. M. (2013), “Likeihood-free cosmological inference with type Iasupernovae: approximate Bayesian computation for a complete treatment of uncertainty,” The AstrophysicalJournal, 764, 116.
38