harrison b. prosper workshop on top physics, grenoble bayesian statistics in analysis harrison b....

Harrison B. Prosper Workshop on Top Physics, Grenoble

Bayesian Statistics in AnalysisBayesian Statistics in Analysis

Harrison B. ProsperFlorida State University

Workshop on Top Physics:

from the TeVatron to the LHCOctober 19, 2007

2Harrison B. Prosper Workshop on Top Physics, Grenoble

OutlineOutline

Introduction

Inference

Model Selection

Summary


IntroductionIntroduction

Blaise Pascal1670

Thomas Bayes1763

Pierre Simon de Laplace1812


)(

)()|(

BP

ABPBAP


AABB

ABAB

Let P(A) and P(B) be probabilities, assigned to statements, orevents, A and B and let P(AB) be the probability assigned to the joint statement AB, then the conditional probability of A given B is defined by

P(A) is the probability of A withoutthe restriction specified by B.

P(A|B) is the probability of A when we restrict to the conditions specifiedby statement B

)(

)()|(

AP

ABPABP


Fromwe deduce immediately Bayes’ TheoremBayes’ Theorem:

Fromwe deduce immediately Bayes’ TheoremBayes’ Theorem:


( | ) ( )( | )

( )

BP BB

A PP

PA

A

( ) ( | ) ( )

( | ) ( )

P PBA A A

P A

PB

B P B

Bayesian statistics is the application of Bayes’ theorem toproblems of inference


InferenceInference


InferenceInference

The Bayesian approach to inference is conceptually simple andalways the same:

Compute Pr(DataData|ModelModel)

ComputePr(ModelModel|DataData) = Pr(DataData|ModelModel) Pr(ModelModel)/Pr(DataData)

Pr(ModelModel) is called the prior. It is the probabilityassigned to the ModelModel irrespective of the DataData

Pr(DataData|ModelModel) is called the likelihoodPr(ModelModel|DataData) is called the posterior probability


posterior densityposterior density prior densityprior density

( | ) ( , | )p D p D d marginalizationmarginalization

( | , ) ( , )( , | )

( | , ) ( , )

p Dp D

p D d d

likelihoodlikelihood

In practice, inference is done using the continuous form of Bayes’ theorem:

are the parameters of interest denote all other

parameters of theproblem, which are referred to as nuisance parameters

InferenceInference


ModelModel

DatumDatum

LikelihoodLikelihood

n bs s is the mean signal count

b is the mean background count

Task: Infer s, given N

}{ND

Prior informationPrior information

max0

ˆ

ss

bb

( | , ) Poisson( , )P D Nb bs s

Example – 1Example – 1


Apply Bayes’ theorem:

(s,b) is the prior density for s and b, which encodes our prior

knowledge of the signal and background means.

The encoding is often difficult and can be controversial.

priorpriorlikelihoodlikelihoodposteriorposterior

( | , ) ( , )( , | )

( | , ) ( , )

b bb

b b b

s sP Dp Ds

s ds sP D d



First factor the prior

( | ) ( | , ) ( )s sD dbP bl bD Define the marginal likelihood

( , ) ( | ) ( )

( ) ( )

s b b s

b

s

s

( | ) ( )( | )

( | ) ( )

s ss

s

l Dp

D sD

l ds

and write the posterior density for the signal as



The Background Prior DensitySuppose that the background has been estimated from a Monte Carlo simulation of the background process, yieldingB events that pass the cuts.

Assume that the probability for the count B is given by P(B|) = Poisson(B), where is the (unknown) mean count of the Monte Carlo sample. We can infer the value of by applying Bayes’ theorem to the Monte Carlo background experiment

( | ) ( )( | )

( | ) ( )

P Bp B

P B d



The Background Prior DensityAssuming a flat prior prior () = constant, we find

p(|B) = Gamma (, 1, B+1) (= B exp(–)/B!).

Often the mean background count b in the real experiment is related to the mean count in the Monte Carlo experiment linearly, b = k , where k is an accurately known scale factor, for example, the ratio of the data to Monte Carlo integrated luminosities.

The background can be estimated as followsˆ ,k kb B b B



( )

0

10

( | ) ( | , ) ( )

( )

! !

( 1)

! (1 ) ( )! !

s k N B

r N rNs

N r Br

l D s P D s k k d

e s k ed

N B

s k N r Be

r k N r B

The calculation of the marginal likelihood yields:



ji

N

jji apd

1

Data partitioned into K bins and modeled by a sum of N sources of strength p. The numbers A are the source distributions for model M. Each M corresponds to a different top signal + background model

1

( | , , ) exp( ) !i

KD

i ii

P D a p d d DM

1

( , , ) ( ) exp( ) !ji

NA

ji ji jij

a p p a a AM

Example – 2: Top Mass – Run IExample – 2: Top Mass – Run I

priorprior


modelmodel

( | ) ( , , | )P D P a p D daM M dp posteriorposterior


130 140 150 160 170 180 190 200 210 220 2300

0.1

0.2

0.3

Probability of Model M

Top Quark Mass (GeV/c**2)

P(M

|d)

mtop = 173.5 ± 4.5 GeVs = 33 ± 8 eventsb = 50.8 ± 8.3 events

Example – 2: Top Mass – Run IExample – 2: Top Mass – Run I


To Bin Or Not To BinTo Bin Or Not To Bin

Binned – Pros Likelihood can be modeled accurately Bins with low counts can be handled exactly Statistical uncertainties handled exactly

Binned – Cons Information loss can be severe Suffers from the curse of dimensionality


December 8, 2006 - Binned likelihoods do work!




Un-Binned – Pros No loss of information (in principle)

Un-Binned – Cons Can be difficult to model likelihood accurately.

Requires fitting (either parametric or KDE) Error in likelihood grows approximately

linearly with the sample size. So at LHC, large sample sizes could become an issue.


1

1 1

( | , , ) exp( ) !

exp( ) !

i

i

KD

i ii

KKD

i ii i

P D a b d d D

d d D


Start with the standard binned likelihood over K bins

i i id a b modelmodel

Un-binned Likelihood FunctionsUn-binned Likelihood Functions


1

1

( | , , ) exp[ ( ( ) ( )) ]

[ ( ) ( )]

exp[ ( )] [ ( ) ( )]

i i

K

i i ii

K

i ii

P D A B a x b x dx

a x b x x

A B a x b x

Make the bins smaller and smaller

( ) [ ( ) ( )]i i i i

i

d d x dx a x b x x the likelihood becomes

where K is now the number of eventsand a(x) and b(x) arethe effective luminosity and background densities, respectively, and A and B are their integrals



1

( | , , ) exp[ ( )] [ ( ) ( )]K

i ii

p D A B A B a x b x

The un-binned likelihood function

is an example of a marked Poisson likelihood. Each event is marked by the discriminating variable xi, which could be multi-dimensional.

The various methods for measuring the top cross section and massdiffer in the choice of discriminating variables x.



( , , , , )A B

Note: Since the functions a(x) and b(x) have to be modeled, they will depend on sets of modeling parameters and , respectively. Therefore, in general, the un-binned likelihood function is

which must be combined with a prior density

1

( | , , , , ) exp[ ( )] [ ( , ) ( , )]K

i ii

p D A B A B a x b x

to compute the posterior density for the cross section

( | ) ( | , , , , ) ( , , , , )p D dA dB d d p D A B A B



If we write s(x) = a(x), and S = A we can re-write the un-binned likelihood function as

Computing the Un-binned Likelihood FunctionComputing the Un-binned Likelihood Function

Since a likelihood function is defined only to within a scaling by aparameter-independent quantity, we are free to scale it by, for example, the observed distribution d(x)

1

( | , ) exp[ ( )] [ ( ) ( )]K

i ii

p D S B S B s x b x

1

( ) ( )( | , ) exp[ ( )]

( )

Ki i

i i

s x b xp D S B S B

d x


One way to approximate the ratio [s(x)+ b(x)]/d(x) is with a neural network function trained with an admixture of data, signal and background in the ratio 2:1:1.

If the training can be done accurately enough, the network will approximate

n(x) = [s(x)+ b(x)]/[ s(x)+b(x)+d(x)]

in which case we can then write

1

( )( | , ) exp[ ( )]

1 ( )

Ki

i i

n xp D S B S B

n x

Computing the Un-binned Likelihood FunctionComputing the Un-binned Likelihood Function


Model SelectionModel Selection


posteriorposterior priorprior

( | ) ( | , , )

( , | )

M M

M M M M

p D p D

d d

M M

M

( | ) ( )( | )

( )

Mp MM

D PP D

p D

evidenceevidence


Model selection can also be addressed using Bayes’ theorem. It requires computing

where the evidence for model M is defined by


posterior oddsposterior odds prior oddsprior odds

( | ) ( | ) ( )

( | ) ( | ) ( )

P D p D P

P D p D

M M M

PN N N

Bayes factorBayes factor


The Bayes Factor, BMN, or any one-to-one function thereof, can be used to choose between two competing models M and N, e.g., signal + background versus background only.

However, one must be careful to use proper priors.


Model 1Model 1 ( | , ) Poisson( , ), ( , )

( | ) Poisson( , ), ( )

P D s b N s b s b

P D b N b b

Model Model

22

Model Selection – ExampleModel Selection – Example

Consider the following two prototypical models

12

Poisson( , ) ( , )( | )

( | ) Poisson( , ) )2

1

(

N s b s b dsdbP DB

P D N b b db

The Bayes factor for these models is given by


Model Selection – ExampleModel Selection – Example

Calibration of Bayes FactorsConsider the quantity (called the Kullback-Leibler divergence)

( | )( || ) ( | ) ln2

1

( | )21 1

P Dk P D dD

P D

For the simple Poisson models with known signal and background, it is easy to show that

1( || ) (2 ) ln 1s

k s s bb

For s << b, we get √k(2||1) ≈ s /√b. That is, roughly speaking,for s << b, √ ln B12 ≈ s /√b


SummarySummary

Bayesian statistics is a well-founded and general framework for thinking about and solving analysis problems, including: Analysis design Modeling uncertainty Parameter estimation Interval estimation (limit setting) Model selection Signal/background discrimination etc.

It well worth learning how to think this way!

harrison b. prosper workshop on top physics, grenoble bayesian statistics in analysis harrison b....

Documents