wouter verkerke, nikhef roofit a tool kit for data modeling in root wouter verkerke (nikhef) david...

Post on 21-Dec-2015

225 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Wouter Verkerke, NIKHEF

RooFitA tool kit for data modeling in ROOT

Wouter Verkerke (NIKHEF) David Kirkby (UC Irvine)

Wouter Verkerke, NIKHEF

Focus: coding a probability density function

• Focus on one practical aspect of many data analysis in HEP: How do you formulate your p.d.f. in ROOT – For ‘simple’ problems (gauss, polynomial), ROOT built-in models

well sufficient

– But if you want to do unbinned ML fits, use non-trivial functions, or work with multidimensional functions you are quickly running into trouble

Wouter Verkerke, NIKHEF

A recent example

• BaBar experiment at SLAC: Extract sin(2) from time dependent CP violation of B decay: e+e- Y(4s) BB– Reconstruct both Bs, measure decay time difference

– Physics of interest is in decay time dependent oscillation

• Many issues arise– Standard ROOT function framework clearly insufficient to handle such

complicated functions must develop new framework

– Normalization of p.d.f. not always trivial to calculate may need numeric integration techniques

– Unbinned fit, >2 dimensions, many events computation performance important must try optimize code for acceptable performance

– Simultaneous fit to control samples to account for detector performance

);|BkgResol();(BkgDecay);BkgSel()1(

);|SigResol())2sin(,;(SigDecay);SigSel(

bkgbkgbkgsig

sigsigsigsig

rdttqtpmf

rdttqtpmf

Wouter Verkerke, NIKHEF

A recent example

• Initial approach BaBar: write it from scratch (in FORTRAN!)– Does its designated job quite well, but took a long time to develop

– Possible because sin(2) effort supported by O(50) people.

– Optimization of ML calculations hand-coded error prone and not easy to maintain

– Difficult to transfer knowledge/code from one analysis to another.

• A better solution: A modeling language in C++ that integrates seamlessly into ROOT– Recycle code and knowledge

• Development of RooFit package– Started 5 years ago.

– Very successful virtually everybody in BaBar uses it now in standard ROOT distribution

Wouter Verkerke, NIKHEF

RooFit – a data modeling language for ROOT

C++ command line interface & macros

Data management & histogramming

Graphics interface

I/O support

MINUIT

ToyMC dataGeneration

Data/ModelFitting

Data Modeling

Model Visualization

Extension to ROOT – (Almost) no overlap with existing functionality

Wouter Verkerke, NIKHEF

Data modeling - Desired functionality

Building/Adjusting Models

Easy to write basic PDFs ( normalization)

Easy to compose complex models (modular design)

Reuse of existing functions

Flexibility – No arbitrary implementation-related restrictions

Using Models

Fitting : Binned/Unbinned (extended) MLL fits, Chi2 fits

Toy MC generation: Generate MC datasets from any model

Visualization: Slice/project model & data in any possible way

Speed – Should be as fast or faster than hand-coded model

A n

a l y

s i s

w

o r

k c

y c

l e

Wouter Verkerke, NIKHEF

• Idea: represent math symbols as C++ objects

– Result: 1 line of code per symbol in a function (the C++ constructor) rather than 1 line of code per function

Data modeling – OO representation

variable RooRealVar

function RooAbsReal

PDF RooAbsPdf

space point RooArgSet

list of space points RooAbsData

integral RooRealIntegral

RooFit classMathematical concept

),;( qpxF

px,

x

dxxfx

xmax

min

)(

)(xf

kx

Wouter Verkerke, NIKHEF

Data modeling – Constructing composite objects

• Straightforward correlation between mathematical representation of formula and RooFit code

RooRealVar x

RooRealVar s

RooFormulaVar sqrts

RooGaussian g

RooRealVar x(“x”,”x”,-10,10) ; RooRealVar m(“m”,”mean”,0) ; RooRealVar s(“s”,”sigma”,2,0,10) ; RooFormulaVar sqrts(“sqrts”,”sqrt(s)”,s) ; RooGaussian g(“g”,”gauss”,x,m,sqrts) ;

Math

RooFitdiagram

RooFitcode

RooRealVar m

),,( smxgauss

Wouter Verkerke, NIKHEF

Model building – (Re)using standard components

• RooFit provides a collection of compiled standard PDF classes

RooArgusBG

RooPolynomial

RooBMixDecay

RooHistPdf

RooGaussian

BasicGaussian, Exponential, Polynomial,…Chebychev polynomial

Physics inspiredARGUS,Crystal Ball, Breit-Wigner, Voigtian,B/D-Decay,….

Non-parametricHistogram, KEYS

Easy to extend the library: each p.d.f. is a separate C++ class

Wouter Verkerke, NIKHEF

Importance of a good and extendable model library

• If ‘new’ p.d.f.s are made available in a easy-to-use form, more people will use them

• Example 1: non-parametric kernel estimation technique– See article by Kyle Cranmer

– People like it, but are put off by the prospect of thoroughly reading the entire article, coding the concept from scratch, and then using it.

– In RooFit switching from a histogram-based shape to a KEYS shape is a one-line code change (RooHistPdf RooKeysPdf)

• Example 2: Chebychev polynomials – Its easy to convince people to try them if all they need to do is a

one-line code change (RooPolynomial RooChebychev)

Wouter Verkerke, NIKHEF

Model building – (Re)using standard components

• Library p.d.f.s can be adjusted on the fly.– Just plug in any function expression you like as input variable

– Works universally, even for classes you write yourself

• Maximum flexibility of library shapes keeps library small

g(x,y;a0,a1,s)

g(x;m,s) m(y;a0,a1)

RooPolyVar m(“m”,y,RooArgList(a0,a1)) ;RooGaussian g(“g”,”gauss”,x,m,s) ;

Wouter Verkerke, NIKHEF

Handling of p.d.f normalization

• Normalization of (component) p.d.f.s to unity is often a good part of the work of writing a p.d.f.

• RooFit handles most normalization issues transparently to the user– P.d.f can advertise (multiple) analytical expressions for integrals

– When no analytical expression is provided, RooFit will automatically perform numeric integration to obtain normalization

– More complicated that it seems: even if gauss(x,m,s) can be integrated analytically over x, gauss(f(x),m,s) cannot. Such use cases are automatically recognized.

– Multi-dimensional integrals can be combination of numeric and p.d.f-provided analytical partial integrals

• Variety of numeric integration techniques is interfaced– Adaptive trapezoid, Gauss-Kronrod, VEGAS MC…

– User can override parameters globally or per p.d.f. as necessary

Wouter Verkerke, NIKHEF

RooBMixDecay

RooPolynomial

RooHistPdf

RooArgusBG

Model building – (Re)using standard components

• Most physics models can be composed from ‘basic’ shapes

RooAddPdf+

RooGaussian

Wouter Verkerke, NIKHEF

RooBMixDecay

RooPolynomial

RooHistPdf

RooArgusBG

RooGaussian

Model building – (Re)using standard components

• Most physics models can be composed from ‘basic’ shapes

RooProdPdf*)()|(),( ygyxfyxk

RooProdPdf k(“k”,”k”,g, Conditional(f,x))

)()(),( ygxfyxh

RooProdPdf h(“h”,”h”, RooArgSet(f,g))

Wouter Verkerke, NIKHEF

Using models - Overview

• All RooFit models provide universal and complete fitting and Toy Monte Carlo generating functionality– Model complexity only limited by available memory and CPU power

– Fitting/plotting a 5-D model as easy as using a 1-D model

– Most operations are one-liners

RooAbsPdf

RooDataSet

RooAbsData

gauss.fitTo(data)

data = gauss.generate(x,1000)

Fitting Generating

Wouter Verkerke, NIKHEF

Fitting functionality

• Binned or unbinned ML fit?– For RooFit this is the essentially the same!

– Nice example of C++ abstraction through inheritance

• Fitting interface takes abstract RooAbsData object– Supply unbinned data object (created from TTree) unbinned fit

– Supply histogram object (created from THx) binned fit

x y z1 3 52 4 61 3 52 4 6

BinnedUnbinned

RooDataSet RooDataHist

RooAbsData

pdf.fitTo(data)

Wouter Verkerke, NIKHEF

Automated vs. hand-coded optimization of p.d.f.

• Automatic pre-fit PDF optimization– Prior to each fit, the PDF is analyzed for possible optimizations

– Optimization algorithms:

• Detection and precalculation of constant terms in any PDF expression

• Function caching and lazy evaluation

• Factorization of multi-dimensional problems where ever possible

– Optimizations are always tailored to the specific use in each fit.

– Possible because OO structure of p.d.f. allows automated analysis of structure

• No need for users to hard-code optimizations – Keeps your code understandable, maintainable and flexible

without sacrificing performance

– Optimization concepts implemented by RooFit are applied consistently and completely to all PDFs

– Speedup of factor 3-10 reported in realistic complex fits

• Fit parallelization on multi-CPU hosts– Option for automatic parallelization of fit function on multi-CPU hosts

(no explicit or implicit support from user PDFs needed)

Wouter Verkerke, NIKHEF

MC Event generation

• Sample “toy Monte Carlo” samples from any PDF– Accept/reject sampling method used by default

• MC generation method automatically optimized– PDF can advertise internal generator if more efficient methods

exists (e.g. Gaussian)

– Each generator request will use the most efficient accept-reject / internal generator combination available

– Operator PDFs (sum,product,…) distribute generation over components whenever possible

– Generation order for products with conditional PDFs is sorted out automatically

– Possible because OO structure of p.d.f. allows automated analysis of structure

• Can also specify template data as input to generator

pdf.generate(vars,nevt)

pdf.generate(vars,data)

Wouter Verkerke, NIKHEF

Using models – Plotting

• Model visualization geared towards ‘publication plots’ not interactive browsing emphasis on 1-dimensional plots

• Simplest case: plotting a 1-D model over data– Modular structure of composite p.d.f.s allows easy access to components for plotting

– Can show Poisson confidence intervals instead of sqrt(N) errors

RooPlot* frame = mes.frame() ;data->plotOn(frame) ;pdf->plotOn(frame) ;

pdf->plotOn(frame,Components(“bkg”))

frame->Draw() ;

Can store plot with data andall curves as single object

Adaptive spacing of curve

points to achieve 1‰

precision

Wouter Verkerke, NIKHEF

Visualizing multi-dimensional models

• IMHO: Visualization of multi-dim. models much more complicated than visualization of multi-dim. data

• Simplest scenario: projection plots– Simple example M(m,t) = f[S(m)*T(t)]+(1-f)[B(m)*U(t)]

– Provide intuitive behavior: projection of p.d.f. automatically follows projection of data: 1-D plot frame remembers ‘hidden’ variables in dataset. Subsequent p.d.f.s in same frame are automatically projected

– Works identically for non-factorizable p.d.f.s! Necessary integrals are calculated automatically either analytical or numerically

RooPlot* frame = dt.frame() ;data.plotOn(frame) ;model.plotOn(frame) ;model.plotOn(frame,Components(“bkg”)) ;

mB distribution t distribution

dttmMmc ),()(

dmtmMtc ),()(

Wouter Verkerke, NIKHEF

Using models – plotting multi-dimensional models

• Projection plot usually doesn’t capture full information content of your p.d.f. equally easy to plot a ‘slice’

RooPlot* frame = dt.frame() ;data.plotOn(frame) ;model.plotOn(frame) ;model.plotOn(frame,Components(“bkg”)) ;

RooPlot* frame = dt.frame() ;dt.setRange(“sel”,5.27,5.30) ;data.plotOn(frame,CutRange(“sel”)) ;model.plotOn(frame,ProjectionRange(“sel”));model.plotOn(frame,ProjectionRange(“sel”), Components(“bkg”)) ;

mB distribution t distribution

More complicated ‘slice’ views such as a likelihood ratio plot only take O(3) lines of extra code

30.5

27.5

),()( dmtmMtc

Wouter Verkerke, NIKHEF

Using models – plotting

• Can also use template data to show ‘average’ curve– Convenient for visualization of conditional p.d.f.s

– Example plot t distribution of following model over data, averaging over dt values of data

)|,( dttmM

DNi

iD

dmdttmMN

tCurve,1

)|,(1

)(

model.plotOn(frame,ProjWData(dtData)) ;model.plotOn(frame,Components("bkg"), ProjWData(dtData),LineStyle(kDashed)) ;

Wouter Verkerke, NIKHEF

Advanced features – Task automation

• Support for routine task automation, e.g. useful in finding confidence intervals

Input model Generate toy MC Fit model

Repeat N times

Accumulatefit statistics

Distribution of- parameter values- parameter errors- parameter pulls

// Instantiate MC study managerRooMCStudy mgr(inputModel) ;

// Generate and fit 100 samples of 1000 eventsmgr.generateAndFit(100,1000) ;

// Plot distribution of sigma parametermgr.plotParam(sigma)->Draw()

Wouter Verkerke, NIKHEF

RooFit at SourceForge - roofit.sourceforge.net

RooFit moved to SourceForge to facilitate access and

communication with non-BaBar users

Code access–CVS repository via pserver

–File distribution sets for production versions

Wouter Verkerke, NIKHEF

RooFit at SourceForge - Documentation

Documentation

Comprehensive set of tutorials

(PPT slide show + example macros)

Five separate tutorials

More than 250 slides and 20

macros in total

Class reference in THtml style

Also available by end of 2005: Pedagogical User Guide + Reference Guide (~100 pages document)

Wouter Verkerke, NIKHEF

Selection of BaBar publications in 2004 using RooFit

• Improved Measurement of the CKM angle alpha using B0+-

• Measurement of branching fractions and charge asymmetries in B+ decays to +, K+, + and '+, and search for B0 decays to K0 and

• Branching Fraction and CP Asymmetries in B0KSKSKS

• Measurement of CP Asymmetries in B0K0 and B0K+K-K0S

• Measurement of Branching Fractions and Time-Dependent CP-Violating Asymmetries in B' K Decays

• Improved Measurement of Time-Dependent CP Violation in B0 to (ccbar)K0 Decays (‘sin2’)

• Measurements of the Branching Fraction and CP-Violating Asymmetries in B0f0(980)KS Decays

• Measurement of Time-dependent CP-Violating Asymmetries in B0K*, K*KS0 Decays

• Study of the decay B0+- and constraints on the CKM angle alpha.

• Measurement of CP-violating Asymmetries in B0K0S0 Decays

• Measurement of Time-Dependent CP Asymmetries in B0K0

• Search for B+/- [K-/+ +/-]D K+/- and upper limit on the bu amplitude in B+/- DK+/-

• Limits on the Decay-Rate Difference of Neutral B Mesons and on CP, T, and CPT Violation in B0B0bar Oscillations

Wouter Verkerke, NIKHEF

Summary

• RooFit adds a powerful data modeling language to ROOT– Mathematical objects (variables, functions…) represented as C++ objects– Complex models are easily composed from library of standard components

• New fundamental components can be written easily– Powerful tools for fitting, Toy MC generation and visualization are easy to use

• Compact syntax• Universal functionality (almost no arbitrary / implementation related restrictions)

– Automated function optimization analysis and implementation delivers industrial strength performance

• RooFit makes the road from a ‘simple’ to a really elaborate/complex p.d.f. less steep:

• RooFit is distributed with ROOT starting with ROOT version v5.02-00

top related