outline
Post on 23-Jan-2016
96 Views
Preview:
DESCRIPTION
TRANSCRIPT
RooStatsCms: a tool for analyses RooStatsCms: a tool for analyses modelling, combination and modelling, combination and statistical studiesstatistical studies
D. Piparo, G. Schott, G. QuastD. Piparo, G. Schott, G. Quast
Institut fInstitut für Experimentelle Kernphysikür Experimentelle KernphysikUniversität KarlsruheUniversität Karlsruhe
D. Piparo 2
OutlineOutline
• The need for a tool
• RooStatsCms (RSC)
• A RooFit interlude
• The three parts– Modelling
• The datacard• Inspect your model
– Statistical studies and limits• Profile Likelihood• Hypothesis separation and “modified frequentist approach”
– Exclusion
– Plotting classes
19.11.08
D. Piparo19.11.08 3
The need for a toolThe need for a tool
• No prexisting structured statistic software framework in CMS: G. Quast, G. Schott and DP developed RooStatsCms
NEEDS:
• Reliable implementation of multiple statistical methods• Combine analyses:
– Stronger limits on quantities like Higgs production cross section, mass ...
• Do not replace existing analyses but complement their results
• Easy user interface
• Satisfactory documentation (no black boxes)
• Examples and tutorials
D. Piparo19.11.08 4
RooStatsCmsRooStatsCms• Originally thought for the CMS Higgs Working Group and a CMS (EKP) exclusive product• Based on RooFit (Part of the ROOT distribution)• Three parts:
– Modelling and combination– Statistical methods – Advanced graphic routines
• It comes with CINT dictionaries (macros, interactive root).• Available to CMS and EKP at: www-ekp.physik.uni-karlsruhe.de/~RooStatsCms
– Visit our wiki for username and password – Statistical methods and graphic routines public: www-ekp.physik.uni-karlsruhe.de/~RooStatsKarlsruhe
• Big effort for documentation:
1. RSC website and Doxygen of every class, method and member
2. Wikipages with links to RSC presentations (~15) and workshop• https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWGRooStatsCms• http://www-ekp.physik.uni-karlsruhe.de/~twiki/bin/view/EkpCms/RooStatsCms
3. An internal CMS note in preparation
D. Piparo19.11.08 5
RooStatsCms - structure 1/2RooStatsCms - structure 1/2
• Already 33 classes!• All of them inherit from TObject: persistency and reflexion• Moreover:
– Programs to compile– Macros for the interpreter– Various utilities in the Rsc namespace (TH1F median,..)
• Class design-wise structure
D. Piparo19.11.08 6
RooStatsCms - structure 2/2RooStatsCms - structure 2/2
Directory Description
doc Links to the documentation
bin Executables after make exe command (see progs dir)
interface Header files
lib Here the library after the make command: libRooStatsCms.so
macros The macros for cint
progs C++ programs to be compiled and linked against the library
scripts Utilities script: python card maker, doxy, environment
src The sources
test …well the directory for the tests!
Directory-wise structure
• Structure “À la CMSSW”: ready to compile in the CMS framework with a newer RooFit
D. Piparo19.11.08 7
RooFit interlude: ouverture RooFit interlude: ouverture • Toolkit for data modeling• Model distribution of observable x in terms of
– parameter of interest p– other parameters q to describe detector effects (resolution ,efficiency)– Probability density function (pdf) F (x;p,q)– normalized over range of observable x w.r.t. the parameters p and q
• RooFit provides the functionality for– building these probability density functions
• scalable to complex models
– maximum likelihood fitting (binned and unbinned)– visualization of the pdf– toy MC generator
D. Piparo19.11.08 8
RooFit interlude: functionalityRooFit interlude: functionality
• Package developed, originally for BaBar analysis (by W. Verkerke and D. Kirkby)– actively maintained by W. Verkerke in view of LHC analysis– Web site: http://roofit.sourceforge.net– Much material shown taken from Wouter’s presentations
• see 200 slides presented at French statistics school (http://sos.in2p3.fr)• Users Manual in the ROOT site:ftp://root.cern.ch/root/doc/RooFit_Users_Manual_2.91-33.pdf
D. Piparo19.11.08 9
RooFit interlude: designRooFit interlude: design• Mathematical entities are represented as C++ objects
D. Piparo19.11.08 10
RooFit interlude : an exampleRooFit interlude : an example
• Gaussian Pdf
• MC data generation
• Maximum likelihood fit on data
D. Piparo19.11.08 11
RSC: A solid toolRSC: A solid tool
• RSC is in “production phase”:– Around since the beginning of the year 2008– Workshop at CERN in June– Approved results: http://cms-physics.web.cern.ch/cms-physics/public/HIG-08-008-pas.pdf– Coming soon results: HIG-008-06 HWW – CMS statistics committee blessed the tool (internal note in preparation)
• Grégory in permanent contact with them
– Interest of other working groups– Negotiations for integration in CMS Software framework (CMSSW)– Base of a common tool with Atlas
• Work in progress: firsts commits in ROOT are taking place
– New manpower: Mario Pelliccioni (former BaBar) from Universita’ di Torino
• Made in EKP (Quast, Schott, Piparo):– Personal assistance at 8th floor!
D. Piparo19.11.08 12
RSC: Is it hard to try?RSC: Is it hard to try?Straightforward to get started on ekpcms3:
wget -O RooStatsCms.tar.gz http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/CMSSW/HiggsAnalysis/RooStatsCms.tar.gz?view=tar\&pathrev=V00-04-00
tar -zxf RooStatsCms.tar.gzcd RooStatsCmssource /home/piparo/set_root_RSC_environment.shsource scripts/RSCenv.shmakemake execd macros/examples/root profilelikelihood_htt.cxxroot qqhtt_-2lnQ_distributions.cxx
See also: www-ekp.physik.uni-karlsruhe.de/~RooStatsCms for detailed instructions
D. Piparo19.11.08 13
RSC in one slideRSC in one slide
Statisticians
A priori, I frequently
believe I am in between ...
... RooStatsCms tries to put you somehow “in between”...
D. Piparo19.11.08 14
The Three PartsThe Three Parts
• Analyses modeling and combination
• Statistical Methods and limits
• Graphics routines
D. Piparo19.11.08 15
Analyses modeling and combinationAnalyses modeling and combination• Modeling based on the datacard concept
• Build a complete combined analysis model from ASCII datacards
– Background and signal components of each analysis
– Shapes from parametrisation or histos
– Constraints and their correlations
– Basic syntax: include, if ...
– Two lines of C++ to produce the RooFit Pdf
• Datacard advantages:
– Automatic bookkeeping of what is done
– Factorise model from C++ code
– Easy to share
RscCombinedModel mymodel ("hzz4l");RooAbsPdf* sb_pdf=mymodel.getPdf();
ASCII Card2 analyses
D. Piparo19.11.08 16
RSC – Modelling 2/2RSC – Modelling 2/2• Yields can be expressed as products of different terms. For example:
– Branching Ratios– Efficiencies– Cross section– Luminosity
• Each term: systematics can be included• Relate terms from one analysis to the other with correlations
Yield = BR · ε · σH · Lumi
D. Piparo19.11.08 17
An example datacard: countingAn example datacard: counting################################## The combined model#################################// Here we specify the names of the models // built down in the card that we want// to be combined
include HZZ_4mu.rscinclude HZZ_4e.rscinclude HZZ_2mu2e.rsc
[hzz4l] model = combined components = hzz_4mu, hzz_4e, hzz_2mu2e
################################## H -> ZZ -> 4mu#################################
[hzz_4mu] variables = x x = 0 L(0 - 1)
[hzz_4mu_sig] hzz_4mu_sig_yield = 62.78 L(0 - 200)
[hzz_4mu_sig_x] model = yieldonly
[hzz_4mu_bkg]
yield_factors_number = 2
yield_factor_1 = scale scale = 1 L (0 - 3) scale_constraint = Gaussian,1,0.041
yield_factor_2 = bkg_4mu bkg_4mu = 19.93 C
[hzz_4mu_bkg_x] model = yieldonly
The combined model
The variable
Signal component description:
- Yield
- Model
Background component description: yield made of different terms.
See RscBaseModel and RscCombinedModel documentation for a complete description
Constraints syntax: <type>,par1,par2
Basic syntax
Comment
D. Piparo19.11.08 18
An example datacard: shapesAn example datacard: shapes
// The combined model of HZZ and Hgg
include hzz_combined.rscInclude hgg_12_categories.rsc
[hgg_hzz_combined] model = combined components = hzz, hgg_cat0, hgg_cat1,..., hgg_cat11
[hgg_cat0]variables = mhmh = 115 L(90 - 180) // [GeV/c^{2}]
[hgg_cat0_sig] yield_factors_number = 3 yield_factor_1 = lumi lumi = 1 C yield_factor_2 = n_events_hgg_115_cat0_sig n_events_hgg_cat0_sig = 3.9577 yield_factor_3 = scale_sig scale_sig = 1 L (0 - 5)
[hgg_cat0_sig_mh] model = fourGaussians hgg_115_cat0_sig_mh_mean1 = 114.654 +/- 0.107106 C hgg_115_cat0_sig_mh_mean2 = 115.146 +/- 2.37687 C hgg_115_cat0_sig_mh_mean3 = 114.12 +/- 0.581539 C hgg_115_cat0_sig_mh_mean4 = 109.979 +/- 11.036 C hgg_115_cat0_sig_mh_sigma1 = 0.6075 +/- 0.0888951 C hgg_115_cat0_sig_mh_sigma2 = 0.601995 +/- 129.141 C hgg_115_cat0_sig_mh_sigma3 = 2.1119 +/- 0.526549 C hgg_115_cat0_sig_mh_sigma4 = 8.16619 +/- 7.75118 C hgg_115_cat0_sig_mh_frac1 = 0.999893 +/- 0.500053 C hgg_115_cat0_sig_mh_frac2 = 0.762761 +/- 0.0870296 C hgg_115_cat0_sig_mh_frac3 = 0.98815 +/- 0.0207781 C
1. Combination of combined models
2. Counting combined with shape analyses
[hgg_cat0_bkg] number_components = 2 yield_factors_number = 3 yield_factor_1 = lumi lumi = 1 C yield_factor_2 = n_events_hgg_115_cat0_bkg n_events_hgg_cat0_bkg = 988.389 yield_factor_3 = scale_bkg scale_bkg = 1 L (0 - 5)
[hgg_cat0_bkg1] qqhtt_bkg1_yield = 1 C
[hgg_cat0_bkg2] qqhtt_bkg2_yield = 1.35 C
[hgg_cat0_bkg1_mh] model = doubleGaussian hgg_cat0_bkg_mh_mean1 = 52.3484 +/- 14.1593 C hgg_cat0_bkg_mh_mean2 = 158.962 +/- 3.21153 C hgg_cat0_bkg_mh_sigma1 = 27.1791 +/- 2.37455 C hgg_cat0_bkg_mh_sigma2 = 74.9328 +/- 70.6298 C hgg_cat0_bkg_mh_frac = 0.924937 +/- 0.0347411 C
[hgg_cat0_bkg2_mh] model = histo hgg_cat0_bkg2_mh _fileName = htt_inputs.root hgg_cat0_bkg2_mh name = background
Comment
Multiple components
Histogram and parametric models mixed
D. Piparo19.11.08 19
A combinationA combination• Combination of CMS H→gg, H →ZZ (3 modes) 30 fb-1
• Perform a simutaneous analysis of Higgs channels:- for each analysis: each data sample is fitted simultaneously with it is own signal and background model- combination of number counting and distribution based analyses
• Significance: sqrt(2lnQ) • Various analyses• Comparison between PTDR and RSC
D. Piparo19.11.08 20
More on constraintsMore on constraints
[combined_120_constraints_block_1] correlation_variable1 = hww_mm_120_bkg_yield correlation_variable2 = hww_ee_120_bkg_yield correlation_variable3 = hww_em_120_bkg_yield
correlation_value1 = 0.80 C correlation_value2 = 0.72 C correlation_value3 = 0.15 C
[combined_120_constraints_block_2]............
• “Same name, same pointer” principle (100% correlation)– Same name in the card → Same object in the model– Common Luminosity, cross-sections
• Partial correlation among Gaussian constraints: constraints block
Correlated Variables
Correlation Coefficients
As many blocks as needed!
D. Piparo19.11.08 21
Analyses model structureAnalyses model structure
RscBaseModelBasic distributionsHistoHisto GaussGauss PolyPoly My modelMy model
RscMultiModel Model for each discriminating variableVariable 1Variable 1 Variable 2Variable 2
RscCompModel Different components forsignal(s) and background(s)SignalSignal Bkg1Bkg1 Bkg2Bkg2 Bkg3Bkg3
RscTotModelThe full analysisAnalysis 1Analysis 1
RscTotModelThe full analysisAnalysis 1Analysis 1
RscTotModelThe full analysisAnalysis 1Analysis 1
Statistical Methods
RscCombinedModel AnalysiscombinationCombinationCombination
D. Piparo 22
Inspect your modelInspect your modelTwo programs to use:
• Model Diagram:creates a simple graph of the
combined model– model_diagram.exe <cardname> <modelname>
• Model Html: creates a website to browse
your combined model– model_html.exe <cardname> <modelname>
19.11.08
D. Piparo 23
The Three PartsThe Three Parts
• Analyses modeling and combination
• Statistical Methods and limits
• Graphics routines
19.11.08
D. Piparo 24
Profile Likelihood - 1/2Profile Likelihood - 1/2
19.11.08
D. Piparo 25
Profile Likelihood – 2/2Profile Likelihood – 2/2• Intersection with horizontal lines gives upper limits / two sided intervals
– W.J. Metzger “Statistical Methods in Data Analysis”, Katholieke Universiteit Nijmegen, 2002.
• Systematics taken into account with penalty terms in the Likelihoods (profiling)
Likelihood scan: l maximised for each point
Interpolated scan minimum
Horizontal cuts
See PLCalcuator, PLResults, PLPlot documentation
• Minuit uses the technique to obtain the fitted parameters errors
• Significance estimator: S=sqrt(2ln(Lsb/Lb))
→ if θ0 is N signal, the scan value at 0 is directly related to S !
θ0 at minimum: 7.16+8.1-5.37
19.11.08
D. Piparo 26
Systematics - 1/2Systematics - 1/2
19.11.08
D. Piparo 27
Systematics - 2/2Systematics - 2/2
19.11.08
D. Piparo 28
A PL prototype studyA PL prototype study• A prototype study: distribution of upper limits using PL and a coverage study
• Many pseudo experiments performed for each mass hypothesis
– Distribution of upper limits obtained
– Coverage: fraction of experiments in which the upper limit is indeed greater than the parameter nominal value
– Easy to do: store PLResults objects in a TTree and loop on it.
Overcoverage for low yields:
• Well known feature of the method (Cramér-Fréchet Bound)
• “Calibrate” the Likelihood
19.11.08
D. Piparo19.11.08 29
Separation of HypothesesSeparation of Hypotheses• Analysis of search results can be formulated as separation of hypotheses:
– Identify observable which comprises the result– Specify a test statistic– Define rules for discovery and exclusion
• Use the likelihoods ratio, Q=Lsb/Lb, assuming signal+background (“s+b”) and the background-only “b” hypotheses, as test statistic.
• Consider “P-values” (also called CLS+B, 1-CLB) of -2lnQ distributions obtained from s+b and b samples
Bayesian pseudo-integration of systematics:
For every toy MC experiment, before the generation of the toy dataset, parameters affected by systematics are properly fluctuated once.
Distributions built with toy MC experiments
(LimitCalculator-HybridCalculator Class)
CLsb
1-CLb
See:
• progs/m2lnq_creator.cpp
•qqhtt_-2lnQ_distributions.cxx in macros/examples/
D. Piparo 30
Modified frequentist method – SignificanceModified frequentist method – Significance
• CLB : background CL, measure of the compatibility of the experiment with the B-only hypothesis
• 1 – CLB : probability for a B-only experiment to give a more S+B-like likelihood ratio than the observed one
• Correspondence between CLB and the resulting significance (Gaussian approximation):- # of standard deviations of an (assumed) Gaussian distribution of the background. - Take CLB assuming the expected s+b yield (i.e. median -2lnQ for s+b distribution)
• CLS+B : measure of the compatibility of the experiment with the S+B hypothesis if CL is small ( < 5% ) the S+B hypothesis can be excluded at more than 95% CL but it does not mean that the signal hypothesis is excluded at that level
Modified frequentist approach: take CLS the signal significance, to be: CLS ≡ CLS+B / CLB (heavily used by LEP, HERA and TEVATRON experiments)
)12(2 1 BCLErfn
19.11.08
D. Piparo 31
The benchmark analysis: HThe benchmark analysis: H→→
• Used as benchmark for the tool• Results approved by the CMS collaboration• Vector boson fusion H→ @1 fb-1
• Small signal on a significant background• No discovery expected with this lumi• Four mass hypotheses:
– 115,125,135,145 GeV
Mass N Sig
(12% sys)
N Bkg
(30% sys)
115 1.6 45.2
125 1.4 45.2
135 1.1 45.2
145 0.6 45.2
19.11.08
D. PiparoCMS Week 32
H→H→: Significance: Significance• Significance calculated for the H→ analysis using CLb
• In this case significance does not tell us much. • The question becomes:
“Which production cross section can I exclude with the data I have?”
19.11.08
D. PiparoCMS Week 33
Modified Frequentist method – ExclusionModified Frequentist method – Exclusion Assume to observe the expected background (i.e. median of the background distribution) and no signal• Amplify the SM production cross section by a factor necessary to obtain CLs=0.05
→ “95% exclusion”
Bands:• Assume to observe Nb + n · sqrt (Nb), where n=2,1,-1,-2 for the -2,-1,1,2 sigma band border respectively• Systematics taken into account in distributions of -2lnQ (marginalisation)
Obtained with real data
Less exclusion power than expected
More exclusion power than expected
~ 80 h on one CPU
ExclusionBandPlot Class
19.11.08
D. PiparoCMS Week 34
How do I find the right ratio?How do I find the right ratio? RSC provides help: • RatioFinder • RatioFinderResults• RatioFinderPlot
Just compile and launch the job(s)!
CLs = 0.05
19.11.08
D. PiparoCMS Week 35
Another representation of the informationAnother representation of the information • Use the distributions of the test statistic.• At glance see how the hypotheses are separated.• For each mH projection of -2lnQ distribution in B only hypothesis.
19.11.08
D. Piparo
LimitCalculatorLimitCalculator
Statistical Methods: class structuresStatistical Methods: class structures
Statistical Methods – Mother: StatisticalMethod
LimitCalculatorLimitCalculator PLScanPLScan FCCalculatorFCCalculator
LimitResultsLimitResults PLScanResultsPLScanResults FCResultsFCResults
Statistical Methods Results – Mother: StatisticalResult
LimitPlotLimitPlot PLScanPlot (add also FC curves)PLScanPlot (add also FC curves)
Statistical Plot – Mother: StatisticalPlot
ConstraintConstrBlock2ConstrBlock3ConstrBlockArray
Constraints Mother: NLLPenalty
LEPBandPlotLEPBandPlot
ExclusionBandPlotExclusionBandPlot
+
• Organisation of the classes of statistical methods:
“Sum” the results:batch/GRID jobs submission easier
Aka HybridCalculatorAka HybridCalculator
Aka HybridResultsAka HybridResults
Aka HybridPlotAka HybridPlot
19.11.08
D. Piparo 37
The Three PartsThe Three Parts
• Analyses modeling and combination
• Statistical Methods and limits
• Graphics routines
19.11.08
D. Piparo 38
Plots collectionPlots collection
19.11.08
D. Piparo 39
TroubleshootingTroubleshootingQ: I want to start now. Where do I find the examples?
A: In the macros dir you find the macros for the interpreter while in the progs directory the programs to compile with the make exe command.
Q: I think I do not know how to write a datacard. How can I do?
A: In the macros directory you find some datacards to find the inspiration. Moreover check the scripts in the scripts directory. You have the create_card_skeleton.py to query for templated card components and TDR_HZZ_card_maker.py, to create the CMS PTDR H→ZZ→4l cards.
Q: I compiled RSC but ROOT does not see the dynamic library libRooStatsCms.so. What do I do?
A: Add to your LD_LIBRARY_PATH environmental variable the /RooStatsCms/lib dir. In the script directory you have the RSCenv.sh script to set up your environment. Then in the interpreter use the command gSystem->Load(“libRooStatsCms.so”).
“Q”: Still.. I cannot get it work!
A: Come down to the eight floor for support!
19.11.08
D. Piparo 40
ConclusionsConclusions• Intuitive “model factory”
– Build the analysis model from an ASCII configuration file, the datacard– Datacard also describes nuisance parameters (and correlations)– Building of a combined model for a combined analysis
• Implementation of nuisance parameters and correlations– Can be marginalised or profiled
• Statistical methods– LimitCalculator (CLb,CLsb,CLs) Complete*– PLScan (Profile Likelihood) Complete*– FCCalculator (fully frequentist approach) Validation to complete– Bayesian approach and Markov chains Being investigated
* Strong implementation, tested and used by CMS analyses
• Batch friendly: decomposition in sub-jobs; results stored in ROOT files– Results can be merged and exploited by results classes
• Plots in a “presentation ready” form easily obtainable
19.11.08
top related