CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Outline
Recent developments in ROOT libMathCorenumerical algorithms interfaces
Fitting improvements TFitResult
New classes in Hist library TEfficiency class
TKDE class for
density estimationGoodness of Fit
new GoFTest class Conclusions
2
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Recent MathCore Developments
libMathCore provides the basic Math functionalityMathematical and statistical functions
in TMath or ROOT::Math namespace
Random number generatorsImplementation of basic algorithms
(integration, derivation, root finders, etc..)Interfaces for function evaluations and for numerical algorithms
Additional implementations provided in other libraries (e.g. libMathMore)
transparent mechanism to use them via the plug-in manager
see Integrator or Minimizer interfaces
Fitting classes (in namespace ROOT::Fit)
Fitter, FitResult, etc.. using function and Minimizer interfaces
3
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Numerical Integration
Single entry point for multiple implementation: ROOT::Math::Integrator
4
using namespace ROOT::Math; //multidim integrand function double func( const double* x, const double *p);....// Functor class to wrap user function in interfaceFunctor f(func,dimension);// adaptive cubature methodIntegratorMultiDim ig(IntegrationMultiDim::kADAPTIVE);double v1 = ig.Integral(f,xmin,xmax);
// MC method (VEGAS) loaded from MathMore libraryIntegratorMultiDim ig(IntegrationMultiDim::kVEGAS);double v2 = ig.Integral(f,xmin,xmax);
Different implementation can be selected. Example of usage: RooStats::BayesianCalculator
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Function MinimizationCommon interface class (ROOT::Math::Minimizer) for all ROOT
minimizer implementations. Existing plug-ins: Minuit (based on class TMinuit, direct translation from Fortran code) Minuit2 (new C++ implementation with OO design) Fumili (only for least-square or log-likelihood minimizations) GSL minimizers : conjugate gradient algorithms (Fletcher-Reeves, BFGS) and
Levenberg-Marquardt (for minimizing least square functions) Linear for least square functions (direct solution, non-iterative method) Genetic minimizer (based on algorithm implemented in TMVA)
Easy to extend and plug-in new minimizers NagC, Opt++,....?
Possible to combine minimizers eg: Minuit+Genetic minimizer
Control via MinimizerOptions class: MinimizerOptions::SetDefaultMinimizer(“Minuit2”);
Exists also a RooFit interface (RooMinimizer) (from A.Lazzaro)see
5
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
DistSampler class
New interface class in version 5.28 for random generation of data according to a generic distribution implemented currently using UNU.RANan implementation based on Foam is coming
can also generate directly a data sets (binned or unbinned) plan to use it in RooFit for implementing RooAbsPdf::generate
6
using namespace ROOT::Math; ....DistSampler * sampler = Factory::CreateDistSampler(“Unuran”);// set the sampling distributionsampler->SetFunction(user_function);// init with algorithm namesampler->Init(“TDR”); for (int i = 0; i< n;++i) {
// sample 1D datadouble x = sampler->Sample1D();// sample for multi-dimensional dataconst double * xx = sampler->Sample().......
}
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Fitting improvements
New fitting classes already presented in past conferencesNew since 5.26: TFitResult class
returned from the TH1::Fit or TGraph::Fit using TFitResultPtr need to use option “S” otherwise just the status (int) is returned
TFitResult contains all fit result informationparameters, error, covariance matrix, Minos erros, minimizer status, etc..
7
// return a smart pointer to TFitResult using option “S” TFitResultPtr r = h1->Fit("gaus","S");double chi2 = r->Chi2(); // chi2 of fit double fmin = r->MinFcnValue(); // minimum of fcn function
const double * par = r->GetParams(); // get fit parametersconst double * err = r->GetErrors(); // get fit errorsTMatrixDSym covMat = r->GetCovarianceMatrix(); TMatrixDSym corMat = r->GetCorrelationMatrix();
r->Print(“V”); // full printout of result
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
New class in 5.28 for efficiencies and binomial errorscommon problem in HEP analysis (trigger, selection cuts, etc..)
Histogram described by Poisson statistics bin counts:
Division of two histograms described by Binomial statistics
if they are correlated
if k and n are uncorrelated, ratio of Poisson can still be written as
Histogram class cannot fully describe binomial statisticsneed both ki and ni for further analysis (combination, fitting, etc..)
Histogram division
8
ni : Poisson(ni|µi)
ki
ni: Binomial(k|n, �)
ni1
ni2
=ni
1 + ni2
ni2
− 1 → 1�− 1
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Motivation for TEfficiency
What did we have in ROOT ? TH1::Divide uses normal approximation for binomial errors
TGraphAsymErrors::BayesDivide binomial intervals with Bayesian statistics assuming an uniform prior.
TEfficiency class now provides several statistical methods for computing binomial confidence intervalsfrequentist interval (Clopper-Pearson) and described in PDGapproximate methods (Agresti-Coull, Wilson)Bayesian interval based on a Beta prior distribution
include uniform ( Beta(1,1) ) and Jeffrey Beta(1/2,1/2) priors
9
�̂ =k
n±
��̂(1− �̂)
nfails for ε-> 0 or 1
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Binomial intervals
Coverage probabilities for the binomial interval
10
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
TEfficiency
TEfficiency provides possibility to estimate and draw intervals at different confidence level and statistics option
Support also for 2D and 3D objectsPossible to fill directly TEfficiency
eff.Fill(true, x); for the events passing a selection eff.Fill(false, x); for the events failing the selection
11
TEfficiency ef(*h1,*h2);
ef.SetStatisticOption(kFCP);ef.SetStatisticOption(kFAC);
ef.SetConfidenceLevel(0.683);
ef.Draw(“A4”);
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Fitting TEfficiency
TEfficiency::Fit : binned maximum likelihood fit using a binomial probability for each bin
12
maxL(ki|Ni, pi) =�
i
ni!(ni − ki)!ki!
fkii (1− fi)ni−ki with fi = f(�i, �p)
Least square (χ2) fit not statistically correct for ε ≃ 0 or ε ≃ 1
using the class TBinomialEfficiency
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
TEfficiency Combinations
Possible to combine and merge different TEfficiency objectssupport combinations from a list of objects with different weights
e.g. combination of efficiency generating from different processesUse Bayesian statistics for the combination
support generalization for weighted eventse.g. in combination of different MC samples
13
Pcomb(�|wi, ki, Ni) ∝�
i
L(ki|Ni, �)wiΠ(�)
L(ki|Ni, �) : is the likelihood functionΠ(�) = B(�, α, β): prior (beta distribution)
wi : weights renormalized to w�
i = wi
�i wi�i w2
i
the combined posterior is then:
Pcomb(�|wi, ki, Ni) = B(�,�
i
wiki + α,�
i
wi(ni − ki) + β)
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Non Parametric Density Estimator
Estimate of the underlying probability density function from the underlying datanon parametric: do not assume any model for the data in contrast
to parametric estimators which require a data modele.g. fitting is instead a parametric estimation
Histogram is a non parametric density estimator simplest and computational most efficientdrawbacks:
discontinuities and dependence on bin width and origin
Kernel density estimators is an alternative method
the bandwidth h is a smoothing parameter influencing both bias and variance of estimator
14
�fh(x) =1
nh
n�
i=1
K(x− xi
h)
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
New TKDE Class
Kernel density estimator classes exist in both RooFit and TMVAno real kernel density estimator in core ROOT.
TH1K class is based on nearest-neighbor (uniform kernel)
New class TKDE will be available in 5.28 (in libHist)support for various kernel (default is Gaussian) but also
Epanechnikov, Bi-weight and Arc-cosine kernelssupport for adaptive bandwidth (better for multi-modal distribution
and for describing peaks and tails)can provide both full result or interpolated one for fast evaluationcan support data binning for efficient bandwidth computation in the
adaptive case
Working on a multi-dimensional class using kd-tree as data storage (TKDTree)plan to use also Foam for optimal multi-dimensional binning
15
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Examples of TKDE
Example: gaussian, bi-gaussian and log-normal
16
GaussianLog-normal
Log-normal(log-scale)
Bi-Gaussian
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Errors from TKDE
Can draw also error (confidence interval at desired level)and also bias and RMS (root mean square)
17
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
New GoF Test Class
New class for goodness of fit tests: ROOT::Math::GoFTest in libMathCore1-sample test
test if data are compatible with a reference distribution user provided distributions or standard ones (normal, log-normal,etc..)
2 sample testtest if two data sets are compatible
working on un-bin data setswe have already the Pearsonχ2 test on the bin data sets (histograms)
Kolmogorov-Smirnov test was already existing in ROOT for the 2 sample and bin dataadd 1 sample test
Anderson-Darling test much more sensitive to detect tails variation
18
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Example of using GoFTest
1 sample test
2 sample test
19
using namespace ROOT::Math; // create gof test class on data x[n]={....} GoFTest gof(n,x,GoFTest::kLogNormal); // set a user distribution object// which must implement operator ()(x) gof.SetUserDistribution(user_dist);
double pValueAD = gof.AndersonDarlingTest();double pValueKS = gof.KolmogorovSmirnovTest();
// create GoF test for data x1[n1] and x2[n2] GoFTest gof2(n1,x1,n2,x2);
double pValueAD = gof2.AndersonDarling2SamplesTest();double pValueKS = gof2.KolmogorovSmirnov2SamplesTest();
data 2 quantiles
data
1 q
uant
iles
data
qua
ntile
s
theoretical quantiles
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Conclusions
Large collection of math and statistical tools available in ROOTworking recently on improving the overall quality
more tests, studied and improved performance whenever possiblefixed several issues found from a code static checker (Coverity)
improving modularity common interfaces for functions and algorithms
improve usability (e.g. new classes like TFitResult) New classes useful for LHC data analysis will be available in 5.28
TEfficiency to compute and display binomial intervals TKDE for density kernel estimation ROOT::Math::GoFTest for goodness of fit tests
Developing advanced tools for physics analysis complex fitting (RooFit) multivariate analysis (TMVA) (see poster 081) new statistical framework (RooStats)
see separate presentation next Thursday in Event Processing session 20
CHEP 2010, Taipei, Taiwan 2010 Lorenzo Moneta, CERN/PH-SFT
Documentation
Online reference documentation (most up-to date) class description with THtml (and also Doxygen)
see http://root.cern.ch/root/htmldoc/MATH_Index.html
see TEfficiency doc as example of a very well documented class
Math library documentation on Drupal see http://root.cern.ch/drupal/content/mathematical-libraries document most of the recent developments (numerical algorithm, fitting, etc..)
ROOT User guides: see http://root.cern.ch/root/doc/RootDoc.html
not been updated with latest developmentsTMVA, RooFit and RooStats (in preparation) user guides
ROOT Talk Forum (for support, requests and discussions)✦ a thread is available for only Math and Statistical topics✦ bugs should be reported to Savannah
21