multivariate methods in hep
DESCRIPTION
Multivariate Methods in HEP. Pushpa Bhat Fermilab. Outline. Introduction/History Physics Analysis Examples Popular Methods Likelihood Discriminants Neural Networks Bayesian Learning Decision Trees Future Issues and Concerns Summary. Some History. - PowerPoint PPT PresentationTRANSCRIPT
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 1
Multivariate Methods in HEP
Pushpa Bhat Fermilab
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 2
Outline
• Introduction/History• Physics Analysis Examples• Popular Methods
• Likelihood Discriminants• Neural Networks• Bayesian Learning• Decision Trees
• Future• Issues and Concerns• Summary
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 3
Some History
• In 1990 most of the HEP community was skeptical towards use of multivariate methods, particularly so in case of neural networks (NN)• NN as a black box
Can’t understand weightsNonlinear mapping; higher order correlations Though mathematical function can’t explain in terms of physicsCan’t calculate systematic errors reliably
Uni-variate or “cut-based” analysis was the norm • Some were pursuing application of neural network methods to HEP
around 1990• Peterson, Lonnblad, Denby, Becks, Seixas, Lindsey, etc
• First AIHENP (Artificial Intelligence in High Energy & Nuclear Physics) workshop was in 1990.• Organizers included D. Perret-Gallix, K.H. Becks, R. Brun, J.Vermaseren. AIHENP metamorphosed into ACAT ten years later, in 2000
• Multivariate methods such as Fisher discriminants were in limited use.• In 1990, I began to pursue the use of multivariate methods, especially
NN, in top quark searches at Dzero.
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 4
Mid-1990’s
• LEP experiments had been using NN and likelihood discriminants for particle-ID applications and eventually for signal searches (Steinberger; tau-ID)
• H1 at HERA successfully implemented and used NN for triggering (Kiesling).
• Hardware NN was attempted at Fermilab at CDF• Fermilab Advanced Analysis Methods Group
brought CDF and DØ together for discussion of these methods and applications in physics analyses.
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 5
The Top QuarkPost-Evidence, Pre-Discovery !
Fisher Analysis of tte channel
One candidate event (S/B)(mt = 180 GeV)
= 18 w.r.t. Z = 10 w.r.t WW
NN Analysis tt e+jets channeltt
W+jets
W+jetstt160 Data
P. Bhat, DPF94
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 6
Cut Optimization for Top Discovery Feb. ‘95
Signal
BackgroundJan. ’95
(Aspen) cut
Mar. ’95Discovery cut
Contours: Possible NN cuts Feb. ‘95
Sig. Eff.
S/B (Feb-Mar, 95 -Discovery
Conventional cut)
S/B reach with 2-v NN analysisfor similar efficiency
(Jan, 95 –Aspen mtg.Conventional cut)
Neural Network Equi-probability Contour cuts from 2-variable analysis compared with conventional cuts used in Jan. ’95 and in Observation paper
P. Bhat, H.Prosper, E. AmidiD0 Top Marathon, Feb. ‘95
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 7
Measurement of the Top Quark Mass
Discriminant variables
mt = 173.3 ± 5.6(stat.) ± 6.2 (syst.) GeV/c2
The DiscriminantsThe Discriminants
DØ Lepton+jetsDØ Lepton+jets
Fit performed in 2-D: (DLB/NN, mfit)
Run I (1996) result with NN and likelihoodRecent (CDF+D0) mt measurement:
mt= 171.4 ± 2.1 Gev/c2
First significant physics result using multivariate methods
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 8
Higgs, the Holy Grail of HEPDiscovery Reach at the Tevatron
• The challenges are daunting! But using NN provides same reach with a factor of 2 less luminosity w.r.t. conventional analysis
• Improved bb mass resolution & b-tag efficiency crucial
Run II Higgs study hep-ph/0010338 (Oct-2000)P.C.Bhat, R.Gilmartin, H.Prosper, Phys.Rev.D.62 (2000) 074022
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 9
Then, it got easier
• One of the important steps in getting the NN accepted at the Tevatron experiments was to make the Bayesian connection.
• Another important message to drive home was “the maximal use of information in the event” for the job at hand
• Developed a random grid search technique that can be used as baseline for comparison
• Neural network methods now have become popular due to the ease of use, power and many successful applications
Maybe too easy??
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 10
Optimal Event Selection
x
r(x,y) = constant defines an optimaldecision boundary
r(x,y) = constant defines an optimaldecision boundary
Feature spaceFeature space
),|(
),|(
)()|,(
)()|,(),(
yxbp
yxsp
bpbyxp
spsyxpyxr
),|(
),|(
)()|,(
)()|,(),(
yxbp
yxsp
bpbyxp
spsyxpyxr
S = B =
Conventional cutsx x
y y
0
0
y
0y
x0
x
y
x
y
0x
0y
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 11
The NN-Bayesian Connection
Output of a feed forward neural network can approximate the posterior probability P(s|x1,x2).
r
rxspxy
1)|()ˆ,(
1x
2x
)ˆ,,( 21 xxy
))P(|P(x
))P(|P(x )x |( 11
1ii CC
CCCP
)()|(
)()|(
bpbxp
spsxpr
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 12
Limitations of “Conventional NN”
• The training yields one set of weights or network parameters• Need to look for “best” network, but avoid overfitting
• Heuristic decisions on network architecture• Inputs, number of hidden nodes, etc.
• No direct way to compute uncertainties
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 13
Ensembles of Networks
NN1
NN2
NN3
NNM
X
y1
y2
y3
yM
)(xyayi
ii
Decision by averaging over many networks (a committee of networks) has lower error than that of any individual network.
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 14
Bayesian Learning
• The result of Bayesian training is a posterior density of the network weights
P(w|training data) • Generate a sequence of weights (network
parameters) in the network parameter space i.e., a sequence of networks. The optimal network is approximated by averaging over the last K points:
K
1knew
1),( kwxy
Ky
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 15
Bayesian Learning – 2
• Advantages• Less prone to over-fitting• Less need to optimize the size of the network. Can use a
large network! Indeed, number of weights can be greater than number of training events!
• In principle, provides best estimate of p(t|x)p(t|x)
• Disadvantages• Computationally demanding!
• The dimensionality of the parameter space is, typically, large • There could be multiple maxima in the likelihood function p(t|
x,w), or, equivalently, multiple minima in the error function E(x,w).
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 16
Example: Single Top Search
• Training Data• 2000 events (1000 tqb- + 1000 Wbb-)• Standard set of 11 variables
• Network• (11, 30, 1) Network (391391 parameters!)
• Markov Chain Monte Carlo (MCMC)• 500 iterations, but use last 100 iterations • 20 MCMC steps per iteration• NN-parameters stored after each iteration• 10,000 steps• ~ 1000 steps / hour (on 1 GHz, Pentium III laptop)
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 17
Signal:tqb; Background:Wbb Distributions
Example: Single Top Search
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 18
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 19
Decision Trees
• Recover events that fail criteria in cut-based analyses• Start at first “node” with a fraction of the “training
sample” • Select best variable and cut with best separation to
produce two “branches ” of events, (F)ailed and (P)assed cut
• Repeat recursively on successive nodes• Stop when improvement stops or when too few events
are left • Terminal node is called a “leaf ” with purity =
Ns/(Ns+Nb)• Run remaining events and data through the tree to
derive results• Boosting DT:
• Boosting is a recently developed technique that improves any weak classifier (decision tree, neural network, etc)
• Boosting averages the results of many trees, dilutes the discrete nature of the output, improves the performance
DØ single topanalysis
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 20
Matrix Element MethodExample: Top mass measurement
• Maximal use of information in each event by calculating event-by-event signal and background probabilities based on the respective matrix element
x: reconstructed kinematic variables of final state objectsJES: jet energy Scale from Mw constraint
• Signal and background probabilities from differential cross sections
• Write combined likelihood for all events
• Maximize likelihood w.r.t. mtop, JES
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 21
Summary
• Multivariate methods are now used extensively in HEP data analysis
• Neural networks, because of their ease of use and power, are favorites for particle-ID and signal/background discrimination
• Bayesian neural networks take us one step closer to optimization
• Likelihood discriminants and Decision trees are becoming popular because they are easier to “defend” (no “black-box” stigma)
• Many issues remain to be addressed as we get ready to deploy the multivariate methods for discoveries in HEP
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 22
Nothing tends so much to the advancement of knowledge as the application of a new instrument - Humphrey Davy
No amount of experimentation can ever prove me right; a single experiment can prove me wrong. - Albert Einstein
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 23
CDF
CDF
DØ
DØDØ
Booster
World’s Highest Energy Laboratory
(for now)
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 24
Our Fancy New Toys
LHC Ring
SPS Ring
PS
Circumference = 27kmBeam Energy = 7.7 TeVLuminosity =1.65x1034 cm-2sec-1
Startup date: 2007
p p
LHC Magnet LHC Tunnel
TI 2TI 2
TI 8TI 8
The Large Hadron Collider
CMS
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 25
LHC Environment
14 TeV Proton Proton colliding beams
Parameter ValueBunch-crossing frequency 40 MHz
Average # of collisions / crossing
20
“interaction rate” ~109
Average # of charged tracks
1000
Radiation field severe
CMS Parameter ValueLevel-1 trigger rate 100 kHz
Mean time between triggers
10 sec
Trigger latency 3.2 sec
Solenoid field 4 T
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 26
CMS Silicon Tracker
Challenges
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 27
CMS Si Tracker
5.4 m
2,4
m
Inner Barrel & Disks
(TIB & TID)
PixelsOuter Barrel (TOB)
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 28
Lots of Silicon
214m2 of silicon sensors11.4 million silicon strips66 million pixels!
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 29
Si Tracker Challenges
• Large and complex system• 77.4 million total channels (out of a total of 78.2 M for
experiment)• Detector monitoring, data organization, data quality monitoring,
analysis, visualization, interpretation all daunting!
• Need to monitor every channel and make sure most of the detector is working at all times (live fraction of the detector and efficiencies bound to decrease with time)
• Need to verify data integrity and data quality for physics• Diagnose and fix problems ASAP• Keep calibration and alignment parameters current
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 30
Detector/Data Monitoring
• Monitor• Environmental variables
• Temperatures, coolant flow rates, interlocks, radiation doses
• Hardware status• Voltages, currents
• Channel Data• Readout states, Errors, missing data/channels, bad ID for
channel/modulemany kinds to be categorized and tracked and displayedshould be able to find rare problems/errors (with low
occurrence rate) that may corrupt data Problems (Rare problems may indicate a developing failure mode or hidden bad behavior)
Correlate problem/noisy channels with history, temperature, currents, etc.
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 31
Data Quality Monitoring
• Monitor• Raw Data
• Pedestals, noise, adc counts, occupancies, efficiencies• Processed high level objects
• Clusters, tracks, etc.• Evaluate thousands of histograms
• Can’t visually examine all• Automatically evaluate histograms by comparing to reference
histograms • Adaptive, efficient, find evolving patterns over time
• Quantiles? q-q plots/comparison instead of KS test?• A variety of 2D “heat” maps
• Occupancies, #of bad channels/module, #of errors/module, etc.
• Typical occupancy ~ 2% in strip tracker• 200,000 channels written out 100 times/sec
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 32
Module Assembly Precision
Example of a“Heat” map
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 33
Need smart approaches
• What are the best techniques for data-mining?• To organize data for analysis and data visualization
• complex geometry/addressing makes visualization difficult
• For finding problematic channels quickly, efficiently clustering, exploratory data-mining
• For finding anomalies, corrupt data, patterns of behaviorFeature-finding algorithms, superpose many events, time
evolution, spatial and temporal correlations
• Noise Correlations • Via correlation coefficients of defined groups• Correlate to history (time variations), environmental
variables
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 34
Data Visualization
• Based on hierarchical/geometrical structure of the tracker• Display every channel, attach objects/info to each
Sub-structuresLayers/ringsModulesReadout Chips
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 35
Multivariate Analysis Issues
• Dimensionality Reduction• Choosing Variables optimally without losing information
• Choosing the right method for the problem• Controlling Model Complexity• Testing Convergence• Validation
• Given a limited sample what is the best way?
• Computational Efficiency
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 36
Multivariate Analysis Issues
• Correctness of modeling• How do we make sure the multivariate modeling is
correct? • The data used for training or building PDEs represent reality.
Is it sufficient to check the modeling in the mapped variable? Pair-wise correlations? Higher order correlations?
• How do we show that the background is modeled well? How do we quantify the correctness of modeling?
• In conventional analysis, we normally look for variables that are well modeled in order to apply cuts
• How well is the background modeled in the signal region?
• Worries about hidden bias• Worries about underestimating errors
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 37
Sociological Issues
• We have been conservative in the use of MV methods for discovery.
• We have been more aggressive in the use of MV methods for setting limits.
• But discovery is more important and needs all the power you can muster!
• This is expected to change at LHC.
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 38
Summary
• The next generation of experiments will need to adopt advanced data mining and data analysis techniques
• Conventional/routine tasks such as alignment, detector performance and data quality monitoring and data visualization will be challenging and require new approaches
• Many issues regarding use of multivariate methods of data analysis for discoveries and measurements need to be addressed to make optimal use of data
Pushpa Bhat, Fermilab ACAT 2007 Apr 23-27, Amsterdam 39
MV: Where can we use them?
• Almost everywhere since HEP events are multivariate• Improve several aspects of analysis
• Event selection• Triggering, Real-time Filters, Data Streaming
• Event reconstruction• Tracking/vertexing, particle ID
• Signal/Background Discrimination• Higgs discovery, SUSY discovery, Single top, …
• Functional Approximation• Jet energy corrections, tag rates, fake rates
• Parameter estimation• Top quark mass, Higgs mass, SUSY model parameters
• Data Exploration• Knowledge Discovery via data-mining• Data-driven extraction of information, latent structure analysis